Data Processing Resources: Command-line Interface (CLI) for CSV, TSV, JSON, and XML

RSS

Data Processing Resources: Command-line Interface (CLI) for CSV, TSV, JSON, and XML

10 Jul

Updates

2023-12-11: dsq no more developed, alternative tools added
2022-06-27: Added bdt
2022-01-17: Added dsq
2021-08-25: Added DuckDB
2020-09-29: Added GNU Parallel
2020-09-28: Added pspg (tabulated data aware pager), and pgcli
2020-09-23: Added query-json
2020-05-27: Added rows for tabular data processing
2020-05-21: Added huniq
2020-01-23: Added VisiData CLI data viewer and plotter
2020-01-10: Added fx interactive JSON viewer and processor.
2019-02-08: Added BigBash It!
2019-01-03: Added GNU datamash for CSV processing
2018-10-24: Added gron for JSON processing.

Sometimes you don’t want pandas, tidyverse, Excel, or PostgreSQL. You know they are very powerful and flexible, you know if you’re already using them daily you can utilize them. But sometimes you just want to be left alone with your CVS, TSV, JSON and XML files, process them quickly on the command line, and get done with it. And you want something a little more specialized than awk , cut, and sed. (Also, to go beyond data processing and enrich your daily command line experience, don’t forget to check out “Command Line for the 21. Century: The Low Hanging Fruit“.)

This list is by no means complete and authoritative. I compiled this as a reference that I can come back later. If you have other suggestions that are according to the spirit of this article, feel free to share them by writing a comment at the end. Without further ado, here’s my list:

xsv: A fast CSV command line toolkit written by the author of ripgrep. It’s useful for indexing, slicing, analyzing, splitting and joining CSV files.
q: run SQL directly on CSV or TSV files.
csvkit: a suite of command-line tools for converting to and working with CSV, the king of tabular file formats.
textql: execute SQL against structured text like CSV or TSV.
miller: like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.
agate: a Python data analysis library that is optimized for humans instead of machines. It is an alternative to numpy and pandas that solves real-world problems with readable code.

Honorable mentions:

bdt: Boring Data Tool. A command-line (CLI) tool for viewing, querying, and converting between various file formats. Powered by DataFusion (an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format). This tool can work with CSV, JSON, Parquet, and Avro file formats. It can run SQL queries against such files, and also convert between file formats.
~~dsq~~: According to its GitHub page and primary author dsq is not developed anymore: “While development may continue in the future with a different architecture, for the moment you should probably instead use DuckDB, ClickHouse-local, or GlareDB (based on DataFusion). These are built on stronger analytics foundations than projects like dsq based on SQLite. For example, column-oriented storage and vectorized execution, let alone JIT-compiled expression evaluation, are possible with these other projects.” Also see the this tweet and this tweet by Phil Eaton, the primary author of dsq utility. There’s also a reference to “Databend” in those threads, and Databend is advertised as “Your best alternative to Snowflake. Cost-effective and simple for massive-scale analytics.” ~~running SQL queries on the command line against JSON, CSV, Excel, Parquet, and more. For example~~:
- dsq testdata.json "SELECT * FROM {} WHERE x > 10"
- dsq testdata.ndjson "SELECT name, AVG(time) FROM {} GROUP BY name ORDER BY AVG(time) DESC"
- cat testdata.csv | dsq -s csv "SELECT * FROM {} LIMIT 1"
- cat testdata.parquet | dsq -s parquet "SELECT COUNT(1) FROM {}"
- dsq testdata/join/users.csv testdata/join/ages.json "select {0}.name, {1}.age from {0} join {1} on {0}.id = {1}.id"

GNU datamash: a command-line program which performs basic numeric, textual and statistical operations on input textual data files. See examples & one-liners.
SQLite: import a CSV File Into an SQLite Table, and use plain SQL to query it.
DuckDB: like SQLite, but specifically designed to handle bigger data sets, using a vectorized engine and optimized for analytics and parallel query processing. Being able to directly Parquet & CSV, without any dependencies is another plus point.
csv-mode for Emacs: sort, align, transpose, and manage rows and fields of CSV files.
lnav: the Log Navigator. See the tutorial in Linux Magazine.
jq: this one is THE tool for processing JSON on the command-line. It’s like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
gron: transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute path to it. (Why shouldn’t you just use jq?)
jid: JSON Incremental Digger, drill down JSON interactively by using filtering queries like jq.
jiq: jid with jq.
fx: interactive JSON viewer and processor.
JMESPath tutorial: a query language for JSON. You can extract and transform elements from a JSON document. There are a lot of implementations at http://jmespath.org/libraries.html and the CLI implementation is jp.
query-json: a faster re-implementation of jq in Reason Native and compiled to binary using OCaml. You can use it like:
```
query-json ".store.books | filter(.price > 10)" stores.json
```
BigBash It!: converts your SQL SELECT queries into an autonomous Bash one-liner that can be executed on almost any *nix device to make quick analyses or crunch GB of log files in CSV format. Perfectly suited for Big Data tasks on your local machine. Source code available at https://github.com/Borisvl/bigbash
VisiData: an interactive multi-tool for tabular data. In addition to CSV, it supports a lot of formats, and has plotting capabilities inside the terminal. You can use it to convert between formats, as an interactive replacement for grep, awk, sed, cut, sort, uniq, create ad-hoc data pipelines on the command line, and a generic utility with many automation capabilities.
huniq: Replacement for sort | uniq optimized for speed (10x faster) when sorting is not needed. huniq replaces sort | uniq (or sort -u with gnu sort) and huniq -c replaces sort | uniq -c. The order of the output is stable when in normal mode, but it is not stable when in -c/count mode.
rows: No matter in which format your tabular data is: rows will import it, automatically detect types and give you high-level Python objects so you can start working with the data instead of trying to parse it. It is also locale-and-unicode aware. More information is in the documentation. The author of miller likes rows, too!
Everybody who uses psql or similar command line clients, e.g. pgcli, for interacting with databases such as PostgreSQL, MySQL, SQLite, etc. uses the venerable less pager. And that’s OK, except that it’s got no special support for tabular data. pspg is your turbo-charged, specialized pager for tabular data. Oh, and of course you can perfectly use it as a standalone utility for all your CSV viewing, browsing, searching and sorting needs (though some of the utilities already mentioned such as VisiData might serve you better, because pspg is really intended to be used as a pager like less). And pspg got themes… lots of themes!

Finally the CLI for XML:

XMLStarlet Command Line XML Toolkit: a set of command line utilities that can be used to transform, query, validate, and edit XML documents and files using simple set of shell commands in similar way it is done for plain text files using UNIX grep, sed, awk, diff, patch, join, etc commands. (I used this most recently in https://github.com/emres/turkeyEarthquakeAlerter.)

A Word on Performance and Parallel Processing

Some of the tools described might not be as performant as you want, using only 1 core of your many-core CPUs. In this case, please keep in mind that you can combine these tools with the excellent GNU Parallel utility to make them process data in parallel, utilizing all the available CPU cores on your machine. For more information about GNU Parallel, see the examples, cheat sheet, and video tutorials.

About the author: Emre is the co-founder & CTO of TM Data ICT Solutions in Belgium. You can read more about him in About page of this blog.

1 Comment

Posted by Emre Sevinç on July 10, 2018 in Linux, Programlama

Tags: AWK, cli, cmd, Command-line interface, CSV, data, json, linux, sed, text, TSV

One response to “Data Processing Resources: Command-line Interface (CLI) for CSV, TSV, JSON, and XML”

Pingback: Command Line for the 21. Century: The Low Hanging Fruit | FZ Blogs

	Doktorunuz, kanser t… on Sizi muayene eden doktorunuz i…
	En önemli ikinci pro… on 'Nerdy' bir bilimcin…
	Kids have to know th… on Müzik Enstrumanları Müzesi…
	Catch-22, Hindistan… on Emacs ile caz çalmak mümkün…
	Catch-22, Hindistan… on İşitsel Programlama, Common Mu…

FZ Blogs