RSS

Tag Archives: linux

Data Processing Resources: Command-line Interface (CLI) for CSV, TSV, JSON, and XML


Sometimes you don’t want pandas, tidyverse, Excel, or PostgreSQL. You know they are very powerful and flexible, you know if you’re already using them daily you can utilize them. But sometimes you just want to be left alone with your CVS, TSV, JSON and XML files, process them quickly on the command line, and get done with it. And you want something a little more specialized than awk , cut, and sed.

This list is by no means complete and authoritative. I compiled this as a reference that I can come back later. If you have other suggestions that are according to the spirit of this article, feel free to share them by writing a comment at the end. Without further ado, here’s my list:

  • xsv: A fast CSV command line toolkit written by the author of ripgrep. It’s useful for indexing, slicing, analyzing, splitting and joining CSV files.
  • q: run SQL directly on CSV or TSV files.
  • csvkit: a suite of command-line tools for converting to and working with CSV, the king of tabular file formats.
  • textql: execute SQL against structured text like CSV or TSV.
  • miller: like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.
  • agate: a Python data analysis library that is optimized for humans instead of machines. It is an alternative to numpy and pandas that solves real-world problems with readable code.

Honorable mentions:

  • SQLite: import a CSV File Into an SQLite Table, and use plain SQL to query it.
  • csv-mode for Emacs: sort, align, transpose, and manage rows and fields of CSV files.
  • lnav: the Log Navigator. See the tutorial in Linux Magazine.
  • jq: this one is THE tool for processing JSON on the command-line. It’s like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
  • JMESPath tutorial: a query language for JSON. You can extract and transform elements from a JSON document. There are a lot of implementations at http://jmespath.org/libraries.html and the CLI implementation is jp.

Finally the CLI for XML:

Advertisements
 
Leave a comment

Posted by on July 10, 2018 in Linux, Programlama

 

Tags: , , , , , , , , , ,

Faster, RegEx! Match! Match! (Which Regular Expression Utility is the Fastest?)


When it comes to dealing with text data, regular expressions are the bread and butter of data processing, as well as programming, most of the time. Hardly a day or two passes before you use grep or a similar tool. Until recently, I thought the field of regular expressions and related tools were very useful, boring, and didn’t present any innovations. It turns out that I was wrong!

There are two relatively new players in town: ICgrep and ripgrep.

ICGrep uses a new, parallel bitstream technology, developed Dr. Robert D. Cameron at Simon Fraser University. It claims to be super fast for many text search and processing tasks. ICGrep is available for download from http://www.icgrep.com/downloads.htm as a binary executable for OS X / MacOS. Its source code is also available if you want to build it for your operating system.

ripgrep is developed mainly by Andrew Gallant and other open source contributors, and its source code is available at https://github.com/BurntSushi/ripgrep. It is developed in Rust programming language, and claims to be very fast, Unicode-ready, as well as smart; ready to replace the Silver Searcher (ag), and “ack“.

Let’s see how they compare to the venerable regular expression utilities that we all know and love. Read the rest of this entry »

 
1 Comment

Posted by on November 3, 2016 in Linux, Programlama, sysadmin

 

Tags: , , , , , , , , ,

Is there a high quality and free Text to Speech system for Dutch that runs on GNU/Linux?


Dear Text to Speech and open source experts:

For a toy / hobby project (non-commercial), I’m trying to find a suitable Text to Speech system for Dutch that I can run on GNU/Linux. So far, the situation does not look very promising. I’ve tried eSpeak, but using it for Dutch is not as good as I expect. I made my experiment using a file “computer.txt” that has the following contents:

Een computer is een apparaat waarmee gegevens volgens formele procedures zoals algoritmen kunnen worden verwerkt. Meestal wordt met het woord computer een elektronisch, digitaal apparaat bedoeld, maar er bestaan ook mechanische en analoge computers.

$ espeak -vnl+7 -s 170 -f computer.txt

Read the rest of this entry »

 
3 Comments

Posted by on December 3, 2015 in Linguistics, Linux

 

Tags: , , ,

PostgreSQL 9 High Availability Cookbook


6969OSPostgreSQL 9 High Availability Cookbook is a very well written book whose primary audience are experienced DBAs and system engineers who want to take their PostgreSQL skills to the next level by diving into the details of building highly available PostgreSQL based systems. Reading this book is like drinking from a fire hose, the signal-to-noise ratio is very high; in other words, every single page is packed with important, critical, and very practical information. As a consequence, this also means that the book is not for newbies: not only you have to know the fundamental aspects of PostgreSQL from a database administrator’s point of view, but you also need to have solid GNU/Linux system administration background.

One of the strongest aspects of the book is the author’s principled and well-structured engineering approach to building a highly available PostgreSQL system. Instead of jumping to some recipes to be memorized, the book teaches you basic but very important principles of capacity planning. More importantly, this planning of servers and networking is not only given as a good template, but the author also explains the logic behind it, as well as drawing attention to the reason behind the heuristics he use and why some magic numbers are taken as a good estimate in case of lack of more case-specific information. This style is applied very consistently throughout the book, each recipe is explained so that you know why you do something in addition to how you do it. Read the rest of this entry »

 
Leave a comment

Posted by on August 21, 2014 in Books, Linux, sysadmin

 

Tags: , , , ,

Do not touch that stone – Do not touch that IDE


What’s the relationship between the ancient game of GO and writing software code?

It is possible to draw various analogies between these two sophisticated, intellectual human activities, but I simply wanted to note down a simple connection that has jumped into my mind recently. It can be summarized as a Go proverb:

Do not touch that stone.

And it can also be summarized as a programming  motto:

Do not touch that IDE.

What is the meaning of those sayings, other than being Zen-like statements? How and why did I come up with them? What are the context, story and history behind them? And most importantly, can they help us to be better programmers at all?

go

Read the rest of this entry »

 
1 Comment

Posted by on November 15, 2013 in CogSci, philosophy, Programlama, psychology

 

Tags: , , , , , , , , ,

GNU/Linux command line tip of the day: sum of numbers in a column


More often than not, I need to quickly need to see the sum of a column of numbers when I’m doing some processing on the GNU/Linux command line. For the sake of simplicity, let’s assume that you have the following output from some command line pipe:
Read the rest of this entry »

 
3 Comments

Posted by on May 28, 2013 in awk, Linux

 

Tags: , , , , ,

How to solve the ugly font problem of Java applications in Ubuntu 12.10


Upgrading from a few years old Ubuntu GNU/Linux version to the latest Ubuntu 12.10 might hurt your eyes… that is, if you happen to code in Java, develop Swing applications, or sometimes prefer IDEs such as NetBeans to Emacs. Somehow upgrading to the latest version of Ubuntu creates a problem with fonts and in many Java applications you see very ugly, bold fonts in menus, tree labels, etc.

This has been confirmed as a bug, and you can read more details at https://bugs.launchpad.net/ubuntu/+source/openjdk-7/+bug/937200 or https://netbeans.org/bugzilla/show_bug.cgi?id=221778

Apparently this bug seems to be somehow related to Wine and a font package. The solution that worked for me was simply to issue the following command:

    sudo apt-get remove fonts-unfonts-core

Well, I did not need Korean TrueType fonts anyway.

 

 
2 Comments

Posted by on February 27, 2013 in java, Linux

 

Tags: , , , , ,