RSS

Tag Archives: cli

How fast can we count the number of lines in GNU/Linux, macOS, and MS Windows?


At a first glance, the question of counting lines in a text file is super straightforward. You simply run `wc` (word count) with -l or --lines option. And that’s what I’ve exactly been doing for more than 20 years. But what I read recently made me question if there are faster and more efficient ways to do that. Because, nowadays with very large and fast storage, you can easily have have text files that can be 1 GB, 10 GB, or even 100 GB. Coupled with that fact is that your laptop has at least 2 cores or maybe 4, that means 8 logical cores with hyper-threading. On a powerful server, it’s not surprising at all to have 16 or more CPU cores. Therefore, can this simple text processing be made more efficient, burning as many cores as available, and utilizing them to their maximum to return the line count in a fraction of time?

Here’s what I found:

In the first link above, I came across an interesting utility:

Apparently, the author of the turbo-linecount decided to implement his solution in C++ (for Linux, macOS, and MS Windows). He uses memory mapping technique to map the text file to the memory, and multi-threading to start threads that count the number of newlines (`\n`s) for different chunks of the memory region that corresponds to file contents, finally returning the sum total of newlines as the line count. Even though there are some issues with that system, I think it’s still very interesting. Actually my initial reaction was “how come this nice utility still not a standard package in most of the GNU/Linux software distributions such as Red Hat, Debian, Ubuntu, etc.?”.

Maybe we’ll have better options soon. Or maybe we already do? Let me know if there are better ways for this simple, yet frequently used operation.

Advertisements
 
Leave a comment

Posted by on November 12, 2018 in Linux, Programlama

 

Tags: , , ,

Command Line for the 21. Century: The Low Hanging Fruit


UPDATE (23-May-2019): Added information about moreutils.

UPDATE (9-Jan-2019): Added information about exa, hyperfine, PathPicker, and svg-term-cli utilities.

UPDATE (19-Nov-2018): Added information about the Pipe Viewer utility.

People who use Unix since 1980s, or GNU/Linux since 1990s know that they can rely on the command line and many utilities for a lot of daily automation and data processing tasks. As someone who’s been using Unix and GNU/Linux based systems since 1994, I’m more than happy that I can count on these tools and the stable know-how built on them. Nevertheless, I think the command line and TUIs (Text-based User Interfaces) can be a bit better in 21. century. Therefore, in this post, I’ll list a few recent utilities that helps us to have a better command line experience.

Before you dive into the list, please be aware that I’m after the low-hanging fruit, that is, tools that can make an existing Unix / Linux command line environment a bit better with the least disruption. In other words, I will not touch on how to have brand new, GPU-powered terminal emulators such as alacritty and kitty, and neither will I talk about how nice it’d be if you only changed your shell from Bash into fish, or elvish. (If you really want to know about alternative shells, please read https://github.com/oilshell/oil/wiki/ExternalResources and http://www.oilshell.org/blog/2018/01/28.html.) I also won’t send you down the rabbit hole and make you spend countless hours customizing your shell prompt (that requires an article by itself, but in the meantime you can go and check Go Bullet Train (GBT), you’ve been warned!). Finally, no, I won’t be talking about tmux, too, because it has books dedicated to it such as “The Tao of tmux” and “tmux 2: Productive Mouse-Free Development“.

If you think something is missing and fits within the context of “low-hanging fruit” (see above), please add a comment at the end of this post. Also, for a more specialized domain, see my recent post titled “Data Processing Resources: Command-line Interface (CLI) for CSV, TSV, JSON, and XML“.

So let’s start with cd command, and how it can be enhanced with context and history, together with fuzzy matching:

Read the rest of this entry »

 
1 Comment

Posted by on October 30, 2018 in Linux

 

Tags: , , ,

Data Processing Resources: Command-line Interface (CLI) for CSV, TSV, JSON, and XML


UPDATE on 8-Feb-2019: Added BigBash It!

UPDATE on 3-Jan-2019: Added GNU datamash for CSV processing

UPDATE on 24-Oct-2018: Added gron for JSON processing.

Sometimes you don’t want pandas, tidyverse, Excel, or PostgreSQL. You know they are very powerful and flexible, you know if you’re already using them daily you can utilize them. But sometimes you just want to be left alone with your CVS, TSV, JSON and XML files, process them quickly on the command line, and get done with it. And you want something a little more specialized than awk , cut, and sed.

This list is by no means complete and authoritative. I compiled this as a reference that I can come back later. If you have other suggestions that are according to the spirit of this article, feel free to share them by writing a comment at the end. Without further ado, here’s my list:

  • xsv: A fast CSV command line toolkit written by the author of ripgrep. It’s useful for indexing, slicing, analyzing, splitting and joining CSV files.
  • q: run SQL directly on CSV or TSV files.
  • csvkit: a suite of command-line tools for converting to and working with CSV, the king of tabular file formats.
  • textql: execute SQL against structured text like CSV or TSV.
  • miller: like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.
  • agate: a Python data analysis library that is optimized for humans instead of machines. It is an alternative to numpy and pandas that solves real-world problems with readable code.

Honorable mentions:

  • GNU datamash: a command-line program which performs basic numeric, textual and statistical operations on input textual data files. See examples & one-liners.
  • SQLite: import a CSV File Into an SQLite Table, and use plain SQL to query it.
  • csv-mode for Emacs: sort, align, transpose, and manage rows and fields of CSV files.
  • lnav: the Log Navigator. See the tutorial in Linux Magazine.
  • jq: this one is THE tool for processing JSON on the command-line. It’s like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
  • gron: transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute path to it. (Why shouldn’t you just use jq?)
  • jid: JSON Incremental Digger, drill down JSON interactively by using filtering queries like jq.
  • jiq: jid with jq.
  • JMESPath tutorial: a query language for JSON. You can extract and transform elements from a JSON document. There are a lot of implementations at http://jmespath.org/libraries.html and the CLI implementation is jp.
  • BigBash It!: converts your SQL SELECT queries into an autonomous Bash one-liner that can be executed on almost any *nix device to make quick analyses or crunch GB of log files in CSV format. Perfectly suited for Big Data tasks on your local machine. Source code available at https://github.com/Borisvl/bigbash

Finally the CLI for XML:

 
1 Comment

Posted by on July 10, 2018 in Linux, Programlama

 

Tags: , , , , , , , , , ,