At a first glance, the question of counting lines in a text file is super straightforward. You simply run `wc` (word count) with
--lines option. And that’s what I’ve exactly been doing for more than 20 years. But what I read recently made me question if there are faster and more efficient ways to do that. Because, nowadays with very large and fast storage, you can easily have have text files that can be 1 GB, 10 GB, or even 100 GB. Coupled with that fact is that your laptop has at least 2 cores or maybe 4, that means 8 logical cores with hyper-threading. On a powerful server, it’s not surprising at all to have 16 or more CPU cores. Therefore, can this simple text processing be made more efficient, burning as many cores as available, and utilizing them to their maximum to return the line count in a fraction of time?
Here’s what I found:
- Parallel processing with unix tools
- Use multiple CPU Cores with your Linux commands — awk, sed, bzip2, grep, wc, etc.
- GNU Parallel example: Processing a big file using more CPUs
In the first link above, I came across an interesting utility:
- turbo-linecount: Super fast Line Counter
Apparently, the author of the turbo-linecount decided to implement his solution in C++ (for Linux, macOS, and MS Windows). He uses memory mapping technique to map the text file to the memory, and multi-threading to start threads that count the number of newlines (`
\n`s) for different chunks of the memory region that corresponds to file contents, finally returning the sum total of newlines as the line count. Even though there are some issues with that system, I think it’s still very interesting. Actually my initial reaction was “how come this nice utility still not a standard package in most of the GNU/Linux software distributions such as Red Hat, Debian, Ubuntu, etc.?”.
Maybe we’ll have better options soon. Or maybe we already do? Let me know if there are better ways for this simple, yet frequently used operation.