RSS

Faster, RegEx! Match! Match! (Which Regular Expression Utility is the Fastest?)

03 Nov

When it comes to dealing with text data, regular expressions are the bread and butter of data processing, as well as programming, most of the time. Hardly a day or two passes before you use grep or a similar tool. Until recently, I thought the field of regular expressions and related tools were very useful, boring, and didn’t present any innovations. It turns out that I was wrong!

There are two relatively new players in town: ICgrep and ripgrep.

ICGrep uses a new, parallel bitstream technology, developed Dr. Robert D. Cameron at Simon Fraser University. It claims to be super fast for many text search and processing tasks. ICGrep is available for download from http://www.icgrep.com/downloads.htm as a binary executable for OS X / MacOS. Its source code is also available if you want to build it for your operating system.

ripgrep is developed mainly by Andrew Gallant and other open source contributors, and its source code is available at https://github.com/BurntSushi/ripgrep. It is developed in Rust programming language, and claims to be very fast, Unicode-ready, as well as smart; ready to replace the Silver Searcher (ag), and “ack“.

Let’s see how they compare to the venerable regular expression utilities that we all know and love.

In this totally unscientific study, I ran the same search tasks on the same data set, using my Apple MacBook laptop, and  compared the performance of `icgrep`, to `ripgrep`, as well as the venerable utilities such GNU grep, BSD grep (that comes pre-installed with OS X), and The Silver Searcher (ag).

First a few words about my environment: Operating system is “OS X El Capitain Version 10.11.4”, CPU is 2.8 GHz Intel Core i7 with a RAM of 16 GB 1600 MHz DDR3, and a storage of 500 GB SSD.

The text file that I used is the dump of Arabic Wikipedia pages downloaded from https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles.xml.bz2 (date: 01-Oct-2016 19:56. Size: 510,252,224 bytes in compressed form, and 3,311,291,975 bytes in uncompressed form, that is, 3.1 GB).

For each text search / pattern matching task, I ran the same command 5 times, and took the average of results.

Let’s start with the pre-installed BSD grep that comes with OS X:

$ /usr/bin/grep --version
grep (BSD grep) 2.5.1-FreeBSD

Time to match some patterns:

$ time /usr/bin/grep -c -i Wats.n arwiki-latest-pages-articles.xml
793

real   1m6.120s
user   1m5.550s
sys    0m0.530s

This is a simple regular expression but we’re searching for it in 3.1 GB text file, so it took about 1 minute. We can certainly do better.

Let’s try `GNU grep`:

$ /usr/local/bin/grep --version
grep (GNU grep) 2.26

$ time /usr/local/bin/grep -c -i Wats.n arwiki-latest-pages-articles.xml
793

real    0m2.628s
user    0m2.168s
sys     0m0.457s

We went from more than 1 minute down to 2 seconds. Not bad. But most of the time I use the Silver Searcher (ag) for my text searches, let’s see if it can deal with it:

$ ag --version
ag version 0.31.0

Features:
+jit +lzma +zlib

$ time ag -c -i Wats.n arwiki-latest-pages-articles.xml
ERR: Skipping /Users/emre/Downloads/arwiki-latest-pages-articles.xml: pcre_exec() can't handle files larger than 2147483647 bytes.

Apparently, we can’t use Silver Searcher for large text files, at least this version of it. Now is the time to give the new RegEx utilities a test run.

Let’s start with icgrep:

$ icgrep --version
LLVM (http://llvm.org/):
LLVM version 3.5.0svn
Optimized build.
Built Nov  4 2015 (10:54:34).
Default target: x86_64-apple-darwin15.4.0
Host CPU: core-avx2

$ time icgrep -c -i Wats.n arwiki-latest-pages-articles.xml
793

real    0m3.022s
user    0m2.028s
sys     0m0.993s

So, icgrep is almost as fast as GNU grep for this simple regular expression matching task.

Let’s try ripgrep:

$ rg --version
0.2.6

$ time rg -c -i Wats.n arwiki-latest-pages-articles.xml
793

real    0m4.340s
user    0m3.780s
sys     0m0.560s

In this case, ripgrep seems a little slower than icgrep and GNU grep.

Now that we have a very crude “feel” for the performance of these utilities, let’s try a more interesting regular expression matching task on the same text file: finding the number of Greek characters in the Arabic Wikipedia pages. Because, why not?

The pre-installed BSD grep does not support searching for Unicode patterns, and we have already seen that the installed version of Silver Searcher couldn’t handle a 3.1 GB text file, so we will test only GNU grep, icgrep, and ripgrep:

$ time /usr/local/bin/grep -P -c '\p{Greek}' arwiki-latest-pages-articles.xml
14480

real    0m27.130s
user    0m26.580s
sys     0m0.510s

It took about 27 seconds for GNU grep to find the 14.430 Greek characters in a 3.1 GB text file that includes Arabic wikipedia items.

Can we do better? Let’s try icgrep:

$ time icgrep -c '\p{Greek}' arwiki-latest-pages-articles.xml
14480

real    0m3.203s
user    0m2.168s
sys     0m1.034s

The R&D effort that went into `icgrep` shows that it is possible to do this well within a few seconds.

But what about ripgrep?

$ time rg -c '\p{Greek}' arwiki-latest-pages-articles.xml
14480

real    0m6.800s
user    0m6.230s
sys     0m0.540s

Apparently, for these kind of tasks ripgrep is two times slower than icgrep, yet it’s still much faster than GNU grep.

So, which one is the true winner? I wish life was so simple, but of course, the answer, like in most of the cases, is “it depends.” There’s clearly a lot of research put into ICgrep, and it shows, but its binary executable is only available for OS X / MacOS, so good luck trying to install it on a GNU/Linux or an MS Windows system just to give it a try.

On the other hand, ripgrep also provides great performance, and installation is straightforward for GNU/Linux, MS Windows, and OS X / MacOS systems because of the simple binary executable availability. Since its use case is the same as The Silver Searcher (ag), and “ack”, that is, being smarter and faster than GNU grep, especially when it comes to searching text in software projects, it will most probably be my preferred tool from now on for such use cases.

You can find a very interesting discussion between the author of ICgrep and the author of ripgrep at https://news.ycombinator.com/item?id=12586928, comparing the performance characteristics of both systems for different scenarios. This discussion here is also interesting.

You can also check another regular expression benchmark (though it doesn’t include ICGrep or ripgrep) at http://sljit.sourceforge.net/regex_perf.html.

 
Leave a comment

Posted by on November 3, 2016 in Linux, Programlama, sysadmin

 

Tags: , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: