Is it possible to write a simple command line to detect whether a given piece of text belongs to English, German, or some other natural language? Using nothing but the
gzip and a few other traditional GNU/Linux utilities? Let’s give it a try!
First lets create two test files, namely
some_de.txt, and store some English and German texts in them respectively:
Then let’s retrieve some English and German corpus from the Gutenberg Project, and rename them as EN and DE respectively:
And now is the time for magic, natural language processing magic. Let’s build our command line for
which results in EN, the correct answer. And let’s try the similar command line for
which returns DE, proving that this command line worked correctly for two examples.
This blog entry is 100% inspired by a recent lecture of Peter Norvig in http://www.ai-class.com. Near the end of the lecture, he explains how and why it is possible to use compression (gzip) for natural language detection. He does not go into the details in his short lecture but mentions that there are many interesting connections between understanding and compression. Unfortunately the most exciting part of the lecture includes a problematic command line, that is, if you take it face value and try it directly, you have a high probability of getting a wrong result:
If you try the example given by Norvig with the same German text:
you will get
71361 EN as a result and this is not correct. Nevertheless, it is easy to fix the example (as you have already seen above), and I’m thankful to Peter Norvig, not only for the inspiration of a very cool command line natural language processing example but also for drawing my attention once again the mysteries of language, statistics, compression and comprehension.