RSS

GODISNOWHERE: A look at a famous question using Python, Google and natural language processing

01 Mar

Are there any commonalities among human intelligence, Bayesian probability models, corpus linguistics, and religion? This blog entry presents a piece of light reading for people interested in a combination of those topics.
You have probably heard the famous question:

       “What do you see below?”

            GODISNOWHERE

The stream of letters can be broken down into English words in two different ways, either as “God is nowhere”   or as “God is now here.” You can find an endless set of variations on this theme on the Internet,  but I will deal with this example in the context of computational linguistics and big data processing.

margo

When I first read the beautiful book chapter titled “Natural Language Corpus Data” written by Peter Norvig, in the book “Beautiful Data“, I’ve decided to make an experiment using Norvig’s code. In that chapter, Norvig showed a very concise Python program that ‘learned’ how to break down a stream of letters into English words, in other words, a program with the capability to do ‘word segmentation’.

Norvig’s code coupled with Google’s language corpus, is powerful and impressive; it is able to take a character string such as

“wheninthecourseofhumaneventsitbecomesnecessary”

and return a correct segmentation:


‘when’, ‘in’, ‘the’, ‘course’, ‘of’, ‘human’, ‘events’, ‘it’, ‘becomes’, ‘necessary’

But how would it deal with “GODISNOWEHERE”? Let’s try it out in a GNU/Linux environment:

Firs we need to fetch Norvig’s Python code and the relevant data files:

wget http://norvig.com/ngrams/ngrams.py
wget http://norvig.com/ngrams/count_1w.txt
wget http://norvig.com/ngrams/count_2w.txt

Since ngrams.py also includes functions that are not related to word segmentation, let’s remove them and add the following at the end of the Python script:

 print(segment("godisnowhere"))
 print(segment2("godisnowhere"))

And finally run the code:

 python ngrams.py
 ['godisnowhere']
 (-7.520585991236696, ['godisnowhere'])

Apparently, based on Google’s corpus, Norvig’s program cannot create a statistical language model that can correctly segment “godisnowhere”.

Now, let us ask the following questions:

- What kind of statistical language model would the program build if it encountered only
religious texts?

- Would it be able to do word segmentation on “godisnowhere”?

To make a small experiment let’s take Bible as a first example. In this case, we will use Norvig’s code again,  but unfortunately we don’t have count_1w.txt and count_2w.txt files derived from them. So we will first have to format the Bible text as one line per sentence, and then create n-gram files out of that. So let’s get started and see whether our program, after ‘reading’ the Bible, will be able to ‘understand’ what “godisnowhere” is about.

First let’s download The King James Version of the Bible from Gutenberg Project.

 wget http://www.gutenberg.org/ebooks/10.txt.utf-8 -O bible.txt

Now we need to reformat it, so that there is one sentence per one line. One could say this would be a simple sed operation, using regular expressions magic, but any student of computational linguistics is aware that sentence splitting is more than a simple regular expressions operation,  not every dot or colon does make a sentence boundary.  So let’s use another concise and lightweight but pretty powerful Python program, called ‘splitta‘ that does statistical sentence boundary detection, using Bayesian methods (as well as other machine learning techniques, but we’ll stick to good old Bayes).

 python ../splitta/sbd.py -m ../splitta/model_nb bible.txt -o bible_split.txt

 loading model from [../splitta/model_nb/]... done!
 reading [bible.txt]
 featurizing... done!
 NB classifying... done!

After having formatted bible.txt into bible_split.txt, that is one sentence per line, now we can get ready for  creating n-gram files based on that file. In order to do that, I prefer to use the ngram-count utility from SRILM – The SRI Language Modeling Toolkit, “a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.”
SRILM is a very mature and heavily optimized set of tools for statistical linguistics, and it has good   documentation, therefore I opted for it, instead of using another library, or writing something from scratch.

Similar to Norvig’s use of of 1-grams and 2-grams, let’s create the corresponding files by processing bible_split.txt:

 ngram-count -order 1 -text bible_split.txt > count_1w.txt
 ngram-count -order 2 -text bible_split.txt > count_2w.txt

At this stage,  we are almost ready to run Norvig’s code again, but before proceeding with that,  we need to make a small adjustment and that is the number of tokens, defined as a constant in Norvig’s code, which is originally given as:

 N = 1024908267229 ## Number of tokens

Since we are not using Google’s corpus, but rather only the Bible text, we need to set N as the number of tokens in bible.txt, and counting tokens mean counting not only words but also punctuation marks (not that this creates a difference in the final result of the program)

 cat bible.txt | sed 's/\([[:punct:]]\)/ \1 /g' | wc -w
 1014633

If we set N to 1014633 and then run Norvig’s program,  we get:

 python ngrams.py
 ['god', 'is', 'no', 'where']
 (-11.993038046417457, ['god', 'is', 'no', 'where'])

We see that unlike the program that used Google’s corpus, this one that used Bible’s text is able to do word segmentation on “godisnowehere” and produce the result of “god is no where.”

I was curious to do another similar experiment, this time on another religious text, namely Koran. For that, I’ve used the text titled as “The Koran (Al-Qur’an) by G. Margoliouth and J. M. Rodwell“. I’ll skip the intermediate steps and jump immediately to the final step:

 python word_segmentation.py
 ['god', 'is', 'nowhere']
 (-8.744177619187273, ['god', 'is', 'nowhere'])

The probabilities are slightly different but the overall statistical model leads to the same result for word segmentatiion, which turns out to be “god is nowhere”.

We have seen that the program cannot infer “God is now here” either from Bible, or from Koran text. One naturally wonders, what kind of meaningful text could lead to a word segmentation result such as “God is now here”. It is,  of course, very easy to quickly write a short piece of text that would lead to such a result, but I wonder whether there are any famous historical texts (focusing on religion or not).

To finish with the initial question, let’s repeat: Are there any commonalities among human intelligence, Bayesian probability models, corpus linguistics, and religion?

The inner workings of the human mind and brain are still a mystery, and statistical language processing is a narrow and specialized look at one of the most marvelous phenomena in the universe. I know that I did not answer the original question definitively, but I hope I was able to kindle your curiosity and make you ask
better questions.

UPDATE (2-Mar-2014): Yusuf Arslan drew my attention to the fact that “godisnowhere” can also be segmented into words as “God I snow here”, but he could not provide me with any Eskimo religious text that could lead to such an interesting result.

About these ads
 
2 Comments

Posted by on March 1, 2014 in Linguistics, Programlama, python

 

Tags: , , , , , , , ,

2 responses to “GODISNOWHERE: A look at a famous question using Python, Google and natural language processing

  1. mdakin

    March 10, 2014 at 17:05

    Or, “Godi, snow here!” In a fictional language where Nimbostratus clouds are called “godi”.

    So basically your study tells us that Norvig’s algorithm is actually too simplistic and can not identify plausible segmentation possibilities?

     
  2. Emre Sevinç

    March 10, 2014 at 17:17

    Another nice variation on a theme! :)

    Well, I wouldn’t call my short piece a ‘study’ at all (‘study’ sounds very serious and scientific), but rather a weekend hack / curiosity. From what I wrote, I think it would be fair enough to say that Norvig’s naive Bayes word segmentation code is good enough for many practical purposes. Actually I was more surprised by Google’s corpus not having enough relevant data to segment GODISNOWHERE (because, a) I’d expect Google corpus to be enough, given the fact that in my examples I’ve used much more limited corpus, namely two very well known texts, b) it has enough data to segment much more complicated examples). I also remember playing with 3-grams (and modifying the code) accordingly, but not having different results. This should not be that surprising though, the conditional probability of ‘… is now here’ seems to be relatively lower (and it is definitely so in the texts of the holy books mentioned above).

     

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 61 other followers

%d bloggers like this: