Are there any commonalities among human intelligence, Bayesian probability models, corpus linguistics, and religion? This blog entry presents a piece of light reading for people interested in a combination of those topics.
You have probably heard the famous question:
“What do you see below?”
The stream of letters can be broken down into English words in two different ways, either as “God is nowhere” or as “God is now here.” You can find an endless set of variations on this theme on the Internet, but I will deal with this example in the context of computational linguistics and big data processing.
When I first read the beautiful book chapter titled “Natural Language Corpus Data” written by Peter Norvig, in the book “Beautiful Data“, I’ve decided to make an experiment using Norvig’s code. In that chapter, Norvig showed a very concise Python program that ‘learned’ how to break down a stream of letters into English words, in other words, a program with the capability to do ‘word segmentation’.
Norvig’s code coupled with Google’s language corpus, is powerful and impressive; it is able to take a character string such as
and return a correct segmentation:
‘when’, ‘in’, ‘the’, ‘course’, ‘of’, ‘human’, ‘events’, ‘it’, ‘becomes’, ‘necessary’
But how would it deal with “GODISNOWEHERE”? Let’s try it out in a GNU/Linux environment:
Firs we need to fetch Norvig’s Python code and the relevant data files:
wget http://norvig.com/ngrams/ngrams.py wget http://norvig.com/ngrams/count_1w.txt wget http://norvig.com/ngrams/count_2w.txt
Since ngrams.py also includes functions that are not related to word segmentation, let’s remove them and add the following at the end of the Python script:
And finally run the code:
python ngrams.py ['godisnowhere'] (-7.520585991236696, ['godisnowhere'])
Apparently, based on Google’s corpus, Norvig’s program cannot create a statistical language model that can correctly segment “godisnowhere”.
Now, let us ask the following questions:
– What kind of statistical language model would the program build if it encountered only
– Would it be able to do word segmentation on “godisnowhere”?
To make a small experiment let’s take Bible as a first example. In this case, we will use Norvig’s code again, but unfortunately we don’t have count_1w.txt and count_2w.txt files derived from them. So we will first have to format the Bible text as one line per sentence, and then create n-gram files out of that. So let’s get started and see whether our program, after ‘reading’ the Bible, will be able to ‘understand’ what “godisnowhere” is about.
First let’s download The King James Version of the Bible from Gutenberg Project.
wget http://www.gutenberg.org/ebooks/10.txt.utf-8 -O bible.txt
Now we need to reformat it, so that there is one sentence per one line. One could say this would be a simple sed operation, using regular expressions magic, but any student of computational linguistics is aware that sentence splitting is more than a simple regular expressions operation, not every dot or colon does make a sentence boundary. So let’s use another concise and lightweight but pretty powerful Python program, called ‘splitta‘ that does statistical sentence boundary detection, using Bayesian methods (as well as other machine learning techniques, but we’ll stick to good old Bayes).
python ../splitta/sbd.py -m ../splitta/model_nb bible.txt -o bible_split.txt loading model from [../splitta/model_nb/]... done! reading [bible.txt] featurizing... done! NB classifying... done!
After having formatted bible.txt into bible_split.txt, that is one sentence per line, now we can get ready for creating n-gram files based on that file. In order to do that, I prefer to use the ngram-count utility from SRILM – The SRI Language Modeling Toolkit, “a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.”
SRILM is a very mature and heavily optimized set of tools for statistical linguistics, and it has good documentation, therefore I opted for it, instead of using another library, or writing something from scratch.
Similar to Norvig’s use of of 1-grams and 2-grams, let’s create the corresponding files by processing bible_split.txt:
ngram-count -order 1 -text bible_split.txt > count_1w.txt ngram-count -order 2 -text bible_split.txt > count_2w.txt
At this stage, we are almost ready to run Norvig’s code again, but before proceeding with that, we need to make a small adjustment and that is the number of tokens, defined as a constant in Norvig’s code, which is originally given as:
N = 1024908267229 ## Number of tokens
Since we are not using Google’s corpus, but rather only the Bible text, we need to set N as the number of tokens in bible.txt, and counting tokens mean counting not only words but also punctuation marks (not that this creates a difference in the final result of the program)
cat bible.txt | sed 's/\([[:punct:]]\)/ \1 /g' | wc -w 1014633
If we set N to 1014633 and then run Norvig’s program, we get:
python ngrams.py ['god', 'is', 'no', 'where'] (-11.993038046417457, ['god', 'is', 'no', 'where'])
We see that unlike the program that used Google’s corpus, this one that used Bible’s text is able to do word segmentation on “godisnowehere” and produce the result of “god is no where.”
I was curious to do another similar experiment, this time on another religious text, namely Koran. For that, I’ve used the text titled as “The Koran (Al-Qur’an) by G. Margoliouth and J. M. Rodwell“. I’ll skip the intermediate steps and jump immediately to the final step:
python word_segmentation.py ['god', 'is', 'nowhere'] (-8.744177619187273, ['god', 'is', 'nowhere'])
The probabilities are slightly different but the overall statistical model leads to the same result for word segmentatiion, which turns out to be “god is nowhere”.
We have seen that the program cannot infer “God is now here” either from Bible, or from Koran text. One naturally wonders, what kind of meaningful text could lead to a word segmentation result such as “God is now here”. It is, of course, very easy to quickly write a short piece of text that would lead to such a result, but I wonder whether there are any famous historical texts (focusing on religion or not).
To finish with the initial question, let’s repeat: Are there any commonalities among human intelligence, Bayesian probability models, corpus linguistics, and religion?
The inner workings of the human mind and brain are still a mystery, and statistical language processing is a narrow and specialized look at one of the most marvelous phenomena in the universe. I know that I did not answer the original question definitively, but I hope I was able to kindle your curiosity and make you ask
UPDATE (2-Mar-2014): Yusuf Arslan drew my attention to the fact that “godisnowhere” can also be segmented into words as “God I snow here”, but he could not provide me with any Eskimo religious text that could lead to such an interesting result.