Recently I was involved with a project that was related to the website of the Universiteit Antwerpen (UA). As a result of my task I developed a system in Python to count the frequencies of all the words that occur throughout the UA website. After spending a few days playing with the data my system produced, I remembered an interesting mathematical law that is supposed to apply to natural languages, namely the Zipf’s law:
“Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc.
Zipf’s law is most easily observed by plotting the data on a log-log graph, with the axes being log(rank order) and log(frequency).”
So I decided to give my data set a try and see how well it conforms to Zipf’s law. I plotted two graphs, the linear graph being the ideal case of Zipf’s law, and the other (blue one) being the actual data:
To compare how well another corpus comforms to Zipf’s law you can examine this: A plot of word frequency in Wikipedia (November 27, 2006). The plot is in log-log coordinates. x is rank of a word in the frequency table; y is the total number of the word’s occurrences.
It seems like neither the Wikipedia corpus nor my Dutch corpus totally comforms to Zipf’s law, but nevertheless come quite close to it. Moreover I came to think that my Dutch corpus is not a very high quality corpus (it is not very representative of daily Dutch) since a) it is from a university web site and b) it comprises of about 40.000 web pages of UA it includes pages not only in Dutch but also some pages that are in English, German and French. Even taking all of these aspects into account I’m still surprised that I can use Zipf’s law to check if I created a reasonable data set. Natural languages never cease to amaze me with all their intricacies and interesting statistical properties. (For curious linguists out there, the 12 most frequent Dutch words in this corpus are: de, van, en, in, het, een, of, voor, op, met, te, is.)