RSS

Comparing top 100 Dutch words to Zipf’s law

23 May

Recently I was involved with a project that was related to the website of the Universiteit Antwerpen (UA). As a result of my task I developed a system in Python to count the frequencies of all the words that occur throughout the UA website. After spending a few days playing with the data my system produced, I remembered an interesting mathematical law that is supposed to apply to natural languages, namely the Zipf’s law:

“Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc.

Zipf’s law is most easily observed by plotting the data on a log-log graph, with the axes being log(rank order) and log(frequency).”

So I decided to give my data set a try and see how well it conforms to Zipf’s law. I plotted two graphs, the linear graph being the ideal case of Zipf’s law, and the other (blue one) being the actual data:

Zipfs law and top 100 Dutch words

Zipf’s law and top 100 Dutch words


To compare how well another corpus comforms to Zipf’s law you can examine this: A plot of word frequency in Wikipedia (November 27, 2006). The plot is in log-log coordinates. x is rank of a word in the frequency table; y is the total number of the word’s occurrences.

A plot of word frequency in Wikipedia (November 27, 2006)

A plot of word frequency in Wikipedia (November 27, 2006)

It seems like neither the Wikipedia corpus nor my Dutch corpus totally comforms to Zipf’s law, but nevertheless come quite close to it. Moreover I came to think that my Dutch corpus is not a very high quality corpus (it is not very representative of daily Dutch) since a) it is from a university web site and b) it comprises of about 40.000 web pages of UA it includes pages not only in Dutch but also some pages that are in English, German and French. Even taking all of these aspects into account I’m still surprised that I can use Zipf’s law to check if I created a reasonable data set. Natural languages never cease to amaze me with all their intricacies and interesting statistical properties. (For curious linguists out there, the 12 most frequent Dutch words in this corpus are: de, van, en, in, het, een, of, voor, op, met, te, is.)

For those curious hackers here’s the code I used to draw the first graph (using IPython, numpy and matplotlib):

 
2 Comments

Posted by on May 23, 2010 in Linguistics, Programlama, python

 

2 responses to “Comparing top 100 Dutch words to Zipf’s law

  1. R

    September 19, 2013 at 20:50

    Hello, do you know any (free) online analyser of the same kind? Like, you send a file, and it returns the frequencies? Thank you very much

     
  2. Emre Sevinç

    September 20, 2013 at 08:16

    Hello I don’t have any experience with them but you might try http://textalyser.net/ or http://sporkforge.com/text/word_count.php

     

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: