“Readability tests, readability formulas, or readability metrics are formulae for evaluating the readability of text, usually by counting syllables, words, and sentences. Readability tests are often used as an alternative to conducting an actual statistical survey of human readers of the subject text (a readability survey). Word processing applications often have readability tests in-built, which can be deployed on documents in-editing.” (Wikipedia).
For a recent project I’m trying to find a lightweight and scalable method that can be used to detect the language difficulty / language level of a given text. There are plenty of readability formulas and test, even interactive ones such as this: http://www.editcentral.com/gwt1/EditCentral.html. There are even command line utilities such as style and diction for GNU/Linux. And some of the formulae above are already implemented in NLTK (Natural Language ToolKit): http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/readability/.
One of the conditions that an example function should satisfy is:
level(simpleText1) < level(complexText1)
e.g. simpleText1 = http://simple.wikipedia.org/wiki/Einstein
e.g. complexText1 = http://en.wikipedia.org/wiki/Albert_Einstein
But it would be great if I can develop or find a method that can produce results which can output the level of language according to Common European Framework of Reference for Languages or ILR scale:
level(text1) = 4
level(text2) = 1
level(text3) = 2
level(text4) = 3
Here is relevant article that I have just found:
Sorting Texts by Readability, Kumiko Tanaka-Ishii, Satoshi Tezuka, Hiroshi Terada, Computational Linguistics, June 2010, Vol. 36, No. 2, Pages 203-227, Posted Online May 11, 2010.
This article presents a novel approach for readability assessment through sorting. A comparator that judges the relative readability between two texts is generated through machine learning, and a given set of texts is sorted by this comparator. Our proposal is advantageous because it solves the problem of a lack of training data, because the construction of the comparator only requires training data annotated with two reading levels. The proposed method is compared with regression methods and a state-of-the art classification method. Moreover, we present our application, called Terrace, which retrieves texts with readability similar to that of a given input text.