RSS

Readability tests and metrics

31 May

Readability tests, readability formulas, or readability metrics are formulae for evaluating the readability of text, usually by counting syllables, words, and sentences. Readability tests are often used as an alternative to conducting an actual statistical survey of human readers of the subject text (a readability survey). Word processing applications often have readability tests in-built, which can be deployed on documents in-editing.” (Wikipedia).

For a recent project I’m trying to find a lightweight and scalable method that can be used to detect the language difficulty / language level of a given text. There are plenty of readability formulas and test, even interactive ones such as this: http://www.editcentral.com/gwt1/EditCentral.html. There are even command line utilities such as style and diction for GNU/Linux. And some of the formulae above are already implemented in NLTK (Natural Language ToolKit): http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/readability/.

One of the conditions that an example function should satisfy is:

level(simpleText1) < level(complexText1)
e.g. simpleText1 = http://simple.wikipedia.org/wiki/Einstein
e.g. complexText1 =
http://en.wikipedia.org/wiki/Albert_Einstein

But it would be great if I can develop or find a method that can produce results which can output the level of language according to Common European Framework of Reference for Languages or ILR scale:

level(text1) = 4
level(text2) = 1
level(text3) = 2
level(text4) = 3

Here is relevant article that I have just found:

Sorting Texts by Readability, Kumiko Tanaka-Ishii, Satoshi Tezuka, Hiroshi Terada, Computational Linguistics, June 2010, Vol. 36, No. 2, Pages 203-227, Posted Online May 11, 2010.

This article presents a novel approach for readability assessment through sorting. A comparator that judges the relative readability between two texts is generated through machine learning, and a given set of texts is sorted by this comparator. Our proposal is advantageous because it solves the problem of a lack of training data, because the construction of the comparator only requires training data annotated with two reading levels. The proposed method is compared with regression methods and a state-of-the art classification method. Moreover, we present our application, called Terrace, which retrieves texts with readability similar to that of a given input text.

Advertisements
 
3 Comments

Posted by on May 31, 2010 in Linguistics, Programlama

 

3 responses to “Readability tests and metrics

  1. Francois

    May 14, 2012 at 18:39

    Hello,

    as you are interested for a formula using CEFR levels as output, you should check this paper :

    François T., Combining a Statistical Language Model with Logistic Regression to Predict the Lexical and Syntactic Difficulty of Texts for FFL, In Proceedings of the EACL 2009 Student Research Workshop, Athens, 2 April 2009, 19-27

     
  2. Emre Sevinc

    June 1, 2012 at 21:45

    François,

    Thank you very much for the pointer. I’m going to read your paper.

     

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: