RSS

Turkish deasciifier in Python and state of the art in deasciification

19 Jul

I have recently finished converting Deniz Yüret’s Turkish deasciifier, turkish-mode (that was implemented in Emacs Lisp) into Python. The source code is available at http://github.com/emres/turkish-deasciifier.

For those who are a little bit puzzled at the term ‘deasciification’: It is the process of converting a Turkish text that is written using only ASCII letters into a Turkish text with correct Turkish letters. For example if your ASCII-only Turkish text is:

“Opusmegi cagristiran catirtilar.”

Then the correct output should be:

“Öpüşmeği çağrıştıran çatırtılar.”

You may be forced to write ascii-only Turkish text if you don’t have a Turkish keyboard, or maybe you’re dealing with Turkish movies from IMDb, the ones whose titles do not include any Turkish letters (e.g. “Yahsi bati” whose correct Turkish form is “Yahşi Batı”).

Even though there are systems that do similar or the same thing I considered this implementation worthwhile because:

– Deniz Yüret’s original turkish-mode works only in Emacs. I’m an Emacs user but that is hardly the case for the majority of users and programmers.

– The Javascript implementation of turkish-mode which you can try at http://turkce-karakter.appspot.com/ is practical for end users but not very practical for programmers and for people who want to use the system from the command line.

– Zemberek based deasciifier is available for download and also has a web-based version at http://zemberek-web.appspot.com/ but it is not very practical to install Java, and then a big spell checking library for deasciification, if that’s the only feature you need as a programmer. Besides Zemberek’s deasciification method is different and it fails to convert some texts.

– The deasciifier that was developed at Sabancı University by Gökhan Tür (which also inspired the current deasciifier) has some limitations: Its source code is not available, it is not downloadable, its web version has length limitation. Why should you be forced to share your data with somebody else anyway?

And finally some example usage, first within a Python program:

And from the Linux command line:


$ echo "Yilanlarin Ocu" | turkis-deasciify
Yılanların Öcü

echo "Hic fena olmadi sanirim, ne dersin hocam?" | turkish-deasciify
Hiç fena olmadı sanırım, ne dersin hocam?

Of course this deasciifier is not perfect, too, it fails in some cases. However as a native Turkish speaker I can say that it works most of the time for me and it is good enough for nearly all practical purposes. For the theory behind it you can read the paper by written by Deniz Yüret. The coolest feature to add would be the ability to add corrections to the deasciifier on the fly (imagine a user helping the system to get better). But that’s for some other time. In the mean time my plans are to develope a web interface for this Python implementation, add this as a package to the Python Package Index and create a stand-alone GUI application that runs on Linux and Windows.

 
10 Comments

Posted by on July 19, 2010 in Linguistics, Programlama, python

 

Tags:

10 responses to “Turkish deasciifier in Python and state of the art in deasciification

  1. Tanya

    July 19, 2010 at 23:23

    Eline saglik “schatje”. Benim cok isime yarayacak gercekten.

     
  2. Volkan YAZICI

    July 19, 2010 at 23:25

    Lisp araçlarını başka programlama dillerine çevirme işinde çok fazla “devil’s advocacy” kokusu alıyorum. Cık. Olmamış.

     
  3. Emre Sevinc

    July 19, 2010 at 23:38

    Volkan, seni tanımasam kafa göz dalardım lakin seni tanıdığım için tam tersi duygular içerisindeyim😉 (Hadi bakalım kolaysa bu cümleyi Textual Entailment mevzusuna tabi tutup doğru çıkarımlar yapabilecek algoritmayı geliştirin😉

     
  4. Pingback: Tanya's Blog
  5. Volkan YAZICI

    August 24, 2010 at 22:31

    Bugün tezin Türkçe giriş kısmını yazarken tr_TR.UTF-8 to LaTeX işlevine[1] ihtiyacım oldu. Zaten tr_TR.UTF-8 to ASCII metodum[2] vardı Hotmail’zedeler için. İşte böyle. Reklamın iyisi kötüsü olmaz.

    [1] http://paste.lisp.org/display/113828
    [2] http://paste.lisp.org/display/63038

     
  6. İlker Fıçıcılar

    September 5, 2010 at 08:15

    Merhaba,

    Teşekkürler… Bu gerçekten yararlı ve kolay kullanımlı olacak. Hemen her kodun içinden rahatlıkla çağrılabilir.

    Türkçe ve NLP ile ilgilendiğiniz için bir de link vermek istedim. Şurada ‘Örneğe Dayalı Bilgisayar Çevirisi’ tekniği üzerine bir makaleler bibliyografyası yer alıyor. İlgi uyandıracakır sanıyorum: http://diluzerine.wordpress.com/2010/08/28/ornege-dayali-bilgisayar-cevirisi-makaleler-dizini/

    Hoşçakalın ve tekrar teşekkürler.

     
  7. Emre Sevinc

    September 5, 2010 at 11:29

    Merhaba,

    Umarım işinizi görür. Herhangi bir kurulum yapmadan denemek isterseniz http://turkceyap.appspot.com/ adresini ziyaret edebilir yahut Firefox eklentisi olarak kurmak isterseniz https://addons.mozilla.org/en-US/firefox/addon/204311/ adresini ziyaret edebilirsiniz.

     
  8. Erinc

    November 18, 2010 at 20:59

    Teşekkürler emeğiniz için.
    Çok işime yaradı.

     
  9. Emre Sevinc

    November 19, 2010 at 01:03

    Sevindim işinize yaramasına.

     

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: