I have recently finished converting Deniz Yüret’s Turkish deasciifier, turkish-mode (that was implemented in Emacs Lisp) into Python. The source code is available at http://github.com/emres/turkish-deasciifier.
For those who are a little bit puzzled at the term ‘deasciification’: It is the process of converting a Turkish text that is written using only ASCII letters into a Turkish text with correct Turkish letters. For example if your ASCII-only Turkish text is:
“Opusmegi cagristiran catirtilar.”
Then the correct output should be:
“Öpüşmeği çağrıştıran çatırtılar.”
You may be forced to write ascii-only Turkish text if you don’t have a Turkish keyboard, or maybe you’re dealing with Turkish movies from IMDb, the ones whose titles do not include any Turkish letters (e.g. “Yahsi bati” whose correct Turkish form is “Yahşi Batı”).
Even though there are systems that do similar or the same thing I considered this implementation worthwhile because:
– Deniz Yüret’s original turkish-mode works only in Emacs. I’m an Emacs user but that is hardly the case for the majority of users and programmers.
– Zemberek based deasciifier is available for download and also has a web-based version at http://zemberek-web.appspot.com/ but it is not very practical to install Java, and then a big spell checking library for deasciification, if that’s the only feature you need as a programmer. Besides Zemberek’s deasciification method is different and it fails to convert some texts.
– The deasciifier that was developed at Sabancı University by Gökhan Tür (which also inspired the current deasciifier) has some limitations: Its source code is not available, it is not downloadable, its web version has length limitation. Why should you be forced to share your data with somebody else anyway?
And finally some example usage, first within a Python program:
And from the Linux command line:
$ echo "Yilanlarin Ocu" | turkis-deasciify Yılanların Öcü echo "Hic fena olmadi sanirim, ne dersin hocam?" | turkish-deasciify Hiç fena olmadı sanırım, ne dersin hocam?
Of course this deasciifier is not perfect, too, it fails in some cases. However as a native Turkish speaker I can say that it works most of the time for me and it is good enough for nearly all practical purposes. For the theory behind it you can read the paper by written by Deniz Yüret. The coolest feature to add would be the ability to add corrections to the deasciifier on the fly (imagine a user helping the system to get better). But that’s for some other time. In the mean time my plans are to develope a web interface for this Python implementation, add this as a package to the Python Package Index and create a stand-alone GUI application that runs on Linux and Windows.