UPDATE (2023-06-14): Now that we’re living in the world of ChatGPT and Large Language Models (LLMs), a software developer, Murat Çorlu, suggested that ChatGPT’s performance for diacritics restoration (deasciification) for Turkish is very successful: https://twitter.com/muratcorlu/status/1668335101602848768 He shared his example at https://chat.openai.com/share/3bb666fd-9f35-40df-8efb-9dd0c59bb264. In order to see if ChatGPT is really the best (see the Accuracy benchmark given in “TABLE IV” below), a nice experiment would be to take a validated Turkish corpus, “asciify” it, feed the output to ChatGPT (e.g. via its API), retrieve the “deasciified” output, comparing it to the original corpus and checking what percentage of the text matches the original one. If the result turns out to be at least 1-2 points bigger than 97.06%, we’ll have a clear winner! 😉 Of course, enough care should be taken so that the initial Turkish corpus is not only validated (all diacritics are correct), but also representative of Turkish usage in a lot of domains, including multi-lingual texts, texts with heavy foreign terminology, abbreviations, ambiguities, etc.
People who need to write correctly in languages that have letters with various diacritics such as ‘ğ‘, ‘ş‘, ‘ö‘, ‘ı‘, etc., can be troubled with US or UK standard QWERTY keyboards because of the lack of such letters on those keyboard layouts. If you also need to switch between languages such as English, and Turkish, you know what I mean.
The process of taking a piece of writing without correct spelling (that uses standard ASCII characters, without proper diacritics) , and replacing the relevant letters with the correct ones is known as “diacritics restoration“, or “diacritics reconstruction” (or “deASCIIfication” colloquially). About 10 years ago, I wrote a Python program to help people with this: Turkish Deasciifier; a port of the Emacs Lisp code developed by Prof. Deniz Yüret. There’s also a web interface at http://turkceyap.appspot.com.
Read the rest of this entry »