The strongest criticism comes after and against one of the most controversial and recently popular research which made use of computers to understand ancient symbols. The issue was made famous by WIRED’s “Artificial Intelligence Cracks Ancient Mystery” article. Richard Sproat’s strong criticism of mis-using statistical methods in order to detect if a sequence of symbols constitute a language is worth reading: “Ancient Symbols, Computational Linguistics, and the Reviewing Practices of the General Science Journals“.
— UPDATE: Rao’s answer to the following criticism can be read at Rebuttal of Sproat, Farmer, et al.’s supposed “refutation”. Also see http://indusresearch.wikidot.com/script —
“Few archaeological finds are as evocative as artifacts inscribed with symbols. Whenever an archaeologist finds a potsherd or a seal impression that seems to have symbols scratched or impressed on the surface, it is natural to want to ‘read’ the symbols. And if the symbols come from an undeciphered or previously unknown symbol system it is common to ask what language the symbols supposedly represent and whether the system can be deciphered.
Of course the first question that really should be asked is whether the symbols are in fact writing. A writing system, as linguists usually define it, is a symbol system that is used to represent language. Familiar examples are alphabets such as the Latin, Greek, Cyrillic, or Hangul alphabets, alphasyllabaries such as Devanagari or Tamil, syllabaries such as Cherokee or Kana, and morphosyllabic systems like Chinese characters. But symbol systems that do not encode language abound: European heraldry, mathematical notation, labanotation (used to represent dance), and Boy Scout merit badges are all examples of symbol systems that represent things, but do not function as part of a system that represents language. Whether an unknown system is writing or not is a difficult question to answer.
It can only be answered definitively in the affirmative if one can develop a verifiable decipherment into some language or languages. Statistical techniques have been used in decipherment for years, but these have always been used under the assumption that the system one is dealing with is writing, and the techniques are used to uncover patterns or regularities that might aid in the decipherment. Patterns of symbol distribution might suggest that a symbol system is not linguistic: For example, odd repetition patterns might make it seem that a symbol system is unlikely to be writing. But until recently nobody had argued that statistical techniques could be used to determine that a system is linguistic.
The only problem is that these techniques are in fact useless for this purpose, and for reasons that are rather trivial and easy to demonstrate. The remainder of this article will be devoted to two points. First, in Section 2, I review the techniques from the Rao et al. (2009a) and Lee, Jonathan, and Ziman (2010) papers, and show why they don’t work. The demonstration will seem rather obvious to any reader of this journal. And this in turn brings us to the second point: How is it that papers that are so trivially and demonstrably wrong get published in journals such as Science or the Proceedings of the Royal Society? Both papers relate to statistical language modeling, which is surely one of the core techniques in computational linguistics yet (apparently) no computational linguists were asked to review these papers. Would a paper that made some blatantly wrong claim about genetics be published in such venues? What does this say about our field and its standing in the world? And what can we do about that? Those questions are the topic of Section 3.”