RSS

Category Archives: Linguistics

How to confuse Google Translate by simply adding a newline?


When you have the most popular and successful computer-based translation service in the world used by millions of people everyday, it’s inevitable that very interesting cases will be discovered. Let’s take the following question:

  • Can simply adding a “newline” character change the translation of a word?

This sounds weird, because for a human being, the obvious reaction would be:

  • What does that even mean? Probably you’ve accidentally hit ENTER or something, and that can’t possibly affect the meaning of a word, why do you even ask that?

Well, if the translation system in question based on statistical natural language processing and neural network algorithms such as deep learning, then things get a little more complex. Let’s first look at a sentence without any superfluous newline inserted:

and now, let’s hit ENTER right after the Dutch word “afzetzone”, to see the translation change magically:

The point here is not if the word “afzetzone” is translated correctly, but rather, how come its translation changes by simply adding one more “white space” after the word.

If you’re a lay person, you’ll probably be baffled by this example, and if you’re an NLP expert, specializing in deep learning techniques, you’ll probably scratch your head and then smile, and if you’re one of the scientists or engineers actually working on the Google Translate software’s debugging, well, then you might give a different reaction. 😉

All in all, keep in mind that in today’s technological landscape, there are super complex systems behind simple interfaces, and such “glitches” barely scratch the surface of this, providing a little, and opaque glimpse into a popular Artificial Intelligence product.

 
Leave a comment

Posted by on November 8, 2019 in Linguistics, Programlama, Science

 

Tags: , ,

Lost in Google Translate: How Unreasonable Effectiveness of Data can Sometimes Lead Us Astray


I’ve recently received an e-mail in Dutch from the Belgian teacher of my 7.5-year-old son, and even though my Dutch is more than enough to understand what his teacher wrote, I also wanted to check it with Google Translate out of habit and because of my professional/academic background. This led to an interesting discovery and made me think once again about artificial intelligence, deep learning, automatic translation, statistical natural language processing, knowledge representation, commonsense reasoning and linguistics.

But first things first, let’s see how Google Translate translated a very ordinary Dutch sentence into English:

Interesting! It is obvious that my son’s teacher didn’t have anything to do with a grinding table (!), and even if he did, I don’t think he’d involve his class with such interesting hobbies. 🙂 Of course, he meant the “multiplication table for 3”.

Then I wanted to see what the giant search engine, Google Search itself knows about Dutch word of “maaltafel”. And I’ve immediately seen that Google Search knows very well that “maaltafel” in Dutch means “Multiplication table” in English. Not only that, but also in the first page of search results, you can see the expected Dutch expression occurring 47 times. Nothing surprising here: Read the rest of this entry »

 
4 Comments

Posted by on February 8, 2019 in CogSci, Linguistics, philosophy, Science

 

Tags: , , , , , ,

Is there a high quality and free Text to Speech system for Dutch that runs on GNU/Linux?


Dear Text to Speech and open source experts:

For a toy / hobby project (non-commercial), I’m trying to find a suitable Text to Speech system for Dutch that I can run on GNU/Linux. So far, the situation does not look very promising. I’ve tried eSpeak, but using it for Dutch is not as good as I expect. I made my experiment using a file “computer.txt” that has the following contents:

Een computer is een apparaat waarmee gegevens volgens formele procedures zoals algoritmen kunnen worden verwerkt. Meestal wordt met het woord computer een elektronisch, digitaal apparaat bedoeld, maar er bestaan ook mechanische en analoge computers.

$ espeak -vnl+7 -s 170 -f computer.txt

Read the rest of this entry »

 
3 Comments

Posted by on December 3, 2015 in Linguistics, Linux

 

Tags: , , ,

Is this the State of the Art for grammar checking on Linux in 21st century?


Recently, I’ve shared an article with a colleague of mine. The article had been published in a peer-reviewed journal and the contents were original and interesting. On the other hand, my colleague, being a meticulous reader of scientific texts, has immediately spotted a few simple grammar errors. It was very easy to blame the authors and editors for not correcting such errors before publication, but this triggered another question:

Why don’t we have open source and very high quality grammar checking software that is already integrated into major text editors such as VIM, Emacs, etc.?

Any user of recent version of MS Word is well aware of on-the-fly grammar checking, at least for English. But as many academicians know very well, many of them use LaTeX to typeset their articles and rely on either well-known text editors such as VIM and Emacs, or specialized software for handling LaTeX easily. Therefore, to tell these people “go and check your article using MS Word, or copy paste your article text to an online grammar checking service” does not make a lot of sense. Those methods are not convenient and thus not very usable by hundreds of thousands of scientists writing articles every day. But what would be the ideal way? The answer is simple in theory: We have high quality open source spell checkers, at least for English, and they have been already integrated into major text editors, therefore scientists who write in LaTeX have no excuse for spelling errors, it is simply a matter of activating the spell checker. If only they had similar software for grammar checking, it would be very straightforward and convenient to eliminate the easiest grammar errors, at least for English.

A quick search on the Internet revealed the following for grammar checking on GNU/Linux:

– Baoqiu Cui has implemented a grammar checker integration for Emacs using link-grammar, but unfortunately it is far from easily usable.

emacsGC1

Read the rest of this entry »

 
1 Comment

Posted by on June 10, 2014 in Emacs, Linguistics, Linux

 

Tags: , , , ,

GODISNOWHERE: A look at a famous question using Python, Google and natural language processing


Are there any commonalities among human intelligence, Bayesian probability models, corpus linguistics, and religion? This blog entry presents a piece of light reading for people interested in a combination of those topics.
You have probably heard the famous question:

       “What do you see below?”

            GODISNOWHERE

The stream of letters can be broken down into English words in two different ways, either as “God is nowhere”   or as “God is now here.” You can find an endless set of variations on this theme on the Internet,  but I will deal with this example in the context of computational linguistics and big data processing.

margo

When I first read the beautiful book chapter titled “Natural Language Corpus Data” written by Peter Norvig, in the book “Beautiful Data“, I’ve decided to make an experiment using Norvig’s code. In that chapter, Norvig showed a very concise Python program that ‘learned’ how to break down a stream of letters into English words, in other words, a program with the capability to do ‘word segmentation’.

Norvig’s code coupled with Google’s language corpus, is powerful and impressive; it is able to take a character string such as

“wheninthecourseofhumaneventsitbecomesnecessary”

and return a correct segmentation:


‘when’, ‘in’, ‘the’, ‘course’, ‘of’, ‘human’, ‘events’, ‘it’, ‘becomes’, ‘necessary’

But how would it deal with “GODISNOWEHERE”? Let’s try it out in a GNU/Linux environment: Read the rest of this entry »

 
2 Comments

Posted by on March 1, 2014 in Linguistics, Programlama, python

 

Tags: , , , , , , , ,

Notes from the event: “Open Science. The key to more scientific integrity?”


Readers of this blog could easily guess my program for this Thursday evening after reading the news “Brussels university welcomes Wikipedia founder“. I have immediately registered for “Open Science. The key to more scientific integrity?” event at Vrije Universiteit Brussel, because I didn’t want to miss the opportunity to listen to Jimmy Wales, the co-founder of Wikipedia, as well as the other notable speakers, namely Prof. Em. André Van Steirteghem and Michel Bauwens.

imagine

It was nice to see be a part of an enthusiastic audience and all of the speakers delivered interesting talks full of insights. For example, thanks to this event, I learned that emeritus Prof. André Van Steirteghem is co-secretary of COPE (Committee on Publication Ethics) and once again was disappointed to hear about the rise of fraudulent research in Belgium, as well as in other European countries. Next speaker, Michel Bauwens talked about his perspectives on post-capitalistic social structures and peer-to-peer production mechanisms, using interesting terminology such as metarchical capitalism. At the end of his talk, he also drew attention to Wikipedia, and voiced his concerns about some of the rules such as notability: Apparently he was not found notable enough for Wikipedia. He claimed that since this ‘notability’ rule was established, the curve of contributions to Wikipedia became almost flat, indicating a particularly problematic situation, as well as the power struggles in Wikipedia.

Jimmy Wales, the co-founder of Wikipedia and keynote speaker of the event was the final speaker and he definitely had a great, enthusiastic presence on the stage. His presentation not only gave a brief and good summary of the history of Wikipedia, its structure, its operating principles and philosophies, but also interesting statistical facts about one of the most popular and valuable sites on the Internet regarding languages and countries. Probably one piece of fact that everyone will easily remember was the following: Read the rest of this entry »

 
Leave a comment

Posted by on October 24, 2013 in events, Linguistics, Science

 

Tags: , , , , , , , , , , ,

How much Dutch do you know? Discover by playing a cool linguistic game


handen_tekst_groterApparently I didn’t think words such as gezwel, boombal, ejaculaat, reflatie, troela, and roro belonged to Dutch. That is, when I took a very interesting and simple test that asked to assess whether I thought the word on the screen is a valid Dutch word. You can try it out yourself simply by visiting http://woordentest.ugent.be and learn your score right after the test that you’ll complete in a few minutes. My result after the first take is 29%, in other words, I have recognized 76% of the words correctly, but unfortunately I have also claimed that 47% of the non-Dutch words to be Dutch. The interpration of the system is: “Dit is een behoorlijk niveau voor een Nederlandssprekende.” (This is a good level for a speaker of Dutch). My wife, a native speaker of Dutch, took the same test and her initial result was something close to 70%, and the system interpreted this as “possessing a very extensive vocabulary of Dutch”. Read the rest of this entry »

 
Leave a comment

Posted by on April 28, 2013 in Linguistics, Science

 

Tags: , , , , , , ,