RSS

Tag Archives: nlp

How to confuse Google Translate by simply adding a newline?


When you have the most popular and successful computer-based translation service in the world used by millions of people everyday, it’s inevitable that very interesting cases will be discovered. Let’s take the following question:

  • Can simply adding a “newline” character change the translation of a word?

This sounds weird, because for a human being, the obvious reaction would be:

  • What does that even mean? Probably you’ve accidentally hit ENTER or something, and that can’t possibly affect the meaning of a word, why do you even ask that?

Well, if the translation system in question based on statistical natural language processing and neural network algorithms such as deep learning, then things get a little more complex. Let’s first look at a sentence without any superfluous newline inserted:

and now, let’s hit ENTER right after the Dutch word “afzetzone”, to see the translation change magically:

The point here is not if the word “afzetzone” is translated correctly, but rather, how come its translation changes by simply adding one more “white space” after the word.

If you’re a lay person, you’ll probably be baffled by this example, and if you’re an NLP expert, specializing in deep learning techniques, you’ll probably scratch your head and then smile, and if you’re one of the scientists or engineers actually working on the Google Translate software’s debugging, well, then you might give a different reaction. 😉

All in all, keep in mind that in today’s technological landscape, there are super complex systems behind simple interfaces, and such “glitches” barely scratch the surface of this, providing a little, and opaque glimpse into a popular Artificial Intelligence product.

 
Leave a comment

Posted by on November 8, 2019 in Linguistics, Programlama, Science

 

Tags: , ,

Lost in Google Translate: How Unreasonable Effectiveness of Data can Sometimes Lead Us Astray


I’ve recently received an e-mail in Dutch from the Belgian teacher of my 7.5-year-old son, and even though my Dutch is more than enough to understand what his teacher wrote, I also wanted to check it with Google Translate out of habit and because of my professional/academic background. This led to an interesting discovery and made me think once again about artificial intelligence, deep learning, automatic translation, statistical natural language processing, knowledge representation, commonsense reasoning and linguistics.

But first things first, let’s see how Google Translate translated a very ordinary Dutch sentence into English:

Interesting! It is obvious that my son’s teacher didn’t have anything to do with a grinding table (!), and even if he did, I don’t think he’d involve his class with such interesting hobbies. 🙂 Of course, he meant the “multiplication table for 3”.

Then I wanted to see what the giant search engine, Google Search itself knows about Dutch word of “maaltafel”. And I’ve immediately seen that Google Search knows very well that “maaltafel” in Dutch means “Multiplication table” in English. Not only that, but also in the first page of search results, you can see the expected Dutch expression occurring 47 times. Nothing surprising here: Read the rest of this entry »

 
4 Comments

Posted by on February 8, 2019 in CogSci, Linguistics, philosophy, Science

 

Tags: , , , , , ,

Is this the State of the Art for grammar checking on Linux in 21st century?


Recently, I’ve shared an article with a colleague of mine. The article had been published in a peer-reviewed journal and the contents were original and interesting. On the other hand, my colleague, being a meticulous reader of scientific texts, has immediately spotted a few simple grammar errors. It was very easy to blame the authors and editors for not correcting such errors before publication, but this triggered another question:

Why don’t we have open source and very high quality grammar checking software that is already integrated into major text editors such as VIM, Emacs, etc.?

Any user of recent version of MS Word is well aware of on-the-fly grammar checking, at least for English. But as many academicians know very well, many of them use LaTeX to typeset their articles and rely on either well-known text editors such as VIM and Emacs, or specialized software for handling LaTeX easily. Therefore, to tell these people “go and check your article using MS Word, or copy paste your article text to an online grammar checking service” does not make a lot of sense. Those methods are not convenient and thus not very usable by hundreds of thousands of scientists writing articles every day. But what would be the ideal way? The answer is simple in theory: We have high quality open source spell checkers, at least for English, and they have been already integrated into major text editors, therefore scientists who write in LaTeX have no excuse for spelling errors, it is simply a matter of activating the spell checker. If only they had similar software for grammar checking, it would be very straightforward and convenient to eliminate the easiest grammar errors, at least for English.

A quick search on the Internet revealed the following for grammar checking on GNU/Linux:

– Baoqiu Cui has implemented a grammar checker integration for Emacs using link-grammar, but unfortunately it is far from easily usable.

emacsGC1

Read the rest of this entry »

 
1 Comment

Posted by on June 10, 2014 in Emacs, Linguistics, Linux

 

Tags: , , , ,

GODISNOWHERE: A look at a famous question using Python, Google and natural language processing


Are there any commonalities among human intelligence, Bayesian probability models, corpus linguistics, and religion? This blog entry presents a piece of light reading for people interested in a combination of those topics.
You have probably heard the famous question:

       “What do you see below?”

            GODISNOWHERE

The stream of letters can be broken down into English words in two different ways, either as “God is nowhere”   or as “God is now here.” You can find an endless set of variations on this theme on the Internet,  but I will deal with this example in the context of computational linguistics and big data processing.

margo

When I first read the beautiful book chapter titled “Natural Language Corpus Data” written by Peter Norvig, in the book “Beautiful Data“, I’ve decided to make an experiment using Norvig’s code. In that chapter, Norvig showed a very concise Python program that ‘learned’ how to break down a stream of letters into English words, in other words, a program with the capability to do ‘word segmentation’.

Norvig’s code coupled with Google’s language corpus, is powerful and impressive; it is able to take a character string such as

“wheninthecourseofhumaneventsitbecomesnecessary”

and return a correct segmentation:


‘when’, ‘in’, ‘the’, ‘course’, ‘of’, ‘human’, ‘events’, ‘it’, ‘becomes’, ‘necessary’

But how would it deal with “GODISNOWEHERE”? Let’s try it out in a GNU/Linux environment: Read the rest of this entry »

 
2 Comments

Posted by on March 1, 2014 in Linguistics, Programlama, python

 

Tags: , , , , , , , ,

What does Google think about your name?


A short excerpt from a non-existent book that may be titled ‘Virtues of statistical natural language processing’:

According to Google, my name ‘Emre’ (in Turkish) is ‘Chris’ in English: http://translate.google.com/#tr|en|%27Emre%27

According to Google, Turkish name ‘Burak’ is ‘John’ in English: http://translate.google.com/#tr|en|%27Burak%27

According to Google, Dutch name ‘Frederik’ is ‘Sam’ in English: http://translate.google.com/#nl|en|%27Frederik%27

According to Google, Turkish name ‘Ece’ is ‘James’ in English: http://translate.google.com/#tr|en|%27Ece%27

According to Google, Turkish name ‘Gökhan’ is ‘Jennifer’ in English: http://translate.google.com/#tr|en|%27G%C3%B6khan%27

 
2 Comments

Posted by on April 14, 2010 in Linguistics, Programlama

 

Tags: ,