Lost in Google Translate: How Unreasonable Effectiveness of Data can Sometimes Lead Us Astray

08 Feb

I’ve recently received an e-mail in Dutch from the Belgian teacher of my 7.5-year-old son, and even though my Dutch is more than enough to understand what his teacher wrote, I also wanted to check it with Google Translate out of habit and because of my professional/academic background. This led to an interesting discovery and made me think once again about artificial intelligence, deep learning, automatic translation, statistical natural language processing, knowledge representation, commonsense reasoning and linguistics.

But first things first, let’s see how Google Translate translated a very ordinary Dutch sentence into English:

Interesting! It is obvious that my son’s teacher didn’t have anything to do with a grinding table (!), and even if he did, I don’t think he’d involve his class with such interesting hobbies. 🙂 Of course, he meant the “multiplication table for 3”.

Then I wanted to see what the giant search engine, Google Search itself knows about Dutch word of “maaltafel”. And I’ve immediately seen that Google Search knows very well that “maaltafel” in Dutch means “Multiplication table” in English. Not only that, but also in the first page of search results, you can see the expected Dutch expression occurring 47 times. Nothing surprising here:

Back to Google Translate that’s powered by state-of-the-art automatic translation, relying on cutting edge deep learning techniques, and tons of data that Google can afford. It’s as if Google Translate isn’t aware of the existing context surrounding the word! It is as if Google Translate doesn’t, or can’t care for the context, because looking at the word itself, we see:

Interestingly, Google Translate suggests the “more frequent” “version” of the expression, and as expected, is relying on real world data and statistics.

But if you write it as it’s suggested, you get:

Please keep in mind that Dutch and English belong to the same family of languages. So, it’s not like I’m trying to translate between two languages that belong to totally unrelated families such as Turkish and English.

But, what’s the nature of the error and other errors that would belong to this class of errors? What does the system “know” (not only about this particular Dutch word), but what’s its knowledge about its knowledge? In other words, what’s the meta-knowledge of Google Translate, and can we even meaningfully talk about this?

Apart from a human being explicitly labeling this as a mistake, can Google Translate learn that it made a mistake? What about the context?

Can Google itself make Google Translate learn from Google Search? They belong to the same company after all. And we know that Google, as well as Microsoft, have been working on semantic knowledge graphs for a long time (employing the brightest and hardest-working minds from the industry and academia), enabling them to have explicit and logical structures that also power their search engines. Before AI taking over, and enslaving humanity, and putting most of the workforce out of work, maybe we should start by integrating the “smart” services of a big company, by making them learn from each other, learn from experience of different domains managed by the same company? How difficult can it be? We’ll see if one day Google will have enough money to solve this. Until then, maybe we should cut through the hype, and re-read what various artificial intelligence and cognitive science researchers have to say, e.g. Douglas Hofstadter.

Maybe we should also continue to keep a critical perspective of statistical and black-box deep learning approaches to fundamental domains of human reasoning, and insist on methods for more explicit, causal automated reasoning systems that can tell something about themselves, provide us humans with a way to tell them their mistakes in a reasoned, structured way, and be able to deal with analogies, applying lessons learned from their mistakes to similar cases in similar classes.


Posted by on February 8, 2019 in CogSci, Linguistics, philosophy, Science


Tags: , , , , , ,

4 responses to “Lost in Google Translate: How Unreasonable Effectiveness of Data can Sometimes Lead Us Astray

  1. Saretha Naudé

    February 11, 2019 at 08:41

    I appreciate the time and effort of your valuable research. Something similar happened to me with a legal letter from the UK that the coroners office has translated into Dutch for me. I do not think they realize what an embarrassment the translation is. 😉

  2. Erwin Baeyens

    December 30, 2019 at 11:16

    I suspect that one of the reasons for this is the fact that while being officially the same language there are subtle but fundamental differences between Dutch and “Flemish” (the version of Dutch spoken in Flanders) For instance in Flanders we use the word “lopen” to signify running as in a work out or a race, In The Netherlands it is used to indicate that people are going somewhere by walking.
    “Maaltafel” is what my teachers would have called a “provincialisme” as in the ’70’s and ’80’s there was more pressure to try to standardise Dutch between the two regions.
    I’m guessing here but I think that Google search is more context aware where Google translate is more geared towards a Netherlands flavoured version of Dutch.

    • Emre Sevinç

      December 30, 2019 at 11:36

      Thanks a lot for enriching the discussion! Language, as you’ve indicated in those examples, is a complex business with a lot of historical baggage, and it’s still not easy for deep learning based systems to deal with such richness.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: