RSS

Lost in Google Translate: How Unreasonable Effectiveness of Data can Sometimes Lead Us Astray


I’ve recently received an e-mail in Dutch from the Belgian teacher of my 7.5-year-old son, and even though my Dutch is more than enough to understand what his teacher wrote, I also wanted to check it with Google Translate out of habit and because of my professional/academic background. This led to an interesting discovery and made me think once again about artificial intelligence, deep learning, automatic translation, statistical natural language processing, knowledge representation, commonsense reasoning and linguistics.

But first things first, let’s see how Google Translate translated a very ordinary Dutch sentence into English:

Interesting! It is obvious that my son’s teacher didn’t have anything to do with a grinding table (!), and even if he did, I don’t think he’d involve his class with such interesting hobbies. 🙂 Of course, he meant the “multiplication table for 3”.

Then I wanted to see what the giant search engine, Google Search itself knows about Dutch word of “maaltafel”. And I’ve immediately seen that Google Search knows very well that “maaltafel” in Dutch means “Multiplication table” in English. Not only that, but also in the first page of search results, you can see the expected Dutch expression occurring 47 times. Nothing surprising here: Read the rest of this entry »

Advertisements
 
1 Comment

Posted by on February 8, 2019 in CogSci, Linguistics, philosophy, Science

 

Tags: , , , , , ,

Two Laws for Systems


The first is known as Gall’s law for for systems design:

“A simple system may or may not work. A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.” — John Gall

This law is essentially an argument in favour of underspecification: it can be used to explain the success of systems like the World Wide Web and Blogosphere, which grew from simple to complex systems incrementally, and the failure of systems like CORBA, which began with complex specifications. Gall’s Law has strong affinities to the practice of agile software development.

Read the rest of this entry »

 
Leave a comment

Posted by on November 22, 2018 in Management, philosophy, Programlama

 

How fast can we count the number of lines in GNU/Linux, macOS, and MS Windows?


At a first glance, the question of counting lines in a text file is super straightforward. You simply run `wc` (word count) with -l or --lines option. And that’s what I’ve exactly been doing for more than 20 years. But what I read recently made me question if there are faster and more efficient ways to do that. Because, nowadays with very large and fast storage, you can easily have have text files that can be 1 GB, 10 GB, or even 100 GB. Coupled with that fact is that your laptop has at least 2 cores or maybe 4, that means 8 logical cores with hyper-threading. On a powerful server, it’s not surprising at all to have 16 or more CPU cores. Therefore, can this simple text processing be made more efficient, burning as many cores as available, and utilizing them to their maximum to return the line count in a fraction of time?

Here’s what I found:

In the first link above, I came across an interesting utility:

Apparently, the author of the turbo-linecount decided to implement his solution in C++ (for Linux, macOS, and MS Windows). He uses memory mapping technique to map the text file to the memory, and multi-threading to start threads that count the number of newlines (`\n`s) for different chunks of the memory region that corresponds to file contents, finally returning the sum total of newlines as the line count. Even though there are some issues with that system, I think it’s still very interesting. Actually my initial reaction was “how come this nice utility still not a standard package in most of the GNU/Linux software distributions such as Red Hat, Debian, Ubuntu, etc.?”.

Maybe we’ll have better options soon. Or maybe we already do? Let me know if there are better ways for this simple, yet frequently used operation.

 
Leave a comment

Posted by on November 12, 2018 in Linux, Programlama

 

Tags: , , ,

İki Opera Salonu Hikayesi: Belçika ve Türkiye


Yedi yaşındaki oğlum bugün sınıf arkadaşları ve öğretmenleri ile Anvers’teki opera binasını ziyaret edecek, perde arkasını görecek ve orada çalışanlarla konuşacak. Sabah onu okula bırakırken müzik ve bugün yapacakları hakkında konuştuk biraz. Onu okuluna bıraktıktan sonra ister istemez bazı hatıralar canlandı gözümde, aynı zamanda 2018 itibariyle yaşadığım sert gerçeklik aklıma geldi.

İstanbul’da büyüdüm, dünyanın en eski şehirlerinden birinde; tarihi, kültürel ve arkeolojik açıdan muazzam zengin bir mirasın içinde. İstanbul’da operaya ve baleye gitmeye başladım arkadaşlarımla, önce lise, sonra da üniversite öğrencisi olarak. 1990larda ve 2000lerin başında çoğu zaman opera bileti sinema biletinden ucuzdu. Bazı okurların iyi bildiği gibi gittiğim yer Atatürk Kültür Merkezi idi. Kolektif hafızamızın önemli bir parçası idi. Uzun zamandır ne halde olduğunu bilmiyordum ve öğrendim ki 2018 itibariyle aşağıdaki gibi görünüyor. İşte biz, kolektif hafızaya böyle davranırız:

Read the rest of this entry »

 
Leave a comment

Posted by on November 8, 2018 in Music

 

Tags: ,

A Tale of Two Opera Houses: Belgium and Turkey


My 7 seven year old son will visit the opera house in Antwerp today, together with his classmates and teachers as part of his school activities. We talked about music and today’s activity as I was driving him to the school this morning. This took me to a trip down the memory lane, and back to the harsh realities of the world I live in 2018.

I grew up in Istanbul, one of the oldest cities in the world with a very rich and complex historical, cultural, and archeological heritage. In my city, I used to go to opera and ballet as a high school, and then a university student. In fact, opera tickets were generally cheaper than cinema tickets, back in the 1990s and beginning of 2000s. The opera house was named “Atatürk Cultural Center”. It was an important part of our collective memory. It’s been demolished recently and this is how it looks as of 2018. This is how collective memory is treated: Read the rest of this entry »

 
Leave a comment

Posted by on November 8, 2018 in Music

 

Tags: ,

Command Line for the 21. Century: The Low Hanging Fruit


UPDATE (9-Jan-2019): Added information about exa, hyperfine, PathPicker, and svg-term-cli utilities.

UPDATE (19-Nov-2018): Added information about the Pipe Viewer utility.

People who use Unix since 1980s, or GNU/Linux since 1990s know that they can rely on the command line and many utilities for a lot of daily automation and data processing tasks. As someone who’s been using Unix and GNU/Linux based systems since 1994, I’m more than happy that I can count on these tools and the stable know-how built on them. Nevertheless, I think the command line and TUIs (Text-based User Interfaces) can be a bit better in 21. century. Therefore, in this post, I’ll list a few recent utilities that helps us to have a better command line experience.

Before you dive into the list, please be aware that I’m after the low-hanging fruit, that is, tools that can make an existing Unix / Linux command line environment a bit better with the least disruption. In other words, I will not touch on how to have brand new, GPU-powered terminal emulators such as alacritty and kitty, and neither will I talk about how nice it’d be if you only changed your shell from Bash into fish, or elvish. (If you really want to know about alternative shells, please read https://github.com/oilshell/oil/wiki/ExternalResources and http://www.oilshell.org/blog/2018/01/28.html.) I also won’t send you down the rabbit hole and make you spend countless hours customizing your shell prompt (that requires an article by itself, but in the meantime you can go and check Go Bullet Train (GBT), you’ve been warned!). Finally, no, I won’t be talking about tmux, too, because it has books dedicated to it such as “The Tao of tmux” and “tmux 2: Productive Mouse-Free Development“.

If you think something is missing and fits within the context of “low-hanging fruit” (see above), please add a comment at the end of this post. Also, for a more specialized domain, see my recent post titled “Data Processing Resources: Command-line Interface (CLI) for CSV, TSV, JSON, and XML“.

So let’s start with cd command, and how it can be enhanced with context and history, together with fuzzy matching:

Read the rest of this entry »

 
1 Comment

Posted by on October 30, 2018 in Linux

 

Tags: , , ,

Data Processing Resources: Command-line Interface (CLI) for CSV, TSV, JSON, and XML


UPDATE on 8-Feb-2019: Added BigBash It!

UPDATE on 3-Jan-2019: Added GNU datamash for CSV processing

UPDATE on 24-Oct-2018: Added gron for JSON processing.

Sometimes you don’t want pandas, tidyverse, Excel, or PostgreSQL. You know they are very powerful and flexible, you know if you’re already using them daily you can utilize them. But sometimes you just want to be left alone with your CVS, TSV, JSON and XML files, process them quickly on the command line, and get done with it. And you want something a little more specialized than awk , cut, and sed.

This list is by no means complete and authoritative. I compiled this as a reference that I can come back later. If you have other suggestions that are according to the spirit of this article, feel free to share them by writing a comment at the end. Without further ado, here’s my list:

  • xsv: A fast CSV command line toolkit written by the author of ripgrep. It’s useful for indexing, slicing, analyzing, splitting and joining CSV files.
  • q: run SQL directly on CSV or TSV files.
  • csvkit: a suite of command-line tools for converting to and working with CSV, the king of tabular file formats.
  • textql: execute SQL against structured text like CSV or TSV.
  • miller: like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.
  • agate: a Python data analysis library that is optimized for humans instead of machines. It is an alternative to numpy and pandas that solves real-world problems with readable code.

Honorable mentions:

  • GNU datamash: a command-line program which performs basic numeric, textual and statistical operations on input textual data files. See examples & one-liners.
  • SQLite: import a CSV File Into an SQLite Table, and use plain SQL to query it.
  • csv-mode for Emacs: sort, align, transpose, and manage rows and fields of CSV files.
  • lnav: the Log Navigator. See the tutorial in Linux Magazine.
  • jq: this one is THE tool for processing JSON on the command-line. It’s like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
  • gron: transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute path to it. (Why shouldn’t you just use jq?)
  • jid: JSON Incremental Digger, drill down JSON interactively by using filtering queries like jq.
  • jiq: jid with jq.
  • JMESPath tutorial: a query language for JSON. You can extract and transform elements from a JSON document. There are a lot of implementations at http://jmespath.org/libraries.html and the CLI implementation is jp.
  • BigBash It!: converts your SQL SELECT queries into an autonomous Bash one-liner that can be executed on almost any *nix device to make quick analyses or crunch GB of log files in CSV format. Perfectly suited for Big Data tasks on your local machine. Source code available at https://github.com/Borisvl/bigbash

Finally the CLI for XML:

 
1 Comment

Posted by on July 10, 2018 in Linux, Programlama

 

Tags: , , , , , , , , , ,