RSS

How to get better performance from Scala by using Parallel Collections


Today I needed to download the HTML content of some articles from a newspaper and I’ve decided to write a quick and dirty Scala application to get the job done quickly. I only needed to parse a main HTML page using regular expressions, get a list of URLs, and then iterate over them, by getting the contents of each, and finally writing them to files. Thanks to Scala I was able to code it comfortably and quickly, but when I ran the code I’ve seen that it took about 50 seconds to grab the contents of 150 URLs. Would it be possible to make it faster? Fortunately, Scala had Parallel Collections support for a very long time, and I’ve decided to try it out.

All I had to do was to convert the following part:

for (url <- urls) { ...

to

for (url <- urls.par) { ...

and run it again.

The result was better than I expected: The ‘normal’ version ran in the range of 30 to 50 seconds whereas the parallelized version run in the range of 8 – 10 seconds, that is 3 to 5 times faster! Yet another reason to use Scala.

And for those who say “Gist or didn’t happen”, you can see the source code at https://gist.github.com/emres/f0f4afbb75562335063c and its relevant build.sbt file at https://gist.github.com/emres/5296a071dae8caf7ca35. Don’t take my word for it, spend a few minutes and try it yourself.

 
Leave a comment

Posted by on October 31, 2014 in Programlama

 

Tags: ,

Functional Programming in Scala: The most advanced Scala and functional programming book for the working programmer


bjarnason_cover150It is safe to say that “Functional Programming in Scala” by Chiusano and Bjarnason can be considered the most advanced Scala programming book published so far (in a sense, it can be compared to SICP.). Half of one of my bookshelves is occupied by Scala books, including Scala in Depth, but none of them takes the concept of functional programming as serious as this book, and pushes it to its limits that much. This, in turn, means that most of the Java programmers (including very senior ones), as well as Scala programmers with some experience should prepare themselves to feel very much like a newbie again.

But why the need for such a book, and what’s all that noise about functional programming? Here is my favorite description of functional programming given by Tony Morris : “Supposing a program composed of parts A, B, C, D, and a requirement for program of parts A, B, C, and E. The effort required to construct this program should be proportional to the size of E. The extent to which this is true is the extent to which one achieves the central thesis of Functional Programming. Identifying independent program parts requires very rigorous cognitive discipline and correct concept formation. This can be very (very) difficult after exposure to sloppy thinking habits. Composable programs are easier to reason about. We may (confidentally) determine program behaviour by determining the behaviour of sub-programs -> fewer bugs. Composable programs scale indefinitely, by composing more and more sub-programs. There is no distinction between a ‘small’ and a ‘large’ application; only ‘smaller than’ or ‘greater than’.”

The description above not only points at the core idea of functional programming and why that is important, as well as useful, but also draws attention to the fact that getting used to functional programming design can be difficult for people who are not used to thinking that way. Fortunately, “Functional Programming in Scala” is here to fill a huge void in that respect.
Read the rest of this entry »

 
Leave a comment

Posted by on September 13, 2014 in FunctionalProgramming

 

Tags: ,

PostgreSQL 9 High Availability Cookbook


6969OSPostgreSQL 9 High Availability Cookbook is a very well written book whose primary audience are experienced DBAs and system engineers who want to take their PostgreSQL skills to the next level by diving into the details of building highly available PostgreSQL based systems. Reading this book is like drinking from a fire hose, the signal-to-noise ratio is very high; in other words, every single page is packed with important, critical, and very practical information. As a consequence, this also means that the book is not for newbies: not only you have to know the fundamental aspects of PostgreSQL from a database administrator’s point of view, but you also need to have solid GNU/Linux system administration background.

One of the strongest aspects of the book is the author’s principled and well-structured engineering approach to building a highly available PostgreSQL system. Instead of jumping to some recipes to be memorized, the book teaches you basic but very important principles of capacity planning. More importantly, this planning of servers and networking is not only given as a good template, but the author also explains the logic behind it, as well as drawing attention to the reason behind the heuristics he use and why some magic numbers are taken as a good estimate in case of lack of more case-specific information. This style is applied very consistently throughout the book, each recipe is explained so that you know why you do something in addition to how you do it. Read the rest of this entry »

 
Leave a comment

Posted by on August 21, 2014 in Books, Linux, sysadmin

 

Tags: , , , ,

Is this the State of the Art for grammar checking on Linux in 21st century?


Recently, I’ve shared an article with a colleague of mine. The article had been published in a peer-reviewed journal and the contents were original and interesting. On the other hand, my colleague, being a meticulous reader of scientific texts, has immediately spotted a few simple grammar errors. It was very easy to blame the authors and editors for not correcting such errors before publication, but this triggered another question:

Why don’t we have open source and very high quality grammar checking software that is already integrated into major text editors such as VIM, Emacs, etc.?

Any user of recent version of MS Word is well aware of on-the-fly grammar checking, at least for English. But as many academicians know very well, many of them use LaTeX to typeset their articles and rely on either well-known text editors such as VIM and Emacs, or specialized software for handling LaTeX easily. Therefore, to tell these people “go and check your article using MS Word, or copy paste your article text to an online grammar checking service” does not make a lot of sense. Those methods are not convenient and thus not very usable by hundreds of thousands of scientists writing articles every day. But what would be the ideal way? The answer is simple in theory: We have high quality open source spell checkers, at least for English, and they have been already integrated into major text editors, therefore scientists who write in LaTeX have no excuse for spelling errors, it is simply a matter of activating the spell checker. If only they had similar software for grammar checking, it would be very straightforward and convenient to eliminate the easiest grammar errors, at least for English.

A quick search on the Internet revealed the following for grammar checking on GNU/Linux:

Baoqiu Cui has implemented a grammar checker integration for Emacs using link-grammar, but unfortunately it is far from easily usable.

emacsGC1

Read the rest of this entry »

 
Leave a comment

Posted by on June 10, 2014 in Emacs, Linguistics, Linux

 

Tags: , , , ,

GODISNOWHERE: A look at a famous question using Python, Google and natural language processing


Are there any commonalities among human intelligence, Bayesian probability models, corpus linguistics, and religion? This blog entry presents a piece of light reading for people interested in a combination of those topics.
You have probably heard the famous question:

       “What do you see below?”

            GODISNOWHERE

The stream of letters can be broken down into English words in two different ways, either as “God is nowhere”   or as “God is now here.” You can find an endless set of variations on this theme on the Internet,  but I will deal with this example in the context of computational linguistics and big data processing.

margo

When I first read the beautiful book chapter titled “Natural Language Corpus Data” written by Peter Norvig, in the book “Beautiful Data“, I’ve decided to make an experiment using Norvig’s code. In that chapter, Norvig showed a very concise Python program that ‘learned’ how to break down a stream of letters into English words, in other words, a program with the capability to do ‘word segmentation’.

Norvig’s code coupled with Google’s language corpus, is powerful and impressive; it is able to take a character string such as

“wheninthecourseofhumaneventsitbecomesnecessary”

and return a correct segmentation:


‘when’, ‘in’, ‘the’, ‘course’, ‘of’, ‘human’, ‘events’, ‘it’, ‘becomes’, ‘necessary’

But how would it deal with “GODISNOWEHERE”? Let’s try it out in a GNU/Linux environment: Read the rest of this entry »

 
2 Comments

Posted by on March 1, 2014 in Linguistics, Programlama, python

 

Tags: , , , , , , , ,

Scala versus Python and R: software archaeology in bioinformatics


When one of the scala-user members has mentioned a bioinformatics package called GATK (Genome Analysis Toolkit) and its use of Scala recently, I’ve decided to take a further look into this matter. Thanks to the valuable Ohloh service, amateur software archaeology has never been easier! After a brief visit to https://www.ohloh.net/p/gatk I’ve learned that GATK software has had 12,871 commits made by 77 contributors  within the last 5 years, representing 99,078 lines of code.

I wanted to learn more about its source code languages, and decided to drill down by visiting https://www.ohloh.net/p/gatk/analyses/latest/languages_summary. What I have discovered was surprising. Let me share the facts I’ve found so far: The project did not have any Scala code until recently, for example in July, 2009, it had no Scala, whereas it contained 4410 lines of Python and 56 lines of R code:

beforeScala

Read the rest of this entry »

 
Leave a comment

Posted by on February 16, 2014 in Programlama

 

Tags: , , , ,

Can LinkedIn endorsements be motivating? A case for Coursera’s Machine Learning class


Any self-respecting social media savvy professional knows that LinkedIn endorsements are the result of a hideous gamification experiment gone wrong (on many levels), except when they think it is a straightforward abuse of human psychology. Some computer programmers even try to give it a backlash by writing automated scripts to endorse profiles with totally absurd ‘skill sets‘.

On the other hand, in some unexpected cases, these endorsements can be very motivating, which is what happened to me a few months ago, back in October, 2013. To cut a long story short, when I came across the following endorsements by some of my LinkedIn contacts, my reaction was something that even surprised me:

ml_linkedin_endorsement

It went like that: “Machine Learning! Should I accept that endorsement? I mean, I did small projects related to machine learning, such as Poor Man’s TV program Recommender that utilized Support Vector Machines, and a cross-cultural and cross-domain recommendation system using a semantic graph database such as AllegroGraph; but apart from an AI course that I had while studying for my cognitive science degree, I haven’t taken any Machine Learning course. On the other hand, Andrew Ng’s famous Machine Learning course at Coursera is about to start, so maybe that’s a nice opportunity! Why not? If I can finish the course successfully, then accepting such an endorsement will be a bit meaningful, at least from a practical, or academic point of view.” Read the rest of this entry »

 
2 Comments

Posted by on January 19, 2014 in e-Learning, Programlama

 

Tags: , , , ,

 
Follow

Get every new post delivered to your Inbox.

Join 62 other followers