RSS

How to get better performance from Scala by using Parallel Collections

31 Oct

Today I needed to download the HTML content of some articles from a newspaper and I’ve decided to write a quick and dirty Scala application to get the job done quickly. I only needed to parse a main HTML page using regular expressions, get a list of URLs, and then iterate over them, by getting the contents of each, and finally writing them to files. Thanks to Scala I was able to code it comfortably and quickly, but when I ran the code I’ve seen that it took about 50 seconds to grab the contents of 150 URLs. Would it be possible to make it faster? Fortunately, Scala had Parallel Collections support for a very long time, and I’ve decided to try it out.

All I had to do was to convert the following part:

for (url <- urls) { ...

to

for (url <- urls.par) { ...

and run it again.

The result was better than I expected: The ‘normal’ version ran in the range of 30 to 50 seconds whereas the parallelized version run in the range of 8 – 10 seconds, that is 3 to 5 times faster! Yet another reason to use Scala.

And for those who say “Gist or didn’t happen”, you can see the source code at https://gist.github.com/emres/f0f4afbb75562335063c and its relevant build.sbt file at https://gist.github.com/emres/5296a071dae8caf7ca35. Don’t take my word for it, spend a few minutes and try it yourself.

 
Leave a comment

Posted by on October 31, 2014 in Programlama

 

Tags: ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: