RSS

Scala versus Python and R: software archaeology in bioinformatics

16 Feb

When one of the scala-user members has mentioned a bioinformatics package called GATK (Genome Analysis Toolkit) and its use of Scala recently, I’ve decided to take a further look into this matter. Thanks to the valuable Ohloh service, amateur software archaeology has never been easier! After a brief visit to https://www.ohloh.net/p/gatk I’ve learned that GATK software has had 12,871 commits made by 77 contributors  within the last 5 years, representing 99,078 lines of code.

I wanted to learn more about its source code languages, and decided to drill down by visiting https://www.ohloh.net/p/gatk/analyses/latest/languages_summary. What I have discovered was surprising. Let me share the facts I’ve found so far: The project did not have any Scala code until recently, for example in July, 2009, it had no Scala, whereas it contained 4410 lines of Python and 56 lines of R code:

beforeScala

Scala was introduced around August, 2009 and it started with 174 lines of Scala code. Within less than 2 years, number of lines of code in Scala rose to 9223, whereas Python increased to 7074 and R to 4091.

middleScala

This seems to be the turning point for GATK project with respect to Scala, because after May 2011, roughly two years after the introduction of Scala, the share of Python, as well as R code started to drop dramatically, and 3 years later, in February, 2014, we see that the Scala has about 7110 lines of code, whereas Python has 36 and R has 924:

finalScala

In other words, Scala rose from 0% to more than 7% in a few years, and the ratio of Python and R code became less than 1%.

It is certainly not very meaningful to jump at hard-and-fast conclusions by looking at those graphs without having more information about the discussions that went into those changes, nevertheless I think the trends observed in this project is telling a story. One can speculate departing from here: If we assume 1 line of Scala code roughly corresponds to 2 to 3 lines of Java code, GATK project, if it’s Java code is converted to Scala, might end up having about 100.000 to 150.000 lines of code, instead of its current 200.000 lines of code. This, in turn, would also mean a more homogenous code base.

As a side note, Scala is of course not limited to GATK project, other projects such as bigdatagenomics make heavy use of Scala nowadays. There are also companies looking for Genome Analytics Software Engineers, using distributed data-analytics frameworks developed in Scala.

Do you know other projects such as GATK that started to make use of Scala (instead of languages such as Python and R) and continuously increased their use of Scala language and its ecosystem? I really would like to know more about similar trends and in which domains they are.

 
3 Comments

Posted by on February 16, 2014 in Programlama

 

Tags: , , , ,

3 responses to “Scala versus Python and R: software archaeology in bioinformatics

  1. Ercan Aydoğan

    February 8, 2015 at 15:07

    Apache Spark ve Scala ile ile büyük veri ile ilgili bir şey(!) ler geliştirmeye çalışan biri olarak, R mi Scala mı dan yola çıkıp tekrar ileriseviye ve FZ kelimeleri ile karşılaşmak ilginç bir durum oldu. Henüz pek fazla kişinin haberi olmadığından mı yoksa yazılan bir yazının altına beynine sağlık yazmaktan korktuğumuz için mi bilmiyorum genelde yazılar sadece okunuyor. Ben de farklı davranmıyordum ama dediğim gibi ileriseviye ve FZ’yi görünce hem selam vermek hem de Scala tercihi konusunda tam olarak hangi ana sebeplerden dolayı bu tercihi yaptığınızı sormak istedim.

    Benim sebebim Apache Spark’tan dolayı (varsayılan dil olması).

     
    • Emre Sevinç

      February 9, 2015 at 09:15

      Merhaba,

      Yukaridaki yazi Scala tercihi yaptigima dair bir sey söylemiyor😉

      Bunun disinda, tercih bana kaldiginda daha gelismis ifade yetenekleri ve tür çikarimindan ötürü Scala tercih ediyorum JVM üzerinde is yapacaksam. Ama mesela bu aralar ugrastigim projede oldugu gibi müsterinin gereksinimlerinden ötürü Apache Spark üzerine bir seyleri Java ile insa ettigim de oluyor.

       
  2. Ercan Aydoğan

    February 9, 2015 at 10:13

    Merhaba,
    Cevabınız için teşekkürler.

    Bahsettiğiniz projede karşılaştığınız durumları ( zorluk ya da kolaylıkları ) daha çok yazmanız temennisiyle. Özellikle Spark ile uğraşan biri olarak okumayı merak ile bekliyorum.

     

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: