When one of the scala-user members has mentioned a bioinformatics package called GATK (Genome Analysis Toolkit) and its use of Scala recently, I’ve decided to take a further look into this matter. Thanks to the valuable Ohloh service, amateur software archaeology has never been easier! After a brief visit to https://www.ohloh.net/p/gatk I’ve learned that GATK software has had 12,871 commits made by 77 contributors within the last 5 years, representing 99,078 lines of code.
I wanted to learn more about its source code languages, and decided to drill down by visiting https://www.ohloh.net/p/gatk/analyses/latest/languages_summary. What I have discovered was surprising. Let me share the facts I’ve found so far: The project did not have any Scala code until recently, for example in July, 2009, it had no Scala, whereas it contained 4410 lines of Python and 56 lines of R code:
Scala was introduced around August, 2009 and it started with 174 lines of Scala code. Within less than 2 years, number of lines of code in Scala rose to 9223, whereas Python increased to 7074 and R to 4091.
This seems to be the turning point for GATK project with respect to Scala, because after May 2011, roughly two years after the introduction of Scala, the share of Python, as well as R code started to drop dramatically, and 3 years later, in February, 2014, we see that the Scala has about 7110 lines of code, whereas Python has 36 and R has 924:
In other words, Scala rose from 0% to more than 7% in a few years, and the ratio of Python and R code became less than 1%.
It is certainly not very meaningful to jump at hard-and-fast conclusions by looking at those graphs without having more information about the discussions that went into those changes, nevertheless I think the trends observed in this project is telling a story. One can speculate departing from here: If we assume 1 line of Scala code roughly corresponds to 2 to 3 lines of Java code, GATK project, if it’s Java code is converted to Scala, might end up having about 100.000 to 150.000 lines of code, instead of its current 200.000 lines of code. This, in turn, would also mean a more homogenous code base.
As a side note, Scala is of course not limited to GATK project, other projects such as bigdatagenomics make heavy use of Scala nowadays. There are also companies looking for Genome Analytics Software Engineers, using distributed data-analytics frameworks developed in Scala.
Do you know other projects such as GATK that started to make use of Scala (instead of languages such as Python and R) and continuously increased their use of Scala language and its ecosystem? I really would like to know more about similar trends and in which domains they are.