Reproducibility in Computational Biology

Posted 4th December 2017 by Jane Williams
Recently Baker and Lithgow (Baker, 2016; Lithgow, et al., 2017) highlighted the problem of the reproducibility in research. Reproducibility criticality affects to different extent a large portion of the science fields (Baker, 2016). Bioinformatics is becoming a key element of many biological/medical studies (Searls, 2010) and reproducibility is also an important issue in this field (Kanwal, et al., 2017; Sandve, et al., 2013).
Reproducibility issues in bioinformatics might be due to the short half-life of the bioinformatics software, the complexity of the pipelines, the uncontrolled effects induced by changes in the system libraries, the incompleteness or imprecision in workflow description, etc. Sandve (Sandve, et al., 2013) suggested ten good practice rules for the development of a bioinformatics workflow to moderate reproducibility issues.
A community that fulfil many of the rules suggested by Sandve is the Bioconductor (Gentleman, et al., 2004) project, which provides version control for a large amount of genomics/bioinformatics packages and store all the Bioconductor versions since its foundation. However, Bioconductor does not cover all the steps of any possible bioinformatics workflow and provides a limited framework for complex pipelines.
Galaxy (Digan, et al., 2017) as well as BaseSpace (Colombo, et al., 2017; Van Neste, et al., 2015) represent examples of both open-source and commercial solutions, which guarantee certain levels of reproducibility. However, the workflows implemented in such environments cannot be heavily customised, e.g. BaseSpace has strict rules for applications submission. On the other hand, it might be difficult to completely reconstruct a Galaxy based workflow in another site, e.g. Galaxy does not provide information on the underling infrastructure used in a specific implementation (e.g. system libraries). Moreover, clouds applications, as BaseSpace, have to cope with legal and ethical issues (Dove, et al., 2015).
Recently the Docker virtualisation technology entered into the area of Bioinformatics (da Veiga Leprevost, et al., 2017; Kim, et al., 2017; Menegidio, et al., 2017). Docker containers technology makes it possible to get many incompatible applications running on the same server and it makes easy to package and distribute programs. Docker containers, instead of virtualising hardware, stay on top of a single UNIX instance, allowing more instances to run on the same hardware than those allowed using virtual machines. Menegidio (Menegidio, et al., 2017), da Veiga (da Veiga Leprevost, et al., 2017) and Kim (Kim, et al., 2017) provided, to the bioinformatics community, a large collection of instruments based on Docker technology.
Furthermore, a community project, Reproducible Bioinformatics Project, delivering to bioinformaticians a controlled, but flexible framework to distribute Docker based workflows under the umbrella of a reproducibility framework, was recently created.
Today there are many instruments to guarantee reproducibility in bioinformatics. However, we have still many questions open:
- Which approach will be the winner, stand alone or cloud based applications?
- Is Docker the technology that will revolutionise the design of bioinformatics applications?
- Which is going to be the right path to convince bioinformaticians to write a distribute their code within rigorous reproducibility rules?
Raffaele Calogero is an Associate Professor in the Department of Biotechnology and Health Sciences at the University of Torino. Raffaele led a roundtable discussion at the recent NGS Tech and Applications Strand of the 4Bio Summit.
Leave a Reply