ABSTRACT

The advent of Big Data has resulted in an urgent need for flexible analysis tools. Machine learning addresses part of this need, providing a stable of potential computational models for extracting knowledge from the flood of data. In contrast to many areas of the natural sciences, such as physics, chemistry, and biology, machine learning can be studied in an algorithmic 344and computational fashion. In principle, machine learning experiments can be precisely defined, leading to perfectly reproducible research. In fact, there have been several efforts in the past of frameworks for reproducible experiments [3,13,15,30]. Since machine learning is a data-driven approach, we need to have access to carefully structured data [25], that would enable direct comparison of the results of statistical estimation procedures. Some headway has been seen in the statistics and bioinformatics community. The success of R and Bioconductor [8,9] as well as projects such as Sweave [15] and Orgmode [29] have resulted in the possibility to embed the code that produces the results of the paper in the paper itself. The idea is to have a unified computation and presentation, with the hope that it results in reproducible research [14,24].