ABSTRACT

Businesses can gain significant value from data for decision making via the construction of complex data analytic pipelines that have a dual purpose of creating reports and serving as machine learning (ML) models. These pipelines implement a series of transformations via scripts that read raw data from different sources and aggregate, clean, transform and save it back into tables. The main challenge addressed in this chapter is how to efficiently transform raw data on the fly into features to be used by ML models. At the same time, the efforts required to maintain the scripts in the face of changes must be minimized. Building on existing solutions, this chapter proposes a hybrid approach that makes a trade-off between supporting dependency change management and allowing partial processing while ensuring platform independence. It uses a directed acyclic graph (DAG) to represent data and features transformations in a way that minimizes the overall processing required and eases the maintenance of the data processing scripts. A prototype has been developed to evaluate the proposed architecture and preliminary performance results are discussed.