ABSTRACT

As massive datasets become increasingly common, new scalable approaches to prediction are needed. Given that memory and runtime constraints are common in practice, it is important to develop practical machine learning methods that perform well on big datasets in a fixed computational resource setting. Procedures using subsets from a training set are promising tools for prediction with large-scale datasets [16]. Recent research has focused on developing and evaluating the performance of various subset-based prediction procedures. Subsetting procedures in machine learning construct subsets from the available training data, then train an algorithm on each subset, and finally combine the results across the subsets to form a final prediction. Prediction methods operating on subsets of the training data can take advantage of modern computational resources, because machine learning on subsets can be massively parallelized.