Large datasets | 33 | Introduction to Data Science

ABSTRACT

Machine learning problems often involve datasets that are as large or larger than the MNIST dataset. There is a variety of computational techniques and statistical concepts that are useful for the analysis of large datasets. In this chapter, the authors scratch the surface of techniques and concepts by describing matrix algebra, dimension reduction, regularization and matrix factorization. The main reason for using matrices is that certain mathematical operations needed to develop efficient code can be performed using techniques from a branch of mathematics called linear algebra. The predictor space can be thought of as the collection of all possible vectors of predictors that should be considered for the machine learning challenge in question. A typical machine learning challenge will include a large number of predictors, which makes visualization somewhat challenging. Recommendation systems use ratings that users have given items to make specific recommendations.