ABSTRACT

Technically, “big data” is a part of data science: the part that deals with data that are so large that they cannot be handled by an ordinary computer. A big data problem occurs when the workflow that the readers have been using to solve problems becomes infeasible due to the expansion in the size of their data. It is useful in this context to think about orders of magnitude of data. The evolution of baseball data illustrates how “big data problems” have arisen as the volume and variety of the data has increased over time. This chapter outlines some of the most important concepts for working with big data, and highlight some of the tools the readers are likely to see on this frontier of their working knowledge. CUDA is a parallel computing platform and application programming interface created by NVIDIA. The OpenCL package provides bindings for R to the open-source, general-purpose OpenCL programming language for GPU computing.