ABSTRACT

Spark is one of the most widely used open source platform for big data. It is suitable for large-scale data processing and analytics. It has APIs in multiple programming languages. It is motivated to solve many problems that are associated with MapReduce. These include support for iterative systems and stream-based processes. Spark solves these two major limitations by introducing an in-memory computation model. The chapter explains the Spark's Structured APIs and elaborate both the high-level DataFrames and Datasets APIs, and the low-level RDD API -through worked examples. It has important libraries for machine learning (MLlib), stream processing (Spark Structured Streaming), SQL for interactive queries (Spark SQL), and graph processing (GraphX).