ABSTRACT

This chapter presents RPig, an integrated framework with R and Pig for scalable machine learning and advanced statistical functionalities, which makes it feasible to use high-level languages to develop analytic jobs easily in concise programming. It describes two scenarios that neither R nor Pig can handle independently. The chapter also describes the foundation frameworks: R, Hadoop, and Pig. It explains the overall RPig framework and its components. RPig takes advantage of both the deep statistical analysis capability of R and parallel data-processing capability of Pig. RPig supports parallelism for various requirements in different scenarios. An initial version of the RPig framework was implemented as a proof-of-concept prototype. The framework provides the RPig script for users to write analytic jobs. When a task is completed and a result is returned, the data stored in the R session will be cleared, and the process will be killed by the RPig framework.