ABSTRACT

While data warehousing is typically an activity associated with enterprise relational databases and therefore operations and analytics on highly structured data, big data applies order at the time of analysis to a view of unstructured data (in addition to potentially also dealing with structured data). In order to be usefully processed, big data is typically analyzed in a massively parallelized environment. A big data analytics service, on the other hand, might feature hundreds or more computational nodes, all working on discrete sections of data. Since big data systems are usually built on the basis of large-scale, distributed filesystems can also be node/locality aware—multiple copies of a file may be stored in nodes in the same rack, with an additional copy stored in a node in a physically separate rack. In the same spirit of reducing the impact of a backup operation on production infrastructure, this could be integrated into a cluster replication target in a big data system.