chapter  Chapter 4
13 Pages

Hadoop Ecosystem

ByRathinaraja Jeyaraj, Ganeshkumar Pugalendhi, Anand Paul

Large dependencies among vertices in a graph require communication, coordination, and global synchronization. Data serialization is a process of packing structured and unstructured objects into a common format and writing it on stream to either store or exchange them across heterogeneous applications in a compact way. Typical machine learning algorithms require multiple iterations to reach a final result. Developing scalable machine learning algorithms is very difficult, and there exists a library containing already implemented algorithms. Stream processing tools process unbounded data and give a response in real-time. Log processing is one of the suitable applications of MapReduce. Logs are generated monotonically across many machines and moved dynamically onto Hadoop Distributed File System in real-time using tools like Chukwa, Flume. Manual effort is required to run and coordinate a set of MapReduce jobs for graph-based applications.