ABSTRACT

INTRODUCTION Big Data applications play a crucial role in our evolving society. ey represent a large proportion of the usage of the cloud [1-3] because the latter oers distributed and online storage and elastic computing services. Indeed, Big Data applications require to scale computing and storage requirements on the y. With the recent improvements of virtual computing, data centers can thus oer a virtualized infrastructure in order to t custom requirements. is exibility has been a decisive enabler for the Big Data application success of the recent years. As an example, many Big Data applications rely, directly or indirectly, on Apache Hadoop [11] which is the most popular implementation of the MapReduce programming

CONTENTS Introduction 121 VM Placement for Reducing Elephant Flow Impact 123 Topology Design 124 Conventional Networking 126

Routing 126 Flow Scheduling 128 Limitations 129

Soware-Dened Networking 129 Soware-Dened Networks 131 Trac-Aware Networking 132 Application-Aware Networking 133

Conclusions 135 References 136

model [4]. From a general perspective, it consists in distributing computing tasks between mappers and reducers. Mappers produce intermediate results which are aggregated in a second stage by reducers. is process is illustrated in Figure 7.1a, where the mappers send partial results (values) to specic reducers based on some keys. e reducers are then in charge of applying a function (like sum, average, or other aggregation function) to the whole set of values corresponding to a single key. is architectural pattern is fault tolerant and scalable. Another interesting feature of this paradigm is the execution environment of the code. In Hadoop, the code is directly executed near the data it operates on, in order to limit the data transfer within the cluster. However, large chunks of data are still transferred between the mappers and reducers (shu e phase) which thus necessitate an ecient underlying network infrastructure. It is important to note that the shu e phase does not wait for the completion of the mappers to start as the latter already emits (key, value) pairs based on partial data it has read from the source (e.g., for each line). Since some failures or bottlenecks can occur, Hadoop tasks are constantly monitored. If one of the components (i.e., mappers or reducers) is not functioning well (i.e., it does not progress as fast as others for example), it can be duplicated into another node for balancing load. In such a case, this leads also to additional data transfers.