ABSTRACT

Machine learning (ML) and deep learning (DL) algorithms are emerging as the new driving force for the computer architecture evolution. With an ever large adoption of ML/DL techniques in Cloud and high-performance computing (HPC) domains, several new architectures (spanning from chips to entire distributed systems) have been pushed on the market to better support applications based on ML/DL algorithms. While HPC and Cloud remained for long time distinguished domains with their own challenges, an ever large number of new applications is pushing for their rapid convergence. In this context, many accelerators (GP-GPUs, FPGAs) and customised ASICs (e.g., Google TPUs, Intel Neural Network Processor – NNP) with dedicated functionalities have been proposed, further enlarging the data center heterogeneity landscape. Supporting such large (at scale) heterogeneity demands for an adequate software environment (orchestration tool) able to maximise flexibility, productivity and extract maximum performance from the underlying hardware. To this end, first, a comprehensive vision on current state-of-the-art hardware and software heterogeneity, covering the whole spectrum of a modern Cloud/HPC system architecture is given. Then, ECRAE is presented, i.e., an orchestration solution devised to explicitly deal with heterogeneous devices deployed at scale.