ABSTRACT

There are three important dimensions of the parallelism: message passing between the nodes, message passing and threading within the node, and vectorization to utilize the single instruction, multiple data instructions (SIMD). When identifying parallelism within the application, care must be taken to avoid moving data from the local caches of one processor to the local caches of another processor. On systems like the NVIDIA graphics processing units (GPU), many more active threads will be required to amortize the latency to memory. While some threads are waiting for operands to reach the registers, other threads can utilize the functional units. Multicore Xeons and manycore Intel Phi systems can handle a significant amount of Message Passing Interface (MPI) on the node, whereas GPU systems cannot. When the application has to scale to millions of degrees of parallelism, the division of the dimensions of the problem across the MPI ranks, threads, and SIMD units is an important design criteria.