ABSTRACT

This chapter deals with the most important aspect of achieving a performance gain on hybrid multi/manycore systems. As chips built for high performance computing become denser, they become more complicated and more difficult to utilize effectively. To supply operands to this increased computational power, a very complicated memory hierarchy is used to mitigate a relatively slow bandwidth to main memory. When multiple DO loops are present in an important kernel, benefit may be obtained by counting the amount of data used in the loops to see if strip mining can be used to reduce the memory footprint and obtain better cache performance. If the operands were all coming from main memory, the performance would be far less than the 27 GFlops, because the operation would be limited by memory bandwidth of 4 GB/sec. When a loop has more array references than computations and there is no reuse, the code tends to be limited by memory bandwidth.