ABSTRACT

General-purpose graphics processing units (GP-GPUs) have emerged as a very powerful approach in the era of multi-cores. GP-GPUs typically employ hundreds of cores and use thousands of threads to exploit the data parallelism inherent in GP-GPU applications. L1 data caches in GPUs have gained traction over the recent years. With the emergence of GP-GPU applications, L1 data caches have become performance critical. Typically, L1 data cache is shared by multiple cores and hundreds of threads. In this paper, we characterize the performance of L1 data cache used in a GPU. We consider general-purpose applications like the Rodinia benchmark. We vary the cache size from 32 to 256 KB; furthermore, we vary the associativity from 32 to 256. We also vary the number of banks from 1 bank to 8 banks. We observe a very high miss rate for most of the applications. The high miss rate is due to the huge working set size of the GP-GPU applications. This high miss rate limits the performance gain of increasing the cache size, associativity, and number of banks.