ABSTRACT

Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA

Onur Mutlu

Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.3 Accelerated Critical Sections (ACS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.3.2 False Serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.4 Trade-off Analysis: Why ACS Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

5.5.1 Performance at Optimal Number of Threads . . . . . . . . . . . 161 5.5.2 Performance When the Number of Threads Equals the

Number of Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.5.3 Application Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.5.4 ACS on Symmetric CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.5.5 ACS versus Techniques to Hide Critical Section Latency 164

5.6 Contributions and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.7 Related Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

5.7.1 Improving Locality of Shared Data and Locks . . . . . . . . . . 167 5.7.2 Hiding the Latency of Critical Sections . . . . . . . . . . . . . . . . . 167 5.7.3 Asymmetric CMPs and CoreFusion . . . . . . . . . . . . . . . . . . . . . 168 5.7.4 Remote Procedure Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

Architecture,

Improving the performance of a single program using multiple processor cores requires that the application be partitioned into threads that execute concurrently on multiple cores. The principle of mutual exclusion dictates that threads cannot be allowed to update shared data concurrently; thus, accesses to shared data are encapsulated inside critical sections. Only one thread executes a critical section at a given time; other threads wanting to execute the same critical section must wait. Critical sections can serialize threads, thereby reducing performance and scalability (i.e., the number of threads at which performance saturates). This performance loss due to critical sections can be reduced by shortening the execution time inside critical sections.