ABSTRACT

Department of Electrical and Computer Engineering, University of Connecticut, Storrs, CT, USA

Mieszko Lis, Keun Sup Shim, Myong Hyon Cho, and Srinivas Devadas

Massachusetts Institute of Technology, Cambridge, MA, USA

7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 7.2 Migration-Based Memory Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

7.2.1 Remote-Access-Only (RA) Architecture . . . . . . . . . . . . . . . . 193 7.2.2 The Execution Migration Machine (EM2) . . . . . . . . . . . . . . 194 7.2.3 Hybrid EM2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 7.2.4 Hardware-Level Migration Framework . . . . . . . . . . . . . . . . . . 196 7.2.5 Data Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

7.3 Analytical Models: Directory Coherence versus EM2 . . . . . . . . . . . 199 7.3.1 Interconnect Traversal Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 7.3.2 Off-Chip Memory Access Costs . . . . . . . . . . . . . . . . . . . . . . . . . 199 7.3.3 EM2 Memory Access Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 7.3.4 Directory Coherence Memory Access Latency . . . . . . . . . . 202

7.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 7.4.1 Architectural Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 7.4.2 On-Chip Interconnect Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 7.4.3 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 7.4.4 Directory-Based Cache Coherence Baseline Selection . . . 205 7.4.5 Remote-Access NUCA Baseline Selection . . . . . . . . . . . . . . . 207 7.4.6 Cache Size Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 7.4.7 Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 7.4.8 Area and Energy Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

7.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 7.5.1 Advantages over Directory-Based Cache Coherence . . . . 209 7.5.2 Advantages over Traditional NUCA (RA) . . . . . . . . . . . . . . 211 7.5.3 Overall Area, Performance, and Energy . . . . . . . . . . . . . . . . . 213

Architecture,

7.5.4 Performance Scaling Potential for EM2 Designs . . . . . . . . 216 7.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

7.6.1 Thread Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 7.6.2 Remote-Access NUCA and Directory Coherence . . . . . . . 218

7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

We introduce the concept of deadlock-free migration-based coherent shared memory to the Non-Uniform Cache Access (NUCA) family of architectures. Migration-based architectures move threads among cores to guarantee sequential semantics in large multicores. Using the Execution Migration Machine (EM2), we achieve performance comparable to directory-based architectures without using directories: avoiding automatic data replication significantly reduces cache miss rates, while a fast hardware, network-level thread migration scheme takes advantage of shared data locality to reduce remote cache accesses that limit traditional NUCA performance.