an asymmetric multi core architecture for accelerating
play

An Asymmetric Multi-core Architecture for Accelerating Critical - PowerPoint PPT Presentation

An Asymmetric Multi-core Architecture for Accelerating Critical Sections M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin 1 Acknowledgements Moinuddin Qureshi (IBM Research, HPS) Onur Mutlu (Microsoft


  1. An Asymmetric Multi-core Architecture for Accelerating Critical Sections M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin 1

  2. Acknowledgements Moinuddin Qureshi (IBM Research, HPS) Onur Mutlu (Microsoft Research, HPS) Eric Sprangle (Intel, HPS) Anwar Rohillah (Intel) Anwar Ghuloum (Intel) Doug Carmean (Intel) 2

  3. The Asymmetric Chip Multiprocessor (ACMP) Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like Large Large Large core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like Large Large core core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core ACMP Approach “Tile-Large” Approach “Niagara” Approach • Provide one large core and many small cores • Accelerate serial part using the large core • Execute parallel part on small cores for high throughput 3

  4. 3 6 2 5 8 1 4 7 5 2 6 5 6 The 8-Puzzle Problem 4 4 8 4 2 8 : : 1 3 7 1 3 7 5 2 6 4 8 1 3 7

  5. The 8-Puzzle Problem 1 4 5 1 2 3 3 2 4 5 6 7 8 6 7 8 while(problem not solved) SubProblem = PriorityQ.remove() Solve(SubProblem) Critical if(solved) Sections break NewSubProblems = Partition(SubProblem) PriorityQ.insert(NewSubProblems) 5

  6. Contention for Critical Sections Critical Section Parallel Thread 1 Thread 2 Idle Thread 3 Thread 4 t 1 t 2 t 3 t 4 t 5 t 6 t 7 Thread 1 Critical Sections Thread 2 execute 2x faster Thread 3 Thread 4 t 1 t 2 t 3 t 4 t 5 t 6 t 7 6

  7. MySQL Database LOCK_open � Acquire() foreach (table locked by thread) table.lock � release() table.file � release() if (table.temporary) table.close() LOCK_open � Release() 7

  8. Conventional ACMP 1. P2 encounters a Critical Section EnterCS() 2. Sends a request for the lock PriorityQ.insert(…) 3. Acquires the lock 4. Executes Critical Section LeaveCS() 5. Releases the lock Core executing critical section P1 P2 P3 P4 Onchip- Interconnect 8

  9. Accelerated Critical Sections (ACS) 1. P2 encounters a Critical Section EnterCS() 2. P2 sends CSCALL Request to CSRB PriorityQ.insert(…) 3. P1 executes Critical Section 4. P1 sends CSDONE signal LeaveCS() Core executing critical section P1 P2 P3 P4 Critical Section Request Buffer (CSRB) Onchip- Interconnect 9

  10. Architecture Overview • ISA extensions – CSCALL LOCK_ADDR , TARGET_PC – CSRET LOCK_ADDR • Compiler/Library inserts CSCALL/CSRET • On a CSCALL, the small core: – Sends a CSCALL request to the large core • Arguments: Lock address, Target PC, Stack Pointer, Core ID – Stalls and waits for CSDONE • Large Core – Critical Section Request Buffer (CSRB) – Executes the critical section and sends CSDONE to the requesting core 10

  11. “False” Serialization • Independent critical sections are used to protect disjoint data • Conventional systems can execute independent critical sections concurrently but ACS can artificially serializes their execution • Selective Acceleration of Critical Sections (SEL) – Augment CSRB with saturating counters which track false serialization CSCALL (A) Critical 3 4 2 A Section CSCALL (A) Request 4 5 B Buffer CSCALL (B) 11

  12. Performance Trade-offs in ACS • Fewer concurrent threads – As number of cores increase • Marginal loss in parallel performance decreases • More threads � Contention for critical sections increases which makes their acceleration more beneficial • Overhead of CSCALL/CSDONE – Fewer cache misses for the lock variable • Cache misses for private data – Fewer misses for shared data Cache misses reduce if Shared data > Private data – The large core can tolerate cache miss latencies better than small cores 12

  13. Experimental Methodology • Configurations – One large core is the size of 4 small cores – At chip area equal to N small cores • Symmetric CMP (SCMP): N small cores, conventional locking • Asymmetric CMP (ACMP): 1 large core, N – 4 small cores, conventional locking • ACS: 1 large core, N – 4 small cores, (N – 4)-entry CSRB. • Workloads – 12 critical section intensive applications from various domains – 7 use coarse-grain locks and 5 use fine-grain locks • Simulation parameters: – x86 cycle accurate processor simulator – Large core: Similar to Pentium-M with 2-way SMT. 2GHz, out-of-order, 128-entry, 4-wide, 12-stage – Small core: Similar to Pentium 1, 2GHz, in-order, 2-wide, 5-stage – Private 32 KB L1, private 256KB L2, 8MB shared L3 – On-chip interconnect: Bi-directional ring 13

  14. Workloads with Coarse-Grain Locks Equal-area comparison Number of threads = Best threads Chip Area = 16 cores Chip Area = 32 small cores SCMP = 16 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 12 small cores ACMP/ACS = 1 large and 28 small cores 14

  15. Area = 32 small cores Workloads with Fine-Grain Locks 15 Area = 16 small cores

  16. Equal-Area Comparisons Number of threads = No. of cores Speedup over a small core Chip Area (small cores) 16

  17. ACS on Symmetric CMP 17

  18. Conclusion • ACS reduces average execution time by: – 34% compared to an equal-area SCMP – 23% compared to an equal-area ACMP • ACS improves scalability of 7 of the 12 workloads • Future work will examine resource allocation in ACS in presence of multiple applications 18

Recommend


More recommend