cache refill access decoupling for vector machines
play

Cache Refill/Access Decoupling for Vector Machines Christopher - PowerPoint PPT Presentation

Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanovi Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology December 8, 2004 Cache


  1. Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanovi ć Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology December 8, 2004

  2. Cache Refill/Access Decoupling for Vector Machines • Intuition – Motivation and Background – Cache Refill/Access Decoupling – Vector Segment Memory Accesses • Evaluation – The SCALE Vector-Thread Processor – Selected Results

  3. Turning access parallelism into performance is challenging Applications with Ample Memory Access Parallelism Processor Architecture Modern High Bandwidth Memory Systems

  4. Turning access parallelism into performance is challenging Target application domain Applications with Ample – Streaming Memory Access Parallelism – Embedded – Media – Graphics Processor Architecture – Scientific Modern High Bandwidth Memory Systems

  5. Turning access parallelism into performance is challenging Target application domain Applications with Ample – Streaming Memory Access Parallelism – Embedded – Media – Graphics Processor Architecture – Scientific Techniques for high Modern High Bandwidth bandwidth memory systems Memory Systems – DDR interfaces – Interleaved banks – Extensive pipelining

  6. Turning access parallelism into performance is challenging Many architectures Applications with Ample Memory Access Parallelism have difficulty turning memory access parallelism Processor Architecture into performance since they are unable to fully Modern High Bandwidth saturate their Memory Systems memory systems

  7. Turning access parallelism into performance is challenging Applications with Ample Memory access Memory Access Parallelism parallelism is poorly encoded in a scalar ISA Processor Architecture Supporting many in-flight accesses is Modern High Bandwidth very expensive Memory Systems

  8. Turning access parallelism into performance is challenging Applications with Ample Memory Access Parallelism Vector Architecture Supporting many in-flight accesses is Modern High Bandwidth very expensive Memory Systems

  9. Turning access parallelism into performance is challenging Applications with Ample Memory Access Parallelism Vector Architecture A data cache helps reduce off-chip bandwidth costs at the Non-Blocking Data Cache expense of additional on-chip hardware Modern High Bandwidth Main Memory

  10. Each in-flight access has an associated hardware cost Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Latency

  11. Each in-flight access has an associated hardware cost Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Latency Access Management State Reserved Element Data Buffering

  12. Saturating modern memory systems requires many in-flight accesses Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Secondary Miss Latency Main Memory Bandwidth-Delay Product 1 element 100 cycles cycle 100 in-flight elements

  13. Caches increase the effective bandwidth-delay product Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Secondary Miss Latency Effective Bandwidth-Delay Product 2 elements 100 cycles cycle 200 in-flight elements

  14. Goal For This Work Reduce the hardware cost of non-blocking caches in vector machines while still turning access parallelism into performance by saturating the memory system

  15. In a basic vector machine a single vector instruction operates on a vector of data Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System

  16. In a basic vector machine a single vector instruction operates on a vector of data vlw vr2, r1 Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System

  17. In a basic vector machine a single vector instruction operates on a vector of data vadd vr0, vr1, vr2 Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System

  18. In a basic vector machine a single vector instruction operates on a vector of data vsw vr0, r2 Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System

  19. In a decoupled vector machine the vector units are connected by queues VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Memory System

  20. Non-blocking caches require extra state to manage outstanding misses VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tag Data Tags Array Array Replay Queues Main Memory

  21. Control processor issues a vector load command to vector units Proc Cache Mem VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

  22. Vector load unit reserves storage in the vector load data queue Proc Cache Mem VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

  23. If request is a hit, then data is written into the VLDQ Proc Cache Mem VEU-CmdQ Vector Control HIT Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

  24. VEU executes writeback command to move data into architectural register Proc Cache Mem VEU-CmdQ Vector Control HIT Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

  25. On a primary miss, cache allocates a new miss tag and replay queue entry Proc Cache Mem VEU-CmdQ Vector Control MISS Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory Replay Queue Entries • Target register specifier • Cache line offset • Other management state

  26. On a primary miss, cache allocates a new miss tag and replay queue entry Proc Cache Mem VEU-CmdQ Vector Control MISS Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

  27. On a secondary miss, cache just allocates a new replay queue entry Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

  28. Processor is free to continue issuing requests which may hit in the cache Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ HIT VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

  29. When the refill returns from memory, the cache replays each pending access Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ HIT VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

  30. When the refill returns from memory, the cache replays each pending access Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ HIT VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues RE- PLAY Main Memory

  31. Expensive hardware is required to support many in-flight accesses Proc Cache Mem VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory

  32. Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP Main Memory VLU VEU VSU

  33. Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP Main Memory VLU VLU- CmdQ VEU VEU-CmdQ VSU VEU-CmdQ

  34. Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP Main Memory VLU VLU- CmdQ VEU VEU-CmdQ VSU VSU-CmdQ

  35. Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP VLDQ Main Memory Entries VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ

  36. Saturating memory system with many misses requires additional queuing Program Execution VEU CP VLU VSU Miss Tags Tags Data CP Replay Queue Entries VLDQ Entries Main Memory VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ

  37. Saturating memory system with many misses requires additional queuing Program Execution VEU CP VLU VSU Miss Tags Tags Data CP Replay Queue Entries VLDQ Entries Main Memory VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ

  38. Saturating memory system with many misses requires additional queuing Program Execution VEU CP VLU VSU Miss Tags Tags Data CP Replay Queue Entries VLDQ Entries Main Memory VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ

  39. Saturating memory system with many misses requires additional queuing Bandwidth-Delay Product VEU CP VLU VSU Miss Tags Tags Data CP Replay Queue Entries VLDQ Entries Main Memory VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ

Recommend


More recommend