Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanovi ć Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology December 8, 2004
Cache Refill/Access Decoupling for Vector Machines • Intuition – Motivation and Background – Cache Refill/Access Decoupling – Vector Segment Memory Accesses • Evaluation – The SCALE Vector-Thread Processor – Selected Results
Turning access parallelism into performance is challenging Applications with Ample Memory Access Parallelism Processor Architecture Modern High Bandwidth Memory Systems
Turning access parallelism into performance is challenging Target application domain Applications with Ample – Streaming Memory Access Parallelism – Embedded – Media – Graphics Processor Architecture – Scientific Modern High Bandwidth Memory Systems
Turning access parallelism into performance is challenging Target application domain Applications with Ample – Streaming Memory Access Parallelism – Embedded – Media – Graphics Processor Architecture – Scientific Techniques for high Modern High Bandwidth bandwidth memory systems Memory Systems – DDR interfaces – Interleaved banks – Extensive pipelining
Turning access parallelism into performance is challenging Many architectures Applications with Ample Memory Access Parallelism have difficulty turning memory access parallelism Processor Architecture into performance since they are unable to fully Modern High Bandwidth saturate their Memory Systems memory systems
Turning access parallelism into performance is challenging Applications with Ample Memory access Memory Access Parallelism parallelism is poorly encoded in a scalar ISA Processor Architecture Supporting many in-flight accesses is Modern High Bandwidth very expensive Memory Systems
Turning access parallelism into performance is challenging Applications with Ample Memory Access Parallelism Vector Architecture Supporting many in-flight accesses is Modern High Bandwidth very expensive Memory Systems
Turning access parallelism into performance is challenging Applications with Ample Memory Access Parallelism Vector Architecture A data cache helps reduce off-chip bandwidth costs at the Non-Blocking Data Cache expense of additional on-chip hardware Modern High Bandwidth Main Memory
Each in-flight access has an associated hardware cost Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Latency
Each in-flight access has an associated hardware cost Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Latency Access Management State Reserved Element Data Buffering
Saturating modern memory systems requires many in-flight accesses Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Secondary Miss Latency Main Memory Bandwidth-Delay Product 1 element 100 cycles cycle 100 in-flight elements
Caches increase the effective bandwidth-delay product Processor Cache Memory Primary Miss 100 Cycle Cache Refill Memory Secondary Miss Latency Effective Bandwidth-Delay Product 2 elements 100 cycles cycle 200 in-flight elements
Goal For This Work Reduce the hardware cost of non-blocking caches in vector machines while still turning access parallelism into performance by saturating the memory system
In a basic vector machine a single vector instruction operates on a vector of data Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System
In a basic vector machine a single vector instruction operates on a vector of data vlw vr2, r1 Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System
In a basic vector machine a single vector instruction operates on a vector of data vadd vr0, vr1, vr2 Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System
In a basic vector machine a single vector instruction operates on a vector of data vsw vr0, r2 Vector Processor Control Processor FU FU FU FU vr0 Memory vr1 Unit vr2 Memory System
In a decoupled vector machine the vector units are connected by queues VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Memory System
Non-blocking caches require extra state to manage outstanding misses VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tag Data Tags Array Array Replay Queues Main Memory
Control processor issues a vector load command to vector units Proc Cache Mem VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory
Vector load unit reserves storage in the vector load data queue Proc Cache Mem VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory
If request is a hit, then data is written into the VLDQ Proc Cache Mem VEU-CmdQ Vector Control HIT Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory
VEU executes writeback command to move data into architectural register Proc Cache Mem VEU-CmdQ Vector Control HIT Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory
On a primary miss, cache allocates a new miss tag and replay queue entry Proc Cache Mem VEU-CmdQ Vector Control MISS Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory Replay Queue Entries • Target register specifier • Cache line offset • Other management state
On a primary miss, cache allocates a new miss tag and replay queue entry Proc Cache Mem VEU-CmdQ Vector Control MISS Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory
On a secondary miss, cache just allocates a new replay queue entry Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory
Processor is free to continue issuing requests which may hit in the cache Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ HIT VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory
When the refill returns from memory, the cache replays each pending access Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ HIT VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory
When the refill returns from memory, the cache replays each pending access Proc Cache Mem VEU-CmdQ Vector Control Execution RE- Proc VLU- VSU- Unit FILL CmdQ CmdQ MISS VLU VLDQ HIT VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues RE- PLAY Main Memory
Expensive hardware is required to support many in-flight accesses Proc Cache Mem VEU-CmdQ Vector Control Execution Proc VLU- VSU- Unit CmdQ CmdQ VLU VLDQ VSU VSDQ Miss MSHR Tags Tag Data Array Array Replay Queues Main Memory
Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP Main Memory VLU VEU VSU
Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP Main Memory VLU VLU- CmdQ VEU VEU-CmdQ VSU VEU-CmdQ
Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP Main Memory VLU VLU- CmdQ VEU VEU-CmdQ VSU VSU-CmdQ
Effective decoupling requires command and data queuing Program Execution VEU CP VLU VSU Tags Data CP VLDQ Main Memory Entries VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ
Saturating memory system with many misses requires additional queuing Program Execution VEU CP VLU VSU Miss Tags Tags Data CP Replay Queue Entries VLDQ Entries Main Memory VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ
Saturating memory system with many misses requires additional queuing Program Execution VEU CP VLU VSU Miss Tags Tags Data CP Replay Queue Entries VLDQ Entries Main Memory VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ
Saturating memory system with many misses requires additional queuing Program Execution VEU CP VLU VSU Miss Tags Tags Data CP Replay Queue Entries VLDQ Entries Main Memory VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ
Saturating memory system with many misses requires additional queuing Bandwidth-Delay Product VEU CP VLU VSU Miss Tags Tags Data CP Replay Queue Entries VLDQ Entries Main Memory VLU VLU- CmdQ VSDQ VEU VEU-CmdQ Entries VSU VSU-CmdQ
Recommend
More recommend