the vector thread architecture
play

The Vector-Thread Architecture Ronny Krashinsky, Christopher - PowerPoint PPT Presentation

The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA ISCA 2004 Goals For


  1. The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA ISCA 2004

  2. Goals For Vector-Thread Architecture • Primary goal is efficiency - High performance with low energy and small area • Take advantage of whatever parallelism and locality is available: DLP, TLP, ILP - Allow intermixing of multiple levels of parallelism • Programming model is key - Encode parallelism and locality in a way that enables a complexity-effective implementation - Provide clean abstractions to simplify coding and compilation

  3. Vector and Multithreaded Architectures vector control Control thread Processor control PE0 PE1 PE2 PE N PE0 PE1 PE2 PE N Memory Memory • Vector processors provide • Multithreaded processors efficient DLP execution can flexibly exploit TLP - Amortize instruction control • Unable to amortize common control overhead across - Amortize loop bookkeeping threads overhead • Unable to exploit structured - Exploit structured memory memory accesses across accesses threads • Unable to execute loops • Costly memory-based with loop-carried synchronization and dependencies or complex communication between internal control flow threads

  4. Vector-Thread Architecture • VT unifies the vector and multithreaded compute models • A control processor interacts with a vector of virtual processors (VPs) • Vector-fetch : control processor fetches instructions for all VPs in parallel • Thread-fetch : a VP fetches its own instructions • VT allows a seamless intermixing of vector and thread control vector-fetch Control Processor thread- fetch VP0 VP1 VP2 VP3 VP N Memory

  5. Outline • Vector-Thread Architectural Paradigm - Abstract model - Physical Model • SCALE VT Processor • Evaluation • Related Work

  6. Virtual Processor Abstraction • VPs contain a set of registers vector-fetch • VPs execute RISC-like instructions VP grouped into atomic instruction blocks (AIBs) thread- Registers • VPs have no automatic program fetch counter, AIBs must be explicitly ALUs fetched - VPs contain pending vector-fetch and thread-fetch addresses VP thread execution • A fetch instruction allows a VP to AIB fetch its own AIB instruction fetch - May be predicated for conditional branch thread-fetch • If an AIB does not execute a fetch, fetch the VP thread stops thread-fetch

  7. Virtual Processor Vector • A VT architecture includes a control processor and a virtual processor vector - Two interacting instruction sets • A vector-fetch command allows the control processor to fetch an AIB for all the VPs in parallel • Vector-load and vector-store commands transfer blocks of data between memory and the VP registers vector-fetch Control Processor VP0 VP1 VP N vector -load vector Registers Registers Registers -store Vector ALUs ALUs ALUs Memory Unit Memory

  8. Cross-VP Data Transfers • Cross-VP connections provide fine-grain data operand communication and synchronization - VP instructions may target nextVP as destination or use prevVP as a source - CrossVP queue holds wrap-around data, control processor can push and pop - Restricted ring communication pattern is cheap to implement, scalable, and matches the software usage model for VPs Control vector-fetch Processor VP0 VP1 VP N crossVP- pop crossVP- push Registers Registers Registers ALUs ALUs ALUs crossVP queue

  9. Mapping Loops to VT • A broad class of loops map naturally to VT - Vectorizable loops - Loops with loop-carried dependencies - Loops with internal control flow • Each VP executes one loop iteration - Control processor manages the execution - Stripmining enables implementation-dependent vector lengths • Programmer or compiler only schedules one loop iteration on one VP - No cross-iteration scheduling

  10. Vectorizable Loops • Data-parallel loops with no internal control flow ld ld mapped using vector commands x << - predication for small conditionals + operation st loop iteration DAG Control Processor VP0 VP1 VP2 VP3 VP N vector-load ld ld ld ld ld vector-load ld ld ld ld ld vector-fetch x << << x << x << x << x + + + + + vector-store st st st st st vector-load ld ld ld ld ld vector-load ld ld ld ld ld vector-fetch

  11. Loop-Carried Dependencies • Loops with cross-iteration dependencies ld ld mapped using vector commands with cross-VP x << data transfers + - Vector-fetch introduces chain of prevVP receives and nextVP sends st - Vector-memory commands with non-vectorizable compute Control Processor VP0 VP1 VP2 VP3 VP N vector-load ld ld ld ld ld vector-load ld ld ld ld ld vector-fetch x << << x << x << x << x + + + + + vector-store st st st st st

  12. Loops with Internal Control Flow • Data-parallel loops with large conditionals or ld inner-loops mapped using thread-fetches ld - Vector-commands and thread-fetches freely == intermixed br - Once launched, the VP threads execute to completion before the next control processor st command Control VP0 VP1 Processor VP2 VP3 VP N vector-load ld ld ld ld ld vector-fetch ld ld ld ld ld == == == == == br br br br br ld ld ld == == == br br br vector-store st st st st st

  13. VT Physical Model • A Vector-Thread Unit contains an array of lanes with physical register files and execution units • VPs map to lanes and share physical resources, VP execution is time-multiplexed on the lanes • Independent parallel lanes exploit parallelism across VPs and data operand locality within VPs Vector-Thread Unit Control Lane 0 Lane 1 Lane 2 Lane 3 Processor VP12 VP13 VP14 VP15 VP8 VP9 VP10 VP11 VP4 VP5 VP6 VP7 VP0 VP1 VP2 VP3 Vector ALU ALU ALU ALU Memory Unit Memory

  14. Lane Execution • Lanes execute decoupled from each other Lane 0 • Command management unit handles vector-fetch vector-fetch and thread-fetch commands thread-fetch vector-fetch • Execution cluster executes instructions in-order from small AIB miss addr cache (e.g. 32 instructions) AIB miss address - AIB caches exploit locality to reduce AIB instruction fetch energy (on par with register read) tags • Execute directives point to AIBs and indicate which VP(s) the AIB should VP12 execute be executed for directive VP8 - For a thread-fetch command, the lane VP4 executes the AIB for the requesting VP VP0 VP - For a vector-fetch command, the lane executes the AIB for every VP AIB • AIBs and vector-fetch commands ALU cache reduce control overhead AIB instr. - 10s—100s of instructions executed per AIB fill fetch address tag-check, even for non- vectorizable loops

  15. VP Execution Interleaving • Hardware provides the benefits of loop unrolling by interleaving VPs • Time-multiplexing can hide thread-fetch, memory, and functional unit latencies time-multiplexing Lane 0 Lane 1 Lane 2 Lane 3 VP0 VP4 VP8 VP12 VP1 VP5 VP9 VP13 VP2 VP6 VP10VP14 VP3 VP7 VP11VP15 vector-fetch AIB vector-fetch thread- fetch time vector-fetch

  16. VP Execution Interleaving • Dynamic scheduling of cross-VP data transfers automatically adapts to software critical path (in contrast to static software pipelining) - No static cross-iteration scheduling - Tolerant to variable dynamic latencies time-multiplexing Lane 0 Lane 1 Lane 2 Lane 3 VP0 VP4 VP8 VP12 VP1 VP5 VP9 VP13 VP2 VP6 VP10VP14 VP3 VP7 VP11VP15 vector-fetch AIB time

  17. SCALE Vector-Thread Processor • SCALE is designed to be a complexity-effective all-purpose embedded processor - Exploit all available forms of parallelism and locality to achieve high performance and low energy • Constrained to small area (estimated 10 mm 2 in 0.18 µm) - Reduce wire delay and complexity - Support tiling of multiple SCALE processors for increased throughput • Careful balance between software and hardware for code mapping and scheduling - Optimize runtime energy, area efficiency, and performance while maintaining a clean scalable programming model

  18. SCALE Clusters • VPs partitioned into four clusters to exploit ILP and allow lane implementations to optimize area, energy, and circuit delay - Clusters are heterogeneous – c0 can execute loads and stores, c1 can execute fetches, c3 has integer mult/div - Clusters execute decoupled from each other Lane 0 Lane 1 Lane 2 Lane 3 Control SCALE VP Processor c3 c3 c3 c3 c3 c2 AIB c2 c2 c2 c2 Fill Unit c1 c1 c1 c1 c1 c0 c0 c0 c0 c0 L1 Cache

  19. SCALE Registers and VP Configuration c0 • Atomic instruction blocks allow VPs to share shared temporary state – only valid within the AIB VP8 VP4 - VP general registers divided into private and shared VP0 cr0 cr1 - Chain registers at ALU inputs – avoid reading and writing general register file to save energy • Number of VP registers in each cluster is configurable - The hardware can support more VPs when they each have fewer private registers - Low overhead: Control processor instruction configures VPs before entering stripmine loop, VP state undefined across reconfigurations shared 4 VPs with 7 VPs with 25 VPs with shared VP12 VP24 0 shared regs 4 shared regs 7 shared regs VP20 VP8 VP16 8 private regs 4 private regs 1 private reg VP12 VP4 VP8 VP4 VP0 VP0

Recommend


More recommend