ACCELERATION VIA EXPLICIT DECOUPLED DATA ORCHESTRATION Michael Pellauer* 1/26/2019 [Extended version to appear in: ASPLOS 2019] In collaboration with: Yakun Sophia Shao*, Jason Clemons*, Neal Crago*, Kartik Hegde**, Rangharajan Venkatesan*, Stephen W. Keckler*, Christopher W. Fletcher,** Joel Emer*^ *Nvidia **UIUC ^MIT
ACCELERATORS ARE GREAT.... BUT! Custom Datapath Off-Chip Memory 2
WHAT IS DATA ORCHESTRATION? Feeding data to a functional unit exactly when it wants it Staging Datapath Buffer Who the “actors” are that touch When data is moved over a (Small, data and their synchronization transfer substrate with each other Private) Staging Staging Datapath Buffer Buffer Off-Chip (Large, (Small, Shared) Private) I/O Staging Datapath Buffer (Small, Private) How data is accessed (strides, Where data is placed in available staging buffers patterns, etc.), including when it is no longer needed ML ASICs use workload knowledge to optimize orchestration at design-time without caches 3
GUIDING PRINCIPLES FOR EFFICIENT DATA ORCHESTRATION Bandwidth efficiency - Maximize Local reuse – staged physically Cross-unit use – amortize data delivery rate by controlling close to consuming units access and communication outstanding requests Simple structures - Minimize Delivery/use overlap - Next Precise synchronization – hardware area/power tile should be available when Only wait for exactly data you current is done (e.g., double- need, respond quickly (e.g., no buffering) barriers or remote polling) 4
CLASSIFYING APPROACHES: IMPLICIT VERSUS EXPLICIT Implicit: Explicit: 5
CLASSIFYING APPROACHES: COUPLED VERSUS DECOUPLED Implicit + Coupled Implicit + Decoupled 6
EXPLICIT DECOUPLED DATA ORCHESTRATION Implicit + Decoupled Explicit + Decoupled 7
PROPERTIES OF APPROACHES CPU + Cache SM + ShMem Spad DAE CPU + Cache DMA Eng. + FIFO Implicit, Coupled Explicit, Coupled Implicit, Decoupled Explicit, Decoupled Buf. Area/Energy High Low High Low Heuristic Programmatic Heuristic Programmatic Placement policy Yes No Yes Yes Hier. Composable Access Multicast No No Yes Yes MLP of Fills Complex Complex Cheap Cheap Landing Zone Round-trip Round-trip Hop-to-Hop Hop-to-Hop Holding Time Encapsulated Encapsulated Encapsulated Data Availability Out-of-band (load-to-use) (load-to-use) (peek stalling) Synchronization Access order Arbitrary Arbitrary Arbitrary Fixed FIFO Yes Yes Yes No In-place updates Heuristic Programmatic Heuristic Dequeue/clear Removal NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 8
PROPERTIES OF APPROACHES CPU + Cache SM + ShMem Spad DAE CPU + Cache DMA Eng. + FIFO Implicit, Coupled Explicit, Coupled Implicit, Decoupled Explicit, Decoupled Buf. Area/Energy High Low High Low Heuristic Programmatic Heuristic Programmatic Placement policy Yes No Yes Yes Hier. Composable Access Multicast No No Yes Yes MLP of Fills Complex Complex Cheap Cheap Landing Zone Round-trip Round-trip Hop-to-Hop Hop-to-Hop Holding Time Encapsulated Encapsulated Encapsulated Data Availability Out-of-band (load-to-use) (load-to-use) (peek stalling) Synchronization Access order Arbitrary Arbitrary Arbitrary Fixed FIFO Yes Yes Yes No In-place updates Heuristic Programmatic Heuristic Dequeue/clear Removal These are not limitations of EDDO, but of the FIFO idiom • Buffets change these points to {Arbitrary, Yes, Programmatic (Contiguous)} • NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 8
BUFFETS: COMPOSABLE IDIOM FOR E.D.D.O. Details to appear in ASPLOS 2019 [April, Providence] 9
ARCHITECTURAL VISION FOR E.D.D.O. Traditional JIT Data-Size Dependent JIT Mapper Input Data uArch uArch Portable Portable Description Description Description Code Code JIT JIT + Mapper uArch- Blocked, mapped Specific uArch-Specific Code Code 10
IDEAS FOR POTENTIAL AUTOMATIC MAPPERS Have the program pre-select a “menu” and provide a heuristic? Train a neural net? Use tensor decomposition + tensor prediction? Key idea: run the mapper on the accelerator itself... Open question: how to make this work with sparsity? What can be conveyed to the mapper in O(1) time? 11
MPELLAUER@NVIDIA.COM 12
Recommend
More recommend