ACCELERATION VIA EXPLICIT DECOUPLED DATA ORCHESTRATION Michael - PowerPoint PPT Presentation

ACCELERATION VIA EXPLICIT DECOUPLED DATA ORCHESTRATION Michael Pellauer* 1/26/2019 [Extended version to appear in: ASPLOS 2019] In collaboration with: Yakun Sophia Shao*, Jason Clemons*, Neal Crago*, Kartik Hegde**, Rangharajan Venkatesan*, Stephen W. Keckler*, Christopher W. Fletcher,** Joel Emer*^ *Nvidia **UIUC ^MIT

ACCELERATORS ARE GREAT.... BUT! Custom Datapath Off-Chip Memory 2

WHAT IS DATA ORCHESTRATION? Feeding data to a functional unit exactly when it wants it Staging Datapath Buffer Who the “actors” are that touch When data is moved over a (Small, data and their synchronization transfer substrate with each other Private) Staging Staging Datapath Buffer Buffer Off-Chip (Large, (Small, Shared) Private) I/O Staging Datapath Buffer (Small, Private) How data is accessed (strides, Where data is placed in available staging buffers patterns, etc.), including when it is no longer needed ML ASICs use workload knowledge to optimize orchestration at design-time without caches 3

GUIDING PRINCIPLES FOR   EFFICIENT DATA ORCHESTRATION Bandwidth efficiency - Maximize Local reuse – staged physically Cross-unit use – amortize data delivery rate by controlling close to consuming units access and communication outstanding requests Simple structures - Minimize Delivery/use overlap - Next Precise synchronization – hardware area/power tile should be available when Only wait for exactly data you current is done (e.g., double- need, respond quickly (e.g., no buffering) barriers or remote polling) 4

CLASSIFYING APPROACHES:   IMPLICIT VERSUS EXPLICIT Implicit: Explicit: 5

CLASSIFYING APPROACHES:   COUPLED VERSUS DECOUPLED Implicit + Coupled Implicit + Decoupled 6

EXPLICIT DECOUPLED DATA ORCHESTRATION Implicit + Decoupled Explicit + Decoupled 7

PROPERTIES OF APPROACHES CPU + Cache SM + ShMem Spad DAE CPU + Cache DMA Eng. + FIFO Implicit, Coupled Explicit, Coupled Implicit, Decoupled Explicit, Decoupled Buf. Area/Energy High Low High Low Heuristic Programmatic Heuristic Programmatic Placement policy Yes No Yes Yes Hier. Composable Access Multicast No No Yes Yes MLP of Fills Complex Complex Cheap Cheap Landing Zone Round-trip Round-trip Hop-to-Hop Hop-to-Hop Holding Time Encapsulated Encapsulated Encapsulated Data Availability Out-of-band (load-to-use) (load-to-use) (peek stalling) Synchronization Access order Arbitrary Arbitrary Arbitrary Fixed FIFO Yes Yes Yes No In-place updates Heuristic Programmatic Heuristic Dequeue/clear Removal NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 8

PROPERTIES OF APPROACHES CPU + Cache SM + ShMem Spad DAE CPU + Cache DMA Eng. + FIFO Implicit, Coupled Explicit, Coupled Implicit, Decoupled Explicit, Decoupled Buf. Area/Energy High Low High Low Heuristic Programmatic Heuristic Programmatic Placement policy Yes No Yes Yes Hier. Composable Access Multicast No No Yes Yes MLP of Fills Complex Complex Cheap Cheap Landing Zone Round-trip Round-trip Hop-to-Hop Hop-to-Hop Holding Time Encapsulated Encapsulated Encapsulated Data Availability Out-of-band (load-to-use) (load-to-use) (peek stalling) Synchronization Access order Arbitrary Arbitrary Arbitrary Fixed FIFO Yes Yes Yes No In-place updates Heuristic Programmatic Heuristic Dequeue/clear Removal These are not limitations of EDDO, but of the FIFO idiom • Buffets change these points to {Arbitrary, Yes, Programmatic (Contiguous)} • NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 8

BUFFETS: COMPOSABLE IDIOM FOR E.D.D.O. Details to appear in ASPLOS 2019 [April, Providence] 9

ARCHITECTURAL VISION FOR E.D.D.O. Traditional JIT Data-Size Dependent JIT Mapper Input Data uArch uArch Portable Portable Description Description Description Code Code JIT JIT + Mapper uArch- Blocked, mapped Specific uArch-Specific Code Code 10

IDEAS FOR POTENTIAL AUTOMATIC MAPPERS Have the program pre-select a “menu” and provide a heuristic? Train a neural net? Use tensor decomposition + tensor prediction? Key idea: run the mapper on the accelerator itself... Open question: how to make this work with sparsity? What can be conveyed to the mapper in O(1) time? 11

MPELLAUER@NVIDIA.COM 12

ACCELERATION VIA EXPLICIT DECOUPLED DATA ORCHESTRATION Michael - PowerPoint PPT Presentation

ACCELERATION VIA EXPLICIT DECOUPLED DATA ORCHESTRATION Michael Pellauer* 1/26/2019 [Extended version to appear in: ASPLOS 2019] In collaboration with: Yakun Sophia Shao, Jason Clemons, Neal Crago*, Kartik Hegde**, Rangharajan Venkatesan*,

AI Driven Orchestration, Challenges & Opportunities Openstack Summit 2018 Sana Tariq (Ph.D.)

Smart Space Orchestration Orchestration The Internet of Things Cyber-Physical Systems Pervasive

Orchestration in Docker Swarm mode, Docker services and declarative application deployment Mike

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

The explicit teaching of a The explicit teaching of a The explicit teaching of a laboratory

MOBILE COMPUTING CSE 40814/60814 Fall 2015 System Structure explicit explicit input output 1

Decoupled I/O for Data-Intensive High Performance Computing Chao Chen 1 Yong Chen 1 Kun Feng 2

Distributed Smart Space Orchestration System 2pace Marc-Oliver Pahl Distributed Smart

Swing Orchestration: Structural and sectional devices in big band swing Splanky, Count Basie

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Damping Power System Inter-area Oscillations Through Decoupled Modulation Rui Fan, Shaobu Wang

Secrets of the decoupled Drupal practitioner Preston So April 11, 2019 DrupalCon

Decoupled Access/Execute Computer Architectures James E. Smith Presented by Dan Amelang How

1 Decoupled & Uprooted Case Study, Government of Flanders Tomas Flpp (Vacilando) 2

Demystifying Decoupled Drupal with Contenta CMS Bayo Fodeke & Mark Shropshire Todays

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Some Technical/Architectural Issues Overview Update and discussion of some ongoing work

CHALLENGES FOR FRONT END DEVELOPERS OF LARGE WEB APPLICATIONS Graham Hinchly FT Labs Me

HPGe detector fabrication at CANBERRA V. Marian, M.O. Lampert, B. Pirard, P. Quirin, J. Flamanc

RDS/TCP: More than just Cluster/IPC. Sowmini Varadhan(sowmini.varadhan@oracle.com) Cloudopen

The post-quantum Internet Daniel J. Bernstein University of Illinois at Chicago & Technische

Unambiguous Encapsulation Separating Data and Signaling LangSec workshop 2015 Michael Ossmann

CMS and localisation Multilingual Web content management Limerick. September 21st, 2011 Page 1

Public-Key Cryptosystems Resilient to Key Leakage Moni Naor Gil Segev Weizmann Institute of

ACCELERATION VIA EXPLICIT DECOUPLED DATA ORCHESTRATION Michael - PowerPoint PPT Presentation

ACCELERATION VIA EXPLICIT DECOUPLED DATA ORCHESTRATION Michael Pellauer* 1/26/2019 [Extended version to appear in: ASPLOS 2019] In collaboration with: Yakun Sophia Shao*, Jason Clemons*, Neal Crago*, Kartik Hegde**, Rangharajan Venkatesan*,

AI Driven Orchestration, Challenges &amp; Opportunities Openstack Summit 2018 Sana Tariq (Ph.D.)

Smart Space Orchestration Orchestration The Internet of Things Cyber-Physical Systems Pervasive

Orchestration in Docker Swarm mode, Docker services and declarative application deployment Mike

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

The explicit teaching of a The explicit teaching of a The explicit teaching of a laboratory

MOBILE COMPUTING CSE 40814/60814 Fall 2015 System Structure explicit explicit input output 1

Decoupled I/O for Data-Intensive High Performance Computing Chao Chen 1 Yong Chen 1 Kun Feng 2

Distributed Smart Space Orchestration System 2pace Marc-Oliver Pahl Distributed Smart

Swing Orchestration: Structural and sectional devices in big band swing Splanky, Count Basie

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Damping Power System Inter-area Oscillations Through Decoupled Modulation Rui Fan, Shaobu Wang

Secrets of the decoupled Drupal practitioner Preston So April 11, 2019 DrupalCon

Decoupled Access/Execute Computer Architectures James E. Smith Presented by Dan Amelang How

1 Decoupled &amp; Uprooted Case Study, Government of Flanders Tomas Flpp (Vacilando) 2

Demystifying Decoupled Drupal with Contenta CMS Bayo Fodeke &amp; Mark Shropshire Todays

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Some Technical/Architectural Issues Overview Update and discussion of some ongoing work

CHALLENGES FOR FRONT END DEVELOPERS OF LARGE WEB APPLICATIONS Graham Hinchly FT Labs Me

HPGe detector fabrication at CANBERRA V. Marian, M.O. Lampert, B. Pirard, P. Quirin, J. Flamanc

RDS/TCP: More than just Cluster/IPC. Sowmini Varadhan(sowmini.varadhan@oracle.com) Cloudopen

The post-quantum Internet Daniel J. Bernstein University of Illinois at Chicago &amp; Technische

Unambiguous Encapsulation Separating Data and Signaling LangSec workshop 2015 Michael Ossmann

CMS and localisation Multilingual Web content management Limerick. September 21st, 2011 Page 1

Public-Key Cryptosystems Resilient to Key Leakage Moni Naor Gil Segev Weizmann Institute of

ACCELERATION VIA EXPLICIT DECOUPLED DATA ORCHESTRATION Michael Pellauer* 1/26/2019 [Extended version to appear in: ASPLOS 2019] In collaboration with: Yakun Sophia Shao, Jason Clemons, Neal Crago*, Kartik Hegde**, Rangharajan Venkatesan*,

AI Driven Orchestration, Challenges & Opportunities Openstack Summit 2018 Sana Tariq (Ph.D.)

1 Decoupled & Uprooted Case Study, Government of Flanders Tomas Flpp (Vacilando) 2

Demystifying Decoupled Drupal with Contenta CMS Bayo Fodeke & Mark Shropshire Todays

The post-quantum Internet Daniel J. Bernstein University of Illinois at Chicago & Technische