Interface, Data, Approximation Sarita Adve With: Vikram Adve, - PowerPoint PPT Presentation

Programming Systems for Specialized Architectures Interface, Data, Approximation Sarita Adve With: Vikram Adve, Johnathan Alsop, Maria Kotsifakou, Sasa Misailovic, Matt Sinclair, Prakalp Srivastava University of Illinois at Urbana-Champaign sadve@illinois.edu Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

A Modern Mobile SoC Different hardware ISAs Incompatible memory systems CPU CPU GPU Modem Increasing diversity in & across SoCs Vector Vector A/V Hardware Accelerators GPS L1 L1 & supercomputers, data centers, … Cache Cache Multi- DSP DSP DSP media L2 Cache Interconnect Different parallelism models Main Memory Need common interface (abstractions): HW-independent SW development, “object code” portability Data movement critical: Memory structures, communication, consistency, synchronization Approximation: Application-driven solution quality trade off to increase efficiency

Interfaces: Back to the Future April 7, 1964: IBM announced the 360 • Family of machines w/ common abstraction/interface/ISA – Programmer freedom: no reprogramming – Designer freedom: implementation creativity Not unique • CPUs : ISAs; Internet : IP; GPUs : CUDA; Databases : SQL; …

Current Interface Levels App. productivity Domain-specific prog. language TensorFlow, MXNet, Halide, … CUDA, OpenCL, OpenAcc, App. performance General-purpose prog. language OpenMP, Python, Julia Language innovation Language-level Compiler IR Delite DSL IR, DLVM, TVM, … Compiler investment Language-neutral Compiler IR Delite IR, HPVM, OSCAR, Polly Object-code portability Virtual ISA SPIR, HPVM IBM AS/400, Hardware innovation "Hardware" ISA Transmeta, PTX, HSAIL, Codesigned Virtual Machines … CPUs + Vector DSP FPGA GPU Source: Vikram Adve, HPVM project, Domain-specific SIMD Units https://publish.illinois.edu/hpvm-project/ Accelerators

Which Interface Levels Can Be Uniform? Domain-specific prog. language Too diverse to define a General-purpose prog. language uniform interface Language-level Compiler IR Language-neutral Compiler IR Much more uniform Virtual ISA "Hardware" ISA Also too diverse … … CPUs + Vector DSP FPGA GPU Source: Vikram Adve, HPVM project, Domain-specific SIMD Units https://publish.illinois.edu/hpvm-project/ Accelerators

One Example HPVM: Heterogeneous Parallel Virtual Machine [PPoPP’18] Parallel program representation for heterogeneous parallel hardware • Virtual ISA: portable virtual object code, simpler translators • Compiler IR: optimizations, map diverse parallel languages • Runtime Representation for flexible scheduling: mapping, load balancing Generalization of LLVM IR for parallel heterogeneous hardware PPoPP’18: Results on GPU (Nvidia), Vector ISA (AVX), Multicore (Intel Xeon) Ongoing: FPGA, novel domain-specific SoCs

HPVM Abstractions Vector Hierarchical Dataflow Graph with side effects V A = load <L4 x float>* A V B = load <L4 x float>* B … V C = fmul <L4 x float> V A , V B or

HPVM Abstractions Vector Hierarchical Dataflow Graph with side effects V A = load <L4 x float>* A V B = load <L4 x float>* B … V C = fmul <L4 x float> V A , V B • Task, data, vector parallelism • Streams, pipelines • Shared memory or • High-level optimizations • FPGAs (more custom hw?) N different parallelism models single unified model

Data inter-chip IF Data movement critical to efficiency Accel. 2 Accel. 1 • Memory structures IF IF cache coherent FIFO • Communication Accel. 3 Accel. 4 • Coherence IF IF • Consistency stash RDMA inter-chip • Synchronization IF inter-chip IF Uniform communication interface for hardware Abstract to software interface

Application-Customized Accelerator Communication Arch Problem: Design + Integrate Multiple accelerator memory systems + Communication inter-chip IF Challenges: Accel. 2 Accel. 1 ‒ Friction between different app-specific specializations IF IF ‒ Inefficiencies due to deep memory hierarchy cache coherent FIFO ‒ Multiple scales: on-chip to cloud Accel. 3 Accel. 4 IF IF New accelerator communication architecture stash RDMA ‒ Coherent, global address space inter-chip IF ‒ App-specialized coherence, comm, storage, soln quality inter-chip One example next focused on coherence: Spandex [ISCA’18] IF

Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity 11

Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity Typical CPU workloads: fine-grain synch, latency sensitive 12

Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity Typical GPU workloads: spatial locality, throughput sensitive

MESI coherence targets CPU workloads MESI GPU Coherence Protocol properties MESI GPU coherence DeNovo  Coarse-grain state  Fine-grain writes Reads: line Reads: flexible  Spatial locality  No false sharing Granularity Line writes: word Writes: word  False sharing  Reduced spatial locality Stale data Writer-invalidate Self-invalidate Self-invalidate invalidation  Writer-initiated invalidation  Self invalidation Write propagation Ownership Write-through Write-back  Temporal locality for reads  Simple, scalable  Overheads limit throughput, scalability  Synch limits read reuse GPU  Ownership-based updates  Write-through caches DeNovo MESI coh. Good for:  Temporal locality for writes  Simple, low overhead CPU GPU  CPU or GPU Indirection if low locality  Synch limits write reuse

GPU coherence fits GPU workloads GPU Coherence Protocol properties MESI GPU coherence DeNovo  Fine-grain writes Reads: line Reads: flexible  No false sharing Granularity Line writes: word Writes: word  Reduced spatial locality Stale data Writer-invalidate Self-invalidate Self-invalidate invalidation  Self invalidation Write propagation Ownership Write-through Write-back  Simple, scalable  Synch limits read reuse GPU  Write-through caches DeNovo MESI coh. Good for:  Simple, low overhead CPU GPU CPU or GPU  Synch limits write reuse 15

DeNovo is good fit for CPU and GPU Protocol properties MESI GPU coherence DeNovo Reads: line Reads: flexible Granularity Line writes: word Writes: word Stale data Writer-invalidate Self-invalidate Self-invalidate invalidation Write propagation Ownership Write-through Ownership GPU DeNovo MESI coh. Good for: CPU GPU CPU or GPU

Integrating Diverse Coherence Strategies GPU ASIC ? GPU CPU FPGA/ Existing Solutions : MESI-based LLC  Accelerator Requests forced to use MESI GPU GPU coh. L1 coh. L1 MESI L1 MESI L1  Added latency for inter-device communication MESI/GPU coh. Hybrid L2  MESI is complex: extensions are difficult MESI LLC Spandex : DeNovo-based interface [ISCA’18] FPGA/ ASIC ? CPU GPU  Supports write-through and write-back  Supports self-invalidate and writer-invalidate MESI L1 GPU coh. L1 DeNovo L1  Supports requests of variable granularity  Directly interfaces MESI, GPU coherence, hybrid Spandex LLC (e.g. DeNovo) caches

Example: Collaborative Graph Applications Vertex-centric algorithms: distribute vertices among CPU, GPU threads Application Access Pattern Important Dimension Results Flat LLC avoids Spandex LLC ⇒ Pull-based Read neighbor vertices, indirection for read 37% better exec. time PageRank Update local vertex misses 9% better NW traffic Push-based Ownership-based write DeNovo at GPU ⇒ Read local vertex, Betweenness propagation exploits 18% better exec. time Update (RMW) neighbor vertices Centrality locality in updates 61% better NW traffic 18

Looking Forward… Software Innovations Coarse-grain Producer/consumer Synchronization Data locality, operations locality visibility relationships + + HPVM + DRF Consistency + ??? Coherent hLRC adaptive Hardware scratchpads Hardware queues laziness Stash, ISCA’15 Innovations Spandex HBM caches NVRAM dynamic caches

Approximation How to express quality of solution from the application to the hardware? Integrate approximation (quality) into the interface

Summary • Interfaces • Data • Approximation

Interface, Data, Approximation Sarita Adve With: Vikram Adve, - PowerPoint PPT Presentation

Programming Systems for Specialized Architectures Interface, Data, Approximation Sarita Adve With: Vikram Adve, Johnathan Alsop, Maria Kotsifakou, Sasa Misailovic, Matt Sinclair, Prakalp Srivastava University of Illinois at Urbana-Champaign

I/O Bus and Interface Data Bus Addr Bus CPU Control Interface Interface Interface Interface

6. Approximation and fitting norm approximation least-norm problems regularized

Interface Aesthetics Week 10 Print Media Interface Aesthetics 04/07/08 OUTLINE - Print media -

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Dual Interface Technology Update EuroForum 2014 Munich Agenda 1/ Dual Interface Technologies

Linux Kernel Crypto API Herbert Xu Red Hat Inc. Current State Async + sync cipher interface.

WatchKit Segues Segues Transition to another interface controller Push segues and modal segues

TDDE18 & 726G77 Interface, command line and vector interface An interface is an abstract

User Interface Design User Interface Design Designing effective Designing effective interfaces

Interface Documents David Christian 11/20/17 Interface between CE and DAQ Interface

Warmup Use a k-map to fi nd a minimal implementation of this truth table: A B C D | Y 0 0 0 0 0

FPGA Multipliers Bogdan PASCA projet Ar enaire, ENS-Lyon/INRIA/CNRS/Universit e de Lyon,

Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs

Why is it important to measure operational wireless networks? Diagnose faults Identify

Addressing Deployment Challenges in Data Stream Processing Corso di Sistemi e Architetture per

DSP HW2-1 HMM Training and Testing Outline 1.

DSS Review Dan Wenman DSS Review November 9, 2016 proto Outline Basic TPC

Computing Husnu S aner Narman Md. Shohrab Hossain Mohammed Atiquzzaman School of Computer

Interface, Data, Approximation Sarita Adve With: Vikram Adve, - PowerPoint PPT Presentation

Programming Systems for Specialized Architectures Interface, Data, Approximation Sarita Adve With: Vikram Adve, Johnathan Alsop, Maria Kotsifakou, Sasa Misailovic, Matt Sinclair, Prakalp Srivastava University of Illinois at Urbana-Champaign

I/O Bus and Interface Data Bus Addr Bus CPU Control Interface Interface Interface Interface

6. Approximation and fitting norm approximation least-norm problems regularized

Interface Aesthetics Week 10 Print Media Interface Aesthetics 04/07/08 OUTLINE - Print media -

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Dual Interface Technology Update EuroForum 2014 Munich Agenda 1/ Dual Interface Technologies

Linux Kernel Crypto API Herbert Xu Red Hat Inc. Current State Async + sync cipher interface.

WatchKit Segues Segues Transition to another interface controller Push segues and modal segues

TDDE18 &amp; 726G77 Interface, command line and vector interface An interface is an abstract

User Interface Design User Interface Design Designing effective Designing effective interfaces

Interface Documents David Christian 11/20/17 Interface between CE and DAQ Interface

Warmup Use a k-map to fi nd a minimal implementation of this truth table: A B C D | Y 0 0 0 0 0

FPGA Multipliers Bogdan PASCA projet Ar enaire, ENS-Lyon/INRIA/CNRS/Universit e de Lyon,

Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs

Why is it important to measure operational wireless networks? Diagnose faults Identify

Addressing Deployment Challenges in Data Stream Processing Corso di Sistemi e Architetture per

DSP HW2-1 HMM Training and Testing Outline 1.

DSS Review Dan Wenman DSS Review November 9, 2016 proto Outline Basic TPC

Computing Husnu S aner Narman Md. Shohrab Hossain Mohammed Atiquzzaman School of Computer

TDDE18 & 726G77 Interface, command line and vector interface An interface is an abstract