Offload Annotations: Bringing Heterogeneous Computing to Existing - PowerPoint PPT Presentation

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar, Deepak Narayanan, Matei Zaharia Stanford University USENIX ATC 2020 (July 15-17)

Background: Hardware Commoditization NVIDIA GPU 2

Background: CPUs vs. GPUs GPUs CPUs Core Core Control Core Core Costly data Cache transfers! Memory Memory (PCI-E) 4-way parallelism 1000-ways parallelism! 512GB memory 16GB memory 3

Background: Data Science on the CPU + (CPU) Popular Python data science libraries for the CPU. 4

Trend: Data Science on the GPU + Lots of parallel data! NVIDIA GPU (cuDF, cuML, etc.) NEW Python data science libraries for the GPU. 5

Trend: CPU Libraries vs. GPU Libraries https://cupy.chainer.org/ https://github.com/rapidsai/cudf https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html https://github.com/rapidsai/cuml 6

Trend: CPU Libraries vs. GPU Libraries https://github.com/rapidsai/cudf https://cupy.chainer.org/ Are GPU libraries as straightforward to https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html use as they seem? https://github.com/rapidsai/cuml 7

Motivating Example cuML 8

Motivating Example cuml cuml Missing Functions 9

Motivating Example cuml cuml Missing Functions X_train = transfer(X_train, GPU) Y_train = transfer(Y_train, GPU) Manual Data Transfers X_test = transfer(X_test, GPU) result = transfer(result, CPU) 10

Motivating Example cuml cuml Missing Functions X_train = transfer(X_train, GPU) Y_train = transfer(Y_train, GPU) Manual Data Transfers for (i,j) in split(X_test): Small GPU Memory [i,j] [i,j] ) X_test[i,j]=transfer(X_test[i,j], GPU) [i,j] [i,j] ) [i,j] [i,j] ) result[i,j]=transfer(result[i,j], CPU) 11

??? Motivating Example cuml cuml Missing Functions X_train = transfer(X_train, GPU) Y_train = transfer(Y_train, GPU) ??? Manual Data Transfers for (i,j) in split(X_test): Small GPU Memory [i,j] [i,j] ) X_test[i,j]=transfer(X_test[i,j], GPU) Scheduling [i,j] [i,j] ) [i,j] [i,j] ) result[i,j]=transfer(result[i,j], CPU) 12

Solution: Offload Annotations The annotator writes offload annotations (OAs) for CPU libraries. An end user imports the annotated library instead of the CPU library. Our runtime , Bach, automatically schedules data transfers and pages computation. 13

Goals With less developer effort: 1. Match handwritten GPU performance 14

Goals With less developer effort: 1. Match handwritten GPU performance 2. Scale to data sizes larger than GPU memory 15

Goals With less developer effort: 1. Match handwritten GPU performance 2. Scale to data sizes larger than GPU memory 3. Beat CPU performance 16

Step 1: Annotator – Function Annotations GPU library CPU library multiply = @oa(func=torch.mul)(np.multiply) sqrt = @oa(func=torch.sqrt)(np.sqrt) 17

Step 1: Annotator – Function Annotations GPU library CPU library multiply = @oa(func=torch.mul)(np.multiply) sqrt = @oa(func=torch.sqrt)(np.sqrt) corresponding functions 18

Step 1: Annotator – Function Annotations arg = (NdArrayType(),) args = (NdArrayType(), NdArrayType()) ret = NdArrayType() multiply = @oa(args, ret, func=torch.mul)(np.multiply) sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt) inputs outputs 19

Step 1: Annotator – Allocation Annotations arg = (NdArrayType(),) args = (NdArrayType(), NdArrayType()) ret = NdArrayType() multiply = @oa(args, ret, func=torch.mul)(np.multiply) sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt) ones = @oa_alloc(ret, func=torch.ones)(np.ones) Allocations only have a return type. 20

Step 1: Annotator – Allocation Annotations arg = (NdArrayType(),) args = (NdArrayType(), NdArrayType()) ret = NdArrayType() "offload split type" multiply = @oa(args, ret, func=torch.mul)(np.multiply) sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt) ones = @oa_alloc(ret, func=torch.ones)(np.ones) What’s in an offload split type? 21

Step 1: Annotator – Offload Split Type API Description device(value) Which device the value is on. offloading API to(value, device) Transfers [value] to [device]. 22

Step 1: Annotator – Offload Split Type API Description device(value) Which device the value is on. offloading API to(value, device) Transfers [value] to [device]. API Implementation for NdArrayType() device(value) ...if isinstance(value, torch.Tensor): ... to(value, device) ...value.to(torch.device('cpu')).numpy() 23

Step 1: Annotator – Offload Split Type API Description size(value) Number of elements in the value. splitting API [Mozart split(start, end, value) Splits a value to enable paging. SOSP ‘19] merge(values) Merges split values. (optional) 24

Step 1: Annotator – Offload Split Type API Description size(value) Number of elements in the value. splitting API [Mozart split(start, end, value) Splits a value to enable paging. SOSP ‘19] merge(values) Merges split values. (optional) API Implementation for NdArrayType() size(value) return value.shape[-1] split(start, end, value) return value[start, end] merge(values) return np.concatenate(values) 25

Step 1: Annotator – Offload Split Type NdArrayType() DataFrameType() ModelType() 26

Step 2: End User import numpy as np End User ≠ # Allocate Annotator a = np.ones(size, dtype='float64') b = np.ones(size, dtype='float64’) # Compute np.arcsin(a, out=a) np.multiply(a, b, out=b) np.sqrt(b, out=b) (Simple, somewhat dumb, Python program.) 27

Step 2: End User import bach.numpy as np # Allocate a = np.ones(size, dtype='float64') b = np.ones(size, dtype='float64’) # Compute np.arcsin(a, out=a) np.multiply(a, b, out=b) np.sqrt(b, out=b) Import the annotated library instead of the CPU library. 28

Step 2: End User import bach.numpy as np # Allocate a = np.ones(size, dtype='float64') b = np.ones(size, dtype='float64’) # Compute np.arcsin(a, out=a) np.multiply(a, b, out=b) np.sqrt(b, out=b) Explicitly materialize lazy values with np.evaluate() included evaluate() function. 29

Step 3: Runtime - Scheduling 1 ? a = np.ones() 1 Bach np.arcsin(a) 2 ? ? 2 3 b = np.ones() 3 np.multiply(a,b) 4 4 ? np.sqrt(b) 5 = CPU ? = GPU = Allocation 5 Generate a lazy computation graph and do a topological sort. 30

Step 3: Runtime - Scheduling 1 ? a = np.ones() 1 Bach np.arcsin(a) 2 ? 2 3 No. b = np.ones() 3 np.multiply(a,b) 4 4 GPU library ? np.sqrt(b) 5 imple- Yes. = CPU mentation = GPU provided? Yes. = Allocation 5 Assign functions to the CPU/GPU based on whether a GPU library implementation is provided in the annotation. 31

Step 3: Runtime - Scheduling 1 ? a = np.ones() 1 Bach np.arcsin(a) 2 2 3 ? b = np.ones() 3 np.multiply(a,b) 4 4 ? np.sqrt(b) 5 = CPU = GPU = Allocation 5 Assign allocations to the CPU/GPU so they are on the same device as the first function that uses the data. 32

Step 3: Runtime – Offloading API 1 ? a = np.ones() 1 Bach np.arcsin(a) 2 2 3 ? -- transfer to GPU -- b = np.ones() 3 np.multiply(a,b) 4 4 ? np.sqrt(b) 5 = CPU -- transfer to CPU -- = GPU = Allocation 5 Automatically transfer data using the offloading API . 33

Step 3: Runtime – Splitting API Split Data Size = 2^28 1 ? a = np.ones() 1 Bach np.arcsin(a) 2 2 3 ? -- transfer to GPU -- b = np.ones() 3 np.multiply(a,b) 4 4 ? np.sqrt(b) 5 = CPU -- transfer to CPU -- = GPU = Allocation 5 Merge Automatically page large datasets using the splitting API . 34

Step 3: Runtime – Scheduling Heuristics (optional) Data Size = 2^28 1 ? ? 2 3 GPU Compute 4 ? + Data Transfer = CPU CPU Compute ? = GPU = Allocation 5 Naive cost-benefit analysis between data transfer and computation cost. 35

Step 3: Runtime – Scheduling Heuristics (optional) Data Size = 2^10 1 ? ? 2 3 4 ? CPU Compute GPU Compute = CPU + Data Transfer ? = GPU = Allocation 5 Naive cost-benefit analysis between data transfer and computation cost. 36

Step 3: Runtime – Scheduling Heuristics (optional) Estimated CPU Cost Compute GPU Compute + Data Transfer CPU GPU Data Size 2^10 2^28 Naive implementations of cost estimators. 37

Evaluation 4 library integrations and 8 data science and ML workloads. 38

Integration Experience ~130 LOC per library including offloading / splitting APIs and function annotations. 39

Evaluation: Summary Speedup: max 1200x , median 6.3x . 40

Evaluation: Summary 41

Evaluation: Summary With less developer effort, Bach can: 1. Match handwritten GPU performance 42

Evaluation: Summary With less developer effort, Bach can: 1. Match handwritten GPU performance 2. Scale to data sizes larger than GPU memory 43

Evaluation: Summary With less developer effort, Bach can: 2.3x 1. Match handwritten GPU 6.8x performance 2. Scale to data sizes larger than GPU memory 3. Beat CPU performance 44

In-Depth Evaluation: Allocations 1.1x Crime Index saves time by eliminating the initial data 4.6x transfer, while the allocation still fits in GPU memory. 45

In-Depth Evaluation: Heuristics At smaller data sizes, TSVD schedules all computation 11x on the CPU. 46

Offload Annotations: Bringing Heterogeneous Computing to Existing - PowerPoint PPT Presentation

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar, Deepak Narayanan, Matei Zaharia Stanford University USENIX ATC 2020 (July 15-17) Background: Hardware Commoditization NVIDIA

8 JDT embraces Type Annotations JDT embraces Type Annotations Java 8 ready Stephan Herrmann GK

HIERARCHICAL QOS HARDWARE OFFLOAD Yossi Kuperman, Maxim Mikityanskiy, 2020 AGENDA Hierarchical

PanMedia Bringing it all together FIRST Vancouver 2008 PanMedia Bringing it all Together

1 Reflection on code annotations Classification of Code Annotations (1) Code annotations may

From Open Annotations to W3C Web Annotations (and the impact on IIIF Presentation API 3.0)

Hardware accelerating Linux network functions Roopa Prabhu, Wilson Kok Proceedings of netdev

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Bringing Our Children Home Act and Five Nations Template Laws on Children and Families Bringing

Combining Dependent Annotations for Relational Algebra Egor V. Kostylev, Peter Buneman

Using null type annotations in practice Till Brychcy, Mercateo EclipseCon Europe, 2017 What

Extending ensembldb : MySQL backend and protein annotations Johannes Rainer (EURAC research,

MODELLING AND EXCHANGING ANNOTATIONS FOR EUROPEANA PROJECTS Hugo Manguinhas, Antoine Isaac,

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations Lubomir Bourdev and

CrowdTruth Metrics for Capturing Ambiguity Interlinking Workers, Annotations and Input Data

Heat Transfer 11-1a1 Conduction R = L Fouriers Law of Conduction Thermal Resistance: kA

Digital Currency Economics and Policy Conclusion Bernard Yeung President, ABFER Dean

An Explicit Shadow Price for the Growth-Optimal Portfolio with Transaction Costs Johannes

Fourth Quarter and Full Year 2019 Financial Results Echo Global Logistics, Inc. February 5, 2020

Learning Transferable Graph Exploration Hanjun Dai, Yujia Li, Chenglong Wang, Rishabh Singh,

The CS Model Becker introduced transferable utility model of marriage market 30 years ago.

Transferable Utility Game Theory Course: Jackson, Leyton-Brown & Shoham Game Theory Course:

Ranking-Based Voting How to Describe . . . Revisited: Maximum Utility-Based Decision . . . How

Offload Annotations: Bringing Heterogeneous Computing to Existing - PowerPoint PPT Presentation

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar, Deepak Narayanan, Matei Zaharia Stanford University USENIX ATC 2020 (July 15-17) Background: Hardware Commoditization NVIDIA

8 JDT embraces Type Annotations JDT embraces Type Annotations Java 8 ready Stephan Herrmann GK

HIERARCHICAL QOS HARDWARE OFFLOAD Yossi Kuperman, Maxim Mikityanskiy, 2020 AGENDA Hierarchical

PanMedia Bringing it all together FIRST Vancouver 2008 PanMedia Bringing it all Together

1 Reflection on code annotations Classification of Code Annotations (1) Code annotations may

From Open Annotations to W3C Web Annotations (and the impact on IIIF Presentation API 3.0)

Hardware accelerating Linux network functions Roopa Prabhu, Wilson Kok Proceedings of netdev

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Bringing Our Children Home Act and Five Nations Template Laws on Children and Families Bringing

Combining Dependent Annotations for Relational Algebra Egor V. Kostylev, Peter Buneman

Using null type annotations in practice Till Brychcy, Mercateo EclipseCon Europe, 2017 What

Extending ensembldb : MySQL backend and protein annotations Johannes Rainer (EURAC research,

MODELLING AND EXCHANGING ANNOTATIONS FOR EUROPEANA PROJECTS Hugo Manguinhas, Antoine Isaac,

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations Lubomir Bourdev and

CrowdTruth Metrics for Capturing Ambiguity Interlinking Workers, Annotations and Input Data

Heat Transfer 11-1a1 Conduction R = L Fouriers Law of Conduction Thermal Resistance: kA

Digital Currency Economics and Policy Conclusion Bernard Yeung President, ABFER Dean

An Explicit Shadow Price for the Growth-Optimal Portfolio with Transaction Costs Johannes

Fourth Quarter and Full Year 2019 Financial Results Echo Global Logistics, Inc. February 5, 2020

Learning Transferable Graph Exploration Hanjun Dai, Yujia Li, Chenglong Wang, Rishabh Singh,

The CS Model Becker introduced transferable utility model of marriage market 30 years ago.

Transferable Utility Game Theory Course: Jackson, Leyton-Brown &amp; Shoham Game Theory Course:

Ranking-Based Voting How to Describe . . . Revisited: Maximum Utility-Based Decision . . . How

Transferable Utility Game Theory Course: Jackson, Leyton-Brown & Shoham Game Theory Course: