offload annotations bringing heterogeneous computing to
play

Offload Annotations: Bringing Heterogeneous Computing to Existing - PowerPoint PPT Presentation

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar, Deepak Narayanan, Matei Zaharia Stanford University USENIX ATC 2020 (July 15-17) Background: Hardware Commoditization NVIDIA


  1. Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar, Deepak Narayanan, Matei Zaharia Stanford University USENIX ATC 2020 (July 15-17)

  2. Background: Hardware Commoditization NVIDIA GPU 2

  3. Background: CPUs vs. GPUs GPUs CPUs Core Core Control Core Core Costly data Cache transfers! Memory Memory (PCI-E) 4-way parallelism 1000-ways parallelism! 512GB memory 16GB memory 3

  4. Background: Data Science on the CPU + (CPU) Popular Python data science libraries for the CPU. 4

  5. Trend: Data Science on the GPU + Lots of parallel data! NVIDIA GPU (cuDF, cuML, etc.) NEW Python data science libraries for the GPU. 5

  6. Trend: CPU Libraries vs. GPU Libraries https://cupy.chainer.org/ https://github.com/rapidsai/cudf https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html https://github.com/rapidsai/cuml 6

  7. Trend: CPU Libraries vs. GPU Libraries https://github.com/rapidsai/cudf https://cupy.chainer.org/ Are GPU libraries as straightforward to https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html use as they seem? https://github.com/rapidsai/cuml 7

  8. Motivating Example cuML 8

  9. Motivating Example cuml cuml Missing Functions 9

  10. Motivating Example cuml cuml Missing Functions X_train = transfer(X_train, GPU) Y_train = transfer(Y_train, GPU) Manual Data Transfers X_test = transfer(X_test, GPU) result = transfer(result, CPU) 10

  11. Motivating Example cuml cuml Missing Functions X_train = transfer(X_train, GPU) Y_train = transfer(Y_train, GPU) Manual Data Transfers for (i,j) in split(X_test): Small GPU Memory [i,j] [i,j] ) X_test[i,j]=transfer(X_test[i,j], GPU) [i,j] [i,j] ) [i,j] [i,j] ) result[i,j]=transfer(result[i,j], CPU) 11

  12. ??? Motivating Example cuml cuml Missing Functions X_train = transfer(X_train, GPU) Y_train = transfer(Y_train, GPU) ??? Manual Data Transfers for (i,j) in split(X_test): Small GPU Memory [i,j] [i,j] ) X_test[i,j]=transfer(X_test[i,j], GPU) Scheduling [i,j] [i,j] ) [i,j] [i,j] ) result[i,j]=transfer(result[i,j], CPU) 12

  13. Solution: Offload Annotations The annotator writes offload annotations (OAs) for CPU libraries. An end user imports the annotated library instead of the CPU library. Our runtime , Bach, automatically schedules data transfers and pages computation. 13

  14. Goals With less developer effort: 1. Match handwritten GPU performance 14

  15. Goals With less developer effort: 1. Match handwritten GPU performance 2. Scale to data sizes larger than GPU memory 15

  16. Goals With less developer effort: 1. Match handwritten GPU performance 2. Scale to data sizes larger than GPU memory 3. Beat CPU performance 16

  17. Step 1: Annotator – Function Annotations GPU library CPU library multiply = @oa(func=torch.mul)(np.multiply) sqrt = @oa(func=torch.sqrt)(np.sqrt) 17

  18. Step 1: Annotator – Function Annotations GPU library CPU library multiply = @oa(func=torch.mul)(np.multiply) sqrt = @oa(func=torch.sqrt)(np.sqrt) corresponding functions 18

  19. Step 1: Annotator – Function Annotations arg = (NdArrayType(),) args = (NdArrayType(), NdArrayType()) ret = NdArrayType() multiply = @oa(args, ret, func=torch.mul)(np.multiply) sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt) inputs outputs 19

  20. Step 1: Annotator – Allocation Annotations arg = (NdArrayType(),) args = (NdArrayType(), NdArrayType()) ret = NdArrayType() multiply = @oa(args, ret, func=torch.mul)(np.multiply) sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt) ones = @oa_alloc(ret, func=torch.ones)(np.ones) Allocations only have a return type. 20

  21. Step 1: Annotator – Allocation Annotations arg = (NdArrayType(),) args = (NdArrayType(), NdArrayType()) ret = NdArrayType() "offload split type" multiply = @oa(args, ret, func=torch.mul)(np.multiply) sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt) ones = @oa_alloc(ret, func=torch.ones)(np.ones) What’s in an offload split type? 21

  22. Step 1: Annotator – Offload Split Type API Description device(value) Which device the value is on. offloading API to(value, device) Transfers [value] to [device]. 22

  23. Step 1: Annotator – Offload Split Type API Description device(value) Which device the value is on. offloading API to(value, device) Transfers [value] to [device]. API Implementation for NdArrayType() device(value) ...if isinstance(value, torch.Tensor): ... to(value, device) ...value.to(torch.device('cpu')).numpy() 23

  24. Step 1: Annotator – Offload Split Type API Description size(value) Number of elements in the value. splitting API [Mozart split(start, end, value) Splits a value to enable paging. SOSP ‘19] merge(values) Merges split values. (optional) 24

  25. Step 1: Annotator – Offload Split Type API Description size(value) Number of elements in the value. splitting API [Mozart split(start, end, value) Splits a value to enable paging. SOSP ‘19] merge(values) Merges split values. (optional) API Implementation for NdArrayType() size(value) return value.shape[-1] split(start, end, value) return value[start, end] merge(values) return np.concatenate(values) 25

  26. Step 1: Annotator – Offload Split Type NdArrayType() DataFrameType() ModelType() 26

  27. Step 2: End User import numpy as np End User ≠ # Allocate Annotator a = np.ones(size, dtype='float64') b = np.ones(size, dtype='float64’) # Compute np.arcsin(a, out=a) np.multiply(a, b, out=b) np.sqrt(b, out=b) (Simple, somewhat dumb, Python program.) 27

  28. Step 2: End User import bach.numpy as np # Allocate a = np.ones(size, dtype='float64') b = np.ones(size, dtype='float64’) # Compute np.arcsin(a, out=a) np.multiply(a, b, out=b) np.sqrt(b, out=b) Import the annotated library instead of the CPU library. 28

  29. Step 2: End User import bach.numpy as np # Allocate a = np.ones(size, dtype='float64') b = np.ones(size, dtype='float64’) # Compute np.arcsin(a, out=a) np.multiply(a, b, out=b) np.sqrt(b, out=b) Explicitly materialize lazy values with np.evaluate() included evaluate() function. 29

  30. Step 3: Runtime - Scheduling 1 ? a = np.ones() 1 Bach np.arcsin(a) 2 ? ? 2 3 b = np.ones() 3 np.multiply(a,b) 4 4 ? np.sqrt(b) 5 = CPU ? = GPU = Allocation 5 Generate a lazy computation graph and do a topological sort. 30

  31. Step 3: Runtime - Scheduling 1 ? a = np.ones() 1 Bach np.arcsin(a) 2 ? 2 3 No. b = np.ones() 3 np.multiply(a,b) 4 4 GPU library ? np.sqrt(b) 5 imple- Yes. = CPU mentation = GPU provided? Yes. = Allocation 5 Assign functions to the CPU/GPU based on whether a GPU library implementation is provided in the annotation. 31

  32. Step 3: Runtime - Scheduling 1 ? a = np.ones() 1 Bach np.arcsin(a) 2 2 3 ? b = np.ones() 3 np.multiply(a,b) 4 4 ? np.sqrt(b) 5 = CPU = GPU = Allocation 5 Assign allocations to the CPU/GPU so they are on the same device as the first function that uses the data. 32

  33. Step 3: Runtime – Offloading API 1 ? a = np.ones() 1 Bach np.arcsin(a) 2 2 3 ? -- transfer to GPU -- b = np.ones() 3 np.multiply(a,b) 4 4 ? np.sqrt(b) 5 = CPU -- transfer to CPU -- = GPU = Allocation 5 Automatically transfer data using the offloading API . 33

  34. Step 3: Runtime – Splitting API Split Data Size = 2^28 1 ? a = np.ones() 1 Bach np.arcsin(a) 2 2 3 ? -- transfer to GPU -- b = np.ones() 3 np.multiply(a,b) 4 4 ? np.sqrt(b) 5 = CPU -- transfer to CPU -- = GPU = Allocation 5 Merge Automatically page large datasets using the splitting API . 34

  35. Step 3: Runtime – Scheduling Heuristics (optional) Data Size = 2^28 1 ? ? 2 3 GPU Compute 4 ? + Data Transfer = CPU CPU Compute ? = GPU = Allocation 5 Naive cost-benefit analysis between data transfer and computation cost. 35

  36. Step 3: Runtime – Scheduling Heuristics (optional) Data Size = 2^10 1 ? ? 2 3 4 ? CPU Compute GPU Compute = CPU + Data Transfer ? = GPU = Allocation 5 Naive cost-benefit analysis between data transfer and computation cost. 36

  37. Step 3: Runtime – Scheduling Heuristics (optional) Estimated CPU Cost Compute GPU Compute + Data Transfer CPU GPU Data Size 2^10 2^28 Naive implementations of cost estimators. 37

  38. Evaluation 4 library integrations and 8 data science and ML workloads. 38

  39. Integration Experience ~130 LOC per library including offloading / splitting APIs and function annotations. 39

  40. Evaluation: Summary Speedup: max 1200x , median 6.3x . 40

  41. Evaluation: Summary 41

  42. Evaluation: Summary With less developer effort, Bach can: 1. Match handwritten GPU performance 42

  43. Evaluation: Summary With less developer effort, Bach can: 1. Match handwritten GPU performance 2. Scale to data sizes larger than GPU memory 43

  44. Evaluation: Summary With less developer effort, Bach can: 2.3x 1. Match handwritten GPU 6.8x performance 2. Scale to data sizes larger than GPU memory 3. Beat CPU performance 44

  45. In-Depth Evaluation: Allocations 1.1x Crime Index saves time by eliminating the initial data 4.6x transfer, while the allocation still fits in GPU memory. 45

  46. In-Depth Evaluation: Heuristics At smaller data sizes, TSVD schedules all computation 11x on the CPU. 46

Recommend


More recommend