cs 744 tpu
play

CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course - PowerPoint PPT Presentation

! morning good CS 744: TPU Shivaram Venkataraman Fall 2020 Administrivia Course ML Tue in Fairness Next , . . summary Midterm 2, Dec 3 rd Midterm Papers from SCOPE to TPU Thu 2 . . Similar format etc. Piazza


  1. ! morning good CS 744: TPU Shivaram Venkataraman Fall 2020

  2. Administrivia Course ML Tue in Fairness Next , . . summary Midterm 2, Dec 3 rd Midterm – Papers from SCOPE to TPU Thu 2 . . – Similar format etc. Piazza Presentations project week after : Presentations Dec 8, 10 talks 4 min - – Sign up sheet slides 4 to 3 - – Presentation template statement Problem – Trial run? - Approach - results Initial Dee 17h - In - progress / final report → -

  3. ↳ MOTIVATION Capacity demands on datacenters e- text speech to convert model to New workloads search ML Voice → → Metrics sensitive Power/operation latency workloads Latency ( or Hut ) → are Performance/operation → Total cost of ownership operate Buy ( Build ) t Goal: Improve cost-performance by 10x over GPUs

  4. WORKLOAD CNN only not 5% ⑥ are weights is ① Number of , and MPs batch are 61% correlated size with ops very high ④ have enns ⑨ hare Mlp & same Lsm ops 1- Byte L batch and size ops / byte ÷ ← - ← ← D O : DNN: RankBrain, LSTM: subset of GNM Translate CNNs: Inception, DeepMind AlphaGo

  5. WORKLOAD: ML INFERNCE model weights from I convert Quantization à Lower precision, energy use 8 bit integer hit float → 32 8-bit integer multiplies (unlike training), 6X less energy and 6X less area T Need for predictable latency and not throughput - e.g., 7ms at 99th percentile Inference Focus on - average improve caches → ! only branch prediction scenario case

  6. TPU DESIGN CONTROL - connect t ) inter PCIe compatibility for limited has Pae → & latency bandwidth ← issued - queried ons host from µ buffer ↳ Instruction - simple thread ! single

  7. ↳ COMPUTE L 8 - bit bit 7 L 16 > O 512 - 255 O - Multiply Unit matrix → ↳ Fully connected convolutions ① the chip of 24% area → Multiply MAC s t t Accumulate - - bit lb or integer ran → 8 bit compute Separate → unit for Activation & Normalize / Pool

  8. Models in size 8GB DATA A X B → can = - OO fit here = = . Models ( or weights ) C) € 7 stored - chip off in - are DRAM arts ) a ::m÷ : : ' " DO - buffer unified d I t - fetching of pipeline (3) matrix weights with multiply \ results Intermediate ④ and then accumulated Unified Buffer stored in

  9. ↳ INSTRUCTIONS CISC format (why ?) set instruction → Specialized 1. Read_Host_Memory 2. Read_Weights → CISC → encode 3. MatrixMultiply/Convolve Instructions - 4. Activate take that operations → 5. Write_Host_Memory to run cycles many

  10. ⇒ ↳ SYSTOLIC EXECUTION Problem: Reading a large SRAM uses much more power than arithmetic! - Typical cpu or Registers , L2 etc ) ( L1 baches , I units compute inputs to have ¥ Tpu of data propagation - like wave → element every for Data reuse → execution predictable → , predictable performance

  11. - tart , band ) Head ROOFLINE MODEL µ operational intensity ( memory f- Slope x. axis part : ; per compute Amount of read of data ⑦ byte ! operations I second TeraOps/sec y - axis : comes from - Blue line spec hardware intensive compute corns bound are - compute bound close ← Memory & Mvps are ↳ em - c- hardware to peak perf of Operational Intensity: MAC Ops/weight byte

  12. HASWELL ROOFLINE ops ! weight byte scope Cpu lo ends at " ⑥ Number of l points d ' TeraOps/sec away . . A roofline . . ⑥ Much i lower I Tera Ops / i second ' Operational Intensity: MAC Ops/weight byte

  13. ⇒ T ry off chip COMPARISON WITH CPU, GPU = .IE?g%cam.myxpower-me1argaegsm.d;nhigherm dukes , L1 , L2 , 43 7 A - D O O - - I - v1 a - . t J down power APUs bring tower - 2x idle used when and am configured Cpu much not so - Tp vs

  14. SELECTED LESSONS Latency more important than throughput for inference • LSTMs and MLPs are more common than CNNs • improve also to • Performance counters are helpful → models DNN compilers of Remember architecture history •

  15. SUMMARY New workloads à new hardware requirements Domain specific design (understand workloads!) No features to improve the average case No caches, branch prediction, out-of-order execution etc. Simple design with MACs, Unified Buffer gives efficiency Drawbacks No sparse support, training support (TPU v2, v3) Vendor specific ?

  16. DISCUSSION https://forms.gle/tss99VSCMeMjZx7P6

  17. have higher tput ① Larger batches Tyer tail latent also higher but ewes per see I O G O O higher ② much tuts IPO are target latency meeting with while 7ms at higher be ③ can Tpu CPU compared to batch size IPS at same higher has ④ apu latency relates to avg → .

  18. How would TPUs impact serving frameworks like Clipper? Discuss what specific effects it could have on distributed serving systems architecture models ① many to Tpvs 8h13 share have Clipper in containers break might ↳ but this frequent less ② stragglers are helpful be batching ) very can - ( Auto ⑦ Batching ④

  19. NEXT STEPS No class Thursday! Happy Thanksgiving! Next week schedule: Tue: Fairness in ML, Summary Thu: Midterm 2

  20. ENERGY PROPORTIONALITY

Recommend


More recommend