cs 744 pytorch
play

CS 744: PYTORCH Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - PowerPoint PPT Presentation

Hi ! CS 744: PYTORCH Shivaram Venkataraman Fall 2020 ADMINISTRIVIA week ) ( Monday 10/5 next Assignment 2 out! Due Oct 1 Bid on topics, submit group (1 sentences) Oct 5 -28g y , Project Proposal (2 pages) Oct 16 Piazza -


  1. Hi ! CS 744: PYTORCH Shivaram Venkataraman Fall 2020

  2. ADMINISTRIVIA week ) ( Monday 10/5 next Assignment 2 out! → Due Oct 1 Bid on topics, submit group (1 sentences) – Oct 5 -28g y , Project Proposal (2 pages) – Oct 16 Piazza - Introduction Related Work Timeline (with eval plan)

  3. Applications - Iem MapReduce ← Machine Learning SQL Streaming Graph spark , Computational Engines → ars → Scalable Storage Systems pesos DRF → Resource Management - Datacenter Architecture

  4. EMPIRICAL RISK MINIMIZATION is ed → and labels Shifrin dd " green training f) , model - - Fit Regularization a Function Model Data (Examples)

  5. - pp DEEP LEARNING dim ] 84 [ 84 FC = eager ; read argon man eager . gtfo " I ResNet18 ( # Convolution - ReLU r t.g.ie : MaxPool ' ' m ? ! " ;f ion O Fully Connected r SoftMax ' Him " qq.im

  6. ↳ for Good fit STOCHASTIC GRADIENT DESCENT sin " raiser Tinhorn → ardent in ; - eat ← f - leathers Initialize w [ y For many iterations: " Fwy ' b Ha - ( model ) → input ) Loss = Forward pass yfcw , diindiarddef - Gradient = backward - ( model ) chain rule Update model parallelize - ↳ do how we End shared model → is depends iteration previous on every

  7. Parallelize DATA PARALLEL MODEL TRAINING one iteration next iteration - ' reflate int ,ft → model WH points . data does CB . ) ← 256 64 B , → ← gradient ( . ) B forward pass model w µ , B ) f ( model → lots !B " pandita , 132 ↳ flmodd model , Bi ) what ! + , BD ffwodd 133 64 . !dd model . Wied - . i ly up ) . Bu : 64 x÷iWy average that Adn step Fun t update . all grads accent into Eli takes

  8. go.im?iodno*EI:ag::g:...qeiqng@-.rni:ad ° COLLECTIVE COMMUNICATION MPI → .EE ] send ties " Broadcast, Scatter Gather, Reduce , - root ) ( data ten , ① detain D vector ¥ " → comate - - - - D Chief Es , 47,42 - - e - D 5+2+7+4 - - - From https://mpitutorial.com/tutorials/

  9. All Reduce ALL REDUCE Ring " ' EET - Po ⑧ Ds - ② → - I 1¥ 's - - - ! # 18 ⑨ ④ c- Da 14 Pe B ends From https://mpitutorial.com/tutorials/

  10. DISTRIBUTED DATA PARALLEL API change code line → only of ✓ local model - intrusive Non - do optimizations to Hooks background - in

  11. GRADIENT BUCKETING 60M parameter Why do we need gradient bucketing? ↳ small sires tensor time for lead greater to Reduce Ad wt how ) All Redn ? latency Every ( con 't - t handoff overhead fixed → why bucket big not one gradiah-reatdy.be all O g for wait = backward , Altadena overlap Cannot =

  12. parameter GRADIENT BUCKETING + ALL REDUCE . layers = \ become buckets A . start ② 0 ready we , them All Reduce on wage { ⑧ CTO background , In £ comp gradient the griffe continues 9 Ered . tf 25 MB sive by = . -

  13. Gradient Accumulation parameter dgidda.FI xtra e 3 - C [ wm ✓ no - sync All Reduce 134 BED DCI D " Bi Allrednce \ Bet , Bu , y 00¥ ; ! ! !8§ , - ④ , Bs B - ' " D Br , D B , ' C D ' 33 Bg , Bb - I

  14. ↳ ↳ Fazio ① → ⑦ Port IMPLEMENTATION 1234 I y ② ③ ← viii. iii. y tunable - that Parameter is 25 MB Bucket_cap_mb ~ middle overhead = = . small → no overlap - baiatal large → ↳ query Parameter-to-bucket mapping " " SMB ¥7 Lag : :] Round-robin ProcessGroups > um 's mate - → filled up buckets function flayer ] math - amp / a batch on backward GPUs = data cpu , . . pass 0

  15. BREAKDOWN

  16. SUMMARY Pytorch: Framework for deep learning DistributedDataParallel API Gradient bucketing, AllReduce Overlap computation and communication

  17. DISCUSSION https://forms.gle/6xhVBNBhdzsJ6gBE6

  18. profanity na%ner÷ldf% , well terrene Andy .de ;rwrk Timefr329pI ⑦ 16 am - e fine for well ! scales O -0 - ④ o . bucket optimal depends - → 00000 on sin or New a. more is 00 Nccu -0 art & - town perform variance less

  19. This paper well ! ? scales f Seeling weak scaling incremeaprn.mn strong 13=64 , GPUs i 256 mm B÷ ¥ increase T - # T - , 2 I

  20. What could be some challenges in implementing similar optimizations for AllReduce in Apache Spark? workloads " " ? larger spark : dataset had spark node worker on Each operation shuffle to needs ↳ spark Necc 14 than Org pig reduce , expensive - more 0 Top veggie fahimgdfngtie Tree - compute / communication Reduce Org , overlap - Otsu - knees compete . → ask time

  21. ⇒ - J flare :C h÷ ! bucket . bye ! NEXT STEPS program copy user Alto C I scatter an . . . Process Group API e - ↳ ¥ → Next class: PipeDream safer \ - . TITE ) Assignment 2 is due soon! EI / < aloo - Nccu which link # Project Proposal monitoring Fes Eisman .mn?!;aYE!ir'?/FEiI.nm network too • ' Groups by Oct 5 :÷ :* 2 pager by Oct 16 . ¥ " " + We YE

Recommend


More recommend