clustar ai training platform powered by high performance
play

CLUSTAR: AI Training Platform Powered by High Performance Networking - PowerPoint PPT Presentation

CLUSTAR: AI Training Platform Powered by High Performance Networking Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST AGUEST 1,2018 Deep Learning Is Becoming Increasingly Important Computer Vision Natural Language Processing Auto-driving Cars


  1. CLUSTAR: AI Training Platform Powered by High Performance Networking Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST AGUEST 1,2018

  2. Deep Learning Is Becoming Increasingly Important Computer Vision Natural Language Processing Auto-driving Cars 27

  3. How does Deep Learning Work ? ๐‘ง = ๐‘ โˆ— ๐‘ฆ + ๐‘ ๐’š ๐’› ๐’› ๐’’๐’”๐’‡๐’† 1 5 mini batch 2 7 ๐‘ = 1 1 ๐‘ก๐‘ฃ๐‘› ๐‘ = 1 ๐‘ฆ Input Layer Output Layer 28

  4. How does Deep Learning Work ? ๐‘ง = ๐‘ โˆ— ๐‘ฆ + ๐‘ ๐’š ๐’› ๐’› ๐’’๐’”๐’‡๐’† 1 5 2 mini batch Forward Pass 3 2 7 ๐‘ = 1 1 ๐‘ก๐‘ฃ๐‘› ๐‘ = 1 ๐‘ฆ Input Layer Output Layer 29

  5. How does Deep Learning Work ? ๐‘ง = ๐‘ โˆ— ๐‘ฆ + ๐‘ ๐’š ๐’› ๐’› ๐’’๐’”๐’‡๐’† 1 5 2 mini batch Forward Pass 3 2 7 ; = 1 ๐‘€ = ๐ท 4 ๐‘ง โˆ’ ๐‘ง 6789 2 4 ๐‘ง โˆ’ ๐‘ง 6789 ๐‘ = 1 1 Calculating Loss ๐‘ก๐‘ฃ๐‘› ๐‘ = 1 ๐‘ฆ Input Layer Output Layer 30

  6. How does Deep Learning Work ? ๐‘ง = ๐‘ โˆ— ๐‘ฆ + ๐‘ ๐’š ๐’› ๐’› ๐’’๐’”๐’‡๐’† 1 5 2 mini batch Backpropagation 3 2 7 ; = 1 ๐‘€ = ๐ท 4 ๐‘ง โˆ’ ๐‘ง 6789 2 4 ๐‘ง โˆ’ ๐‘ง 6789 ๐‘ = 1 โˆ’ 0.1 โˆ— โˆ’7 = 1.7 1 ร— ๐œ–๐‘ง 6789 ๐œ–๐‘€ ๐œ–๐‘€ Calculating Loss ๐œ–๐‘ = = 4 ๐‘ง 6789 โˆ’ ๐‘ง ๐‘ฆ = โˆ’11 ๐œ–๐‘ง 6789 ๐œ–๐‘ ๐‘ก๐‘ฃ๐‘› ๐œ–๐‘€ ๐œ–๐‘€ ร— ๐œ–๐‘ง 6789 ๐œ–๐‘ = = 4 ๐‘ง 6789 โˆ’ ๐‘ง = โˆ’7 ๐‘ = 1 โˆ’ 0.1 โˆ— โˆ’11 = 2.1 ๐œ–๐‘ง 6789 ๐œ–๐‘ ๐‘ฆ ๐‘ = ๐‘ โˆ’ ๐‘  ๐œ–๐‘€ ๐‘ = ๐‘ โˆ’ ๐‘  ๐œ–๐‘€ ๐œ–๐‘ ๐œ–๐‘ Input Layer Output Layer 31

  7. How does Deep Learning Work ? ๐‘ง = ๐‘ โˆ— ๐‘ฆ + ๐‘ ๐’š ๐’› ๐’› ๐’’๐’”๐’‡๐’† 3 9 Backpropagation 5 13 ; = 1 ๐‘€ = ๐ท 4 ๐‘ง โˆ’ ๐‘ง 6789 2 4 ๐‘ง โˆ’ ๐‘ง 6789 ๐‘ = 1 โˆ’ 0.1 โˆ— โˆ’7 = 1.7 1 ร— ๐œ–๐‘ง 6789 ๐œ–๐‘€ ๐œ–๐‘€ Next Calculating Loss ๐œ–๐‘ = = 4 ๐‘ง 6789 โˆ’ ๐‘ง ๐‘ฆ = โˆ’11 Iteration ๐œ–๐‘ง 6789 ๐œ–๐‘ ๐‘ก๐‘ฃ๐‘› ๐œ–๐‘€ ๐œ–๐‘€ ร— ๐œ–๐‘ง 6789 ๐œ–๐‘ = = 4 ๐‘ง 6789 โˆ’ ๐‘ง = โˆ’7 ๐‘ = 1 โˆ’ 0.1 โˆ— โˆ’11 = 2.1 ๐œ–๐‘ง 6789 ๐œ–๐‘ ๐‘ฆ ๐‘ = ๐‘ โˆ’ ๐‘  ๐œ–๐‘€ ๐‘ = ๐‘ โˆ’ ๐‘  ๐œ–๐‘€ ๐œ–๐‘ ๐œ–๐‘ Input Layer Output Layer 32

  8. How does Deep Learning Work ? Input Layer Output Layer Hidden Layer 33

  9. How does Deep Learning Work ? Backpropagation Backpropagation Backpropagation D ๐‘ฅ E; D ๐‘ฅ ;C D ๐‘ฅ FE Calculating Loss Forward Pass Forward Pass Forward Pass Input Layer Output Layer Hidden Layer 34

  10. The Big Data Drives a New Paradigm for Training 1. Data is too large to fit in single machine 2. The training time is too long Uber: it usually takes weeks or longer to complete [1] 35

  11. Networking Plays an Important Role โ€ฆ ๐‘ฅ E ๐‘ฅ ; Parameter Server Networking Data Data Partition 1 Partition 2 Worker 1 Worker 2 36

  12. Networking Plays an Important Role โ€ฆ ๐‘ฅ E ๐‘ฅ ; Parameter Server Pull Parameters From Servers Networking ๐‘ฅ E ๐‘ฅ ; ๐‘ฅ E ๐‘ฅ ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 37

  13. Networking Plays an Important Role โ€ฆ ๐‘ฅ E ๐‘ฅ ; Parameter Server Networking ๐‘ฅ E ๐‘ฅ ; ๐‘ฅ E ๐‘ฅ ; Data Data Input Input Partition 1 Partition 2 Worker 1 Worker 2 Forward Pass Forward Pass 38

  14. Networking Plays an Important Role โ€ฆ ๐‘ฅ E ๐‘ฅ ; Parameter Server Networking Calculating Loss Calculating Loss ๐‘ฅ E ๐‘ฅ ; ๐‘ฅ E ๐‘ฅ ; Data Data Input Input Partition 1 Partition 2 Worker 1 Worker 2 Forward Pass Forward Pass 39

  15. Networking Plays an Important Role โ€ฆ ๐‘ฅ E ๐‘ฅ ; Parameter Server Networking DD DD ๐‘ฅ E D ๐‘ฅ ; D ๐‘ฅ E ๐‘ฅ ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 Backpropagation Backpropagation 40

  16. Networking Plays an Important Role โ€ฆ ๐‘ฅ E ๐‘ฅ ; Parameter Server Push parameters to Servers Networking DD DD ๐‘ฅ E D ๐‘ฅ ; D ๐‘ฅ E ๐‘ฅ ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 Backpropagation Backpropagation 41

  17. Networking Plays an Important Role Networking is critical to performance ! โ€ฆ ๐‘ฅ E ๐‘ฅ ; Parameter Server Push parameters to Servers Networking DD DD ๐‘ฅ E D ๐‘ฅ ; D ๐‘ฅ E ๐‘ฅ ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 Backpropagation Backpropagation 42

  18. Networking Plays an Important Role The speedup achieved after utilizing the 40Gbps networking bandwidth with CLUSTAR Model Logistic Multi-layer Alexnet VGG-16 Resnet-50 Regression perceptron Speedup 2.59x 3.45x 1.6x 1.33x 1.03x 43

  19. CLUSTAR: AI Training Platform Powered by High Performance Networking The important of networking towards AI system equals the traffic system towards cities Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI Key Technology ๏ผˆ World-leading Research Achievements ๏ผ‰ GDR ParaExpress MLT Smart Networking Scheduling โ€ข Utilize the SGD of AI training โ€ข Towards 0-copy data flow โ€ข Resilient and adaptive parameter aggregation โ€ข Co-flow scheduling โ€ข Utilize RDMA and GPUDirect โ€ข Semi-loss tolerance โ€ข Tackles the disadvantage of โ€ข Elephant & Mice flow โ€ข Model quality awareness โ€ข Integrated with TensorFlow Parameter Server & Ring AllReduce scheduling 44

  20. CLUSTAR: AI Training Platform Powered by High Performance Networking The important of networking towards AI system equals the traffic system towards cities Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI Key Technology ๏ผˆ World-leading Research Achievements ๏ผ‰ GDR ParaExpress MLT Smart Networking Scheduling โ€ข Utilize the SGD of AI training โ€ข Towards 0-copy data flow โ€ข Resilient and adaptive parameter aggregation โ€ข Co-flow scheduling โ€ข Utilize RDMA and GPUDirect โ€ข Semi-loss tolerance โ€ข Tackles the disadvantage of โ€ข Elephant & Mice flow โ€ข Model quality awareness โ€ข Integrated with TensorFlow Parameter Server & Ring AllReduce scheduling 45

  21. CLUSTAR: AI Training Platform Powered by High Performance Networking The important of networking towards AI system equals the traffic system towards cities Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI Key Technology ๏ผˆ World-leading Research Achievements ๏ผ‰ GDR ParaExpress MLT Smart Networking Scheduling โ€ข Utilize the SGD of AI training โ€ข Towards 0-copy data flow โ€ข Resilient and adaptive parameter aggregation โ€ข Co-flow scheduling โ€ข Utilize RDMA and GPUDirect โ€ข Semi-loss tolerance โ€ข Tackles the disadvantage of โ€ข Elephant & Mice flow โ€ข Model quality awareness โ€ข Integrated with TensorFlow Parameter Server & Ring AllReduce scheduling 46

  22. ๅฏ็ผ–็จ‹โฝน็ฝ’็ปœ Broadcom FPGA ASIC RDMAโฝน็ฝ’็ปœ Intel Nvidia AMD ๅฏ’ๆญฆ็บช Mellanox P4 CPU E8 Storage Clustar AI Fabrics RoCE ๆ™บ่ƒฝโฝน็ฝ’ๅก Sparkไผ˜ๅŒ– TensorFlowไผ˜ๅŒ– ๅฎนๅ™จ๏จน็ผ–ๆŽ’ๅผ•ๆ“Ž ไบคไบ’็ผ–็จ‹็•Œโพฏ้ฃ GPU ๅ…จ้—ชๅญ˜ๅญ˜ๅ‚จ ไปถ โพƒ่‡ซๅŠจ้ฉพ้ฉถ ๅŸบ ็ก€ ่ฎพ ๆ–ฝ ๅบ”โฝค็”ฉ ็กฌ โพฆ้‡‘๏ค‹่žโพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ ่ฏญโพณ้Ÿด่ฏ†ๅˆซ โพƒ่‡ซ็„ถ่ฏญโพ”่จๅค„็†๏งฅ ่ฎก็ฎ—ๆœบ่ง†่ง‰ ๆ™บ่ƒฝๅๆฌบ่ฏˆ ๆ™บ่ƒฝโฝ†ๆ—กโผˆไบปๆœบ ๅฎ‰้˜ฒโพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ ไบ’่”โฝน็ฝ’โพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ ๅˆถ้€ ไธšโพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ ๅŒป็–—โพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ โฝค็”ฉ ้€š ๆ”ฟๅบœโพ่กŒ๏จ‰ไธšๅบ”โฝค็”ฉ CLUSTAR Platform ๆ˜Ÿ ๆ•ฐๆฎ้ข„ๅค„็† ็ฆป็บฟ่ฎญ็ปƒ ๅœจ็บฟ่ฎญ็ปƒ ๅคš็งŸๆˆท็ฎก็† ไปปๅŠก่ฐƒๅบฆ ่ฟ็ปด็›‘ๆŽง ไบ‘ ๅนณ ๅฐ 47

  23. GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 48

  24. GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 49

  25. GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 50

  26. GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 51

  27. GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 52

  28. GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 53

Recommend


More recommend