CLUSTAR: AI Training Platform Powered by High Performance Networking Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST AGUEST 1,2018
Deep Learning Is Becoming Increasingly Important Computer Vision Natural Language Processing Auto-driving Cars 27
How does Deep Learning Work ? ๐ง = ๐ โ ๐ฆ + ๐ ๐ ๐ ๐ ๐๐๐๐ 1 5 mini batch 2 7 ๐ = 1 1 ๐ก๐ฃ๐ ๐ = 1 ๐ฆ Input Layer Output Layer 28
How does Deep Learning Work ? ๐ง = ๐ โ ๐ฆ + ๐ ๐ ๐ ๐ ๐๐๐๐ 1 5 2 mini batch Forward Pass 3 2 7 ๐ = 1 1 ๐ก๐ฃ๐ ๐ = 1 ๐ฆ Input Layer Output Layer 29
How does Deep Learning Work ? ๐ง = ๐ โ ๐ฆ + ๐ ๐ ๐ ๐ ๐๐๐๐ 1 5 2 mini batch Forward Pass 3 2 7 ; = 1 ๐ = ๐ท 4 ๐ง โ ๐ง 6789 2 4 ๐ง โ ๐ง 6789 ๐ = 1 1 Calculating Loss ๐ก๐ฃ๐ ๐ = 1 ๐ฆ Input Layer Output Layer 30
How does Deep Learning Work ? ๐ง = ๐ โ ๐ฆ + ๐ ๐ ๐ ๐ ๐๐๐๐ 1 5 2 mini batch Backpropagation 3 2 7 ; = 1 ๐ = ๐ท 4 ๐ง โ ๐ง 6789 2 4 ๐ง โ ๐ง 6789 ๐ = 1 โ 0.1 โ โ7 = 1.7 1 ร ๐๐ง 6789 ๐๐ ๐๐ Calculating Loss ๐๐ = = 4 ๐ง 6789 โ ๐ง ๐ฆ = โ11 ๐๐ง 6789 ๐๐ ๐ก๐ฃ๐ ๐๐ ๐๐ ร ๐๐ง 6789 ๐๐ = = 4 ๐ง 6789 โ ๐ง = โ7 ๐ = 1 โ 0.1 โ โ11 = 2.1 ๐๐ง 6789 ๐๐ ๐ฆ ๐ = ๐ โ ๐ ๐๐ ๐ = ๐ โ ๐ ๐๐ ๐๐ ๐๐ Input Layer Output Layer 31
How does Deep Learning Work ? ๐ง = ๐ โ ๐ฆ + ๐ ๐ ๐ ๐ ๐๐๐๐ 3 9 Backpropagation 5 13 ; = 1 ๐ = ๐ท 4 ๐ง โ ๐ง 6789 2 4 ๐ง โ ๐ง 6789 ๐ = 1 โ 0.1 โ โ7 = 1.7 1 ร ๐๐ง 6789 ๐๐ ๐๐ Next Calculating Loss ๐๐ = = 4 ๐ง 6789 โ ๐ง ๐ฆ = โ11 Iteration ๐๐ง 6789 ๐๐ ๐ก๐ฃ๐ ๐๐ ๐๐ ร ๐๐ง 6789 ๐๐ = = 4 ๐ง 6789 โ ๐ง = โ7 ๐ = 1 โ 0.1 โ โ11 = 2.1 ๐๐ง 6789 ๐๐ ๐ฆ ๐ = ๐ โ ๐ ๐๐ ๐ = ๐ โ ๐ ๐๐ ๐๐ ๐๐ Input Layer Output Layer 32
How does Deep Learning Work ? Input Layer Output Layer Hidden Layer 33
How does Deep Learning Work ? Backpropagation Backpropagation Backpropagation D ๐ฅ E; D ๐ฅ ;C D ๐ฅ FE Calculating Loss Forward Pass Forward Pass Forward Pass Input Layer Output Layer Hidden Layer 34
The Big Data Drives a New Paradigm for Training 1. Data is too large to fit in single machine 2. The training time is too long Uber: it usually takes weeks or longer to complete [1] 35
Networking Plays an Important Role โฆ ๐ฅ E ๐ฅ ; Parameter Server Networking Data Data Partition 1 Partition 2 Worker 1 Worker 2 36
Networking Plays an Important Role โฆ ๐ฅ E ๐ฅ ; Parameter Server Pull Parameters From Servers Networking ๐ฅ E ๐ฅ ; ๐ฅ E ๐ฅ ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 37
Networking Plays an Important Role โฆ ๐ฅ E ๐ฅ ; Parameter Server Networking ๐ฅ E ๐ฅ ; ๐ฅ E ๐ฅ ; Data Data Input Input Partition 1 Partition 2 Worker 1 Worker 2 Forward Pass Forward Pass 38
Networking Plays an Important Role โฆ ๐ฅ E ๐ฅ ; Parameter Server Networking Calculating Loss Calculating Loss ๐ฅ E ๐ฅ ; ๐ฅ E ๐ฅ ; Data Data Input Input Partition 1 Partition 2 Worker 1 Worker 2 Forward Pass Forward Pass 39
Networking Plays an Important Role โฆ ๐ฅ E ๐ฅ ; Parameter Server Networking DD DD ๐ฅ E D ๐ฅ ; D ๐ฅ E ๐ฅ ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 Backpropagation Backpropagation 40
Networking Plays an Important Role โฆ ๐ฅ E ๐ฅ ; Parameter Server Push parameters to Servers Networking DD DD ๐ฅ E D ๐ฅ ; D ๐ฅ E ๐ฅ ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 Backpropagation Backpropagation 41
Networking Plays an Important Role Networking is critical to performance ! โฆ ๐ฅ E ๐ฅ ; Parameter Server Push parameters to Servers Networking DD DD ๐ฅ E D ๐ฅ ; D ๐ฅ E ๐ฅ ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 Backpropagation Backpropagation 42
Networking Plays an Important Role The speedup achieved after utilizing the 40Gbps networking bandwidth with CLUSTAR Model Logistic Multi-layer Alexnet VGG-16 Resnet-50 Regression perceptron Speedup 2.59x 3.45x 1.6x 1.33x 1.03x 43
CLUSTAR: AI Training Platform Powered by High Performance Networking The important of networking towards AI system equals the traffic system towards cities Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI Key Technology ๏ผ World-leading Research Achievements ๏ผ GDR ParaExpress MLT Smart Networking Scheduling โข Utilize the SGD of AI training โข Towards 0-copy data flow โข Resilient and adaptive parameter aggregation โข Co-flow scheduling โข Utilize RDMA and GPUDirect โข Semi-loss tolerance โข Tackles the disadvantage of โข Elephant & Mice flow โข Model quality awareness โข Integrated with TensorFlow Parameter Server & Ring AllReduce scheduling 44
CLUSTAR: AI Training Platform Powered by High Performance Networking The important of networking towards AI system equals the traffic system towards cities Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI Key Technology ๏ผ World-leading Research Achievements ๏ผ GDR ParaExpress MLT Smart Networking Scheduling โข Utilize the SGD of AI training โข Towards 0-copy data flow โข Resilient and adaptive parameter aggregation โข Co-flow scheduling โข Utilize RDMA and GPUDirect โข Semi-loss tolerance โข Tackles the disadvantage of โข Elephant & Mice flow โข Model quality awareness โข Integrated with TensorFlow Parameter Server & Ring AllReduce scheduling 45
CLUSTAR: AI Training Platform Powered by High Performance Networking The important of networking towards AI system equals the traffic system towards cities Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI Key Technology ๏ผ World-leading Research Achievements ๏ผ GDR ParaExpress MLT Smart Networking Scheduling โข Utilize the SGD of AI training โข Towards 0-copy data flow โข Resilient and adaptive parameter aggregation โข Co-flow scheduling โข Utilize RDMA and GPUDirect โข Semi-loss tolerance โข Tackles the disadvantage of โข Elephant & Mice flow โข Model quality awareness โข Integrated with TensorFlow Parameter Server & Ring AllReduce scheduling 46
ๅฏ็ผ็จโฝน็ฝ็ป Broadcom FPGA ASIC RDMAโฝน็ฝ็ป Intel Nvidia AMD ๅฏๆญฆ็บช Mellanox P4 CPU E8 Storage Clustar AI Fabrics RoCE ๆบ่ฝโฝน็ฝๅก Sparkไผๅ TensorFlowไผๅ ๅฎนๅจ๏จน็ผๆๅผๆ ไบคไบ็ผ็จ็โพฏ้ฃ GPU ๅ จ้ชๅญๅญๅจ ไปถ โพ่ซๅจ้ฉพ้ฉถ ๅบ ็ก ่ฎพ ๆฝ ๅบโฝค็ฉ ็กฌ โพฆ้๏ค่โพ่ก๏จไธๅบโฝค็ฉ ่ฏญโพณ้ด่ฏๅซ โพ่ซ็ถ่ฏญโพ่จๅค็๏งฅ ่ฎก็ฎๆบ่ง่ง ๆบ่ฝๅๆฌบ่ฏ ๆบ่ฝโฝๆกโผไบปๆบ ๅฎ้ฒโพ่ก๏จไธๅบโฝค็ฉ ไบ่โฝน็ฝโพ่ก๏จไธๅบโฝค็ฉ ๅถ้ ไธโพ่ก๏จไธๅบโฝค็ฉ ๅป็โพ่ก๏จไธๅบโฝค็ฉ โฝค็ฉ ้ ๆฟๅบโพ่ก๏จไธๅบโฝค็ฉ CLUSTAR Platform ๆ ๆฐๆฎ้ขๅค็ ็ฆป็บฟ่ฎญ็ป ๅจ็บฟ่ฎญ็ป ๅค็งๆท็ฎก็ ไปปๅก่ฐๅบฆ ่ฟ็ปด็ๆง ไบ ๅนณ ๅฐ 47
GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 48
GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 49
GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 50
GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 51
GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 52
GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 53
Recommend
More recommend