CLUSTAR: AI Training Platform Powered by High Performance Networking - PowerPoint PPT Presentation

CLUSTAR: AI Training Platform Powered by High Performance Networking Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST AGUEST 1,2018

Deep Learning Is Becoming Increasingly Important Computer Vision Natural Language Processing Auto-driving Cars 27

How does Deep Learning Work ? 𝑧 = 𝑏 ∗ 𝑦 + 𝑐 𝒚 𝒛 𝒛 𝒒𝒔𝒇𝒆 1 5 mini batch 2 7 𝑐 = 1 1 𝑡𝑣𝑛 𝑏 = 1 𝑦 Input Layer Output Layer 28

How does Deep Learning Work ? 𝑧 = 𝑏 ∗ 𝑦 + 𝑐 𝒚 𝒛 𝒛 𝒒𝒔𝒇𝒆 1 5 2 mini batch Forward Pass 3 2 7 𝑐 = 1 1 𝑡𝑣𝑛 𝑏 = 1 𝑦 Input Layer Output Layer 29

How does Deep Learning Work ? 𝑧 = 𝑏 ∗ 𝑦 + 𝑐 𝒚 𝒛 𝒛 𝒒𝒔𝒇𝒆 1 5 2 mini batch Forward Pass 3 2 7 ; = 1 𝑀 = 𝐷 4 𝑧 − 𝑧 6789 2 4 𝑧 − 𝑧 6789 𝑐 = 1 1 Calculating Loss 𝑡𝑣𝑛 𝑏 = 1 𝑦 Input Layer Output Layer 30

How does Deep Learning Work ? 𝑧 = 𝑏 ∗ 𝑦 + 𝑐 𝒚 𝒛 𝒛 𝒒𝒔𝒇𝒆 1 5 2 mini batch Backpropagation 3 2 7 ; = 1 𝑀 = 𝐷 4 𝑧 − 𝑧 6789 2 4 𝑧 − 𝑧 6789 𝑐 = 1 − 0.1 ∗ −7 = 1.7 1 × 𝜖𝑧 6789 𝜖𝑀 𝜖𝑀 Calculating Loss 𝜖𝑏 = = 4 𝑧 6789 − 𝑧 𝑦 = −11 𝜖𝑧 6789 𝜖𝑏 𝑡𝑣𝑛 𝜖𝑀 𝜖𝑀 × 𝜖𝑧 6789 𝜖𝑐 = = 4 𝑧 6789 − 𝑧 = −7 𝑏 = 1 − 0.1 ∗ −11 = 2.1 𝜖𝑧 6789 𝜖𝑐 𝑦 𝑏 = 𝑏 − 𝑠 𝜖𝑀 𝑐 = 𝑐 − 𝑠 𝜖𝑀 𝜖𝑏 𝜖𝑐 Input Layer Output Layer 31

How does Deep Learning Work ? 𝑧 = 𝑏 ∗ 𝑦 + 𝑐 𝒚 𝒛 𝒛 𝒒𝒔𝒇𝒆 3 9 Backpropagation 5 13 ; = 1 𝑀 = 𝐷 4 𝑧 − 𝑧 6789 2 4 𝑧 − 𝑧 6789 𝑐 = 1 − 0.1 ∗ −7 = 1.7 1 × 𝜖𝑧 6789 𝜖𝑀 𝜖𝑀 Next Calculating Loss 𝜖𝑏 = = 4 𝑧 6789 − 𝑧 𝑦 = −11 Iteration 𝜖𝑧 6789 𝜖𝑏 𝑡𝑣𝑛 𝜖𝑀 𝜖𝑀 × 𝜖𝑧 6789 𝜖𝑐 = = 4 𝑧 6789 − 𝑧 = −7 𝑏 = 1 − 0.1 ∗ −11 = 2.1 𝜖𝑧 6789 𝜖𝑐 𝑦 𝑏 = 𝑏 − 𝑠 𝜖𝑀 𝑐 = 𝑐 − 𝑠 𝜖𝑀 𝜖𝑏 𝜖𝑐 Input Layer Output Layer 32

How does Deep Learning Work ? Input Layer Output Layer Hidden Layer 33

How does Deep Learning Work ? Backpropagation Backpropagation Backpropagation D 𝑥 E; D 𝑥 ;C D 𝑥 FE Calculating Loss Forward Pass Forward Pass Forward Pass Input Layer Output Layer Hidden Layer 34

The Big Data Drives a New Paradigm for Training 1. Data is too large to fit in single machine 2. The training time is too long Uber: it usually takes weeks or longer to complete [1] 35

Networking Plays an Important Role … 𝑥 E 𝑥 ; Parameter Server Networking Data Data Partition 1 Partition 2 Worker 1 Worker 2 36

Networking Plays an Important Role … 𝑥 E 𝑥 ; Parameter Server Pull Parameters From Servers Networking 𝑥 E 𝑥 ; 𝑥 E 𝑥 ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 37

Networking Plays an Important Role … 𝑥 E 𝑥 ; Parameter Server Networking 𝑥 E 𝑥 ; 𝑥 E 𝑥 ; Data Data Input Input Partition 1 Partition 2 Worker 1 Worker 2 Forward Pass Forward Pass 38

Networking Plays an Important Role … 𝑥 E 𝑥 ; Parameter Server Networking Calculating Loss Calculating Loss 𝑥 E 𝑥 ; 𝑥 E 𝑥 ; Data Data Input Input Partition 1 Partition 2 Worker 1 Worker 2 Forward Pass Forward Pass 39

Networking Plays an Important Role … 𝑥 E 𝑥 ; Parameter Server Networking DD DD 𝑥 E D 𝑥 ; D 𝑥 E 𝑥 ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 Backpropagation Backpropagation 40

Networking Plays an Important Role … 𝑥 E 𝑥 ; Parameter Server Push parameters to Servers Networking DD DD 𝑥 E D 𝑥 ; D 𝑥 E 𝑥 ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 Backpropagation Backpropagation 41

Networking Plays an Important Role Networking is critical to performance ! … 𝑥 E 𝑥 ; Parameter Server Push parameters to Servers Networking DD DD 𝑥 E D 𝑥 ; D 𝑥 E 𝑥 ; Data Data Partition 1 Partition 2 Worker 1 Worker 2 Backpropagation Backpropagation 42

Networking Plays an Important Role The speedup achieved after utilizing the 40Gbps networking bandwidth with CLUSTAR Model Logistic Multi-layer Alexnet VGG-16 Resnet-50 Regression perceptron Speedup 2.59x 3.45x 1.6x 1.33x 1.03x 43

CLUSTAR: AI Training Platform Powered by High Performance Networking The important of networking towards AI system equals the traffic system towards cities Between 2 Machines Multiple Machines AI Protocol Wider Roads Traffic Scheduling New Traffic Rule for AI Key Technology （ World-leading Research Achievements ） GDR ParaExpress MLT Smart Networking Scheduling • Utilize the SGD of AI training • Towards 0-copy data flow • Resilient and adaptive parameter aggregation • Co-flow scheduling • Utilize RDMA and GPUDirect • Semi-loss tolerance • Tackles the disadvantage of • Elephant & Mice flow • Model quality awareness • Integrated with TensorFlow Parameter Server & Ring AllReduce scheduling 44

可编程⽹罒络 Broadcom FPGA ASIC RDMA⽹罒络 Intel Nvidia AMD 寒武纪 Mellanox P4 CPU E8 Storage Clustar AI Fabrics RoCE 智能⽹罒卡 Spark优化 TensorFlow优化容器塀编排引擎交互编程界⾯靣 GPU 全闪存存储件⾃臫动驾驶基础设施应⽤甩硬⾦金喇融⾏行降业应⽤甩语⾳韴识别⾃臫然语⾔訁处理痢计算机视觉智能反欺诈智能⽆旡⼈亻机安防⾏行降业应⽤甩互联⽹罒⾏行降业应⽤甩制造业⾏行降业应⽤甩医疗⾏行降业应⽤甩⽤甩通政府⾏行降业应⽤甩 CLUSTAR Platform 星数据预处理离线训练在线训练多租户管理任务调度运维监控云平台 47

GDR: Towards Zero Copy Data Flow Socket 1 Socket 2 Socket 1 Socket 2 CPU CPU CPU CPU Memory Memory Memory Memory RDMA RDMA GPU GPU GPU GPU GPU GPU NIC NIC Server 1 Server 2 Data Center Networking The unnecessary copy between RNIC/Memory and GPU RAM/Memory enlarges latency, degrades throughput and burns CPU 48

CLUSTAR: AI Training Platform Powered by High Performance Networking - PowerPoint PPT Presentation

CLUSTAR: AI Training Platform Powered by High Performance Networking Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST AGUEST 1,2018 Deep Learning Is Becoming Increasingly Important Computer Vision Natural Language Processing Auto-driving Cars

Context More a more devices are powered by battery: High performance Required features: Long

PERFORMANCE OPTIMIZATION IN RED PERFORMANCE OPTIMIZATION IN RED HAT OPENSTACK PLATFORM HAT

Modern Composites, The Composite Prototyping Center & STEAM Powered High Schools Jim

HIGH PERFORMANCE BASKETBALL DEVELOPMENT AND TRAINING CENTRE PRESENTATION TO THE BOARD OF

A High- A High -Performance Area Performance Area- -Efficient Efficient AES Cipher on a Many

ServerSwitch: A Programmable and High Performance Platform for Data Center Networks Guohan Lu,

VC616-1 WHAT IS MODERACARE? ModeraCare is a virtual telemedicine platform, powered by MeMD, that

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

Deployment of a National Research Data Grid Powered by iRODS Ilari Korhonen PDC Center for High

Design of a Flexible Open Platform for High Performance Active Networks Sumi Choi, Dan Decasper,

The Future of Water Management Powered by Life beyond the 100th meridian 2 Powered by Our

VAS Management Platform. Solution from TELCO to TELCO Powered by 1Click The fastest payment

A multi- -layer layer A multi A multi-layer research and training platform research and

Annual Performance Report (APR) Training: Using Sage to Complete your APR CoC and Recipient

Towards a high performance parallel platform for dependable embedded systems Mitsuhisa Sato

Time Management Training Ppt Presentation High performance time management skill training. 10,107.

Annual Performance Report (APR) Training: An Overview of the Sage HMIS Repository CoC and

Facilities Engineering Technician Training Program Powered by Development Guided by DOE

ETP Office of the Sma 11 Business Advocate Employment Training Panel POWERED BY I I

powered by Confidential iLMS iLMS is a software platform empowers educational institutions to

L 410 UVP-E20 AIRCRAFT The all-metal, high-wing, turboprop commuter, L 410 UVP-E20, is powered by

PX9710/PW9620/PU9730 Digital Projector High Brightness Easy Installation & Control (7

POWERED SUBMARINE and its performance compared to other submarine designs Nevesbu Sven Los

Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and

CLUSTAR: AI Training Platform Powered by High Performance Networking - PowerPoint PPT Presentation

CLUSTAR: AI Training Platform Powered by High Performance Networking Junxue ZHANG EVP CLUSTAR PhD SING Lab, HKUST AGUEST 1,2018 Deep Learning Is Becoming Increasingly Important Computer Vision Natural Language Processing Auto-driving Cars

Context More a more devices are powered by battery: High performance Required features: Long

PERFORMANCE OPTIMIZATION IN RED PERFORMANCE OPTIMIZATION IN RED HAT OPENSTACK PLATFORM HAT

Modern Composites, The Composite Prototyping Center &amp; STEAM Powered High Schools Jim

HIGH PERFORMANCE BASKETBALL DEVELOPMENT AND TRAINING CENTRE PRESENTATION TO THE BOARD OF

A High- A High -Performance Area Performance Area- -Efficient Efficient AES Cipher on a Many

ServerSwitch: A Programmable and High Performance Platform for Data Center Networks Guohan Lu,

VC616-1 WHAT IS MODERACARE? ModeraCare is a virtual telemedicine platform, powered by MeMD, that

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

Deployment of a National Research Data Grid Powered by iRODS Ilari Korhonen PDC Center for High

Design of a Flexible Open Platform for High Performance Active Networks Sumi Choi, Dan Decasper,

The Future of Water Management Powered by Life beyond the 100th meridian 2 Powered by Our

VAS Management Platform. Solution from TELCO to TELCO Powered by 1Click The fastest payment

A multi- -layer layer A multi A multi-layer research and training platform research and

Annual Performance Report (APR) Training: Using Sage to Complete your APR CoC and Recipient

Towards a high performance parallel platform for dependable embedded systems Mitsuhisa Sato

Time Management Training Ppt Presentation High performance time management skill training. 10,107.

Annual Performance Report (APR) Training: An Overview of the Sage HMIS Repository CoC and

Facilities Engineering Technician Training Program Powered by Development Guided by DOE

ETP Office of the Sma 11 Business Advocate Employment Training Panel POWERED BY I I

powered by Confidential iLMS iLMS is a software platform empowers educational institutions to

L 410 UVP-E20 AIRCRAFT The all-metal, high-wing, turboprop commuter, L 410 UVP-E20, is powered by

PX9710/PW9620/PU9730 Digital Projector High Brightness Easy Installation &amp; Control (7

POWERED SUBMARINE and its performance compared to other submarine designs Nevesbu Sven Los

Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and

Modern Composites, The Composite Prototyping Center & STEAM Powered High Schools Jim

PX9710/PW9620/PU9730 Digital Projector High Brightness Easy Installation & Control (7