PhD Defense Optimizing Communication for Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John
Problem Statement GPUs and Networks in the Wild ▪ GPUs are everywhere in HPC, Big Data, Machine Learning, and beyond – Excellent performance/watt for many classes of data-parallel computation ▪ Many GPUs are required to solve the biggest computational problems – Can only fit so many GPUs in a single node! – GPUs need to talk to each other through Network Interface Controllers (NICs) – Path between GPU and NIC needs to be efficient ▪ Vendor’s are selling machines filled with many GPUs and NICs: Nvidia’s DGX -2 AMD’s Project 47 Node 16 Tesla V100 4 Radeon Instinct GPUs 8 Mellanox 100G NICs 2 Mellanox 100G NICs 2 Ethernet NICs 1 EPYC 7601 32-Core CPU 2 Xeon Platinum 2:1 GPU/NIC Ratio 1.6:1 GPU/NIC Ratio Michael LeBeane – PhD Defense 2 07/16/2018
Problem Statement IOC = IO Controller Today’s GPU Networks ▪ Largely focused on an optimized data plane – Path taken by the application data that needs to be transferred by the network – Industry technologies such as ROCn RDMA and GPUDirect RDMA allow peer-to-peer data transfers Initiator Target CPU GPU CPU GPU IOC IOC Cache Cache Memory Memory Network Memory Memory Memory NIC NIC Memory Michael LeBeane – PhD Defense 3 07/16/2018
Problem Statement IOC = IO Controller Challenges with Today’s GPU Networks ▪ Control plane is unoptimized! – Focused on a host-centric model where only the CPU can coordinate network transfers – Very high latencies to perform networking from the GPU Target Initiator CPU GPU CPU GPU IOC IOC Cache Cache Memory Memory Network Memory Memory Memory NIC NIC Memory Michael LeBeane – PhD Defense 4 07/16/2018
Problem Statement Motivating Example for Control Plane Optimizations 6 3 6 3 9 8 4 6 Buffers 1 1 1 1 4 6 5 2 5 2 5 2 5 2 5 2 0 0 0 0 0 Nodes/ GPUs 1 2 1 2 1 2 1 2 1 2 1 1 3 5 3 5 1 1 1 1 1 1 3 5 1 1 3 5 1 1 3 5 5 2 3 5 5 2 8 7 6 3 8 7 6 3 4 6 8 7 4 6 8 7 9 8 9 8 Initial Communication Compute Communication Compute Time ▪ GPU Allreduce Computation – Many communication/computation phases – Scaling out increases the number phases Michael LeBeane – PhD Defense 5 07/16/2018
Problem Statement Thesis Statement GPU networking can be improved by both software and hardware enhancements that enable GPUs to more directly interface with the network control plane. ▪ Proposed Solutions – Extended Task Queuing • Direct NIC-to-GPU active messaging – Command Processor Networking • Dynamic communication using on-chip GPU Command Processor – GPU Triggered Networking • Initiate messages without critical path CPU Michael LeBeane – PhD Defense 6 07/16/2018
Outline ▪ Introduction ▪ Contribution 1: Extended Task Queuing ▪ Contribution 2: Command Processor Networking ▪ Contribution 3: GPU Triggered Networking ▪ Conclusion Michael LeBeane – PhD Defense 7 07/16/2018
Contribution 1: Extended Task Queuing (XTQ) Local GPU Work Dispatch ▪ GPUs consume work through in-memory command queues Devices – Queue format standardized through GPU/CPU GPU Heterogeneous System Architecture (HSA) (Producer) (Consumer) – Any device can produce work for another device Command – Assumes unified virtual address space Packet ▪ Can we extend this across a node? Command – NIC doesn’t know how to talk to HSA queues Queue Virtual – Initiator doesn’t know the virtual addresses of Memory resources at the target Michael LeBeane – PhD Defense 8 07/16/2018
Contribution 1: Extended Task Queuing (XTQ) Extended Task Queuing (XTQ) Overview ▪ XTQ allows direct access to remote GPU queues – Teach NICs how to speak with HSA queues ▪ Enables Active Messaging without target CPU involvement – Improves latency and frees CPU service thread(s) Target Initiator CPU GPU XTQ NIC NIC XTQ NIC IC GPU CPU Cache Cache Cache Cache Cache Cache Memory Memory M. LeBeane, B. Potter, A. Pan, A. Dutu, V. Agarwala, W. Lee, D. Majeti, B. Ghimire, E. Van Tassell, S. Wasmundt, B. Benton, M. Breternitz, M. L. Chu, M. Thottethodi, L. K. John, and S. K. Reinhardt, \Extended task queuing: active messages for heterogeneous systems," in Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2016. Michael LeBeane – PhD Defense 9 07/16/2018
Contribution 1: Extended Task Queuing (XTQ) Target-side XTQ Operation ▪ Payload data streams into target-side receive buffer ▪ Command descriptor is placed into command queue Tightly Coupled Devices XT XTQ NIC NIC GPU GP CPU Doorbell Lookup Virtual Memory Payload Signal Data Command Queue Michael LeBeane – PhD Defense 10 07/16/2018
Contribution 1: Extended Task Queuing (XTQ) Target-side XTQ Operation ▪ NIC notifies the GPU using memory-mapped doorbell ▪ GPU reads command packet Tightly Coupled Devices XT XTQ NIC NIC GP GPU CPU Doorbell Lookup Virtual Memory Payload Signal Data Command Queue Michael LeBeane – PhD Defense 11 07/16/2018
Contribution 1: Extended Task Queuing (XTQ) Target-side XTQ Operation ▪ GPU reads transferred data ▪ GPU writes shared memory completion signal Tightly Coupled Devices XT XTQ NIC NIC GP GPU CPU Doorbell Lookup Virtual Memory Payload Signal Data Command Queue Michael LeBeane – PhD Defense 12 07/16/2018
Contribution 1: Extended Task Queuing (XTQ) XTQ Coordinated Indices ▪ How does initiator know about remote VAs at the target? ▪ Use coordinated indices specified by the initiator ▪ Lookup tables are populated by the target-side XTQ Library Queue Lookup Initiator Target Unified Virtual Target PID Table RDMA Header Memory 0xF123 𝑦 Command Packet Queue Index .... .... Queue Lookup Table Kernel Arguments Base Address Register .... Data Payload Example Queue Lookup Michael LeBeane – PhD Defense 13 07/16/2018
Contribution 1: Extended Task Queuing (XTQ) XTQ Runtime API ▪ XTQ Put is implemented as a simple extension to standard RDMA put operation – Compatible with many low-level RDMA transports (e.g. InfiniBand, RoCE, Portals 4, iWARP, etc.) ▪ XTQ Registration API is used to provide address index-to-address translations Regular RDMA Put Operation XTQ-Enhanced RDMA Put Operation XTQ Rewrite Registration API Register Queue Put Command Fields Additional XTQ Fields ‒ Queue Desc. VA Target NID/PID Remote Queue Index Register Function Send Buffer Ptr. ‒ Function Ptr. VA Remote Function/Kernel Index ‒ Target Side Buffer VA Send Buffer Length GPU command packet Register Kernel Target Buffer Index Kernel/Function Launch ‒ Kernel Ptr. VA Parameters Transport specific metadata ‒ Target Side Buffer VA ‒ Kernel Argument Size ‒ Completion Signal VA Michael LeBeane – PhD Defense 14 07/16/2018
Contribution 1: Extended Task Queuing (XTQ) Experimental Setup GPU NIC CPU ▪ CPU: Standard CPU-only systems Cache Cache Cache – Baseline non-accelerated system CPU and Memory Configuration Type 4-wide OOO, x86, 8 cores @ 4GHz Memory I,D-Cache 64KB, 2-way, 2 cycles L2-Cache 2MB, 8-way, 8 cycles L3-Cache 16MB, 16-way, 20 cycles ▪ HSA: Currently available GPU NIC GPU CPU DRAM DDR3, 8 Channels, 800MHz systems Cache Cache Cache GPU Configuration – Involves CPU runtime Type AMD GCN3 @ 1GHz CU Config 24 CUs with 4 SIMD-16 engines Memory Wavefronts 40 Waves per SIMD (64 lanes) V-Cache 32KB, 16-way, 12 cycles, per CU K-Cache 32KB, 8-way, 12 cycles, per 4 CU NIC GPU CPU ▪ XTQ: Extended Task Queuing I-Cache 64KB, 8-way, 12 cycles, per 4 CU Cache Cache Cache – Enables efficient active messaging L2-Cache 1MB, 16-way, 8 banks, 100 cycles style communication that bypasses the NIC Configuration CPU on the target Link Speed 100ns/ 100Gbps Memory Topology Star Michael LeBeane – PhD Defense 15 07/16/2018
Contribution 1: Extended Task Queuing (XTQ) Results ▪ Latency Decomposition Smaller is Better 0.07 CPU 0.31 0.22 0.42 0.14 0.65 4KB HSA 0.31 0.22 0.43 0.14 0.66 0.55 15% 0.31 0.24 0.44 0.15 0.28 0.59 XTQ 0.07 0.06 0.06 0.08 CPU 0.31 0.11 0.31 0.21 64B HSA 0.31 0.11 0.30 0.61 0.23 19% 0.31 0.16 0.31 0.25 0.23 XTQ 0.09 0.07 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 Time (µs) CPU PtlPut NIC Initiator Put Network NIC Target Put GPU Launch GPU Kernel Execution CPU Completion Bigger is Better Smaller is Better 2000 CPU CPU 10 HSA 1500 HSA XTQ Runtime (us) Speedup XTQ 1000 1 500 1 2 3 0 0.1 1 16 256 4K 64K 1M 0 8 16 24 32 40 48 56 64 1 16 256 4,096 65,536 1,048,576 Data Items (4 Byte Integers) Nodes ▪ MPI Accumulate ▪ MPI Allreduce Michael LeBeane – PhD Defense 16 07/16/2018
Recommend
More recommend