Major IT companies run datacenters Datacenter infra market is huge. - PowerPoint PPT Presentation

DCS: A Fast, Scalable, Flexible Device-Centric Server Architecture Jangwoo Kim E-mail: jangwoo@snu.ac.kr Web: https://hpcs.snu.ac.kr/~jangwoo High Performance Computer System (HPCS) Lab Department of Electrical and Computer Engineering Seoul National University

Major IT companies run datacenters Datacenter infra market is huge. @ 2018 Jangwoo Kim 1

All others use the datacenters • Buy a SW/HW platform as a service Client A Client B Client C Client D Client E Again, datacenter infra market is huge. @ 2018 Jangwoo Kim 2

Moore’s Law is Dead What is the use of extra transistors? can’t build a faster CPU due to the power ceiling @ 2018 Jangwoo Kim 3

CPU is NOT the 1st-class citizen any more VS “Un-CPU” devices now dominate the performance, power, and costs. @ 2018 Jangwoo Kim 4

Every company now deals with big data Storage infra market is EVEN larger! @ 2018 Jangwoo Kim 5

Neuromorphic computer is coming Brain-inspired computing  New World? @ 2018 Jangwoo Kim 6

- Message #1 (for system engineers) We must build a datacenter-friendly, intelligent server (e.g., cloud, big data, artificial intelligence) - Message #2 (for system engineers) The advantage must come from emerging devices (e.g., Memory, SSD, GPU, ASIC, ..) @ 2018 Jangwoo Kim 7

My solution: Let’s use our intelligent server architecture! “DCS: Device-Centric Server Architecture” Three papers appeared in - 2018 ACM/IEEE International Symposium on Computer Architecture (ISCA) - 2017 ACM/IEEE International Symposium on Microarchitecture (MICRO) - 2015 ACM/IEEE International Symposium on Microarchitecture (MICRO) @ 2018 Jangwoo Kim 8

Existing servers do not work • Host-centric device management − Host manages every device invocation − Frequent host-involved layer crossings  Increases latency and management cost Application Userspace Kernel stack Kernel stack Kernel stack Kernel Driver A Driver B Driver C Hardware Device A Device B Device C Datapath Metadata/Command path @ 2018 Jangwoo Kim 9

Latency: High software overhead • Single sendfile: Storage read & NIC send − Faster devices, more software overhead Software overhead 7% 50% 77% 82% 100% Decomposition (Normalized) Latency 0% HDD NVMe PCM PCM 10Gb NIC 10Gb NIC 10Gb NIC 100Gb NIC Software Storage NIC @ 2018 Jangwoo Kim 10

Cost: High host resource demand • Sendfile under host resource (CPU) contention − Faster devices, more host resource consumption Sendfile bandwidth CPU Busy Sendfile bandwidth 100% Sendfile Sendfile CPU usage bandwidth 34% 14% Sendfile CPU usage 6% No contention High contention *Measured from NVMe SSD/10Gb NIC @ 2018 Jangwoo Kim 11

Limitations of existing work • Single-device optimization − Do not address inter-device communication e.g., Moneta (SSD), DCA (NIC), mTCP (NIC), Arrakis (Generic) • Inter-device communication − Not applicable for unsupported devices e.g., GPUNet (GPU-NIC), GPUDirect RDMA (GPU-Infiniband) • Integrating devices − Custom devices and protocols, limited applicability e.g., QuickSAN (SSD+NIC), BlueDBM (Accelerator–SSD+NIC) Need for fast, scalable, and generic inter-device communication @ 2018 Jangwoo Kim 12

Our solution: Device-Centric Server • Minimize host involvement & data movement DCS Library Application Application Userspace Kernel stack Kernel stack Kernel stack Device drivers & Kernel stacks DCS Driver Kernel Driver A Driver B Driver C Hardware DCS Engine Device A Device B Device C Device A Device B Device C Datapath Metadata/Command path Single command → Optimized multi-device invocation @ 2018 Jangwoo Kim 13

DCS: Benefits • Selective, D2D transfer − Faster data delivery, lower total operation latency • Better host performance/efficiency − Resource/time spent for device management now available for other applications • High applicability − Relies on existing drivers / kernel supports / interfaces − Easy to extend and cover more devices @ 2018 Jangwoo Kim 14

Device-Centric Server Components • DCS Engine − A custom HW device to selectively connect devices • DCS drivers − Convert commodity devices to work with DCS engines • DCS library − OS library to hook with the existing system calls • DCS applications − Applications developed or tuned for DCS systems @ 2018 Jangwoo Kim 15

DCS: Architecture overview Existing System DCS Library Application Userspace sendfile(), encrypted sendfile() DCS Driver Kernel Drivers & Kernel communicator Kernel stack Command generator PCIe Switch Hardware DCS Engine (on NetFPGA NIC) NVMe SSD Command Per-device GPU Command interpreter manager NetFPGA NIC Queue Fully compatible with existing systems @ 2018 Jangwoo Kim 16

Communicating with storage Userspace Application DCS Library Hook / API call File descriptor Kernel DCS Driver (Virtual) Filesystem Block addr (in device) / buffer addr (cached) Hardware DCS Engine Source device Target NVMe SSD Source device VFS cache Data consistency guaranteed @ 2018 Jangwoo Kim 17

Communicating with network interface Userspace Application DCS Library Hook / API call Socket descriptor Kernel DCS Driver Network stack Connection information Hardware DCS Engine NetFPGA NIC Packet generation & Send Data buffer HW PacketGen HW-assisted packet generation @ 2018 Jangwoo Kim 18

Communicating with accelerator Kernel invocation Memory allocation Call DCS library Userspace Application DCS Library GPU user library Kernel DCS Driver GPU kernel driver Get memory mapping Hardware DCS Engine GPU Process data Memory Source device (Kernel launch) DMA / NVMe transfer Direct data loading without memcpy @ 2018 Jangwoo Kim 19

DCS sytem in a big picture! @ 2018 Jangwoo Kim 20

Experimental setup • Host: Power-efficient system − Core 2 Duo @ 2.00GHz, 2MB LLC − 2GB DDR2 DRAM • Device: Off-the-shelf emerging devices − Storage: Samsung XS1715 NVMe SSD − NIC: NetFPGA with Xilinx Virtex 5 (up to 1Gb bandwidth) − Accelerator: NVIDIA Tesla K20m − Device interconnect: Cyclone Microsystems PCIe2-2707 (Gen 2 switch, 5 slots, up to 80Gbps) @ 2018 Jangwoo Kim 21

DCS prototype implementation • Our 4-node DCS prototype − Can support many devices per host A working prototype of Device-Centric Server (DCS)! @ 2018 Jangwoo Kim 22

Reducing device utilization latency • Single sendfile: Storage read & NIC send − Host-centric: Per-device layer crossings − DCS: Batch management in HW layer 2x latency improvement (with low-latency devices) Latency ( m s) SW 79 DCS 39 Latency 75 HW 75 Host-centric DCS Host-centric DCS @ 2018 Jangwoo Kim 23

Host-independent performance • Sendfile under host resource (CPU) contention − Host-centric: host-dependent, high management cost − DCS: host-independent, low management cost CPU Busy Host-centric 100% BW / CPU 70% busy DCS Sendfile bandwidth 100% BW / CPU 29% busy 71% BW / CPU 11% busy 13% BW / CPU 10% busy No contention High contention High performance even on weak hosts @ 2018 Jangwoo Kim 24

Multi-device invocation • Encrypted sendfile (SSD → GPU → NIC, 512MB ) − DCS provides much efficient data movement to GPU − Current bottleneck is NIC (1Gbps) GPU data loading GPU processing Network send NVIDIA driver Host-centric 32 6 12 62 Network send (1Gb) Network send (10Gb) 14% reduction 38% reduction DCS 6 6 6 13 68 Normalized processing time @ 2018 Jangwoo Kim 25

Real-world workload: Hadoop-grep • Hadoop-grep (10GB) − Faster input delivery & smaller host resource consumption Map progress Reduce progress % 100 Host-centric Map/Reduce progress 75 50 25 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 % 100 DCS 75 50 25 0 40% faster processing @ 2018 Jangwoo Kim 26

Scalability: More devices per host • Doubling # of devices in a single host Host-centric DCS Total device throughput (Normalized) 1.3x 2x SSD SSDx2 SSD SSDx2 Devices NIC NICx2 NIC NICx2 CPU Utilization 60% 100% 22% 37% Scalable many-device support @ 2018 Jangwoo Kim 27

1 st prototype in 2015 [MICRO 2015] • A new server architecture: DCS! − Device latency reduction: ~25% − Host resource savings: ~61% − Hadoop speed improvement: ~40% @ 2018 Jangwoo Kim 28

Wait. We can do even better! @ 2018 Jangwoo Kim 29

Limitations of Existing D2D Comm. • P2P communication − Direct data transfers through PCI Express  D2D comm. − Slow, high-overhead control path becomess a killer Dev Control Data copy Others Control A 100% SW Latency (us) 120 CPU util. (%) Dev CPU 75% 90 B 50% 60 Dev 25% 30 C 0 0% SW SW P2P Data path P2P opt opt Control path @ 2018 Jangwoo Kim 30

Limitations of Existing D2D Comm. • Integrated devices − Integrating heterogeneous devices  D2D comm. − Fast data & control transfers − Fixed and inflexible aggregate implementation Dev A Controllers Dev CPU B New Dev Dev C $$$ @ 2018 Jangwoo Kim 31

Major IT companies run datacenters Datacenter infra market is huge. - PowerPoint PPT Presentation

DCS: A Fast, Scalable, Flexible Device-Centric Server Architecture Jangwoo Kim E-mail: jangwoo@snu.ac.kr Web: https://hpcs.snu.ac.kr/~jangwoo High Performance Computer System (HPCS) Lab Department of Electrical and Computer Engineering Seoul

Trusted End Host Monitors for Securing Cloud Datacenters Alan Shieh Srikanth Kandula

Performance Datacenters HotNets15 Xpander: Unveiling the Secrets of High-Performance

Reliable Communication for Datacenters Mahesh Balakrishnan Cornell University Mahesh

Operating Systems CSE451 Simon Peter With thanks to Timothy Roscoe (ETH Zurich) Autumn 2015

H R Smith Group of Companies H R Smith Group of Companies www.searchandrescue.com

From something that fits in your pocket ... ... to, well, this. The future? ... Energy A look

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Prometheus @ Datacenters Why Modbus Is Even Worse than SNMP Richard Hartmann, RichiH@ {

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Google is Really Different. The Dalles, OR (2006) Huge Datacenters in 25+ Worldwide

Introduction to Datacenters & the Cloud Introduction to Storage in the Cloud Alex M. Hurd

Introduction to Datacenters & the Cloud Datacenter Power Alex M. Hurd Clarkson Open Source

Muddy Run/Conowingo Recreation Sites and Facilities Consultation Presentation September 14-15,

Outdoor Heritage Projects Blood Run Blood Run Oak Forest Blood Run 2012 Big Sioux River overlook

+ Characterization of Miller Run and Conceptual Plan for Characterization of Miller Run and

5 Official 5 Official 5 Official 5 Official Run Zone Coverage Run Zone Coverage Run Zone

50% pass developmental credit course course pass take pass developmental credit credit

Symnet: scalable symbolic execution for modern networks University Politehnica of Bucharest Radu

Massively Scalable Indoor Positioning: The Skyhook Solution Christopher Steger Skyhook

Scalable financial solutions for energy renovations Best practices from Utrecht Region. Whats

Agenda 5 David 1 Maija Product development Team Members & Roles journey [video graphic]

Hamilton & National Grid Select reset slide (note: Using this action will

HDPE PIPE L INE IO N SERVIC ES DAC O N INSPEC T INSPE CT ION P R E S E N T A T I O N Who

Canadian Experience with Aerial Patrols Utilizing Gimbal Systems to Scan Lines with IR & UV

Major IT companies run datacenters Datacenter infra market is huge. - PowerPoint PPT Presentation

DCS: A Fast, Scalable, Flexible Device-Centric Server Architecture Jangwoo Kim E-mail: jangwoo@snu.ac.kr Web: https://hpcs.snu.ac.kr/~jangwoo High Performance Computer System (HPCS) Lab Department of Electrical and Computer Engineering Seoul

Trusted End Host Monitors for Securing Cloud Datacenters Alan Shieh Srikanth Kandula

Performance Datacenters HotNets15 Xpander: Unveiling the Secrets of High-Performance

Reliable Communication for Datacenters Mahesh Balakrishnan Cornell University Mahesh

Operating Systems CSE451 Simon Peter With thanks to Timothy Roscoe (ETH Zurich) Autumn 2015

H R Smith Group of Companies H R Smith Group of Companies www.searchandrescue.com

From something that fits in your pocket ... ... to, well, this. The future? ... Energy A look

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Prometheus @ Datacenters Why Modbus Is Even Worse than SNMP Richard Hartmann, RichiH@ {

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Google is Really Different. The Dalles, OR (2006) Huge Datacenters in 25+ Worldwide

Introduction to Datacenters &amp; the Cloud Introduction to Storage in the Cloud Alex M. Hurd

Introduction to Datacenters &amp; the Cloud Datacenter Power Alex M. Hurd Clarkson Open Source

Muddy Run/Conowingo Recreation Sites and Facilities Consultation Presentation September 14-15,

Outdoor Heritage Projects Blood Run Blood Run Oak Forest Blood Run 2012 Big Sioux River overlook

+ Characterization of Miller Run and Conceptual Plan for Characterization of Miller Run and

5 Official 5 Official 5 Official 5 Official Run Zone Coverage Run Zone Coverage Run Zone

50% pass developmental credit course course pass take pass developmental credit credit

Symnet: scalable symbolic execution for modern networks University Politehnica of Bucharest Radu

Massively Scalable Indoor Positioning: The Skyhook Solution Christopher Steger Skyhook

Scalable financial solutions for energy renovations Best practices from Utrecht Region. Whats

Agenda 5 David 1 Maija Product development Team Members &amp; Roles journey [video graphic]

Hamilton &amp; National Grid Select reset slide (note: Using this action will

HDPE PIPE L INE IO N SERVIC ES DAC O N INSPEC T INSPE CT ION P R E S E N T A T I O N Who

Canadian Experience with Aerial Patrols Utilizing Gimbal Systems to Scan Lines with IR &amp; UV

Introduction to Datacenters & the Cloud Introduction to Storage in the Cloud Alex M. Hurd

Introduction to Datacenters & the Cloud Datacenter Power Alex M. Hurd Clarkson Open Source

Agenda 5 David 1 Maija Product development Team Members & Roles journey [video graphic]

Hamilton & National Grid Select reset slide (note: Using this action will

Canadian Experience with Aerial Patrols Utilizing Gimbal Systems to Scan Lines with IR & UV