MCC : a Predictable and Scalable Massive Client Load Generator - PowerPoint PPT Presentation

�� Nov 14-16, 2019 // Denver, Colorado, USA MCC : a Predictable and Scalable Massive Client Load Generator Wenqing Wu*, Xiao Feng*, Wenli Zhang, Mingyu Chen 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Science 2 University of Chinese Academy of Sciences 3 Peng Cheng Laboratory, Shenzhen, China

Outline I. Background/Motivation II. Design III. Evaluation IV. Conclusion �

Background ( D evice u nder T est) 1 Load DUT 2 generator Data Center IoT Stateless Stateful 1 2 No connections TCP connection oriented • • Uni-directional Bi-directional • • Supported: Supported: • • Ø Network load to simulate Bandwidth test Bandwidth test Packet loss test Packet loss test ü Stateful Latency analysis ü Tremendous NAT/Firewall test ü Various distributions Application interaction �

State of the Art Ø Hardware-based load Generators OSNT (NetFPGA, $ 2,000) Ixia Sprient (Specialized device, > $100,000 ) (Specialized device) ✓ Precise & accurate ✓ High throughput ✘ Stateless Inflexible （ firm, not open source ） ✘ ✘ Expensive �

State of the Art Ø Software-based load Generators Load Generators Description Stateful Comments trafgen Packet generator based on AF_PACKET � MoonGen Packet generator fueled-by DPDK � D-ITG Distributed framework A multiplatform tool � ✓ Iperf Bandwidth and jitter analysis Close-loop, No concurrency support ✓ wrk HTTP benchmarking tool Limited throughput, poor scalability ✓ Cheap (Run on cheap commodity hardware ) ✓ Some are flexible (Open source, easy to add new features) ✘ Stateful generators can not achieve microsecond precision ✘ Stateful generators show poor scalability 4

Imprecision in stateful load generation Ø Why software-based stateful load generators are less precise? ü Scheduling policies in Operating System (OS) E.g. , sleep() does not guarantee one microsecond precision. • ü POSIX blocking I/O interface E.g. , select() introduces at least 20 µs deviation to timed tasks. • ü Heavy kernel protocol stack Uncertain stack processing time poisons the precision of timed operations running • in application layer. �

MCC (Massive Client Connections) Ø Design goal : • A predictable and scalable massive client load generator ü Stateful TCP connection oriented • ü Predictable load generation Two-stage time mechanism achieves one microsecond precision • ü High throughput while preforming flow-level simulation Lightweight protocol stack based to simplify packet processing • High-speed I/O with kernel-bypass technique • ü Scalability in multi-core systems Shared-nothing multi-threaded model • �

MCC Overview Ø Load generation model of MCC Application Layer Data Path ü Load modeling 1 Control Logic Adjust packet size, add timestamp Data Modeling 1 Load Modeling Adjust packets inter departure time 2 2 Connection Manager ü Reactor NIO pattern App Reactor Model ü User-level stack based Timer ü Customized I/O layer TCP/IP Stack Controllable I/O I/O I/O Layer ü Two-stage timer mechanism Timer Control packet I/O precisely Device Driver �

User-level load generator Ø MCC runs fully in user space ü We are able to optimize the full path of load generation • An I/O Thread is added to achieve precise control LoadGen Application LoadGen Thread User Stateful Socket-like API Socket API Kernel Stack thread Kernel stack Fully in user space Stateless I/O thread Customized I/O layer Packet I/O DPDK I/O library Device driver Device driver Kernel-based solutions MCC’s approach �

Two-stage timer Ø Two-stage timer helps to generate predictable load • App Timer ü Flow-level Connection App Timer 1 manager ü Control send operation LoadGen send() Data with Thread • IO Timer 3 timestamp ü Packet-level ü Control xmit operation according to timestamp added in application layer Stack encapsulation 4 Thread • Step Initialize Load Generation thread 1 • I/O Timer Register 2 Initialize I/O thread • 2 queue I/O Send data according to the App timer • 3 (RQ) Thread xmit() xmit() Encapsulate packet and enqueue it to RQ • 4 5 Xmit packet according to the I/O timer • 5 DPDK IO library �

App timer Ø Polling-based application layer timer ü Avoid the imprecision resulting from scheduling policies in OS • Structures User APIs Non-blocking event loop - Timed task: 2-tuple ( timestamp, function ) P o l l i n - Task set: RB-tree (fast insertion/deletion) g c h e Add timed task c k • Steps Task sched(ts, func) I. Register (Add timed task) Set Execute II. Trigger timeout (Polling check) III. Execute (Run callback function) Store with RB-tree 3 Microsecond precision ��

I/O timer Ø Novel I/O timer added in customized I/O layer ü Eliminate timing error introduced by protocol stack (tens of microseconds ) • I/O layer (I/O thread) sndbuf - Dedicated I/O thread Stack - Lockless Register queue (RQ) Thread RQ • Steps IO Timer I. Insert encapsulated packet into RQ II. Polling check RQ I/O III. Send packet out at specific time Thread DPDK 1 Microsecond precision ring queue NIC ��

Scalable Multi-threaded Model Ø Shared-nothing multi-threaded Model Distributor thread • Per-core threading core0 ü Per-core listening queue core1 core2 worker ü Per-core file descriptors LoadGen LoadGen LoadGen ü Run-to-completion model Thread Thread Thread • Core affinity … • RSS (Receive-Side Scaling ) Stack Stack Stack Thread Thread Thread • Distributor thread ü Parse/distribute configuration DPDK ü Aggregate statistics Scalability queue0 queue1 queue2 RSS NIC ��

Scalable Multi-threaded Model Ø Message passing model between Distributor and Workers ü Avoid synchronization primitives (lock, memory barrier, atomic operations, …) ü Easy to extend for multiple Workers Task queue push pull Worker Distributor push pull Result queue (statistics, state) 13

Evaluation ��

Experimental Setup ü Machines(client/server ): • CPU: Intel(R) Xeon(R) CPU E5645 12 cores @ 2.40GHz • Memory: 96 GB • NIC: Intel 82599ES 10Gb Ethernet Adapter ü Experiments: • Microbenchmark: - Precision with different timers • Predictable load generation • Throughput & Scalability ��

Precision with different timers ü Timers in MCC bring one microsecond precision. Baseline: sleep() + Linux kernel stack • (“Linux” in the table indicates “Linux kernel stack”) Precision of the load generator when generating constant bit rate (CBR) traffic ��

Predictable Load Generation ü MCC is able to generate traffic following the analytical Poisson distribution. 0.175 App timer vs. analytic 0.150 0.125 Probability Density App timer + IO timer vs. analytic 0.100 0.075 0.050 0.025 0.000 0 20 40 60 80 100 120 Packet interval (µs) Poisson Traffic Generation ��

Throughput & Scalability ü 2.4x ~ 3.5x higher throughput than wrk (A kernel-based HTTP benchmark) ü Almost linear scalability before reaching line rate 35 wrk MCC 30.83 28.57 30 Requests per Second (x10 5 ) 25 19.27 20 13.87 15 8.6 10 7.65 6.96 6.3 4.75 3.56 5 2.4 1.5 0 1 2 4 6 8 10 Number of CPU cores HTTP load generation (File size: 64B) ��

Conclusion Ø MCC ： A predictable and scalable massive client load generator ü Predictable load generation - Two-stage timer mechanism ü High throughput - Lightweight user-level stack - Kernel-bypass ü Scalability in multi-core systems - Shared-nothing multi-threaded model ��

�� Please feel free to email us at wuwenqing@ict.ac.cn if you have any questions.

MCC : a Predictable and Scalable Massive Client Load Generator - PowerPoint PPT Presentation

Multi-Threaded Servers December 6, 2007 1 Client-Server Communication Client Client Client

Reducing Poverty Through Sustainable Economic Growth The MCC Compact Experience by JOHN A. POLK ,

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Introduction to MCC-Funded Business Opportunities The MCC Mandate Reduce poverty through economic

PgBouncer and a Bit of Queueing Theory Peter Eisentraut peter.eisentraut@2ndquadrant.com

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

MCC MCC Advi Advisory sory Co Council uncil 2019 Fall 2019 Fall Mee Meeting ting

A U C D M U LT I C U LT U R A L C O U N C I L Meeting Agenda MCC Orientation Board of

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

CROSS-LAYER CROSS-LAYER LATENCY-AWARE AND -PREDICTABLE LATENCY-AWARE AND -PREDICTABLE DATA

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

How Predictable is Information Diffusion? Travis Martin, Jake Hofman, Amit Sharma, Ashton

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

sysbench 1.0: teaching an old dog new tricks Alexey Kopytov akopytov@gmail.com 1 The early days

Effectively Measure and Reduce Kernel Latencies for Real-time Constraints Embedded Linux

Real Time Linux Scheduling Comparison Vince Bridgers Software Architect Altera Corporation Who

Modern Management of Case Sleep Disorders 58 yr. old WF with >4 yr. of poor sleep

Groking the Linux SPI Subsystem FOSDEM 2017 Matt Porter Obligatory geek reference

Windows 10 Anniversary Update Battery Life and Performance Jessie Labayen Principal Program

Performance Visualizations Brendan Gregg Software Engineer brendan.gregg@joyent.com Thursday,

Course Business l New sample dataset for class today: l CourseWeb: Course Documents Sample Data

Sambuz

Useful Links

Newsletter

Mail Us

MCC : a Predictable and Scalable Massive Client Load Generator - PowerPoint PPT Presentation

Multi-Threaded Servers December 6, 2007 1 Client-Server Communication Client Client Client

Reducing Poverty Through Sustainable Economic Growth The MCC Compact Experience by JOHN A. POLK ,

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Introduction to MCC-Funded Business Opportunities The MCC Mandate Reduce poverty through economic

PgBouncer and a Bit of Queueing Theory Peter Eisentraut peter.eisentraut@2ndquadrant.com

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

MCC MCC Advi Advisory sory Co Council uncil 2019 Fall 2019 Fall Mee Meeting ting

A U C D M U LT I C U LT U R A L C O U N C I L Meeting Agenda MCC Orientation Board of

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Caches &amp; Memcache Example Client N. America Client System Asia + Caches Client Africa

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

CROSS-LAYER CROSS-LAYER LATENCY-AWARE AND -PREDICTABLE LATENCY-AWARE AND -PREDICTABLE DATA

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

How Predictable is Information Diffusion? Travis Martin, Jake Hofman, Amit Sharma, Ashton

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

sysbench 1.0: teaching an old dog new tricks Alexey Kopytov akopytov@gmail.com 1 The early days

Effectively Measure and Reduce Kernel Latencies for Real-time Constraints Embedded Linux

Real Time Linux Scheduling Comparison Vince Bridgers Software Architect Altera Corporation Who

Modern Management of Case Sleep Disorders 58 yr. old WF with &gt;4 yr. of poor sleep

Groking the Linux SPI Subsystem FOSDEM 2017 Matt Porter Obligatory geek reference

Windows 10 Anniversary Update Battery Life and Performance Jessie Labayen Principal Program

Performance Visualizations Brendan Gregg Software Engineer brendan.gregg@joyent.com Thursday,

Course Business l New sample dataset for class today: l CourseWeb: Course Documents Sample Data

Sambuz

Useful Links

Newsletter

Mail Us

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa

Modern Management of Case Sleep Disorders 58 yr. old WF with >4 yr. of poor sleep