��������������������������������������������� �������������������������������������� Nov 14-16, 2019 // Denver, Colorado, USA MCC : a Predictable and Scalable Massive Client Load Generator Wenqing Wu*, Xiao Feng*, Wenli Zhang, Mingyu Chen 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Science 2 University of Chinese Academy of Sciences 3 Peng Cheng Laboratory, Shenzhen, China
Outline I. Background/Motivation II. Design III. Evaluation IV. Conclusion �
Background ( D evice u nder T est) 1 Load DUT 2 generator Data Center IoT Stateless Stateful 1 2 No connections TCP connection oriented • • Uni-directional Bi-directional • • Supported: Supported: • • Ø Network load to simulate Bandwidth test Bandwidth test Packet loss test Packet loss test ü Stateful Latency analysis ü Tremendous NAT/Firewall test ü Various distributions Application interaction �
State of the Art Ø Hardware-based load Generators OSNT (NetFPGA, $ 2,000) Ixia Sprient (Specialized device, > $100,000 ) (Specialized device) ✓ Precise & accurate ✓ High throughput ✘ Stateless Inflexible ( firm, not open source ) ✘ ✘ Expensive �
State of the Art Ø Software-based load Generators Load Generators Description Stateful Comments trafgen Packet generator based on AF_PACKET � MoonGen Packet generator fueled-by DPDK � D-ITG Distributed framework A multiplatform tool � ✓ Iperf Bandwidth and jitter analysis Close-loop, No concurrency support ✓ wrk HTTP benchmarking tool Limited throughput, poor scalability ✓ Cheap (Run on cheap commodity hardware ) ✓ Some are flexible (Open source, easy to add new features) ✘ Stateful generators can not achieve microsecond precision ✘ Stateful generators show poor scalability 4
Imprecision in stateful load generation Ø Why software-based stateful load generators are less precise? ü Scheduling policies in Operating System (OS) E.g. , sleep() does not guarantee one microsecond precision. • ü POSIX blocking I/O interface E.g. , select() introduces at least 20 µs deviation to timed tasks. • ü Heavy kernel protocol stack Uncertain stack processing time poisons the precision of timed operations running • in application layer. �
MCC (Massive Client Connections) Ø Design goal : • A predictable and scalable massive client load generator ü Stateful TCP connection oriented • ü Predictable load generation Two-stage time mechanism achieves one microsecond precision • ü High throughput while preforming flow-level simulation Lightweight protocol stack based to simplify packet processing • High-speed I/O with kernel-bypass technique • ü Scalability in multi-core systems Shared-nothing multi-threaded model • �
MCC Overview Ø Load generation model of MCC Application Layer Data Path ü Load modeling 1 Control Logic Adjust packet size, add timestamp Data Modeling 1 Load Modeling Adjust packets inter departure time 2 2 Connection Manager ü Reactor NIO pattern App Reactor Model ü User-level stack based Timer ü Customized I/O layer TCP/IP Stack Controllable I/O I/O I/O Layer ü Two-stage timer mechanism Timer Control packet I/O precisely Device Driver �
User-level load generator Ø MCC runs fully in user space ü We are able to optimize the full path of load generation • An I/O Thread is added to achieve precise control LoadGen Application LoadGen Thread User Stateful Socket-like API Socket API Kernel Stack thread Kernel stack Fully in user space Stateless I/O thread Customized I/O layer Packet I/O DPDK I/O library Device driver Device driver Kernel-based solutions MCC’s approach �
Two-stage timer Ø Two-stage timer helps to generate predictable load • App Timer ü Flow-level Connection App Timer 1 manager ü Control send operation LoadGen send() Data with Thread • IO Timer 3 timestamp ü Packet-level ü Control xmit operation according to timestamp added in application layer Stack encapsulation 4 Thread • Step Initialize Load Generation thread 1 • I/O Timer Register 2 Initialize I/O thread • 2 queue I/O Send data according to the App timer • 3 (RQ) Thread xmit() xmit() Encapsulate packet and enqueue it to RQ • 4 5 Xmit packet according to the I/O timer • 5 DPDK IO library �
App timer Ø Polling-based application layer timer ü Avoid the imprecision resulting from scheduling policies in OS • Structures User APIs Non-blocking event loop - Timed task: 2-tuple ( timestamp, function ) P o l l i n - Task set: RB-tree (fast insertion/deletion) g c h e Add timed task c k • Steps Task sched(ts, func) I. Register (Add timed task) Set Execute II. Trigger timeout (Polling check) III. Execute (Run callback function) Store with RB-tree 3 Microsecond precision ��
I/O timer Ø Novel I/O timer added in customized I/O layer ü Eliminate timing error introduced by protocol stack (tens of microseconds ) • I/O layer (I/O thread) sndbuf - Dedicated I/O thread Stack - Lockless Register queue (RQ) Thread RQ • Steps IO Timer I. Insert encapsulated packet into RQ II. Polling check RQ I/O III. Send packet out at specific time Thread DPDK 1 Microsecond precision ring queue NIC ��
Scalable Multi-threaded Model Ø Shared-nothing multi-threaded Model Distributor thread • Per-core threading core0 ü Per-core listening queue core1 core2 worker ü Per-core file descriptors LoadGen LoadGen LoadGen ü Run-to-completion model Thread Thread Thread • Core affinity … • RSS (Receive-Side Scaling ) Stack Stack Stack Thread Thread Thread • Distributor thread ü Parse/distribute configuration DPDK ü Aggregate statistics Scalability queue0 queue1 queue2 RSS NIC ��
Scalable Multi-threaded Model Ø Message passing model between Distributor and Workers ü Avoid synchronization primitives (lock, memory barrier, atomic operations, …) ü Easy to extend for multiple Workers Task queue push pull Worker Distributor push pull Result queue (statistics, state) 13
Evaluation ��
Experimental Setup ü Machines(client/server ): • CPU: Intel(R) Xeon(R) CPU E5645 12 cores @ 2.40GHz • Memory: 96 GB • NIC: Intel 82599ES 10Gb Ethernet Adapter ü Experiments: • Microbenchmark: - Precision with different timers • Predictable load generation • Throughput & Scalability ��
Precision with different timers ü Timers in MCC bring one microsecond precision. Baseline: sleep() + Linux kernel stack • (“Linux” in the table indicates “Linux kernel stack”) Precision of the load generator when generating constant bit rate (CBR) traffic ��
Predictable Load Generation ü MCC is able to generate traffic following the analytical Poisson distribution. 0.175 App timer vs. analytic 0.150 0.125 Probability Density App timer + IO timer vs. analytic 0.100 0.075 0.050 0.025 0.000 0 20 40 60 80 100 120 Packet interval (µs) Poisson Traffic Generation �� �
Throughput & Scalability ü 2.4x ~ 3.5x higher throughput than wrk (A kernel-based HTTP benchmark) ü Almost linear scalability before reaching line rate 35 wrk MCC 30.83 28.57 30 Requests per Second (x10 5 ) 25 19.27 20 13.87 15 8.6 10 7.65 6.96 6.3 4.75 3.56 5 2.4 1.5 0 1 2 4 6 8 10 Number of CPU cores HTTP load generation (File size: 64B) ��
Conclusion Ø MCC : A predictable and scalable massive client load generator ü Predictable load generation - Two-stage timer mechanism ü High throughput - Lightweight user-level stack - Kernel-bypass ü Scalability in multi-core systems - Shared-nothing multi-threaded model ��
������ Please feel free to email us at wuwenqing@ict.ac.cn if you have any questions.
Recommend
More recommend