Flexible System Call Scheduling with Exception-Less System Calls and Michael Stumm University of Toronto FlexSC FlexSC Livio Soares Livio Soares
2 The legacy from the single core era system calls for the multicore era Expensive! Costs are: structure pollution Motivation synchronous system call interface is a synchronous ➔ direct : mode-switch ➔ indirect : processor efficient and flexible FlexSC implements efficient and flexible
3 Two contributions: FlexSC and FlexSC-Threads Results in: 1) MySQL throughput increase of up to 40% and latency reduction of 30% 2) Apache throughput increase of up to 115% and latency reduction of 50% FlexSC overview
Ideally, user-mode performance is unaltered 4 Performance impact of synchronous syscalls ➔ Xalan from SPEC CPU 2006 ➔ Virtually no time in the OS ➔ Linux on Intel Core i7 (Nehalem) ➔ Injected exceptions with varying frequencies ➔ Direct Direct : emulate null system call ➔ Indirect Indirect : emulate “write()” system call ➔ Measured only user-mode time ➔ Kernel time ignored
5 MySQL Apache Degradation due to sync. syscalls Degradation (lower is faster) Xalan (SPEC CPU 2006) 70% Indirect 60% Direct 50% 40% 30% 20% 10% 0% 1K 2K 5K 10K 20K 50K 100K user-mode instructions between exceptions (log scale) half processor efficiency; System calls can half indirect cause is major contributor indirect
6 rd of the L1 data cache and data Processor state pollution ➔ Key source of performance impact ➔ On a Linux write() call: ➔ up to 2/3 evicted TLB are evicted ➔ Kernel performance equally affected ➔ Processor efficiency for OS code is also cut half in half
7 Traditional system calls are synchronous and use exceptions to cross domains Synchronous system calls are expensive User Kernel
by decoupling invocation from execution 8 Alternative: side-step the boundary User Kernel Exception-less syscalls remove synchronicity Exception-less syscalls
9 Benefits of exception-less system calls ➔ Significantly reduce direct costs ➔ Fewer mode switches User ➔ Allow for batching Kernel ➔ Reduce indirect costs ➔ Allow for dynamic multicore specialization ➔ Further reduce direct and indirect costs
10 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT SUBMIT; while (entry->status != DONE DONE) while do_something_else(); return entry->return_code; return
11 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; SUBMIT SUBMIT entry->status = SUBMIT SUBMIT; while (entry->status != DONE DONE) while do_something_else(); return entry->return_code; return
12 Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; DONE entry->args[2] = 4096; DONE entry->status = SUBMIT SUBMIT; while (entry->status != DONE DONE) while do_something_else(); return entry->return_code; return
13 Syscall threads ➔ Kernel-only threads ➔ Part of application process ➔ Execute requests from syscall page ➔ Schedulable on a per-core basis
14 Request as many system calls as possible Switch to kernel-mode Start executing all posted system calls Avoids direct and indirect costs, even on a single core System call batching
15 FlexSC makes specializing cores simple Dynamically adapts to workload needs Dynamic multicore specialization
16 Event-driven servers (e.g., memcached, nginx webserver) exception-less ones What programs can benefit from FlexSC? ➔ Use asynchoronous calls, similar to FlexSC ➔ Can use FlexSC directly ➔ Mix sync and exception-less system calls FlexSC-Threads Multi-threaded servers: FlexSC-Threads ➔ Thread library, compatible with Pthreads ➔ No changes to app. code or recompilation required ➔ Transparently converts legacy syscalls into
17 FlexSC-Threads library ➔ Hybrid (M-on-N) threading model ➔ One kernel visible thread per core ➔ Many user threads per kernel-visible thread ➔ Redirects system calls ( libc wrappers) ➔ Posts exception-less syscall to syscall page ➔ Switches to other user-level thread ➔ Resumes thread upon syscall completion Benefits of exception-less syscalls while maintaining sequential syscall interface
18 FlexSC-Threads in action User
19 On a syscall: Post request to system call page Block user-level thread FlexSC-Threads in action
20 On a syscall: Post request to system call page Block user-level thread Switch to next ready thread FlexSC-Threads in action Kernel
21 If all user-level threads become blocked: 1) enter kernel 2) wait for completion of at least 1 syscall FlexSC-Threads in action User Kernel
22 Evaluation ➔ Linux 2.6.33 ➔ Nehalem (Core i7) server, 2.3GHz ➔ 4 cores on a chip ➔ Clients connected on 1 Gbps network ➔ Workloads ➔ Sysbench on MySQL (80% user, 20% kernel) ➔ ApacheBench on Apache (50% user, 50% kernel) sync ”) vs. ➔ Default Linux NTPL (“ sync flexsc ”) FlexSC-Threads (“ flexsc
23 Sysbench: “OLTP” on MySQL (1 core) 500 400 (requests/sec.) Throughput 300 15% improvement 200 flexsc 100 sync 0 0 50 100 150 200 250 300 Request Concurrency
24 Sysbench: “OLTP” on MySQL (4 cores) 1,000 800 (requests/sec.) Throughput 600 40% improvement 400 flexsc 200 sync 0 0 50 100 150 200 250 300 Request Concurrency
25 Up to 30% reduction of average request latencies MySQL latency per client request 256 connections 1900 1,000 95th 900 percentile 800 Latency (ms) average 700 600 500 400 300 200 100 0 sync flexsc sync flexsc sync flexsc 4 cores 1 core 2 cores
26 Performance improvements consequence of more efficient processor execution MySQL processor metrics SysBench (4 cores) 1.4 1.2 Relative Performance User Kernel 1 (flexsc/sync) 0.8 0.6 0.4 0.2 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB
27 ApacheBench throughput (1 core) 45,000 flexsc 40,000 sync 35,000 (requests/sec.) Throughput 30,000 25,000 20,000 80-90% improvement 15,000 10,000 5,000 0 0 200 400 600 800 1000 Request Concurrency
28 ApacheBench throughput (4 cores) 45,000 40,000 35,000 (requests/sec.) Throughput 30,000 115% improvement 25,000 20,000 15,000 10,000 flexsc 5,000 sync 0 0 200 400 600 800 1000 Request Concurrency
29 Up to 50% reduction of average request latencies Apache latency per client request 256 concurrent requests 238 30 99th 25 percentile Latency (ms) average 20 15 10 5 0 sync flexsc sync flexsc sync flexsc 1 core 2 cores 4 cores
30 Processor efficiency doubles for kernel and user-mode execution Apache processor metrics Apache (1 core) 2 Relative Performance 1.5 (flexsc/sync) User Kernel 1 0.5 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB
31 Discussion ➔ New OS architecture not necessary ➔ Exception-less syscalls can coexist with legacy ones ➔ Foundation for non-blocking system calls ➔ select() / poll() in user-space ➔ Interesting case of non-blocking free() ➔ Multicore ultra -specialization ➔ TCP Servers (Rutgers; Iftode et.al), FS Servers ➔ Single-ISA asymmetric cores ➔ OS-friendly cores (HP Labs; Mogul et. al)
32 system calls Concluding Remarks ➔ System calls degrade server performance pollution is inherent to synchronous ➔ Processor ➔ Exception-less syscalls Exception-less syscalls ➔ Flexible and efficient system call execution ➔ FlexSC-Threads FlexSC-Threads ➔ Leverages exception-less syscalls ➔ No modifications to multi-threaded applications ➔ Throughput & latency gains ➔ 2x throughput improvement for Apache and BIND ➔ 1.4x throughput improvement for MySQL
Flexible System Call Scheduling with Exception-Less System Calls and Michael Stumm University of Toronto FlexSC FlexSC Livio Soares Livio Soares
Recommend
More recommend