The Case for the Vector Operating System Vijay Vasudevan , David G. Andersen, Michael Kaminsky Carnegie Mellon University and Intel Labs
A webserver accept(...) stat(...) open(f1) fcntl(...) Req1 fcntl(...) ... accept(...) stat(f2) open(f2) fcntl(...) Req2 fcntl(...) ... 2
A webserver accept(...) accept(...) stat(f2) stat(...) open(f2) open(f1) fcntl(...) fcntl(...) Req2 Req1 fcntl(...) fcntl(...) ... ... 2
A scalable, parallel webserver accept(...) accept(...) accept(...) stat(f2) stat(...) stat(f3) open(f2) open(f1) open(f3) fcntl(...) fcntl(...) fcntl(...) Req2 Req3 Req1 fcntl(...) fcntl(...) fcntl(...) ... ... ... 2
A scalable, parallel webserver accept(...) accept(...) accept(...) stat(f1) stat(f2) stat(f3) open(f1) open(f2) open(f3) Req2 Req3 Req1 ... ... ... 3
A scalable, parallel webserver accept(...) accept(...) accept(...) stat(f1) stat(f2) stat(f3) open(f1) open(f2) open(f3) Req2 Req3 Req1 ... ... ... 3
A scalable, parallel webserver accept(...) accept(...) accept(...) stat(f1) stat(f2) stat(f3) open(f1) open(f2) open(f3) Req2 Req3 Req1 ... ... ... 3
A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) stat(f1) stat(f2) stat(f3) open(f1) open(f2) open(f3) 3
A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f3) stat(f1) open(f1) open(f2) open(f3) 3
A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f3) stat(f1) open(f1) open(f2) open(f3) { context switch alloc() copy(f1) path_resolve(f1) acl_check(f1) h = hash(f1) lookup(h) read(f1) dealloc() context switch } 3
A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f1) stat(f3) open(f1) open(f2) open(f3) { { { context switch context switch context switch alloc() alloc() alloc() copy(f2) copy(f1) copy(f3) path_resolve(f2) path_resolve(f1) path_resolve(f3) acl_check(f2) acl_check(f1) acl_check(f3) h = hash(f2) h = hash(f1) h = hash(f3) lookup(h) lookup(h) lookup(h) read(f2) read(f1) read(f3) dealloc() dealloc() dealloc() context switch context switch context switch } } } 3
A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f3) stat(f1) vec_open([f1,f2,f3]) { open(f3) open(f2) open(f1) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 3
A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4
A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) Eliminate N-1 context switches context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4
A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) Reduce path resolutions vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4
A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) Use SSE to hash filenames h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4
A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) Search dentry list once lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4
VOS core ideas Known: Batching syscalls improves throughput ๏ Amortizes per-execution cost ๏ Applies regardless of similarity of batched work “SIMD” vectorization improves efficiency ๏ Eliminates redundant instructions in || execution ๏ Frees up resources to allow more work to be done ๏ Enables algorithmic optimizations 5
VOS core ideas Known: Batching syscalls improves throughput ๏ Amortizes per-execution cost ๏ Applies regardless of similarity of batched work “SIMD” vectorization improves efficiency ๏ Eliminates redundant instructions in || execution ๏ Frees up resources to allow more work to be done ๏ Enables algorithmic optimizations One concrete example: mprotect One difficult challenge: managing divergence One possible implementation path 5
Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect 6 Data courtesy of Iulian Moraru
Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: 6 Data courtesy of Iulian Moraru
Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) 6 Data courtesy of Iulian Moraru
Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) 6 Data courtesy of Iulian Moraru
Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) ๏ Eliminate TLB flush per individual call 6 Data courtesy of Iulian Moraru
Speeding up memory protection 1500000 1125000 page protections sec 3x performance 750000 improvement 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) ๏ Eliminate TLB flush per individual call 6 Data courtesy of Iulian Moraru
Speeding up memory protection 1500000 1125000 page protections sec 3x performance 750000 improvement 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) 30% { ๏ Optimized data structure allocation (sorting) 170% { ๏ Eliminate TLB flush per individual call 6 Data courtesy of Iulian Moraru
One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) h = hash([f1,f2,f3]) lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) dealloc() context switch 7
One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch fork() ? vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) h = hash([f1,f2,f3]) lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) join() ? dealloc() context switch 7
One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch fork() ? messages? vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) h = hash([f1,f2,f3]) lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) join() ? dealloc() context switch 7
One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch fork() ? messages? vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) Is this worth h = hash([f1,f2,f3]) joining for? lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) join() ? dealloc() context switch 7
OS as staged event system Ideal interface for vectorization ๏ Use message passing as underlying primitive on packet is_new_connection accept process 8
OS as staged event system Ideal interface for vectorization ๏ Use message passing as underlying primitive Programming interface? on packet Event abstraction is nice is_new_connection Who vectorizes? Static analysis, compiler OS or App developer? accept Efficiency vs. Latency process 8
Summary of VOS Vectors fundamentally improve efficiency by ๏ Collecting similar requests ๏ Eliminating redundant work ๏ Remaining parallel when code diverges Challenges ๏ Programming vector abstractions ๏ Identifying what to coalesce and how to diverge 9
Summary of VOS Don’t let embarrassingly parallel become embarrassingly wasteful Vectors fundamentally improve efficiency by ๏ Collecting similar requests ๏ Eliminating redundant work ๏ Remaining parallel when code diverges Challenges ๏ Programming vector abstractions ๏ Identifying what to coalesce and how to diverge 9
Recommend
More recommend