The Case for the Vector Operating System Vijay Vasudevan , David - PowerPoint PPT Presentation

The Case for the Vector Operating System Vijay Vasudevan , David G. Andersen, Michael Kaminsky Carnegie Mellon University and Intel Labs

A webserver accept(...) stat(...) open(f1) fcntl(...) Req1 fcntl(...) ... accept(...) stat(f2) open(f2) fcntl(...) Req2 fcntl(...) ... 2

A webserver accept(...) accept(...) stat(f2) stat(...) open(f2) open(f1) fcntl(...) fcntl(...) Req2 Req1 fcntl(...) fcntl(...) ... ... 2

A scalable, parallel webserver accept(...) accept(...) accept(...) stat(f2) stat(...) stat(f3) open(f2) open(f1) open(f3) fcntl(...) fcntl(...) fcntl(...) Req2 Req3 Req1 fcntl(...) fcntl(...) fcntl(...) ... ... ... 2

A scalable, parallel webserver accept(...) accept(...) accept(...) stat(f1) stat(f2) stat(f3) open(f1) open(f2) open(f3) Req2 Req3 Req1 ... ... ... 3

A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) stat(f1) stat(f2) stat(f3) open(f1) open(f2) open(f3) 3

A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f3) stat(f1) open(f1) open(f2) open(f3) 3

A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f3) stat(f1) open(f1) open(f2) open(f3) { context switch alloc() copy(f1) path_resolve(f1) acl_check(f1) h = hash(f1) lookup(h) read(f1) dealloc() context switch } 3

A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f1) stat(f3) open(f1) open(f2) open(f3) { { { context switch context switch context switch alloc() alloc() alloc() copy(f2) copy(f1) copy(f3) path_resolve(f2) path_resolve(f1) path_resolve(f3) acl_check(f2) acl_check(f1) acl_check(f3) h = hash(f2) h = hash(f1) h = hash(f3) lookup(h) lookup(h) lookup(h) read(f2) read(f1) read(f3) dealloc() dealloc() dealloc() context switch context switch context switch } } } 3

A scalable, parallel webserver accept(...) vec_ accept(...) accept(...) vec_stat([f1,f2,f3]) stat(f2) stat(f3) stat(f1) vec_open([f1,f2,f3]) { open(f3) open(f2) open(f1) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 3

A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4

A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) Eliminate N-1 context switches context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4

A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) Reduce path resolutions vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4

A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) Use SSE to hash filenames h = hash([f1,f2,f3]) lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4

A vectored webserver vec_ accept(...) vec_ stat([f1,f2,f3]) vec_open([f1,f2,f3]) { open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1,f2,f3]) acl_check([f1,f2,f3]) h = hash([f1,f2,f3]) Search dentry list once lookup(h) vec_read([f1,f2,f3]) dealloc() context switch } 4

VOS core ideas Known: Batching syscalls improves throughput ๏ Amortizes per-execution cost ๏ Applies regardless of similarity of batched work “SIMD” vectorization improves efficiency ๏ Eliminates redundant instructions in || execution ๏ Frees up resources to allow more work to be done ๏ Enables algorithmic optimizations 5

VOS core ideas Known: Batching syscalls improves throughput ๏ Amortizes per-execution cost ๏ Applies regardless of similarity of batched work “SIMD” vectorization improves efficiency ๏ Eliminates redundant instructions in || execution ๏ Frees up resources to allow more work to be done ๏ Enables algorithmic optimizations One concrete example: mprotect One difficult challenge: managing divergence One possible implementation path 5

Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect 6 Data courtesy of Iulian Moraru

Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: 6 Data courtesy of Iulian Moraru

Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) 6 Data courtesy of Iulian Moraru

Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) 6 Data courtesy of Iulian Moraru

Speeding up memory protection 1500000 1125000 page protections sec 750000 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) ๏ Eliminate TLB flush per individual call 6 Data courtesy of Iulian Moraru

Speeding up memory protection 1500000 1125000 page protections sec 3x performance 750000 improvement 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) ๏ Optimized data structure allocation (sorting) ๏ Eliminate TLB flush per individual call 6 Data courtesy of Iulian Moraru

Speeding up memory protection 1500000 1125000 page protections sec 3x performance 750000 improvement 375000 0 mprotect vec_mprotect vec_mprotect techniques: ๏ Amortize context switches (async batching) 30% { ๏ Optimized data structure allocation (sorting) 170% { ๏ Eliminate TLB flush per individual call 6 Data courtesy of Iulian Moraru

One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) h = hash([f1,f2,f3]) lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) dealloc() context switch 7

One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch fork() ? vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) h = hash([f1,f2,f3]) lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) join() ? dealloc() context switch 7

One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch fork() ? messages? vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) h = hash([f1,f2,f3]) lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) join() ? dealloc() context switch 7

One difficult challenge Handling convergence and divergence vec_open([f1,f2,f3]) open(f2) context switch fork() ? messages? vec_alloc() vec_copy([f1,f2,f3]) vec_path_resolve([f1]) vec_path_resolve([f2,f3]) acl_check([f1]) acl_check([f2,f3]) Is this worth h = hash([f1,f2,f3]) joining for? lookup(h[0]) lookup(h[1..2]) read([f1]) read([f2,f3]) join() ? dealloc() context switch 7

OS as staged event system Ideal interface for vectorization ๏ Use message passing as underlying primitive on packet is_new_connection accept process 8

OS as staged event system Ideal interface for vectorization ๏ Use message passing as underlying primitive Programming interface? on packet Event abstraction is nice is_new_connection Who vectorizes? Static analysis, compiler OS or App developer? accept Efficiency vs. Latency process 8

Summary of VOS Vectors fundamentally improve efficiency by ๏ Collecting similar requests ๏ Eliminating redundant work ๏ Remaining parallel when code diverges Challenges ๏ Programming vector abstractions ๏ Identifying what to coalesce and how to diverge 9

Summary of VOS Don’t let embarrassingly parallel become embarrassingly wasteful Vectors fundamentally improve efficiency by ๏ Collecting similar requests ๏ Eliminating redundant work ๏ Remaining parallel when code diverges Challenges ๏ Programming vector abstractions ๏ Identifying what to coalesce and how to diverge 9

The Case for the Vector Operating System Vijay Vasudevan , David - PowerPoint PPT Presentation

The Case for the Vector Operating System Vijay Vasudevan , David G. Andersen, Michael Kaminsky Carnegie Mellon University and Intel Labs A webserver accept(...) stat(...) open(f1) fcntl(...) Req1 fcntl(...) ... accept(...) stat(f2)

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

The Case for VOS The Vector Operating System Maciej Weksej based on paper written by Vasudevan,

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Chapter 3: Operating-System Structures System Components Operating System Services

Chapter 3: Operating-System Structures System Components Operating System Services

Module 3: Operating-System Structures System Components Operating-System Services

Module 3: Operating-System Structures System Components Operating System Services

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Lecture 11 Vector Linear Network Coding Vector Linear Network Coding Outline Fundamentals for

. Vector Graphics Introduction to Web Design Vector graphics contain geometric objects, such as

Class 7: Vector and scalar, components Vector operations in components Multiplying a vector with a

Vector Functions A vector function is simply a function whose codomain is R n . In other words,

Vector Field Topology 8-1 Ronald Peikert SciVis 2007 - Vector Field Topology Vector fields as

Linear Algebra Vectors A column vector is a list of numbers stored vertically. The dimen-

vector class homogeneous aggregate with random access templated class: Vector<int>

CS406: Compilers Spring 2020 Week 7: (IR) Code Generation - For Loops, Switch Statements, and

CISC101 Reminders & Notes Today Assignment 2 grades are posted in Moodle Cover

Content of this lecture Administration (personnel, policy, agenda, etc.) Boring stuff

Patent Quality Chat: Opportunities for Examiner Interviews: First Action Interview Pilot and

Improving the Reliability of Commodity Operating Systems Mike Swift, Brian Bershad, Hank Levy

CS 343H: Honors AI Lecture 23: Kernels and clustering 4/15/2014 Kristen Grauman UT Austin

Visualization Frameworks for Data Staging and In- Situ Environments David Pugmire Scientific

2018 CRs Nick Amin September 2, 2018 Overview (1) Previously looked at 2017 CRs with an

The Case for the Vector Operating System Vijay Vasudevan , David - PowerPoint PPT Presentation

The Case for the Vector Operating System Vijay Vasudevan , David G. Andersen, Michael Kaminsky Carnegie Mellon University and Intel Labs A webserver accept(...) stat(...) open(f1) fcntl(...) Req1 fcntl(...) ... accept(...) stat(f2)

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

The Case for VOS The Vector Operating System Maciej Weksej based on paper written by Vasudevan,

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Chapter 3: Operating-System Structures System Components Operating System Services

Chapter 3: Operating-System Structures System Components Operating System Services

Module 3: Operating-System Structures System Components Operating-System Services

Module 3: Operating-System Structures System Components Operating System Services

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Lecture 11 Vector Linear Network Coding Vector Linear Network Coding Outline Fundamentals for

. Vector Graphics Introduction to Web Design Vector graphics contain geometric objects, such as

Class 7: Vector and scalar, components Vector operations in components Multiplying a vector with a

Vector Functions A vector function is simply a function whose codomain is R n . In other words,

Vector Field Topology 8-1 Ronald Peikert SciVis 2007 - Vector Field Topology Vector fields as

Linear Algebra Vectors A column vector is a list of numbers stored vertically. The dimen-

vector class homogeneous aggregate with random access templated class: Vector&lt;int&gt;

CS406: Compilers Spring 2020 Week 7: (IR) Code Generation - For Loops, Switch Statements, and

CISC101 Reminders &amp; Notes Today Assignment 2 grades are posted in Moodle Cover

Content of this lecture Administration (personnel, policy, agenda, etc.) Boring stuff

Patent Quality Chat: Opportunities for Examiner Interviews: First Action Interview Pilot and

Improving the Reliability of Commodity Operating Systems Mike Swift, Brian Bershad, Hank Levy

CS 343H: Honors AI Lecture 23: Kernels and clustering 4/15/2014 Kristen Grauman UT Austin

Visualization Frameworks for Data Staging and In- Situ Environments David Pugmire Scientific

2018 CRs Nick Amin September 2, 2018 Overview (1) Previously looked at 2017 CRs with an

vector class homogeneous aggregate with random access templated class: Vector<int>

CISC101 Reminders & Notes Today Assignment 2 grades are posted in Moodle Cover