Solving the Linux storage scalability bottlenecks Jens Axboe - PowerPoint PPT Presentation

Solving the Linux storage scalability bottlenecks Jens Axboe Software Engineer Vault 2016

What are the issues? • Devices went from “hundreds of IOPS” to “hundreds of thousands of IOPS” • Increases in core count, and NUMA • Existing IO stack has a lot of data sharing ● For applications ● And between submission and completion • Existing heuristics and optimizations centered around slower storage

Observed problems • The old stack had severe scaling issues ● Even negative scaling ● Wasting lots of CPU cycles • This also lead to much higher latencies • But where are the real scaling bottlenecks hidden?

IO stack F i l e s y s t e m B I O l a y e r ( s t r u c t b i o ) B l o c k l a y e r ( s t r u c t r e q u e s t ) S C S I s t a c k S C S I d r i v e r r e q u e s t _ f n d r i v e r B y p a s s d r i v e r

Seen from the application A p p A p p A p p A p p C P U A C P U B C P U C C P U D F i l e s y s t e m B I O l a y e r B l o c k l a y e r D r i v e r

Seen from the application A p p A p p A p p A p p C P U A C P U B C P U C C P U D F i l e s y s t e m B I O l a y e r H m m m m ! B l o c k l a y e r D r i v e r

Testing the theory • At this point we may have a suspicion of where the bottleneck might be. Let's run a test and see if it backs up the theory. • We use null_blk ● queue_mode=1 completion_nsec=0 irqmode=0 • Fio ● Each thread does pread(2), 4k, randomly, O_DIRECT • Each added thread alternates between the two available NUMA nodes (2 socket system, 32 threads)

T h a t l o o k s l i k e a l o t o f l o c k c o n t e n t i o n … F i o r e p o r t s s p e n d i n g 9 5 % o f t h e t i m e i n t h e k e r n e l , l o o k s l i k e ~ 7 5 % o f t h a t t i m e i s s p i n n i n g o n l o c k s . L o o k i n g a t c a l l g r a p h s , i t ' s a g o o d m i x o f q u e u e v s c o m p l e t i o n , a n d q u e u e v s q u e u e ( a n d q u e u e - t o - b l o c k v s q u e u e - t o - d r i v e r ) .

A p p A p p A p p A p p C P U A C P U B C P U C C P U D B l o c k l a y e r - R e q u e s t s p l a c e d f o r p r o c e s s i n g - R e q u e s t s r e t r i e v e d b y d r i v e r - R e q u e s t s c o m p l e t i o n s i g n a l e d = = L o t s o f s h a r e d s t a t e ! D r i v e r

Problem areas • We have good scalability until we reach the block layer ● The shared state is a massive issue • A bypass mode driver could work around the problem • We need a real and future proof solution!

Enter block multiqueue • Shares basic name with similar networking functionality, but was built from scratch • Basic idea is to separate shared state ● Between applications ● Between completion and submission • Improve scaling on non-mq hardware was a criteria • Provide a full pool of helper functionality ● Implement and debug once • Become THE queuing model, not “the 3 rd one”

History • Started in 2011 • Original design reworked, fjnalized around 2012 • Merged in 3.13

A p p A p p A p p A p p A p p A p p C P U D C P U A C P U B C P U C C P U E C P U F P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e a n d d r i v e r H a r d w a r e q u e u e H a r d w a r e q u e u e H a r d w a r e q u e u e

Application touches private per-cpu queue • A p p A p p C P U A C P U B ● Software queues ● Submission is now almost fully privatized P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r

A p p A p p C P U A C P U B Software queues map M:N to hardware • P e r - c p u queues s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) ● There are always as many software queues S u b m i s s i o n s as CPUs ● With enough hardware queues, it's a 1:1 C o m p l e t i o n s mapping ● Fewer, and we map based on topology of H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) the system H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r

A p p A p p C P U A C P U B P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) Hardware queues handle dispatch to • hardware and completions H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r

Features • Effjcient and fast versions of: ● T agging ● Timeout handling ● Allocation eliminations ● Local completions • Provides intelligent queue ↔ CPU mappings ● Can be used for IRQ mappings as well • Clean API ● Driver conversions generally remove more code than they add

blk-mq IO fmow I n s e r t i n t o s o f t w a r e S i g n a l h a r d w a r e q u e u e A l l o c a t e b i o F i n d f r e e r e q u e s t M a p b i o t o r e q u e s t q u e u e r u n ? F r e e r e s o u r c e s S l e e p o n f r e e r e q u e s t H a r d w a r e q u e u e r u n s ( b i o , m a r k r q a s f r e e ) C o m p l e t e I O H a r d w a r e I R Q e v e n t S u b m i t t o h a r d w a r e

Block layer IO fmow A l l o c a t e b i o A l l o c a t e r e q u e s t M a p b i o t o r e q u e s t I n s e r t i n t o q u e u e S i g n a l d r i v e r ( ? ) F r e e r e s o u r c e s A l l o c a t e d r i v e r ( r e q u e s t , b i o , t a g , S l e e p o n r e s o u r c e s c o m m a n d a n d S G A l l o c a t e t a g D r i v e r r u n s H a r d w a r e , e t c ) l i s t P u l l r e q u e s t o ff b l o c k C o m p l e t e I O H a r d w a r e I R Q e v e n t S u b m i t t o h a r d w a r e l a y e r q u e u e

Completions • Want completions as local as possible ● Even without queue shared state, there's still the request • Particularly for fewer/single hardware queue design, care must be taken to minimize sharing • If completion queue can place event, we use that ● If not, IPI

A p p A p p C P U A C P U B C o m p l e t e I O P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) Y e s I R Q I P I t o r i g h t C P U H a r d w a r e q u e u e N o I R Q i n r i g h t l o c a t i o n ? H a r d w a r e a n d d r i v e r

Tagging • Almost all hardware uses tags to identify IO requests ● Must get a free tag on request issue ● Must return tag to pool on completion “ T h i s i s a r e q u e s t i d e n t i fi e d b y t a g = 0 x 1 3 ” D r i v e r H a r d w a r e “ T h i s i s t h e c o m p l e t i o n e v e n t f o r t h e r e q u e s t i d e n t i fi e d b y t a g = 0 x 1 3 ”

Tag support • Must have features: ● Effjcient at or near tag exhaustion ● Effjcient for shared tag maps • Blk-mq implements a novel bitmap tag approach ● Software queue hinting (sticky) ● Sparse layout ● Rolling wakeups

Solving the Linux storage scalability bottlenecks Jens Axboe - PowerPoint PPT Presentation

Solving the Linux storage scalability bottlenecks Jens Axboe Software Engineer Vault 2016 What are the issues? Devices went from hundreds of IOPS to hundreds of thousands of IOPS Increases in core count, and NUMA

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org

One Billion Files: Scalability Limits in Linux File Systems Ric Wheeler Architect & Manager,

Scaling the Linux VFS Nick Piggin SuSE Labs, Novell Inc. September 19, 2009 0-0 Outline I

Creating Web Farms with Linux (Linux High Availability and Scalability) Horms (Simon Horman)

Creating Web Farms with Linux (Linux High Availability and Scalability) Horms (Simon Horman)

An Analysis of Linux Scalability to Many Cores Silas Boyd-Wickizer, Austin T. Clements, Yandong

Userspace RCU Library: What Linear Multiprocessor Scalability Means for Your Application Linux

An Updated Overview of the QEMU Storage Stack Stefan Hajnoczi stefanha@linux.vnet.ibm.com

Building a large scale SaaS app Open Source, Storage and Scalability Dan Hanley, CTO

Session Outline Course themes imperative problem solving (think: outline form) C

Extending Scalability of Collective IO Through Nessie and Staging Parallel Data Storage Workshop

Solving the C20K Problem: PHP Performance and Scalability Kuassi Mensah, Group Product Manager

Bringsel: A Tool for Measuring Storage System Reliability, Uniformity, Performance and Scalability

Solving the Routing Scalability Problem -- The Hard Parts Jari Arkko APRICOT 2007, Bali,

On The Scalability of Storage Sub-System Back-end Network Yan Li, Roland Ibbett, Nigel Topham and

Linux and Advanced Storage Technologies Martin K. Petersen <martin.petersen@oracle.com>

Evaluating storage APIs for QEMU Anthony Liguori aliguori@us.ibm.com Open Virtualization

Taking Linux File and Storage Systems into the Future Ric Wheeler Director Kernel File and

Data Storage Revolution Relational Databases Object Storage (put/get) Speed Dynamo

STORAGE CONTENT PROVIDER Storage File System Linux OS Internal External (Flash Memory) (SD

Linux Kung Fu Introduction What is Linux? Why Linux? What is the difference between a client

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu,

Solving the Linux storage scalability bottlenecks Jens Axboe - PowerPoint PPT Presentation

Solving the Linux storage scalability bottlenecks Jens Axboe Software Engineer Vault 2016 What are the issues? Devices went from hundreds of IOPS to hundreds of thousands of IOPS Increases in core count, and NUMA

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org

One Billion Files: Scalability Limits in Linux File Systems Ric Wheeler Architect &amp; Manager,

Scaling the Linux VFS Nick Piggin SuSE Labs, Novell Inc. September 19, 2009 0-0 Outline I

Creating Web Farms with Linux (Linux High Availability and Scalability) Horms (Simon Horman)

Creating Web Farms with Linux (Linux High Availability and Scalability) Horms (Simon Horman)

An Analysis of Linux Scalability to Many Cores Silas Boyd-Wickizer, Austin T. Clements, Yandong

Userspace RCU Library: What Linear Multiprocessor Scalability Means for Your Application Linux

An Updated Overview of the QEMU Storage Stack Stefan Hajnoczi stefanha@linux.vnet.ibm.com

Building a large scale SaaS app Open Source, Storage and Scalability Dan Hanley, CTO

Session Outline Course themes imperative problem solving (think: outline form) C

Extending Scalability of Collective IO Through Nessie and Staging Parallel Data Storage Workshop

Solving the C20K Problem: PHP Performance and Scalability Kuassi Mensah, Group Product Manager

Bringsel: A Tool for Measuring Storage System Reliability, Uniformity, Performance and Scalability

Solving the Routing Scalability Problem -- The Hard Parts Jari Arkko APRICOT 2007, Bali,

On The Scalability of Storage Sub-System Back-end Network Yan Li, Roland Ibbett, Nigel Topham and

Linux and Advanced Storage Technologies Martin K. Petersen &lt;martin.petersen@oracle.com&gt;

Evaluating storage APIs for QEMU Anthony Liguori aliguori@us.ibm.com Open Virtualization

Taking Linux File and Storage Systems into the Future Ric Wheeler Director Kernel File and

Data Storage Revolution Relational Databases Object Storage (put/get) Speed Dynamo

STORAGE CONTENT PROVIDER Storage File System Linux OS Internal External (Flash Memory) (SD

Linux Kung Fu Introduction What is Linux? Why Linux? What is the difference between a client

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu,

One Billion Files: Scalability Limits in Linux File Systems Ric Wheeler Architect & Manager,

Linux and Advanced Storage Technologies Martin K. Petersen <martin.petersen@oracle.com>