Solving the Linux storage scalability bottlenecks Jens Axboe Software Engineer Vault 2016
What are the issues? • Devices went from “hundreds of IOPS” to “hundreds of thousands of IOPS” • Increases in core count, and NUMA • Existing IO stack has a lot of data sharing ● For applications ● And between submission and completion • Existing heuristics and optimizations centered around slower storage
Observed problems • The old stack had severe scaling issues ● Even negative scaling ● Wasting lots of CPU cycles • This also lead to much higher latencies • But where are the real scaling bottlenecks hidden?
IO stack F i l e s y s t e m B I O l a y e r ( s t r u c t b i o ) B l o c k l a y e r ( s t r u c t r e q u e s t ) S C S I s t a c k S C S I d r i v e r r e q u e s t _ f n d r i v e r B y p a s s d r i v e r
Seen from the application A p p A p p A p p A p p C P U A C P U B C P U C C P U D F i l e s y s t e m B I O l a y e r B l o c k l a y e r D r i v e r
Seen from the application A p p A p p A p p A p p C P U A C P U B C P U C C P U D F i l e s y s t e m B I O l a y e r H m m m m ! B l o c k l a y e r D r i v e r
Testing the theory • At this point we may have a suspicion of where the bottleneck might be. Let's run a test and see if it backs up the theory. • We use null_blk ● queue_mode=1 completion_nsec=0 irqmode=0 • Fio ● Each thread does pread(2), 4k, randomly, O_DIRECT • Each added thread alternates between the two available NUMA nodes (2 socket system, 32 threads)
T h a t l o o k s l i k e a l o t o f l o c k c o n t e n t i o n … F i o r e p o r t s s p e n d i n g 9 5 % o f t h e t i m e i n t h e k e r n e l , l o o k s l i k e ~ 7 5 % o f t h a t t i m e i s s p i n n i n g o n l o c k s . L o o k i n g a t c a l l g r a p h s , i t ' s a g o o d m i x o f q u e u e v s c o m p l e t i o n , a n d q u e u e v s q u e u e ( a n d q u e u e - t o - b l o c k v s q u e u e - t o - d r i v e r ) .
A p p A p p A p p A p p C P U A C P U B C P U C C P U D B l o c k l a y e r - R e q u e s t s p l a c e d f o r p r o c e s s i n g - R e q u e s t s r e t r i e v e d b y d r i v e r - R e q u e s t s c o m p l e t i o n s i g n a l e d = = L o t s o f s h a r e d s t a t e ! D r i v e r
Problem areas • We have good scalability until we reach the block layer ● The shared state is a massive issue • A bypass mode driver could work around the problem • We need a real and future proof solution!
Enter block multiqueue • Shares basic name with similar networking functionality, but was built from scratch • Basic idea is to separate shared state ● Between applications ● Between completion and submission • Improve scaling on non-mq hardware was a criteria • Provide a full pool of helper functionality ● Implement and debug once • Become THE queuing model, not “the 3 rd one”
History • Started in 2011 • Original design reworked, fjnalized around 2012 • Merged in 3.13
A p p A p p A p p A p p A p p A p p C P U D C P U A C P U B C P U C C P U E C P U F P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e a n d d r i v e r H a r d w a r e q u e u e H a r d w a r e q u e u e H a r d w a r e q u e u e
Application touches private per-cpu queue • A p p A p p C P U A C P U B ● Software queues ● Submission is now almost fully privatized P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r
A p p A p p C P U A C P U B Software queues map M:N to hardware • P e r - c p u queues s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) ● There are always as many software queues S u b m i s s i o n s as CPUs ● With enough hardware queues, it's a 1:1 C o m p l e t i o n s mapping ● Fewer, and we map based on topology of H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) the system H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r
A p p A p p C P U A C P U B P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) Hardware queues handle dispatch to • hardware and completions H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r
Features • Effjcient and fast versions of: ● T agging ● Timeout handling ● Allocation eliminations ● Local completions • Provides intelligent queue ↔ CPU mappings ● Can be used for IRQ mappings as well • Clean API ● Driver conversions generally remove more code than they add
blk-mq IO fmow I n s e r t i n t o s o f t w a r e S i g n a l h a r d w a r e q u e u e A l l o c a t e b i o F i n d f r e e r e q u e s t M a p b i o t o r e q u e s t q u e u e r u n ? F r e e r e s o u r c e s S l e e p o n f r e e r e q u e s t H a r d w a r e q u e u e r u n s ( b i o , m a r k r q a s f r e e ) C o m p l e t e I O H a r d w a r e I R Q e v e n t S u b m i t t o h a r d w a r e
Block layer IO fmow A l l o c a t e b i o A l l o c a t e r e q u e s t M a p b i o t o r e q u e s t I n s e r t i n t o q u e u e S i g n a l d r i v e r ( ? ) F r e e r e s o u r c e s A l l o c a t e d r i v e r ( r e q u e s t , b i o , t a g , S l e e p o n r e s o u r c e s c o m m a n d a n d S G A l l o c a t e t a g D r i v e r r u n s H a r d w a r e , e t c ) l i s t P u l l r e q u e s t o ff b l o c k C o m p l e t e I O H a r d w a r e I R Q e v e n t S u b m i t t o h a r d w a r e l a y e r q u e u e
Completions • Want completions as local as possible ● Even without queue shared state, there's still the request • Particularly for fewer/single hardware queue design, care must be taken to minimize sharing • If completion queue can place event, we use that ● If not, IPI
A p p A p p C P U A C P U B C o m p l e t e I O P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) Y e s I R Q I P I t o r i g h t C P U H a r d w a r e q u e u e N o I R Q i n r i g h t l o c a t i o n ? H a r d w a r e a n d d r i v e r
Tagging • Almost all hardware uses tags to identify IO requests ● Must get a free tag on request issue ● Must return tag to pool on completion “ T h i s i s a r e q u e s t i d e n t i fi e d b y t a g = 0 x 1 3 ” D r i v e r H a r d w a r e “ T h i s i s t h e c o m p l e t i o n e v e n t f o r t h e r e q u e s t i d e n t i fi e d b y t a g = 0 x 1 3 ”
Tag support • Must have features: ● Effjcient at or near tag exhaustion ● Effjcient for shared tag maps • Blk-mq implements a novel bitmap tag approach ● Software queue hinting (sticky) ● Sparse layout ● Rolling wakeups
Recommend
More recommend