solving the linux storage scalability bottlenecks
play

Solving the Linux storage scalability bottlenecks Jens Axboe - PowerPoint PPT Presentation

Solving the Linux storage scalability bottlenecks Jens Axboe Software Engineer Vault 2016 What are the issues? Devices went from hundreds of IOPS to hundreds of thousands of IOPS Increases in core count, and NUMA


  1. Solving the Linux storage scalability bottlenecks Jens Axboe Software Engineer Vault 2016

  2. What are the issues? • Devices went from “hundreds of IOPS” to “hundreds of thousands of IOPS” • Increases in core count, and NUMA • Existing IO stack has a lot of data sharing ● For applications ● And between submission and completion • Existing heuristics and optimizations centered around slower storage

  3. Observed problems • The old stack had severe scaling issues ● Even negative scaling ● Wasting lots of CPU cycles • This also lead to much higher latencies • But where are the real scaling bottlenecks hidden?

  4. IO stack F i l e s y s t e m B I O l a y e r ( s t r u c t b i o ) B l o c k l a y e r ( s t r u c t r e q u e s t ) S C S I s t a c k S C S I d r i v e r r e q u e s t _ f n d r i v e r B y p a s s d r i v e r

  5. Seen from the application A p p A p p A p p A p p C P U A C P U B C P U C C P U D F i l e s y s t e m B I O l a y e r B l o c k l a y e r D r i v e r

  6. Seen from the application A p p A p p A p p A p p C P U A C P U B C P U C C P U D F i l e s y s t e m B I O l a y e r H m m m m ! B l o c k l a y e r D r i v e r

  7. Testing the theory • At this point we may have a suspicion of where the bottleneck might be. Let's run a test and see if it backs up the theory. • We use null_blk ● queue_mode=1 completion_nsec=0 irqmode=0 • Fio ● Each thread does pread(2), 4k, randomly, O_DIRECT • Each added thread alternates between the two available NUMA nodes (2 socket system, 32 threads)

  8. T h a t l o o k s l i k e a l o t o f l o c k c o n t e n t i o n … F i o r e p o r t s s p e n d i n g 9 5 % o f t h e t i m e i n t h e k e r n e l , l o o k s l i k e ~ 7 5 % o f t h a t t i m e i s s p i n n i n g o n l o c k s . L o o k i n g a t c a l l g r a p h s , i t ' s a g o o d m i x o f q u e u e v s c o m p l e t i o n , a n d q u e u e v s q u e u e ( a n d q u e u e - t o - b l o c k v s q u e u e - t o - d r i v e r ) .

  9. A p p A p p A p p A p p C P U A C P U B C P U C C P U D B l o c k l a y e r - R e q u e s t s p l a c e d f o r p r o c e s s i n g - R e q u e s t s r e t r i e v e d b y d r i v e r - R e q u e s t s c o m p l e t i o n s i g n a l e d = = L o t s o f s h a r e d s t a t e ! D r i v e r

  10. Problem areas • We have good scalability until we reach the block layer ● The shared state is a massive issue • A bypass mode driver could work around the problem • We need a real and future proof solution!

  11. Enter block multiqueue • Shares basic name with similar networking functionality, but was built from scratch • Basic idea is to separate shared state ● Between applications ● Between completion and submission • Improve scaling on non-mq hardware was a criteria • Provide a full pool of helper functionality ● Implement and debug once • Become THE queuing model, not “the 3 rd one”

  12. History • Started in 2011 • Original design reworked, fjnalized around 2012 • Merged in 3.13

  13. A p p A p p A p p A p p A p p A p p C P U D C P U A C P U B C P U C C P U E C P U F P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e a n d d r i v e r H a r d w a r e q u e u e H a r d w a r e q u e u e H a r d w a r e q u e u e

  14. Application touches private per-cpu queue • A p p A p p C P U A C P U B ● Software queues ● Submission is now almost fully privatized P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r

  15. A p p A p p C P U A C P U B Software queues map M:N to hardware • P e r - c p u queues s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) ● There are always as many software queues S u b m i s s i o n s as CPUs ● With enough hardware queues, it's a 1:1 C o m p l e t i o n s mapping ● Fewer, and we map based on topology of H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) the system H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r

  16. A p p A p p C P U A C P U B P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) Hardware queues handle dispatch to • hardware and completions H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r

  17. Features • Effjcient and fast versions of: ● T agging ● Timeout handling ● Allocation eliminations ● Local completions • Provides intelligent queue ↔ CPU mappings ● Can be used for IRQ mappings as well • Clean API ● Driver conversions generally remove more code than they add

  18. blk-mq IO fmow I n s e r t i n t o s o f t w a r e S i g n a l h a r d w a r e q u e u e A l l o c a t e b i o F i n d f r e e r e q u e s t M a p b i o t o r e q u e s t q u e u e r u n ? F r e e r e s o u r c e s S l e e p o n f r e e r e q u e s t H a r d w a r e q u e u e r u n s ( b i o , m a r k r q a s f r e e ) C o m p l e t e I O H a r d w a r e I R Q e v e n t S u b m i t t o h a r d w a r e

  19. Block layer IO fmow A l l o c a t e b i o A l l o c a t e r e q u e s t M a p b i o t o r e q u e s t I n s e r t i n t o q u e u e S i g n a l d r i v e r ( ? ) F r e e r e s o u r c e s A l l o c a t e d r i v e r ( r e q u e s t , b i o , t a g , S l e e p o n r e s o u r c e s c o m m a n d a n d S G A l l o c a t e t a g D r i v e r r u n s H a r d w a r e , e t c ) l i s t P u l l r e q u e s t o ff b l o c k C o m p l e t e I O H a r d w a r e I R Q e v e n t S u b m i t t o h a r d w a r e l a y e r q u e u e

  20. Completions • Want completions as local as possible ● Even without queue shared state, there's still the request • Particularly for fewer/single hardware queue design, care must be taken to minimize sharing • If completion queue can place event, we use that ● If not, IPI

  21. A p p A p p C P U A C P U B C o m p l e t e I O P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) Y e s I R Q I P I t o r i g h t C P U H a r d w a r e q u e u e N o I R Q i n r i g h t l o c a t i o n ? H a r d w a r e a n d d r i v e r

  22. Tagging • Almost all hardware uses tags to identify IO requests ● Must get a free tag on request issue ● Must return tag to pool on completion “ T h i s i s a r e q u e s t i d e n t i fi e d b y t a g = 0 x 1 3 ” D r i v e r H a r d w a r e “ T h i s i s t h e c o m p l e t i o n e v e n t f o r t h e r e q u e s t i d e n t i fi e d b y t a g = 0 x 1 3 ”

  23. Tag support • Must have features: ● Effjcient at or near tag exhaustion ● Effjcient for shared tag maps • Blk-mq implements a novel bitmap tag approach ● Software queue hinting (sticky) ● Sparse layout ● Rolling wakeups

Recommend


More recommend