tlb misses the missing issue of adaptive radix tree
play

TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong - PowerPoint PPT Presentation

TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong Ziqiang Feng Wenjian Xu Eric Lo Ben Kao Department of Computer Science, The University of Hong Kong Department of Computing, The Hong Kong Polytechnic University


  1. TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong Ziqiang Feng Wenjian Xu Eric Lo Ben Kao Department of Computer Science, The University of Hong Kong Department of Computing, The Hong Kong Polytechnic University

  2. Motivation • In-memory databases H-Store • Hekaton • • Efficient in-memory index structures Cache-Sensitive B+-Tree (CSB+-Tree) • Fast Architecture Sensitive Tree (FAST) • Adaptive Radix Tree (ART) • 2 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  3. Why Adaptive Radix Tree • Outperforms existing index structures • both search and update • has small memory footprint • Avoid cache miss • Leverage SIMD data parallelism • Reduce branch mis-prediction • adopt a radix tree structure V. Leis, A. Kemper al et ICDE’13 3 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  4. What is Adaptive Radix Tree small node type (Node4) for nodes with few child pointers pointer array FD FE FF 00 01 … Node256 … pointer array key array 01 02 03 04 01 02 03 04 00 01 02 … Node4 EE Node256 … … … … index array child pointer 00 01 02 03 01 03 FF 02 3 48 1 2 … … … Node256 Node48 … Data Data large node type ( Node256 ) for nodes with many child pointers

  5. Whether TLB miss matter in ART? • Translation Look-aside Buffer (TLB) • cache for page table entries for program for CPU • fast way to translate virtual memory address to physical memory address executing an instruction in CPU • • in-memory index structure like ART • few cache miss, few branch mis-prediction, SIMD-friendly • whether misses in TLB would become a bottleneck if positive • what are the measures to alleviate • how effective those measures are • 5 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  6. Whether TLB miss matter in ART? • Experiment to show • Stall time % due to TLB miss • System specification • Intel Core i7 2630QM CPU • 2.00 GHz clock rate, 2.9 GHz turbo frequency. • Each core 32KB L1i cache, 32KB L1d cache, 256KB unified L2 Cache • • share 6MB L3 cache, 16GB 1600 RAM. 6 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  7. Whether TLB miss matter in ART? • Data • 1,000,000 integer keys • Dense: from 1 to n (19MB in RAM) • Sparse: random number in 32bit domain (22MB in RAM) cannot fit into 6MB L3 cache • 7 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  8. 
 Whether TLB miss matter in ART? • Workload • 256M lookups • Varying skewness: zipf=0 (each key is uniformly accessed) 
 • to zipf=3 (few very hot keys and many non-hot keys) • 8 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  9. Whether TLB miss matter in ART? • No, when key access is 0% to 2% stall time due to TLB miss/index lookup latency (%) 25 very skew (Zipf=2 to 3) Dense Sparse 20 • few very hot search keys 15 • occupies very few page 10 table entries in TLB 5 very few TLB misses are • 0 incurred (0% to 2% of 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Zipf stall time) Very skew TLB miss doesn’t matter • 9 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  10. Whether TLB miss matter in ART? • No, when workload is 5% to 7% not skew (Zipf=0 to 1) stall time due to TLB miss/index lookup latency (%) 25 Dense Sparse 20 • each key is uniformly 15 accessed 10 • no spatial locality 5 • lots of cache misses 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 dominate the latency Zipf • Uniform TLB miss not so matters • (5% to 7%) 10 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  11. Whether TLB miss matter in ART? • YES, when the up to 23% workload posses stall time due to TLB miss/index lookup latency (%) 25 Dense Sparse realistic skewness (Zipf 20 = 1 to 2) 15 10 • key access with certain spatial locality 5 0 • cache miss is not high 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Zipf • TLB matters now (up to 23%) 11 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  12. What are the measures that we can take to alleviate? • use of huge page • workload-conscious node-to-page reorganization 12 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  13. What are the measures that we can take to alleviate? • use of huge page • workload-conscious node-to-page reorganization 13 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  14. What is Huge Page? • In memory allocation • eliminate fragmentation over the whole memory space cutting memory space into pages • • Regular page size (in most processors e.g. Intel Sandy Bridge - Xeon E5) 4KB • OS’s default value • • Huge page size (e.g. Sandy Bridge) 2MB, 1GB • good tactic to reduce TLB misses • 14 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  15. Why Huge Page? • if apply huge page in ART • reduce # of pages spanned by ART nodes reduce the pressure on the TLB • fewer TLB miss • throughput increase • 15 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  16. Why Huge Page? L2 Cache when using regular page • page table entry Page Table Entries ART Data Others L2 Cache when using huge page • besides being stored in TLB Page Table ART Data Others Entries occupy space in L1/L2/L3 cache and RAM • • So… fewer page table entries occupy fewer space in processor’s cache • fewer cache misses • • throughput increase 16 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  17. Huge Page always Help? • but… different # of TLB entries for different page sizes • # of huge page entries are fewer than that of regular page • entries In Xeon E5, • 64 DTLB and 512 STLB entires for regular pages • 32 DTLB entires for huge pages • fewer TLB entries available for huge page • throughput may decrease • 17 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  18. Can Huge Page Help? • Yes 40 Dense • when workload is uniform and Sparse 35 Throughput Improvement (%) quite skew (Zipf < 2) 30 25 • reduce TLB miss and cache miss 20 15 throughput increase as expected • 10 • when workload extreme skew 5 0 (Zipf > 2) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Zipf • very few TLB miss and cache miss no further improvement • 18 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  19. What are the measures that we can take to alleviate? • use of huge page • workload-conscious node-to-page reorganization 19 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  20. What is Workload-Conscious Node-to-Page Reorganization? • tree nodes in ART are allocated • dynamic memory allocation OS’s default scheme • eliminate fragmentation over the whole memory space • • workload-conscious allocation (R. Stoica and A. Ailamaki et al. DaMoN’13) takes over OS’s control • organize the hot ART nodes into the same page. • 20 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  21. Why Workload-Conscious Node-to-Page Reorganization • OLTP workload is skew • some keys are hot and accessed frequently • if putting all hot nodes into one (huge) page • page table entry of the hot page will be kept in TLB no TLB miss when accessing hot keys • Throughput increase • 21 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  22. How Workload-Conscious Node-to-Page Reorganization P 1 P 2 • When query execution • log key accesses P hot P cold • analyzing access logs • sort the keys by their access frequencies • node-to-page reorganization • according to access frequencies • hot nodes will be placed in same page 22 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  23. Can Workload-Conscious Node- to-Page Reorganization Help? • Yes, when 50M 45M • data is sparse and workload is 40M skew Throughput (lookup/s) 35M 30M • sparse data 25M 20M • each node contain few children 15M 10M small nodes (Node4, size is 36 • 5M ART with reorganization ART byte) are used 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Zipf many nodes, not so condense • pointer array FD FE FF 00 01 … Node256 more space, more pages • … key array pointer array 01 02 03 04 00 01 02 01 02 03 04 … Node4 Node256 EE … … … … more page table entries • index array child pointer 00 01 FF 02 03 01 02 03 1 2 3 48 … … … Node256 Node48 … TLB miss matters Data Data • 23 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Recommend


More recommend