TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong - PowerPoint PPT Presentation

TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong Ziqiang Feng Wenjian Xu Eric Lo Ben Kao Department of Computer Science, The University of Hong Kong Department of Computing, The Hong Kong Polytechnic University

Motivation • In-memory databases H-Store • Hekaton • • Efficient in-memory index structures Cache-Sensitive B+-Tree (CSB+-Tree) • Fast Architecture Sensitive Tree (FAST) • Adaptive Radix Tree (ART) • 2 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Why Adaptive Radix Tree • Outperforms existing index structures • both search and update • has small memory footprint • Avoid cache miss • Leverage SIMD data parallelism • Reduce branch mis-prediction • adopt a radix tree structure V. Leis, A. Kemper al et ICDE’13 3 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

What is Adaptive Radix Tree small node type (Node4) for nodes with few child pointers pointer array FD FE FF 00 01 … Node256 … pointer array key array 01 02 03 04 01 02 03 04 00 01 02 … Node4 EE Node256 … … … … index array child pointer 00 01 02 03 01 03 FF 02 3 48 1 2 … … … Node256 Node48 … Data Data large node type ( Node256 ) for nodes with many child pointers

Whether TLB miss matter in ART? • Translation Look-aside Buffer (TLB) • cache for page table entries for program for CPU • fast way to translate virtual memory address to physical memory address executing an instruction in CPU • • in-memory index structure like ART • few cache miss, few branch mis-prediction, SIMD-friendly • whether misses in TLB would become a bottleneck if positive • what are the measures to alleviate • how effective those measures are • 5 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Whether TLB miss matter in ART? • Experiment to show • Stall time % due to TLB miss • System specification • Intel Core i7 2630QM CPU • 2.00 GHz clock rate, 2.9 GHz turbo frequency. • Each core 32KB L1i cache, 32KB L1d cache, 256KB unified L2 Cache • • share 6MB L3 cache, 16GB 1600 RAM. 6 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Whether TLB miss matter in ART? • Data • 1,000,000 integer keys • Dense: from 1 to n (19MB in RAM) • Sparse: random number in 32bit domain (22MB in RAM) cannot fit into 6MB L3 cache • 7 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

  Whether TLB miss matter in ART? • Workload • 256M lookups • Varying skewness: zipf=0 (each key is uniformly accessed)   • to zipf=3 (few very hot keys and many non-hot keys) • 8 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Whether TLB miss matter in ART? • No, when key access is 0% to 2% stall time due to TLB miss/index lookup latency (%) 25 very skew (Zipf=2 to 3) Dense Sparse 20 • few very hot search keys 15 • occupies very few page 10 table entries in TLB 5 very few TLB misses are • 0 incurred (0% to 2% of 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Zipf stall time) Very skew TLB miss doesn’t matter • 9 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Whether TLB miss matter in ART? • No, when workload is 5% to 7% not skew (Zipf=0 to 1) stall time due to TLB miss/index lookup latency (%) 25 Dense Sparse 20 • each key is uniformly 15 accessed 10 • no spatial locality 5 • lots of cache misses 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 dominate the latency Zipf • Uniform TLB miss not so matters • (5% to 7%) 10 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Whether TLB miss matter in ART? • YES, when the up to 23% workload posses stall time due to TLB miss/index lookup latency (%) 25 Dense Sparse realistic skewness (Zipf 20 = 1 to 2) 15 10 • key access with certain spatial locality 5 0 • cache miss is not high 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Zipf • TLB matters now (up to 23%) 11 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

What are the measures that we can take to alleviate? • use of huge page • workload-conscious node-to-page reorganization 12 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

What is Huge Page? • In memory allocation • eliminate fragmentation over the whole memory space cutting memory space into pages • • Regular page size (in most processors e.g. Intel Sandy Bridge - Xeon E5) 4KB • OS’s default value • • Huge page size (e.g. Sandy Bridge) 2MB, 1GB • good tactic to reduce TLB misses • 14 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Why Huge Page? • if apply huge page in ART • reduce # of pages spanned by ART nodes reduce the pressure on the TLB • fewer TLB miss • throughput increase • 15 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Why Huge Page? L2 Cache when using regular page • page table entry Page Table Entries ART Data Others L2 Cache when using huge page • besides being stored in TLB Page Table ART Data Others Entries occupy space in L1/L2/L3 cache and RAM • • So… fewer page table entries occupy fewer space in processor’s cache • fewer cache misses • • throughput increase 16 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Huge Page always Help? • but… different # of TLB entries for different page sizes • # of huge page entries are fewer than that of regular page • entries In Xeon E5, • 64 DTLB and 512 STLB entires for regular pages • 32 DTLB entires for huge pages • fewer TLB entries available for huge page • throughput may decrease • 17 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Can Huge Page Help? • Yes 40 Dense • when workload is uniform and Sparse 35 Throughput Improvement (%) quite skew (Zipf < 2) 30 25 • reduce TLB miss and cache miss 20 15 throughput increase as expected • 10 • when workload extreme skew 5 0 (Zipf > 2) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Zipf • very few TLB miss and cache miss no further improvement • 18 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

What is Workload-Conscious Node-to-Page Reorganization? • tree nodes in ART are allocated • dynamic memory allocation OS’s default scheme • eliminate fragmentation over the whole memory space • • workload-conscious allocation (R. Stoica and A. Ailamaki et al. DaMoN’13) takes over OS’s control • organize the hot ART nodes into the same page. • 20 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Why Workload-Conscious Node-to-Page Reorganization • OLTP workload is skew • some keys are hot and accessed frequently • if putting all hot nodes into one (huge) page • page table entry of the hot page will be kept in TLB no TLB miss when accessing hot keys • Throughput increase • 21 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

How Workload-Conscious Node-to-Page Reorganization P 1 P 2 • When query execution • log key accesses P hot P cold • analyzing access logs • sort the keys by their access frequencies • node-to-page reorganization • according to access frequencies • hot nodes will be placed in same page 22 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Can Workload-Conscious Node- to-Page Reorganization Help? • Yes, when 50M 45M • data is sparse and workload is 40M skew Throughput (lookup/s) 35M 30M • sparse data 25M 20M • each node contain few children 15M 10M small nodes (Node4, size is 36 • 5M ART with reorganization ART byte) are used 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Zipf many nodes, not so condense • pointer array FD FE FF 00 01 … Node256 more space, more pages • … key array pointer array 01 02 03 04 00 01 02 01 02 03 04 … Node4 Node256 EE … … … … more page table entries • index array child pointer 00 01 FF 02 03 01 02 03 1 2 3 48 … … … Node256 Node48 … TLB miss matters Data Data • 23 TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong - PowerPoint PPT Presentation

TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong Ziqiang Feng Wenjian Xu Eric Lo Ben Kao Department of Computer Science, The University of Hong Kong Department of Computing, The Hong Kong Polytechnic University

R A D I X S O R T Radix Sort 147 dnc CS 16: Radix Sort Radix Sort Unlike other sorting

branch prediction 1 last time what happens with TLB in access patterns overlapping TLB and

Enhancing the Linux Radix Tree MATTHEW WILCOX LINUXCON NORTH AMERICA 2016-08-24 Enhancing the

RADIX SORT Parosh Aziz Abdulla Uppsala University September 21, 2008 Parosh Aziz Abdulla

Hybrid TLB Coal B Coalescing: I Improving g TLB Translati tion C Cover erage e under er D

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

Investigation and Improvement on the Impact of TLB misses in

Implementation of Direct Segments on a RISC-V Processor Nikhita Kunati, Michael M. Swift

Are Hybrid Physical Designs Important? 1 B+ tree 2 C O L B+ tree 3 ? C O L C O L B+ tree

AMTH140 Lecture 20 Radix Conversion Slide 1 April 10, 2006 Reading: Lecture Notes 14.2

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Jun He 1 , Huaiming Song 1 , Xian-He Sun 1 , Yanlong Yin 1 , Rajeev Thakur 2 1: Illinois Institute

BORG: Block-reORGanization for Self-optimizing Storage Systems Medha Bhadkamkar Jorge Guerra

Block Device Scheduling Don Porter CSE 506 Quick Recap CPU Scheduling Balance

Information Visualization Foundations 3: out Thu Jan 30, due Wed Feb 5 6pm Manipulate Facet

Simplicity and informativeness in the cultural evolution of language Jon W. Carr, Kenny Smith,

An Efficient Memory-Mapped Key-Value Store for Flash Storage Anastasios Papagiannis, Giorgos

Example on 2D potential with 4 wells Simulations by Masha Cameron Monday, October 8, 12 Spectral

the Android Ecosystem Yury Zhauniarovich Advisor: Bruno Crispo University of Trento Agenda