lists revisited cache conscious stl lists
play

Lists Revisited: Cache Conscious STL lists Leonor Frias, Jordi - PowerPoint PPT Presentation

Lists Revisited: Cache Conscious STL lists Leonor Frias, Jordi Petit, Salvador Roura Departament de Llenguatges i Sistemes Inform` atics. Universitat Polit` ecnica de Catalunya. Overview Goal: Improve STL lists perfomance in most common


  1. Lists Revisited: Cache Conscious STL lists Leonor Frias, Jordi Petit, Salvador Roura Departament de Llenguatges i Sistemes Inform` atics. Universitat Polit` ecnica de Catalunya.

  2. Overview Goal: Improve STL lists perfomance in most common settings using a cache-conscious data structure. Previous work: Either ✷ double-linked lists implementations: easily cope with standard requirements ✷ theoretical cache-conscious data structures : do not take into account any of these requirements Main contribution: merging both approaches. Main problem: dealing with STL list s iterator functionality. Work done: analysis, design, implementation and comprehensive experimental study.

  3. Index 1. Introduction and motivation 2. Problem and our approach 3. Design 4. Experiments 5. Conclusions and further work

  4. Standard Template Library (STL) Core of C ++ standard library [International Standard ISO/IEC 14882 1998] . Elements : ✷ containers: list , vector , map ... ✷ iterators: high-level pointers ✷ algorithms: sort , reverse , find ... Implementation : classical literature on algorithms and data structures.

  5. Improve performance Use memory hierarchy effectively for known / regular access patterns algorithms & → cache-conscious data structures General idea : organize data s.t. logical access pattern ≈ physical memory locations. Models : ✷ cache-aware ✷ cache-oblivious [Frigo et al. 1999]

  6. STL lists Forward and backward traversal container, that supports insertion and deletion in constant time. STL list iterators properties : ✷ arbitrary number ✷ operations cannot invalidate them Straightforward implementation : This is what all known STL implementations do!

  7. Double-linked lists cache performance Pointer-based data structures cannot guarantee good cache performance. Traversal using libstdc++ 0.5 no-modification 1-insert-erase 0.45 2-insert-erase 4-insert-erase 0.4 8-insert-erase 16-insert-erase 32-insert-erase 0.35 scaled time (in microsec) 64-insert-erase sort 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4) It is worth trying a cache-conscious approach!

  8. Index 1. Introduction and motivation 2. Problem and our approach 3. Design 4. Experiments 5. Conclusions and further work

  9. Previous work on cache-conscious lists [Demaine 2002] Cache-aware : partition of Θ( n/B ) pieces with ( B/ 2, B ) elems. ✷ Traversal: O ( n/B ) amortized ✷ Update: constant Cache-oblivious : uses the packed memory structure, array of Θ( n ) size with uniformly distributed gaps. ✷ Traversal: O ( n/B ) amortized ✷ Update: O ((log 2 n ) /B ) (lower by partitioning the array) ⊲ Amortized constant with self-organizing structures (updates may break the uniformity until the list is reorganized when traversed).

  10. Problem Pointers + cache-conscious data structure: physical/logical location are not independent. No trivial pointers ⇒ reach iterators whenever a modification occurs. Main issue : unbounded number of iterators pointing to the same element. Achieving Θ(1) operations: ✷ number of iterators arbitrarily restricted ✷ iterators pointing to the same element share some data STL list s are not traversed as a whole but step by step ⇒ NO self-organizing strategies.

  11. Our approach Efficient data access + full iterator functionality + (constant) worst case costs compliant with the Standard Base: cache-aware solution. Common list usages: ✷ Only a few iterators on a list instance ✷ Many traversals are performed due to sequential access ✷ Frequent modifications at any position ✷ Small/Plain old data (POD) types (*)Implicit or explicit in general cache-conscious literature

  12. Index 1. Introduction and motivation 2. Problem and our approach 3. Design 4. Experiments 5. Conclusions and further work

  13. Basic design Double-linked list of buckets . What more? 1. how to arrange the elements inside a bucket 2. how to reorganize the buckets on insertion/deletion 3. how to manage iterators 4. bucket capacity? → Experimentally

  14. Arrangement of elements

  15. Reorganization of buckets Preserve data structure invariant after modification ✷ minimum bucket occupancy ✷ arrangement coherency ✷ . . . Main issue: Keeping balance between: ✷ high occupancy ✷ few bucket accessed ✷ few elements movements

  16. Iterators management Key idea: all the iterators referred to an element are identified with a dynamic node ( relayer ) that points to it. Figure 1: Bucket of pairs Figure 2: 2-level- list

  17. Index 1. Introduction and motivation 2. Problem and our approach 3. Design 4. Experiments 5. Conclusions and further work

  18. Set up Our three implementations : ✷ bucket-pairs ✷ 2-level-cont ✷ 2-level-link against libstdC ++ in GCC 4.01. Basic environment : ✷ 64-bit Sun Workstation, AMD Opteron CPU at 2.4 Ghz ✷ 1 GB main memory ✷ 64 KB + 64 KB 2-associative L1 cache, 1024 KB 16-associative L2 cache and 64 bytes per cache line. Other : Pentium 4, 3.06 GHz hyperthreading CPU, 900 Mb of main memory and 512 Kb L2 cache.

  19. Which experiments Performance measures: ✷ wall-clock times ✷ cache performance data: Pin [Luk et al. 2005] Types of experiments: ✷ lists with no iterators ✷ lists with iterators ✷ lists with several bucket capacities ✷ LEDA Lists before and after elements reorganization (by sorting).

  20. Traversal before Traversal before shuffling (0% it load and bucket capacity 100) 0.08 0.07 scaled time (in microsec) 0.06 0.05 0.04 0.03 0.02 gcc bucket-pairs 0.01 2-level-cont 2-level-link 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

  21. Traversal after Traversal after shuffling (0% it load and bucket capacity 100) 0.5 gcc 0.45 bucket-pairs 2-level-cont 0.4 2-level-link scaled time (in microsec) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

  22. Pin Traversal after Traversal after shuffling (0% it load and bucket capacity 100) 10 scaled number of L2 cache accesses 1 0.1 0.01 gcc (misses) bucket-pairs (misses) 0.001 2-level-cont (misses) 2-level-link (misses) gcc (total) 1e-04 bucket-pairs (total) 2-level-cont (total) 2-level-link (total) 1e-05 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

  23. Insert before Insert traversal before shuffling (0% it load and bucket capacity 100) 0.25 0.2 scaled time (in microsec) 0.15 0.1 gcc 0.05 bucket-pairs 2-level-cont 2-level-link 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

  24. Insert after Insert traversal after shuffling (0% it load and bucket capacity 100) 1 gcc 0.9 bucket-pairs 2-level-cont 0.8 2-level-link scaled time (in microsec) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

  25. Intensive insertion Insert after shuffling (0% it load and bucket capacity 100) 0.45 gcc bucket-pairs 0.4 2-level-cont 2-level-link 0.35 scaled time (in microsec) 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

  26. Internal sort Sort (0% it load and bucket capacity 100) 1 gcc 0.9 bucket-pairs 2-level-cont 0.8 2-level-link scaled time (in microsec) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

  27. Effect of bucket capacity Insert traversal after shuffling (486* 10^4 list size and 0% it load) 5 gcc 4.5 bucket-pairs 2-level-cont 4 2-level-link scaled time (in microsec) 3.5 3 2.5 2 1.5 1 0.5 0 10 100 1000 bucket capacity

  28. Iterators Traversal after shuffling (486* 10^4 list size and bucket capacity 100) 0.7 0.6 scaled time (in microsec) 0.5 0.4 0.3 0.2 gcc bucket-pairs 0.1 2-level-cont 2-level-link 0 0 20 40 60 80 100 percentage of iterator load

  29. LEDA Traversal after shuffling (0% it load and bucket capacity 100) 0.9 gcc leda 0.8 2-level-link 0.7 scaled time (in microsec) 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 4 8 16 32 64 128 256 512 list size (in 10^4)

  30. Index 1. Introduction and motivation 2. Problem and our approach 3. Design 4. Experiments 5. Conclusions and further work

  31. Conclusions (1) Pioneering to show the importance of porting existing theory and practice on cache-conscious data structures to standard libraries, as the STL. Provided three standard compliant cache-conscious lists implementations. This is not straightforward, although based on simple existing data structures. ✷ Kept with standard requirements , in particular with iterators . We have provided two standard compliant iterators designs. ✷ The algorithms involved must be designed carefully to keep up some properties.

  32. Conclusions (2) Provided a comprehensive experimental study. Our implementations are prefferable in many (common) situations to classical double-linked list implementations, such as GCC (or LEDA). Specifically, ✷ 5-10 times faster traversals ✷ 3-5 times faster internal sort ✷ still competitive with (unusual) big load of iterators ✷ bucket capacity is not a critical parameter Between our implementations : ✷ 2-level linked implementation ✷ linked bucket implementation

  33. What next? My webpage : www.lsi.upc.edu/~lfrias Extended article : reorganization algorithm analysed in detail. ✷ Using amortized analysis, we show that the number of created/destroyed buckets is assymptotically optimal.

  34. Thank you Questions?

Recommend


More recommend