in place data structures which complexity measures do
play

In-Place Data Structures: Which Complexity Measures Do Matter? Jyrki - PowerPoint PPT Presentation

In-Place Data Structures: Which Complexity Measures Do Matter? Jyrki Katajainen 1 , 2 Jingsen Chen 3 , Stefan Edelkamp 4 , Amr Elmasry 5 , Max Stenmark 2 1 Kbenhavns Universitet 2 Jyrki Katajainen and Company 3 Lule a Tekniska Universitet 4


  1. In-Place Data Structures: Which Complexity Measures Do Matter? Jyrki Katajainen 1 , 2 Jingsen Chen 3 , Stefan Edelkamp 4 , Amr Elmasry 5 , Max Stenmark 2 1 Københavns Universitet 2 Jyrki Katajainen and Company 3 Lule˚ a Tekniska Universitet 4 Universit¨ at Bremen 5 Alexandria University � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (1)

  2. Model of computation Available • An infinite array a suitable for storing elements • O (1) number of other memory locations for storing elements • O (1) number of other variables (counters, indices, bit strings of length ⌈ lg(1 + n ) ⌉ ) workspace n = 8 a 5 6 7 0 1 2 3 4 Requirement • If the data structure stores n elements, these elements must be kept in the first n locations of a . � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (2)

  3. Coverage In-place data structures Complexity measures • Binary heaps • Space utilization • Static search trees • # Element comparisons • # Element moves • # Cache misses • # Branch mispredictions • Running time Aha! The whole cycle What is important? design analysis experimentation implementation � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (3)

  4. Binary heaps 0 8 1 2 10 26 construct () 3 4 5 6 for ( i = parent ( n − 1); i ≥ 0; −− i ) 75 12 46 75 siftdown ( i ) 7 minimum () 80 return a [0] n = 8 insert ( x ) a 8 10 26 75 12 46 75 80 a [ n ] = x 5 6 7 0 1 2 3 4 siftup ( n ) n += 1 left - child ( i ) return 2 i + 1 extract - min () min = a [0] right - child ( i ) n − = 1 return 2 i + 2 a [0] = a [ n ] parent ( i ) siftdown (0) return ⌊ ( i − 1) / 2 ⌋ return min � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (4)

  5. Experimental setup Standard benchmark Processor � Intel R Core TM – construct a heap of size n i5-2520M Input data CPU @ 2.50GHz × 4 All elements are of type int Memory system Repetitions 12-way-associative L3 cache: Repeat each experiment 3 MB r times, r = 2 26 /n cache lines: 64 B Reported value main memory: 3.8 GB Measurement result divided Operating system by r × n Ubuntu 12.04 (Linux kernel 3.2.0-29-generic) Compiler compiler ( gcc version g++ 4.6.3) with optimization -O3 � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (5)

  6. Reduce # element comparisons Inventor construct insert extract - min Extra Space Williams/Floyd 2 n ∼ lg n ∼ 2 lg n O (1) words Gonnet & Munro 1 . 625 n Θ( n ) words ∼ lg n + log ∗ n Gonnet & Munro ∼ lg lg n O (1) words Lower bounds ∼ 1 . 37 n Ω(1) ∼ lg n Ω(1) words construct : Use a binomial tree in the construction insert : Binary search on the siftup path extract - min : lg n − lg lg n levels down along the siftdown path, siftup or recur further down � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (6)

  7. Floyd’s heap-construction program 1 template < typename position , typename index , typename comparator > 2 void siftdown ( position a , index i , index n , comparator less ) { 3 typedef typename std : : iterator_traits < position > :: value_type element ; 4 element copy = a [ i ] ; 0 5 loop : index j = 2 ∗ i ; 6 8 7 i f ( j < = n ) { 1 2 8 i f ( j < n ) 26 10 9 i f ( less ( a [ j ] , a [ j + 1]) ) 3 5 4 6 10 j = j + 1; 11 i f ( less ( copy , a [ j ]) ) { 75 12 46 75 12 a [ i ] = a [ j ] ; 7 13 i = j ; 80 14 goto loop ; 15 } n = 8 16 } 17 a [ i ] = copy ; a 8 10 26 75 12 46 75 80 18 } 0 1 2 3 4 5 6 7 19 20 template < typename position , typename comparator > comparator less ) { 21 void make_heap ( position first , position beyond , 22 typedef typename std : : iterator_traits < position > :: difference_type index ; 23 position const a = first − 1; 24 index const n = beyond − first ; 25 for ( index i = n / 2; i > 0; −− i ) 26 siftdown ( a , i , n , less ) ; [Floyd 1964] 27 } � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (7)

  8. Remove an easy-to-predict if opt 1 : Make sure that siftdown is always called with an odd n i f ( j < n ) . . . for ( index i = n / 2; i > 0; −− i ) siftdown ( a , i , n , less ) ; − → template < typename position , typename index , typename comparator > void siftup ( position a , index j , comparator less ) { . . . Construction time [ns] } n F F 1 index const m = ( n & 1) ? n : n − 1; for ( index i = m / 2; i > 0; −− i ) 2 10 7.5 7.1 siftdown ( a , i , m , less ) ; 2 15 siftup ( a , n , less ) ; 7.4 7.0 2 20 8.2 7.9 2 25 8.9 8.4 � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (8)

  9. Remove a hard-to-predict if opt 2 : Interpret the result of a comparison as an integer and use this value in normal index arithmetic i f ( condition ) { j = j + 1; Construction time [ns] } n F 1 F 12 − → 2 10 7.1 4.8 j = j + condition ; 2 15 7.0 4.9 2 20 7.9 6.3 2 25 8.4 7.2 � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (9)

  10. commercial break Lean programs • A program has a constant Theorem. Let P be a program number of unnested loops. of length κ , measured in the • Each loop is branch-free , number of assembly-language in- except the final conditional structions. Assume that the run- branch at the end. ning time of P is t ( n ) for an input • A branch predictor is static : of size n . There exists a pro- forward branches are not gram Q of length O ( κ ) that is taken and backward branches equivalent to P , runs in O ( κt ( n )) are taken. time for the same input as P , and • Each such program induces induces O (1) branch mispredic- O (1) branch mispredictions in tions. this model. [Elmasry, Katajainen 2012] � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (10)

  11. Reduce # element moves opt 3 : Do not make any element moves when the element at the root stays in its original location Construction time [ns] element copy = a [ i ] ; n F 12 F 123 − → 2 10 4.8 4.3 2 15 4.9 4.6 element copy ; index k = 2 ∗ i ; 2 20 6.3 5.9 k = k + less ( a [ k ] , a [ k + 1]) ; 2 25 7.2 6.9 i f ( less ( a [ i ] , a [ k ]) ) { copy = a [ i ] ; Element moves a [ i ] = a [ k ] ; } n F F 123 else { return ; 2 10 1.73 1.52 } i = k ; 2 15 1.74 1.53 2 20 1.74 1.53 2 25 1.74 1.52 Aha! Loop unrolling � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (11)

  12. Reduce # cache misses opt 4 : Visit the nodes in reverse depth-first order instead of reverse breadth-first order [Bojesen et al. 2000] for ( index i = n / 2; i > 0; −− i ) siftdown ( a , i , n , less ) ; Construction time [ns] − → F F 123 F 1 - 4 n index j = n / 2; index const i = j / 2; 2 10 7.4 4.3 5.2 while ( j > i ) { 2 15 siftdown ( a , j , n , less ) ; 7.4 4.6 5.1 index z = j ; 2 20 8.2 5.9 5.2 while (( z & 1) = = 0) { 2 25 z / = 2; 8.7 6.9 5.1 siftdown ( a , z , n , less ) ; } −− j ; } � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (12)

  13. Making the GM algorithm in-place Element comparisons size: ∼ n/ lg n ∼ 2 n − → ∼ 1 . 625 n Element moves size: ∼ lg n ∼ 2 n − → ∼ 2 . 125 n Cache misses 1. Improve GM : ∼ n lg B ∼ n B , assuming − → B O ( n ) words − → O ( n ) bits that B lg n << M ( B block 2. Apply the improved algo- size; M memory size) rithm for all bottom trees; Construction time [ns] keep the bits needed com- n F GM pactly in a word 2 10 7.4 8.0 3. Use F ’s siftdown approach for 2 15 7.4 7.7 the top tree. 2 20 8.2 7.7 2 25 8.7 7.7 � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (13)

  14. Construction time [ns] Instructions n std F F 123 F 1 - 4 GM n std F F 123 F 1 - 4 GM Heap construction: Summary 2 10 10.7 2 15 7.4 4.3 5.2 8.0 2 15 10.4 2 20 35.5 20.8 13.4 16.2 42.9 7.4 4.6 5.1 7.7 2 20 11.0 8.2 5.9 5.2 7.7 2 25 2 25 11.5 8.7 6.9 5.1 7.7 Element comparisons Branches | mispredictions n std / F GM n std F F 123 F 1 - 4 2 10 2 10 5.39 | 0.96 1.98 1.80 4.53 | 0.81 2.17 | 0.27 2.42 | 0.47 2 15 2 15 5.40 | 0.89 1.99 1.66 2.43 | 0.78 2.18 | 0.24 2.43 | 0.47 2 20 2 20 5.41 | 0.89 1.99 1.63 4.57 | 0.78 2.18 | 0.24 2.43 | 0.47 2 25 2 25 5.41 | 0.89 2 1.63 4.56 | 0.78 2.18 | 0.24 2.43 | 0.47 GM Element moves I/Os | misses (per n/B ) 3.60 | 0.66 2.39 | 0.38 n std F GM std / F F 1 - 4 GM n – – | 2 10 3.99 2 10 1.00 | 1.00 1.99 2.15 1.00 | 1.00 0.95 | 0.95 – | – 2 15 3.99 1.99 2.39 2 15 5.66 | 1.00 1.03 | 1.00 1.03 | 1.00 2 20 4 1.99 2.38 2 20 5.87 | 4.94 1.04 | 1.00 – | – 2 25 4 2 2.38 2 25 5.87 | 5.84 1.04 | 0.99 – | – � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (14)

  15. Static search trees 4 46 2 6 construct () 12 75 sort ( a, a + n ) 1 3 5 7 is - member ( x ) 10 26 75 80 i = 0 0 k = n 8 while i � = k n = 8 if x < a [ i ] k = i a 8 10 12 26 46 75 75 80 i = left - child ( i ) 5 6 7 0 1 2 3 4 else if a [ i ] < x left - child ( i ) i = right - child ( i ) return . . . else return yes right - child ( i ) return no return . . . � Performance Engineering Laboratory c ARCO meeting at ITU, fall 2012 (15)

Recommend


More recommend