Cache-Oblivious Algorithms 1
Cache-Oblivious Model 2
The Unknown Machine Algorithm Algorithm ↓ ↓ C program Java program ↓ gcc ↓ javac Object code Java bytecode ↓ linux ↓ java Execution Interpretation Can be executed on machines with a Can be executed on any machine specific class of CPUs with a Java interpreter 3
The Unknown Machine Algorithm Algorithm ↓ ↓ C program Java program ↓ gcc ↓ javac Object code Java bytecode ↓ linux ↓ java Execution Interpretation Can be executed on machines with a Can be executed on any machine specific class of CPUs with a Java interpreter Goal Develop algorithms that are optimized w.r.t. memory hierarchies without knowing the parameters 3
Cache-Oblivious Model CPU Memory I/O Disk • I/O model • Algorithms do not know the parameters B and M • Optimal off-line cache replacement strategy Frigo et al. 1999 4
Justification of the ideal-cache model Optimal replacement LRU + 2 × cache size ⇒ at most 2 × cache misses Sleator an Tarjan, 1985 Corollary T M,B ( N ) = O ( T 2 M,B ( N )) ⇒ #cache misses using LRU is O ( T M,B ( N )) Two memory levels Optimal cache-oblivious algorithm satisfying T M,B ( N ) = O ( T 2 M,B ( N )) ⇒ optimal #cache misses on each level of a multilevel cache using LRU Fully associativity cache Simulation of LRU • Direct mapped cache • Explicit memory management • Dictionary (2-universal hash functions) of cache lines in memory • Expected O (1) access time to a cache line in memory 5
Matrix Multiplication 6
Matrix Multiplication Problem � C = A · B , c ij = a ik · b kj k =1 ..N Layout of matrices 0 1 2 3 4 5 6 7 0 8 16 24 32 40 48 56 0 1 2 3 16 17 18 19 0 1 4 5 16 17 20 21 8 9 10 11 12 13 14 15 1 9 17 25 33 41 49 57 4 5 6 7 20 21 22 23 2 3 6 7 18 19 22 23 16 17 18 19 20 21 22 23 2 10 18 26 34 42 50 58 8 9 10 11 24 25 26 27 8 9 12 13 24 25 28 29 24 25 26 27 28 29 30 31 3 11 19 27 35 43 51 59 12 13 14 15 28 29 30 31 10 11 14 15 26 27 30 31 32 33 34 35 36 37 38 39 4 12 20 28 36 44 52 60 32 33 34 35 48 49 50 51 32 33 36 37 48 49 52 53 40 41 42 43 44 45 46 47 5 13 21 29 37 45 53 61 36 37 38 39 52 53 54 55 34 35 38 39 50 51 54 55 48 49 50 51 52 53 54 55 6 14 22 30 38 46 54 62 40 41 42 43 56 57 58 59 40 41 44 45 56 57 60 61 56 57 58 59 60 61 62 63 7 15 23 31 39 47 55 63 44 45 46 47 60 61 62 63 42 43 46 47 58 59 62 63 Column major 4 × 4 -blocked Bit interleaved Row major 7
Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj 8
Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj Algorithm 2: Blocked algorithm (cache-aware) s – Partition A and B into blocks of size s × s where 0 1 2 3 4 5 6 7 √ s 8 9 10 11 12 13 14 15 s = Θ( M ) 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 – Apply Algorithm 1 to the N s × N s matrices where 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 elements are s × s matrices 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 8
Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj Algorithm 2: Blocked algorithm (cache-aware) s – Partition A and B into blocks of size s × s where 0 1 2 3 4 5 6 7 √ s 8 9 10 11 12 13 14 15 s = Θ( M ) 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 – Apply Algorithm 1 to the N s × N s matrices where 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 elements are s × s matrices 48 49 50 51 52 53 54 55 – s × s -blocked or ( row major and M = Ω( B 2 ) ) 56 57 58 59 60 61 62 63 � 3 · s 2 �� � � � � � N 3 N 3 N O = O = O I/Os √ s B s · B B M 8
Matrix Multiplication Algorithm 1: Nested loops for i = 1 to N for j = 1 to N – Row major c ij = 0 – Reading a column of B uses N I/Os for k = 1 to N – Total O ( N 3 ) I/Os c ij = c ij + a ik · b kj Algorithm 2: Blocked algorithm (cache-aware) s – Partition A and B into blocks of size s × s where 0 1 2 3 4 5 6 7 √ s 8 9 10 11 12 13 14 15 s = Θ( M ) 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 – Apply Algorithm 1 to the N s × N s matrices where 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 elements are s × s matrices 48 49 50 51 52 53 54 55 – s × s -blocked or ( row major and M = Ω( B 2 ) ) 56 57 58 59 60 61 62 63 � 3 · s 2 �� � � � � � N 3 N 3 N O = O = O I/Os √ s B s · B B M – Optimal Hong & Kung, 1981 8
Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious) A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22 = A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 – 8 recursive N 2 × N 2 matrix multiplications + 4 N 2 × N 2 matrix sums 9
Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious) A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22 = A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 – 8 recursive N 2 × N 2 matrix multiplications + 4 N 2 × N 2 matrix sums – # I/Os if bit interleaved or ( row major and M = Ω( B 2 ) ) √ O ( N 2 B ) if N ≤ ε M T ( N ) ≤ � � � � N 2 N 8 · T + O otherwise 2 B � N 3 � T ( N ) O √ ≤ B M 9
Matrix Multiplication Algorithm 3: Recursive algorithm (cache-oblivious) A 11 A 12 B 11 B 12 A 11 B 11 + A 12 B 21 A 11 B 12 + A 12 B 22 = A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22 – 8 recursive N 2 × N 2 matrix multiplications + 4 N 2 × N 2 matrix sums – # I/Os if bit interleaved or ( row major and M = Ω( B 2 ) ) √ O ( N 2 B ) if N ≤ ε M T ( N ) ≤ � � � � N 2 N 8 · T + O otherwise 2 B � N 3 � T ( N ) O √ ≤ B M – Optimal Hong & Kung, 1981 – Non-square matrices Frigo et al., 1999 9
Matrix Multiplication Algorithm 4: Strassen’s algorithm (cache-oblivious) – 7 recursive N 2 × N 2 matrix multiplications + O (1) matrix sums C 11 C 12 A 11 A 12 B 11 B 12 = C 21 C 22 A 21 A 22 B 21 B 22 m 1 := ( a 21 + a 22 − a 11 )( b 22 − b 12 + b 11 ) c 11 := m 2 + m 3 m 2 := a 11 b 11 c 12 := m 1 + m 2 + m 5 + m 6 m 3 := a 12 b 21 c 21 := m 1 + m 2 + m 4 − m 7 m 4 := ( a 11 − a 21 )( b 22 − b 12 ) c 22 := m 1 + m 2 + m 4 + m 5 m 5 := ( a 21 + a 22 )( b 12 − b 11 ) m 6 := ( a 12 − a 21 + a 11 − a 22 ) b 22 m 7 := a 22 ( b 11 + b 22 − b 12 − b 21 ) 10
Matrix Multiplication Algorithm 4: Strassen’s algorithm (cache-oblivious) – 7 recursive N 2 × N 2 matrix multiplications + O (1) matrix sums C 11 C 12 A 11 A 12 B 11 B 12 = C 21 C 22 A 21 A 22 B 21 B 22 m 1 := ( a 21 + a 22 − a 11 )( b 22 − b 12 + b 11 ) c 11 := m 2 + m 3 m 2 := a 11 b 11 c 12 := m 1 + m 2 + m 5 + m 6 m 3 := a 12 b 21 c 21 := m 1 + m 2 + m 4 − m 7 m 4 := ( a 11 − a 21 )( b 22 − b 12 ) c 22 := m 1 + m 2 + m 4 + m 5 m 5 := ( a 21 + a 22 )( b 12 − b 11 ) m 6 := ( a 12 − a 21 + a 11 − a 22 ) b 22 m 7 := a 22 ( b 11 + b 22 − b 12 − b 21 ) – # I/Os if bit interleaved or ( row major and M = Ω( B 2 ) ) √ O ( N 2 B ) if N ≤ ε M T ( N ) ≤ � � � � N 2 N 7 · T + O otherwise 2 B � � N log2 7 T ( N ) O log 2 7 ≈ 2 . 81 ≤ √ B M 10
Cache-Oblivious Search Trees 11
Static Cache-Oblivious Trees Recursive memory layout ≡ van Emde Boas layout · · · ⌊ h/ 2 ⌋ A · · · · · · · · · h · · · ⌈ h/ 2 ⌉ · · · · · · · · · B 1 Bk · · · · · · · · · · · · · · · · · · A B 1 · · · Bk Degree O(1) Searches use O(log B N ) I/Os Prokop 1999 12
Static Cache-Oblivious Trees Recursive memory layout ≡ van Emde Boas layout · · · ⌊ h/ 2 ⌋ A · · · · · · · · · h · · · ⌈ h/ 2 ⌉ · · · · · · · · · B 1 Bk · · · · · · · · · · · · · · · · · · A B 1 · · · Bk Degree O(1) Searches use O(log B N ) I/Os Range reportings use � � log B N + k O I/Os B Prokop 1999 12
Static Cache-Oblivious Trees Recursive memory layout ≡ van Emde Boas layout · · · ⌊ h/ 2 ⌋ A · · · · · · · · · h · · · ⌈ h/ 2 ⌉ · · · · · · · · · B 1 Bk · · · · · · · · · · · · · · · · · · A B 1 · · · Bk Degree O(1) Searches use O(log B N ) I/Os Range reportings use � � log B N + k O I/Os B Prokop 1999 Bender, Brodal, Fagerberg, Ge, He, Hu Best possible (log 2 e + o (1)) log B N Iacono, López-Ortiz 2003 12
Recommend
More recommend