Covariance Matrices and All-pairs similarity Covariance Matrices & All-pairs Similarity Reza Zadeh Introduction Reza Zadeh First Pass DIMSUM Analysis Experiments Spark More Results April 2015, Stanford DAO Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 1 / 34
Notation for matrix A Covariance Matrices and All-pairs Given m × n matrix A , with m ≫ n . similarity Reza Zadeh a 1 , 1 a 1 , 2 · · · a 1 , n Introduction a 2 , 1 a 2 , 2 · · · a 2 , n A = First Pass . . . ... . . . . . . DIMSUM · · · a m , 1 a m , 2 a m , n Analysis Experiments A is tall and skinny, example values Spark m = 10 12 , n = { 10 4 , 10 6 } . More Results A has sparse rows , each row has at most L nonzeros. A is stored across hundreds of machines and cannot be streamed through a single machine. Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 2 / 34
Computing A T A Covariance Matrices and All-pairs similarity Reza Zadeh Introduction We compute A T A . First Pass A T A is n × n , considerably smaller than A . DIMSUM Analysis A T A is dense. Experiments Holds dot products between all pairs of columns of A . Spark More Results Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 3 / 34
Guarantees Covariance Matrices and All-pairs similarity Reza Zadeh There is a knob γ which can be turned to preserve Introduction similarities and singular values. Paying O ( nL γ ) First Pass communication cost and O ( γ ) computation cost. DIMSUM With a low setting of γ , preserve similar entries of A T A Analysis (via Cosine, Dice, Overlap, and Jaccard similarity). Experiments Spark With a high setting of γ , preserve singular values of More Results A T A . Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 4 / 34
Computing All Pairs of Cosine Similarities Covariance Matrices and All-pairs similarity We have to find dot products between all pairs of Reza Zadeh columns of A Introduction We prove results for general matrices, but can do better First Pass for those entries with cos ( i , j ) ≥ s DIMSUM Cosine similarity: a widely used definition for “similarity" Analysis between two vectors Experiments Spark c T i c j More Results cos ( i , j ) = || c i |||| c j || c i is the i ′ th column of A Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 5 / 34
Example matrix Covariance Matrices and All-pairs similarity Reza Zadeh Rows: users. Introduction First Pass Columns: movies. DIMSUM Analysis Experiments Spark More Results Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 6 / 34
Distributed Computing Environment Covariance Matrices and All-pairs similarity Reza Zadeh Introduction With such large datasets, we must use many machines. First Pass Algorithm code available in Spark and Scalding. DIMSUM Analysis Described with Maps and Reduces so that the Experiments framework takes care of distributing the computation. Spark More Results Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 7 / 34
Naive Implementation Covariance Matrices and Given row r i , Map with NaiveMapper (Algorithm 1) 1 All-pairs similarity Reduce using the NaiveReducer (Algorithm 2) 2 Reza Zadeh Introduction Algorithm 1 NaiveMapper ( r i ) First Pass DIMSUM for all pairs ( a ij , a ik ) in r i do Analysis Emit (( j , k ) → a ij a ik ) Experiments end for Spark More Results Algorithm 2 NaiveReducer (( i , j ) , � v 1 , . . . , v R � ) i c j → � R output c T i = 1 v i Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 8 / 34
Analysis for First Pass Covariance Matrices and All-pairs similarity Reza Zadeh Very easy analysis 1) Shuffle size: O ( mL 2 ) Introduction First Pass 2) Largest reduce-key: O ( m ) DIMSUM Both depend on m , the larger dimension, and are Analysis intractable for m = 10 12 , L = 100. Experiments Spark We’ll bring both down via clever sampling More Results Assuming column norms are known or estimates available Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 9 / 34
Dimension Independent Matrix Square using MapReduce Covariance Matrices and Algorithm 3 DIMSUMMapper ( r i ) All-pairs similarity for all pairs ( a ij , a ik ) in r i do Reza Zadeh � � 1 With probability min 1 , γ Introduction || c j |||| c k || emit (( j , k ) → a ij a ik ) First Pass end for DIMSUM Analysis Experiments Algorithm 4 DIMSUMReducer (( i , j ) , � v 1 , . . . , v R � ) Spark More Results γ if || c i |||| c j || > 1 then � R 1 output b ij → i = 1 v i || c i |||| c j || else � R output b ij → 1 i = 1 v i γ end if Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 10 / 34
Analysis for DIMSUM Covariance Matrices and All-pairs similarity The algorithm outputs b ij , which is a matrix of cosine Reza Zadeh similarities, call it B . Introduction Four things to prove: First Pass Shuffle size: O ( nL γ ) 1 DIMSUM Largest reduce-key: O ( γ ) Analysis 2 Experiments The sampling scheme preserves similarities when 3 Spark γ = Ω( log ( n ) / s ) More Results The sampling scheme preserves singular values when 4 γ = Ω( n /ǫ 2 ) Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 11 / 34
Shuffle size for DIMSUM Covariance Theorem Matrices and All-pairs similarity For { 0 , 1 } matrices, the expected shuffle size for Reza Zadeh DIMSUMMapper is O ( nL γ ) . Introduction First Pass Proof. DIMSUM The expected contribution from each pair of columns will Analysis constitute the shuffle size: Experiments Spark #( c i , c j ) n n � � � More Results Pr [ DIMSUMEmit ( c i , c j )] i = 1 j = i + 1 k = 1 n n � � = #( c i , c j ) Pr [ DIMSUMEmit ( c i , c j )] i = 1 j = i + 1 Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 12 / 34
Shuffle size for DIMSUM Covariance Proof. Matrices and All-pairs n n similarity #( c i , c j ) � � ≤ γ Reza Zadeh � � #( c i ) #( c j ) i = 1 j = i + 1 Introduction First Pass DIMSUM Analysis Experiments Spark More Results Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 13 / 34
Shuffle size for DIMSUM Covariance Proof. Matrices and All-pairs n n similarity #( c i , c j ) � � ≤ γ Reza Zadeh � � #( c i ) #( c j ) i = 1 j = i + 1 Introduction n n First Pass (by AM-GM) ≤ γ 1 1 � � #( c i , c j )( #( c i ) + #( c j )) DIMSUM 2 Analysis i = 1 j = i + 1 Experiments Spark More Results Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 13 / 34
Shuffle size for DIMSUM Covariance Proof. Matrices and All-pairs n n similarity #( c i , c j ) � � ≤ γ Reza Zadeh � � #( c i ) #( c j ) i = 1 j = i + 1 Introduction n n First Pass (by AM-GM) ≤ γ 1 1 � � #( c i , c j )( #( c i ) + #( c j )) DIMSUM 2 Analysis i = 1 j = i + 1 Experiments n n 1 Spark � � ≤ γ #( c i , c j ) More Results #( c i ) i = 1 j = 1 Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 13 / 34
Shuffle size for DIMSUM Covariance Proof. Matrices and All-pairs n n similarity #( c i , c j ) � � ≤ γ Reza Zadeh � � #( c i ) #( c j ) i = 1 j = i + 1 Introduction n n First Pass (by AM-GM) ≤ γ 1 1 � � #( c i , c j )( #( c i ) + #( c j )) DIMSUM 2 Analysis i = 1 j = i + 1 Experiments n n 1 Spark � � ≤ γ #( c i , c j ) More Results #( c i ) i = 1 j = 1 n 1 � ≤ γ #( c i ) L #( c i ) = γ Ln i = 1 Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 13 / 34
Shuffle size for DIMSUM Covariance Matrices and All-pairs similarity Reza Zadeh O ( nL γ ) has no dependence on the dimension m , this is Introduction the heart of DIMSUM. First Pass Happens because higher magnitude columns are DIMSUM sampled with lower probability: Analysis Experiments 1 Spark γ || c 1 |||| c 2 || More Results Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 14 / 34
Shuffle size for DIMSUM Covariance Matrices and All-pairs similarity Reza Zadeh Introduction For matrices with real entries, we can still get a bound First Pass Let H be the smallest nonzero entry in magnitude, after DIMSUM all entries of A have been scaled to be in [ − 1 , 1 ] Analysis E.g. for { 0 , 1 } matrices, we have H = 1 Experiments Spark Shuffle size is bounded by O ( nL γ/ H 2 ) More Results Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 15 / 34
Largest reduce key for DIMSUM Covariance Matrices and All-pairs similarity Reza Zadeh Each reduce key receives at most γ values (the Introduction oversampling parameter) First Pass DIMSUM Immediately get that reduce-key complexity is O ( γ ) Analysis Also independent of dimension m . Happens because Experiments high magnitude columns are sampled with lower Spark probability. More Results Reza Zadeh (Stanford) Covariance Matrices and All-pairs similarity April 2014 16 / 34
Recommend
More recommend