dimension independent matrix square
play

Dimension Independent Matrix Square Introduction using MapReduce - PowerPoint PPT Presentation

Dimension Independent Matrix Square Reza Zadeh Dimension Independent Matrix Square Introduction using MapReduce The Problem Why Bother MapReduce First Pass Naive Reza Bosagh Zadeh Analysis DIMSUM Algorithm Shuffle Size Correctness


  1. Dimension Independent Matrix Square Reza Zadeh Dimension Independent Matrix Square Introduction using MapReduce The Problem Why Bother MapReduce First Pass Naive Reza Bosagh Zadeh Analysis DIMSUM Algorithm Shuffle Size Correctness Singular values Similarities Experiments Large Small STOC 2013 More Results

  2. Outline Introduction 1 Dimension Independent The Problem Matrix Square Why Bother Reza Zadeh MapReduce Introduction First Pass 2 The Problem Why Bother Naive MapReduce Analysis First Pass Naive DIMSUM 3 Analysis Algorithm DIMSUM Algorithm Shuffle Size Shuffle Size Correctness Correctness Singular values Similarities Singular values Experiments Similarities Large Small Experiments 4 More Results Large Small More Results 5

  3. Computing A T A Dimension Independent Matrix Square Given m × n matrix A with entries in [ 0 , 1 ] and m ≫ n , Reza Zadeh compute A T A . Introduction The Problem   · · · a 1 , 1 a 1 , 2 a 1 , n Why Bother MapReduce a 2 , 1 a 2 , 2 · · · a 2 , n   First Pass A =  . . .  ... Naive . . .   . . . Analysis   DIMSUM a m , 1 a m , 2 · · · a m , n Algorithm Shuffle Size Correctness A is tall and skinny, example values m = 10 12 , n = 10 6 . Singular values Similarities A has sparse rows , each row has at most L nonzeros. Experiments Large A is stored across thousands of machines and cannot Small More Results be streamed through a single machine.

  4. Guarantees Dimension Independent Matrix Square Reza Zadeh Preserve singular values of A T A with ǫ relative error Introduction The Problem paying shuffle size O ( n 2 /ǫ 2 ) and reduce-key complexity Why Bother MapReduce O ( n /ǫ 2 ) . i.e. independent of m . First Pass Preserve specific entries of A T A , then we can reduce Naive Analysis the shuffle size to O ( n log ( n ) / s ) and reduce-key DIMSUM Algorithm complexity to O ( log ( n ) / s ) where s is the minimum Shuffle Size Correctness similarity for the entries being estimated. Similarity can Singular values Similarities be via Cosine, Dice, Overlap, or Jaccard. Experiments Large Small More Results

  5. Computing All Pairs of Cosine Similarities Dimension Independent Matrix Square Reza Zadeh We have to find dot products between all pairs of columns of A Introduction The Problem We prove results for general matrices, but can do better Why Bother MapReduce for those entries with cos ( i , j ) ≥ s First Pass Naive Cosine similarity: a widely used definition for “similarity" Analysis between two vectors DIMSUM Algorithm Shuffle Size c T i c j Correctness cos ( i , j ) = Singular values Similarities || c i |||| c j || Experiments Large c i is the i ′ th column of A Small More Results

  6. Ubiquitous problem Dimension Independent Matrix Square Reza Zadeh Introduction The Problem Why Bother MapReduce First Pass Naive Analysis DIMSUM Algorithm Shuffle Size Correctness Singular values Similarities Experiments Large Small More Results

  7. MapReduce Dimension Independent Matrix Square Reza Zadeh With such large datasets (e.g. m = 10 12 ), we must use Introduction The Problem many machines. Why Bother MapReduce Biggest clusters of computers use MapReduce First Pass Naive MapReduce is the tool of choice in such distributed Analysis systems DIMSUM Algorithm With so many machines (around 1000), CPU power is Shuffle Size Correctness abundant, but communication is expensive Singular values Similarities 2 Minute description of MapReduce... Experiments Large Small More Results

  8. MapReduce Dimension Independent Matrix Square Reza Zadeh Introduction The Problem Why Bother MapReduce First Pass Naive Analysis DIMSUM Algorithm Shuffle Size Correctness Singular values Similarities Experiments Large Small More Results

  9. MapReduce Dimension Independent Matrix Square Reza Zadeh Introduction The Problem Why Bother MapReduce First Pass Naive Analysis DIMSUM Algorithm Shuffle Size Correctness Singular values Similarities Experiments Large Small More Results

  10. MapReduce Dimension Independent Matrix Square Reza Zadeh Introduction Input gets dished out to the mappers roughly equally The Problem Why Bother Two performance measures MapReduce First Pass 1) Shuffle size: shuffling the data output by the Naive mappers to the correct reducer is expensive Analysis DIMSUM 2) Largest reduce-key: can’t send too much of the data Algorithm Shuffle Size to a single reducer Correctness Singular values First pass at implementing cos ( i , j ) in MapReduce... Similarities Experiments Large Small More Results

  11. Naive Implementation Dimension Independent Matrix Square Given row r i , Map with NaiveMapper (Algorithm 1) 1 Reza Zadeh Reduce using the NaiveReducer (Algorithm 2) 2 Introduction The Problem Why Bother Algorithm 1 NaiveMapper ( r i ) MapReduce First Pass for all pairs ( a ij , a ik ) in r i do Naive Analysis Emit (( c j , c k ) → a ij a ik ) DIMSUM end for Algorithm Shuffle Size Correctness Singular values Similarities Experiments Algorithm 2 NaiveReducer (( c i , c j ) , � v 1 , . . . , v R � ) Large Small i c j → � R output c T i = 1 v i More Results

  12. Analysis for First Pass Dimension Independent Matrix Square Reza Zadeh Introduction Very easy analysis The Problem Why Bother 1) Shuffle size: O ( mL 2 ) MapReduce First Pass 2) Largest reduce-key: O ( m ) Naive Analysis Both depend on m , the larger dimension, and are DIMSUM Algorithm intractable for m = 10 12 , L = 100. Shuffle Size Correctness We’ll bring both down via clever sampling Singular values Similarities Experiments Large Small More Results

  13. DIMSUM Algorithm Dimension Independent Algorithm 3 DIMSUMMapper ( r i ) Matrix Square Reza Zadeh for all pairs ( a ij , a ik ) in r i do � � 1 Introduction With probability min 1 , γ || c j |||| c k || The Problem Why Bother emit (( c j , c k ) → a ij a ik ) MapReduce end for First Pass Naive Analysis DIMSUM Algorithm Algorithm 4 DIMSUMReducer (( c i , c j ) , � v 1 , . . . , v R � ) Shuffle Size Correctness γ Singular values if || c i |||| c j || > 1 then Similarities � R Experiments 1 output b ij → i = 1 v i || c i |||| c j || Large Small else More Results � R output b ij → 1 i = 1 v i γ end if

  14. Analysis for DIMSUM Dimension Independent Matrix Square Reza Zadeh Four things to prove: Introduction The Problem Shuffle size: O ( nL γ ) Why Bother 1 MapReduce Largest reduce-key: O ( γ ) 2 First Pass Naive The sampling scheme preserves similarities when Analysis 3 DIMSUM γ = Ω( log ( n ) / s ) Algorithm Shuffle Size The sampling scheme preserves singular values when 4 Correctness Singular values γ = Ω( n /ǫ 2 ) Similarities Experiments Large Small More Results

  15. Analysis for DIMSUM Dimension Independent Matrix Square Reza Zadeh Introduction Some notation The Problem Why Bother #( c i , c j ) is the number of times columns i and j have a MapReduce 1 First Pass nonzero in the same dimension Naive Analysis #( c i ) is the number of nonzeros in the vector c i 2 DIMSUM Algorithm Theorem will be about { 0 , 1 } matrices, but can be 3 Shuffle Size Correctness generalized Singular values Similarities Experiments Large Small More Results

  16. Shuffle size for DIMSUM Dimension Theorem Independent Matrix Square For { 0 , 1 } matrices, the expected shuffle size for Reza Zadeh DIMSUMMapper is O ( nL γ ) . Introduction The Problem Why Bother Proof. MapReduce First Pass The expected contribution from each pair of columns will Naive Analysis constitute the shuffle size: DIMSUM Algorithm #( c i , c j ) n n Shuffle Size Correctness � � � Pr [ DIMSUMSampleEmit ( c i , c j )] Singular values Similarities i = 1 j = i + 1 k = 1 Experiments Large n n Small � � More Results = #( c i , c j ) Pr [ CosineSampleEmit ( c i , c j )] i = 1 j = i + 1

  17. Shuffle size for DIMSUM Dimension Independent Proof. Matrix Square n n Reza Zadeh #( c i , c j ) � � ≤ γ � � #( c i ) #( c j ) Introduction i = 1 j = i + 1 The Problem Why Bother MapReduce First Pass Naive Analysis DIMSUM Algorithm Shuffle Size Correctness Singular values Similarities Experiments Large Small More Results

  18. Shuffle size for DIMSUM Dimension Independent Proof. Matrix Square n n Reza Zadeh #( c i , c j ) � � ≤ γ � � #( c i ) #( c j ) Introduction i = 1 j = i + 1 The Problem Why Bother MapReduce n n 1 1 � � First Pass (by AM-GM) ≤ γ #( c i , c j )( #( c i ) + #( c j )) Naive Analysis i = 1 j = i + 1 DIMSUM Algorithm Shuffle Size Correctness Singular values Similarities Experiments Large Small More Results

  19. Shuffle size for DIMSUM Dimension Independent Proof. Matrix Square n n Reza Zadeh #( c i , c j ) � � ≤ γ � � #( c i ) #( c j ) Introduction i = 1 j = i + 1 The Problem Why Bother MapReduce n n 1 1 � � First Pass (by AM-GM) ≤ γ #( c i , c j )( #( c i ) + #( c j )) Naive Analysis i = 1 j = i + 1 DIMSUM n n Algorithm 1 Shuffle Size � � ≤ γ #( c i , c j ) Correctness #( c i ) Singular values i = 1 j = 1 Similarities Experiments Large Small More Results

Recommend


More recommend