dimensionality reduction and jl lemma
play

Dimensionality Reduction and JL Lemma Lecture 12 February 21, 2019 - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data, Spring 2019 Dimensionality Reduction and JL Lemma Lecture 12 February 21, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 23 F 2 estimation in turnstile setting AMS- 2 -Estimate : Let Y 1 , Y 2 , . . .


  1. CS 498ABD: Algorithms for Big Data, Spring 2019 Dimensionality Reduction and JL Lemma Lecture 12 February 21, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 23

  2. F 2 estimation in turnstile setting AMS- ℓ 2 -Estimate : Let Y 1 , Y 2 , . . . , Y n be {− 1 , +1 } random variables that are 4 -wise independent z ← 0 While (stream is not empty) do a j = ( i j , ∆ j ) is current update z ← z + ∆ j Y i j endWhile Output z 2 Claim: Output estimates || x || 2 2 where x is the vector at end of stream of updates. Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 23

  3. Analysis Z = � n i =1 x i Y i and output is Z 2 Z 2 = � � x 2 i Y 2 i + 2 x i x j Y i Y j i � = j i and hence Z 2 � � x 2 i = || x || 2 � = 2 . E i One can show that Var ( Z 2 ) ≤ 2(E Z 2 � ) 2 . � Chandra (UIUC) CS498ABD 3 Spring 2019 3 / 23

  4. Linear Sketching View Recall that we take average of independent estimators and take median to reduce error. Can we view all this as a sketch? AMS- ℓ 2 -Sketch : k = c log(1 /δ ) /ǫ 2 Let M be a ℓ × n matrix with entries in {− 1 , 1 } s.t (i) rows are independent and (ii) in each row entries are 4 -wise independent z is a ℓ × 1 vector initialized to 0 While (stream is not empty) do a j = ( i j , ∆ j ) is current update z ← z + ∆ j Me i j endWhile Output vector z as sketch. M is compactly represented via k hash functions, one per row, independently chosen from 4 -wise independent hash family. Chandra (UIUC) CS498ABD 4 Spring 2019 4 / 23

  5. Geometric Interpretation Given vector x ∈ R n let M the random map z = Mx has the following features � z 2 � = � x � 2 2 for each 1 ≤ i ≤ k where k is E[ z i ] = 0 and E i number of rows of M Thus each z 2 i is an estimate of length of x in Euclidean norm When k = Θ( 1 ǫ 2 log(1 /δ )) one can obtain an (1 ± ǫ ) estimate of � x � 2 by averaging and median ideas Thus we are able to compress x into k -dimensional vector z such that z contains information to estimate � x � 2 accurately Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 23

  6. Geometric Interpretation Given vector x ∈ R n let M the random map z = Mx has the following features � z 2 � = � x � 2 2 for each 1 ≤ i ≤ k where k is E[ z i ] = 0 and E i number of rows of M Thus each z 2 i is an estimate of length of x in Euclidean norm When k = Θ( 1 ǫ 2 log(1 /δ )) one can obtain an (1 ± ǫ ) estimate of � x � 2 by averaging and median ideas Thus we are able to compress x into k -dimensional vector z such that z contains information to estimate � x � 2 accurately Question: Do we need median trick? Will averaging do? Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 23

  7. Distributional JL Lemma Lemma (Distributional JL Lemma) Fix vector x ∈ R d and let Π ∈ R k × d matrix where each entry Π ij is chosen independently according to standard normal distribution N (0 , 1) distribution. If k = Ω( 1 ǫ 2 log(1 /δ )) , then with probability (1 − δ ) � 1 √ Π x � 2 = (1 ± ǫ ) � x � 2 . k Can choose entries from {− 1 , 1 } as well. Note: unlike ℓ 2 estimation, entries of Π are independent. 1 Letting z = k Π x we have projected x from d dimensions to √ k = O ( 1 ǫ 2 log(1 /δ )) dimensions while preserving length to within (1 ± ǫ ) -factor. Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 23

  8. Dimensionality reduction Theorem (Metric JL Lemma) Let v 1 , v 2 , . . . , v n be any n points/vectors in R d . For any ǫ ∈ (0 , 1 / 2) , there is linear map f : R d → R k where k ≤ 8 ln n /ǫ 2 such that for all 1 ≤ i < j ≤ n , (1 − ǫ ) || v i − v j || 2 ≤ || f ( v i ) − f ( v j ) || 2 ≤ || v i − v j || 2 . Moreover f can be obtained in randomized polynomial-time. Linear map f is simply given by random matrix Π : f ( v ) = Π v . Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 23

  9. Normal Distribution 2 πσ 2 e − ( x − µ )2 1 Density function: f ( x ) = √ 2 σ 2 Standard normal: N (0 , 1) is when µ = 0 , σ = 1 Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 23

  10. Normal Distribution Cumulative density function for standard normal: � t ∞ e − t 2 / 2 (no closed form) 1 Φ( x ) = √ 2 π Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 23

  11. Sum of independent Normally distributed variables Lemma Let X and Y be independent random variables. Suppose X ∼ N ( µ X , σ 2 X ) and Y ∼ N ( µ Y , σ 2 Y ) . Let Z = X + Y . Then Z ∼ N ( µ X + µ Y , σ 2 X + σ 2 Y ) . Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 23

  12. Sum of independent Normally distributed variables Lemma Let X and Y be independent random variables. Suppose X ∼ N ( µ X , σ 2 X ) and Y ∼ N ( µ Y , σ 2 Y ) . Let Z = X + Y . Then Z ∼ N ( µ X + µ Y , σ 2 X + σ 2 Y ) . Corollary Let X and Y be independent random variables. Suppose X ∼ N (0 , 1) and Y ∼ N (0 , 1) . Let Z = aX + bY . Then Z ∼ N (0 , a 2 + b 2 ) . Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 23

  13. Concentration of sum of squares of normally distributed variables Lemma Let Z 1 , Z 2 , . . . , Z k be independent N (0 , 1) random variables and i Z 2 let Y = � i . Then, for ǫ ∈ (0 , 1 / 2) , there is a constant c such that, Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − 2 e c ǫ 2 k . Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 23

  14. χ 2 distribution Density function Chandra (UIUC) CS498ABD 12 Spring 2019 12 / 23

  15. χ 2 distribution Cumulative density function Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 23

  16. Proof of DJL Lemma Without loss of generality assume � x � 2 = 1 (unit vector) Z i = � n j =1 Π ij x i Z i ∼ N (0 , 1) Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 23

  17. Proof of DJL Lemma Without loss of generality assume � x � 2 = 1 (unit vector) Z i = � n j =1 Π ij x i Z i ∼ N (0 , 1) i . Y ’s distribution is χ 2 since Z 1 , . . . , Z k are Let Y = � k i =1 Z 2 iid Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 23

  18. Proof of DJL Lemma Without loss of generality assume � x � 2 = 1 (unit vector) Z i = � n j =1 Π ij x i Z i ∼ N (0 , 1) i . Y ’s distribution is χ 2 since Z 1 , . . . , Z k are Let Y = � k i =1 Z 2 iid Hence Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − 2 e c ǫ 2 k Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 23

  19. Proof of DJL Lemma Without loss of generality assume � x � 2 = 1 (unit vector) Z i = � n j =1 Π ij x i Z i ∼ N (0 , 1) i . Y ’s distribution is χ 2 since Z 1 , . . . , Z k are Let Y = � k i =1 Z 2 iid Hence Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − 2 e c ǫ 2 k Since k = Ω( 1 ǫ 2 log(1 /δ )) we have Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − δ Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 23

  20. Proof of DJL Lemma Without loss of generality assume � x � 2 = 1 (unit vector) Z i = � n j =1 Π ij x i Z i ∼ N (0 , 1) i . Y ’s distribution is χ 2 since Z 1 , . . . , Z k are Let Y = � k i =1 Z 2 iid Hence Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − 2 e c ǫ 2 k Since k = Ω( 1 ǫ 2 log(1 /δ )) we have Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − δ � Therefore � z � 2 = Y / k has the property that with probability (1 − δ ) , � z � 2 = (1 ± ǫ ) � x � 2 . Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 23

  21. JL lower bounds Question: Are the bounds achieved by the lemmas tight or can we do better? How about non-linear maps? Essentially optimal modulo constant factors for worst-case point sets. Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 23

  22. Fast JL and Sparse JL Projection matrix Π is dense and hence Π x takes Θ( kn ) time. Question: Can we find Π to improve time bound? Two scenarios: x is dense x is sparse Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 23

  23. Fast JL and Sparse JL Projection matrix Π is dense and hence Π x takes Θ( kn ) time. Question: Can we find Π to improve time bound? Two scenarios: x is dense x is sparse Main ideas: Choose Π ij to be {− 1 , 0 , 1 } with probability 1 / 6 , 1 / 3 , 1 / 6 . Also works. Roughly 1 / 3 entries are 0 Fast JL: Choose Π in a dependent way to ensure Π x can be computed in O ( d log d ) time Sparse JL: Choose Π such that each column is s -sparse. The best known is s = O ( 1 ǫ log(1 /δ )) Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 23

  24. Subspace Embedding Question: Suppose we have linear subspace E of R d of dimension ℓ . Can we find a projection Π : R d → R k such that for every x ∈ E , � Π x � 2 = (1 ± ǫ ) � x � 2 ? Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 23

  25. Subspace Embedding Question: Suppose we have linear subspace E of R d of dimension ℓ . Can we find a projection Π : R d → R k such that for every x ∈ E , � Π x � 2 = (1 ± ǫ ) � x � 2 ? Not possible if k < ℓ . Why? Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 23

  26. Subspace Embedding Question: Suppose we have linear subspace E of R d of dimension ℓ . Can we find a projection Π : R d → R k such that for every x ∈ E , � Π x � 2 = (1 ± ǫ ) � x � 2 ? Not possible if k < ℓ . Why? Π maps E to a lower dimension. Implies some non-zero vector x ∈ E mapped to 0 Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 23

Recommend


More recommend