sparser johnson lindenstrauss transforms
play

Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton - PowerPoint PPT Presentation

Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 2012 joint work with Daniel Kane (Stanford) Random Projections x R d , d huge store y = Sx , where S is a k d matrix (compression) Random Projections


  1. Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 2012 joint work with Daniel Kane (Stanford)

  2. Random Projections • x ∈ R d , d huge • store y = Sx , where S is a k × d matrix (compression)

  3. Random Projections • x ∈ R d , d huge • store y = Sx , where S is a k × d matrix (compression) • compressed sensing (recover x from y when x is (near-)sparse) • group-testing (as above, but Sx is Boolean multiplication) • recover properties of x (entropy, heavy hitters, . . . ) • approximate norm preservation (want � y � ≈ � x � ) • motif discovery (slightly different; randomly project discrete x onto subset of its coordinates) [Buhler-Tompa]

  4. Random Projections • x ∈ R d , d huge • store y = Sx , where S is a k × d matrix (compression) • compressed sensing (recover x from y when x is (near-)sparse) • group-testing (as above, but Sx is Boolean multiplication) • recover properties of x (entropy, heavy hitters, . . . ) • approximate norm preservation (want � y � ≈ � x � ) • motif discovery (slightly different; randomly project discrete x onto subset of its coordinates) [Buhler-Tompa] • In many of these applications, random S is either required or obtains better parameters than deterministic constructions.

  5. Random Projections • x ∈ R d , d huge • store y = Sx , where S is a k × d matrix (compression) • compressed sensing (recover x from y when x is (near-)sparse) • group-testing (as above, but Sx is Boolean multiplication) • recover properties of x (entropy, heavy hitters, . . . ) • approximate norm preservation (want � y � ≈ � x � ) • motif discovery (slightly different; randomly project discrete x onto subset of its coordinates) [Buhler-Tompa] • In many of these applications, random S is either required or obtains better parameters than deterministic constructions.

  6. Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean space can be embedded into O ( ε − 2 log n )-dimensional Euclidean space so that all pairwise distances are preserved up to a 1 ± ε factor.

  7. Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean space can be embedded into O ( ε − 2 log n )-dimensional Euclidean space so that all pairwise distances are preserved up to a 1 ± ε factor. Uses: • Speed up geometric algorithms by first reducing dimension of input [Indyk-Motwani, 1998], [Indyk, 2001] • Low-memory streaming algorithms for linear algebra problems [Sarl´ os, 2006], [LWMRT, 2007], [Clarkson-Woodruff, 2009] • Essentially equivalent to RIP matrices from compressed sensing [Baraniuk et al., 2008], [Krahmer-Ward, 2011] (used for recovery of sparse signals)

  8. How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any 0 < ε, δ < 1 / 2 there exists a distribution D ε,δ on R k × d for k = O ( ε − 2 log(1 /δ )) so that for any x of unit norm � > ε �� � � Sx � 2 � � Pr 2 − 1 < δ. S ∼D ε,δ

  9. How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any 0 < ε, δ < 1 / 2 there exists a distribution D ε,δ on R k × d for k = O ( ε − 2 log(1 /δ )) so that for any x of unit norm � > ε �� � � Sx � 2 � � Pr 2 − 1 < δ. S ∼D ε,δ Proof of MJL: Set δ = 1 / n 2 in DJL and x as the difference vector � n � of some pair of points. Union bound over the pairs. 2

  10. How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any 0 < ε, δ < 1 / 2 there exists a distribution D ε,δ on R k × d for k = O ( ε − 2 log(1 /δ )) so that for any x of unit norm � > ε �� � � Sx � 2 � � Pr 2 − 1 < δ. S ∼D ε,δ Proof of MJL: Set δ = 1 / n 2 in DJL and x as the difference vector � n � of some pair of points. Union bound over the pairs. 2 Theorem (Alon, 2003) For every n, there exists a set of n points requiring target dimension k = Ω(( ε − 2 / log(1 /ε )) log n ) . Theorem (Jayram-Woodruff, 2011; Kane-Meka-N., 2011) For DJL, k = Θ( ε − 2 log(1 /δ )) is optimal.

  11. Proving the JL lemma Older proofs • [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988]: Random rotation, then projection onto first k coordinates. • [Indyk-Motwani, 1998], [Dasgupta-Gupta, 2003]: Random matrix with independent Gaussian entries. • [Achlioptas, 2001]: Independent ± 1 entries. • [Clarkson-Woodruff, 2009]: O (log(1 /δ ))-wise independent ± 1 entries. • [Arriaga-Vempala, 1999], [Matousek, 2008]: Independent entries having mean 0, variance 1 / k , and subGaussian tails

  12. Proving the JL lemma Older proofs • [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988]: Random rotation, then projection onto first k coordinates. • [Indyk-Motwani, 1998], [Dasgupta-Gupta, 2003]: Random matrix with independent Gaussian entries. • [Achlioptas, 2001]: Independent ± 1 entries. • [Clarkson-Woodruff, 2009]: O (log(1 /δ ))-wise independent ± 1 entries. • [Arriaga-Vempala, 1999], [Matousek, 2008]: Independent entries having mean 0, variance 1 / k , and subGaussian tails Downside: Performing embedding is dense matrix-vector multiplication, O ( k · � x � 0 ) time

  13. Fast JL Transforms • [Ailon-Chazelle, 2006]: x �→ PHDx , O ( d log d + k 3 ) time P is a random sparse matrix, H is Hadamard, D has random ± 1 on diagonal • [Ailon-Liberty, 2008]: O ( d log k + k 2 ) time, also based on fast Hadamard transform • [Ailon-Liberty, 2011] and [Krahmer-Ward, 2011]: O ( d log d ) for MJL, but with suboptimal k = O ( ε − 2 log n log 4 d ).

  14. Fast JL Transforms • [Ailon-Chazelle, 2006]: x �→ PHDx , O ( d log d + k 3 ) time P is a random sparse matrix, H is Hadamard, D has random ± 1 on diagonal • [Ailon-Liberty, 2008]: O ( d log k + k 2 ) time, also based on fast Hadamard transform • [Ailon-Liberty, 2011] and [Krahmer-Ward, 2011]: O ( d log d ) for MJL, but with suboptimal k = O ( ε − 2 log n log 4 d ). Downside: Slow to embed sparse vectors: running time is Ω(min { k · � x � 0 , d log d } ).

  15. Where Do Sparse Vectors Show Up? • Document as bag of words: x i = number of occurrences of word i . Compare documents using cosine similarity. d = lexicon size; most documents aren’t dictionaries • Network traffic: x i , j = #bytes sent from i to j d = 2 64 (2 256 in IPv6); most servers don’t talk to each other • User ratings: x i is user’s score for movie i on Netflix d = # movies ; most people haven’t rated all movies • Streaming: x receives a stream of updates of the form: “add v to x i ”. Maintaining Sx requires calculating v · Se i . • . . .

  16. Sparse JL transforms One way to embed sparse vectors faster: use sparse matrices.

  17. Sparse JL transforms One way to embed sparse vectors faster: use sparse matrices. s = #non-zero entries per column in embedding matrix (so embedding time is s · � x � 0 ) reference value of s type k ≈ 4 ε − 2 log(1 /δ ) [JL84], [FM88], [IM98], . . . dense [Achlioptas01] k / 3 sparse Bernoulli [WDALS09] no proof hashing O ( ε − 1 log 3 (1 /δ )) ˜ [DKS10] hashing O ( ε − 1 log 2 (1 /δ )) ˜ [KN10a], [BOR10] ” O ( ε − 1 log(1 /δ )) [KN12] hashing (random codes)

  18. Other related work • CountSketch of [Charikar-Chen-FarachColton] gives s = O (log(1 /δ )) (see [Thorup-Zhang])

  19. Other related work • CountSketch of [Charikar-Chen-FarachColton] gives s = O (log(1 /δ )) (see [Thorup-Zhang]) • Can recover (1 ± ε ) � x � 2 from Sx , but not as � Sx � 2 (not an embedding into ℓ 2 ) • Not applicable in certain situations, e.g. in some nearest neighbor data structures, and when learning classifiers over projected vectors via stochastic gradient descent

  20. Sparse JL Constructions Θ( ε − 1 log 2 (1 /δ )) [DKS, 2010] s = ˜

  21. Sparse JL Constructions [DKS, 2010] Θ( ε − 1 log 2 (1 /δ )) s = ˜ [this work] s = Θ( ε − 1 log(1 /δ ))

  22. Sparse JL Constructions Θ( ε − 1 log 2 (1 /δ )) [DKS, 2010] s = ˜ s = Θ( ε − 1 log(1 /δ )) [this work] s = Θ( ε − 1 log(1 /δ )) [this work] k/s

  23. Sparse JL Constructions (in matrix form) 0 = 0 k 0 0 0 = 0 k/s k 0 0 Each black cell is ± 1 / √ s at random

  24. Sparse JL Constructions (nicknames) “Graph” construction “Block” construction k/s

  25. Sparse JL notation (block construction) • Let h ( j , r ) , σ ( j , r ) be random hash location and random sign for copy of x j in r th block.

  26. Sparse JL notation (block construction) • Let h ( j , r ) , σ ( j , r ) be random hash location and random sign for copy of x j in r th block. • ( Sx ) i = (1 / √ s ) · � h ( j , r )= i x j · σ ( j , r )

  27. Sparse JL notation (block construction) • Let h ( j , r ) , σ ( j , r ) be random hash location and random sign for copy of x j in r th block. • ( Sx ) i = (1 / √ s ) · � h ( j , r )= i x j · σ ( j , r ) s 2 + 1 � Sx � 2 2 = � x � 2 � � s · x i x j σ ( i , r ) σ ( j , r ) · 1 h ( i , r )= h ( j , r ) r =1 i � = j

  28. Sparse JL via Codes 0 = 0 k 0 0 0 = 0 k/s k 0 0 • Graph construction: Constant weight binary code of weight s . • Block construction: Code over q -ary alphabet, q = k / s .

Recommend


More recommend