compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 11 0
logistics guide/practice questions. extended time). hold them at the usual time, and before class at 10:15am. on the JL Lemma and randomized methods, before moving on the spectral methods (PCA, spectral clustering, etc.) 1 • Problem Set 2 is due thus upcoming Sunday 3/8. • Midterm is next Thursday, 3/12. See webpage for study • Let me know ASAP if you need accommodations (e.g., • My office hours next Tuesday will focus on exam review. I will • I am rearranging the next two lectures to spend more time
midterm assessment process Thanks for you feedback! Some specifics: balance with this. but Iooking into ways to improve. into more smaller assignments to spread out the work more. 2 • More details in proofs and slower pace. Will try to find a • Recap at the end of class. • I will post ‘compressed’ versions of the slides. Not perfect, • After the midterm, I might split the homework assignments
summary Last Class: The Johnson-Lindenstrauss Lemma projection. Lemma. This Class: 3 • Low-distortion embeddings for any set of points via random • Started on proof of the JL Lemma via the Distributional JL • Finish Up proof of the JL lemma. • Example applications to classification and clustering. • Discuss connections to high dimensional geometry.
the johnson-lindenstrauss lemma points from Johnson-Lindenstrauss Lemma: log n x i : of set For any 4 x n ∈ R d and ϵ > 0 there exists a linear map Π : R d → R m ⃗ ⃗ ⃗ x 1 , . . . , ⃗ ( ) and letting ˜ x i = Π ⃗ such that m = O ϵ 2 For all i , j : ( 1 − ϵ ) ∥ ⃗ x i − ⃗ x j ∥ 2 ≤ ∥ ˜ x i − ˜ x j ∥ 2 ≤ ( 1 + ϵ ) ∥ ⃗ x i − ⃗ x j ∥ 2 . R m × d has each entry chosen i.i.d. Further, if Π ∈ ( ) log n /δ N ( 0 , 1 / m ) and m = O , Π satisfies the guarantee with ϵ 2 probability ≥ 1 − δ .
random projection 5 • Can store ˜ x 1 , . . . , ˜ x n in n · m rather than n · d space. What about Π ? • Often don’t need to store explicitly – compute it on the fly. • For i = 1 . . . d : • ˜ x j := ˜ x j + h ( i ) · x j ( i ) where h : [ d ] → R m is a random hash function outputting vectors (the columns of Π ).
distributional jl We showed that the Johnson-Lindenstrauss Lemma follows from: dimension, : embedding error, : embedding failure prob. d : random projection matrix. d : original dimension. m : compressed m x j . 2 Main Idea: Union bound over 6 Distributional JL Lemma: Let Π ∈ R m × d have each entry cho- ( ) log ( 1 /δ ) sen i.i.d. as N ( 0 , 1 / m ) . If we set m = O ϵ 2 , then for any ⃗ y ∈ R d , with probability ≥ 1 − δ ( 1 − ϵ ) ∥ ⃗ y ∥ 2 ≤ ∥ Π ⃗ y ∥ 2 ≤ ( 1 + ϵ ) ∥ ⃗ y ∥ 2 ( n ) difference vectors ⃗ y ij = ⃗ x i − ⃗
distributional jl proof 1 pressed dimension, g i : normally distributed random variable. 7 • Let ỹ denote Π ⃗ y and let Π ( j ) denote the j th row of Π . • For any j , ỹ ( j ) = ⟨ Π ( j ) ,⃗ i = 1 g i · ⃗ y ⟩ = ∑ d y ( i ) where g i ∼ N ( 0 , 1 ) . √ m • g i · ⃗ y ( i ) ∼ N ( 0 ,⃗ y ( i ) 2 ) : a normal distribution with variance ⃗ y ( i ) 2 . ỹ ( j ) is also Gaussian, with ˜ y ( j ) ∼ N ( 0 , ∥ ⃗ y ∥ 2 2 / m ) . ⃗ y ∈ R d : arbitrary vector, ỹ ∈ R m : compressed vector, Π ∈ R m × d : random projection mapping ⃗ y → ỹ . Π ( j ) : j th row of Π , d : original dimension. m : com-
distributional jl proof m g i : normally distributed random variable So ỹ has the right norm in expectation. 2 m 2 m Up Shot: Each entry of our compressed vector ỹ is Gaussian: 8 m ỹ ( j ) ∼ N ( 0 , ∥ ⃗ y ∥ 2 2 / m ) . ∑ = ∑ E [ ∥ ỹ ∥ 2 2 ] = E ỹ ( j ) 2 E [ ỹ ( j ) 2 ] j = 1 j = 1 ∥ ⃗ y ∥ 2 ∑ = ∥ ⃗ y ∥ 2 = j = 1 How is ∥ ỹ ∥ 2 2 distributed? Does it concentrate? ⃗ y ∈ R d : arbitrary vector, ỹ ∈ R m : compressed vector, Π ∈ R m × d : random projection mapping ⃗ y → ỹ . d : original dimension. m : compressed dimension,
distributional jl proof 2 log 1 1 : 1 2 2 1 Gives the distributional JL Lemma and thus the classic JL Lemma! So Far: Each entry of our compressed vector ỹ is Gaussian with : y d : arbitrary vector, ỹ m : compressed vector, m d : random projection mapping y ỹ . d : original dimension. m : compressed dimension, : embedding error, : embedding failure prob. O e , with probability 1 9 Squared random variable with m degrees of freedom, 2 2 Lemma: (Chi-Squared Concentration) Letting Z be a Chi- If we set m O log 1 ỹ ( j ) ∼ N ( 0 , ∥ ⃗ 2 ] = ∥ ⃗ y ∥ 2 2 / m ) and E [ ∥ ỹ ∥ 2 y ∥ 2 ∥ ỹ ∥ 2 2 = ∑ m i = 1 ỹ ( j ) 2 a Chi-Squared random variable with m degrees of freedom (a sum of m squared independent Gaussians) Pr [ | Z − E Z | ≥ ϵ E Z ] ≤ 2 e − m ϵ 2 / 8 . ( ) y 2 ỹ 2 y 2
example application: svm Support Vector Machines: A classic ML algorithm, where data is the lower dimensional space to find separator w̃ . Upshot: Can random project and run SVM (much more efficiently) in dimensions, still m 2 log n JL Lemma implies that after projection into O have unit norm. 10 a in A, b in B classified with a hyperplane. • For any point ⃗ ⟨ ⃗ a , ⃗ w ⟩ ≥ c + m • For any point ⃗ ⟨ ⃗ b , ⃗ w ⟩ ≤ c − m . • Assume all vectors ( ) have ⟨ ã , w̃ ⟩ ≥ c + m / 2 and ⟨ b̃ , w̃ ⟩ ≤ c − m / 2.
example application: svm 4 4 4 Claim: After random projection into O 2 11 dimensions, if log n m 2 ( ) ⟨ ⃗ a , ⃗ w ⟩ ≥ c + m ≥ 0 then ⟨ ã , w̃ ⟩ ≥ c + m / 2. By JL Lemma: applied with ϵ = m / 4, ( ) ∥ ⃗ a − ⃗ ∥ ã − w̃ ∥ 2 2 ≤ 1 + m w ∥ 2 ( ) ( ∥ ã ∥ 2 2 + ∥ w̃ ∥ 2 2 − 2 ⟨ ã , w̃ ⟩ ≤ ∥ ⃗ a ∥ 2 2 + ∥ ⃗ w ∥ 2 2 − 2 ⟨ ⃗ a , ⃗ w ⟩ ) 1 + m ( ) 2 ⟨ ⃗ a , ⃗ 1 + m w ⟩ − 4 · m 4 ≤ 2 ⟨ ã , w̃ ⟩ ⟨ ⃗ a , ⃗ w ⟩ − m 2 ≤ ⟨ ã , w̃ ⟩ c + m − m 2 ≤ ⟨ ã , w̃ ⟩ .
example application: k -means clustering 2 . 2 k Goal: Separate n points in d dimensional space into k groups. Write in terms of distances: 12 k ∑ ∑ ∥ ⃗ k-means Objective: Cost ( C 1 , . . . , C k ) = min x − µ j ∥ 2 C 1 ,... C k ⃗ j = 1 x ∈C k ∑ ∑ ∥ ⃗ x 1 − ⃗ Cost ( C 1 , . . . , C k ) = min x 2 ∥ 2 C 1 ,... C k j = 1 ⃗ x 1 ,⃗ x 2 ∈C k
example application: k -means clustering log n Upshot: Can cluster in m dimensional space (much more 2 k x 2 , 13 k ∑ ∑ k-means Objective: Cost ( C 1 , . . . , C k ) = min ∥ ⃗ x 1 − ⃗ x 2 ∥ 2 2 If C 1 ,... C k j = 1 ⃗ x 1 ,⃗ x 2 ∈C k ( ) dimensions, for all pairs ⃗ x 1 ,⃗ we randomly project to m = O ϵ 2 2 ≤ ∥ ⃗ x 1 − ⃗ ( 1 − ϵ ) ∥ x̃ 1 − x̃ 2 ∥ 2 x 2 ∥ 2 2 ≤ ( 1 + ϵ ) ∥ x̃ 1 − x̃ 2 ∥ 2 2 = ⇒ ∑ ∑ Letting Cost ( C 1 , . . . , C k ) = min ∥ x̃ 1 − x̃ 2 ∥ 2 C 1 ,... C k j = 1 x̃ 1 , x̃ 2 ∈C k ( 1 − ϵ ) Cost ( C 1 , . . . , C k ) ≤ Cost ( C 1 , . . . , C k ) ≤ ( 1 + ϵ ) Cost ( C 1 , . . . , C k ) efficiently) and minimize Cost ( C 1 , . . . , C k ) . The optimal set of clusters will have true cost within 1 + c ϵ times the true optimal.
The Johnson-Lindenstrauss Lemma and High Dimensional Geometry low-dimensional space. So how can JL work? useless? 14 • High-dimensional Euclidean space looks very different from • Are distances in high-dimensional meaningless, making JL
orthogonal vectors What is the largest set of mutually orthogonal unit vectors in d -dimensional space? Answer: d . 15
nearly orthogonal vectors In fact, an exponentially large set of random vectors will be nearly d . What is the largest set of unit vectors in d -dimensional space that pairwise orthogonal with high probability! 16 1. d have all pairwise dot products |⟨ ⃗ x ,⃗ y ⟩| ≤ ϵ ? (think ϵ = . 01) 4. 2 Θ( d ) 2. Θ( d ) 3. Θ( d 2 ) Proof: Let ⃗ x 1 , . . . ,⃗ √ x t each have independent random entries set to ± 1 / • ⃗ x i is always a unit vector. • E [ ⟨ ⃗ x i ,⃗ x j ⟩ ] = ? 0 . • By a Chernoff bound, Pr [ |⟨ ⃗ x i ,⃗ x j ⟩| ≥ ϵ ] ≤ 2 e − ϵ 2 d / 3 . 2 e ϵ 2 d / 6 , using a union bound over all ≤ t 2 = 1 4 e ϵ 2 d / 3 • If we chose t = 1 possible pairs, with probability ≥ 1 / 2 all with be nearly orthogonal.
curse of dimensionality Even with an exponential number of random vector samples, have a huge amount of data. high dimensional space – samples are very ‘sparse’ unless we Curse of dimensionality for sampling/learning functions in clustering useless. we don’t see any nearby vectors. x T 17 Up Shot: In d -dimensional space, a set of 2 Θ( ϵ 2 d ) random unit vectors have all pairwise dot products at most ϵ (think ϵ = . 01) ∥ ⃗ x i − ⃗ 2 = ∥ ⃗ 2 + ∥ ⃗ 2 − 2 ⃗ i ⃗ x j ∥ 2 x i ∥ 2 x j ∥ 2 x j ≥ 1 . 98 . • Can make methods like nearest neighbor classification or • Only hope is if we lots of structure (which we typically do...)
Recommend
More recommend