Randomized methods for machine learning David Lopez-Paz, FAIR May 17, 2016 http://tinyurl.com/randomized-practical
Some random examples Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings
It all starts with a big bang...
... By some smart people.
The problem � f ( x )d x
The problem, simplified � p ( x ) f ( x )d x
The solution m p ( x ) f ( x )d x ≈ 1 � � f ( x i ) , x i ∼ p m i =1
From [Eck87].
Exercise
Example: computing π O ( m − 1 / 2 ) convergence regardless of the dimensionality of x ! Why?
Example: computing π O ( m − 1 / 2 ) convergence regardless of the dimensionality of x ! Why?
Example: computing π O ( m − 1 / 2 ) convergence regardless of the dimensionality of x ! Why?
Example: computing π O ( m − 1 / 2 ) convergence regardless of the dimensionality of x ! Why?
Monte Carlo model selection Still cross-validating over grids? From [BB12], a.k.a. “The rule of 59“ [NLB16]: P ( F µ (min( x 1 , . . . , x T )) ≤ α ) = 1 − (1 − α ) ⊤ .
Some random examples Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings
Truncated SVD From research.facebook.com/blog/fast-randomized-svd/ . Complexity of O ( mn 2 )! [GVL12]
Randomized SVD [HMT11] Computation of r -rank SVD of A ∈ R m × n 1. Compute a column-ort. Q ∈ R m × ( r + p ) s.t. A ≈ QQ ⊤ A 2. Construct B = Q ⊤ A , now B ∈ R ( r + p ) × n 3. Compute the SVD of B = S Σ V ⊤ O (( r + p ) n 2 )) 4. Note A ≈ QQ ⊤ A = QB = Q ( S Σ V ⊤ ) 5. Taking U = QS , return the SVD A = U Σ V ⊤
Randomized SVD [HMT11] Computation of r -rank SVD of A ∈ R m × n 1. Compute a column-ort. Q ∈ R m × ( r + p ) s.t. A ≈ QQ ⊤ A 2. Construct B = Q ⊤ A , now B ∈ R ( r + p ) × n 3. Compute the SVD of B = S Σ V ⊤ O (( r + p ) n 2 )) 4. Note A ≈ QQ ⊤ A = QB = Q ( S Σ V ⊤ ) 5. Taking U = QS , return the SVD A = U Σ V ⊤ Hey, but how do I compute Q ?
Randomized SVD [HMT11] Computation of r -rank SVD of A ∈ R m × n 1. Compute a column-ort. Q ∈ R m × ( r + p ) s.t. A ≈ QQ ⊤ A 2. Construct B = Q ⊤ A , now B ∈ R ( r + p ) × n 3. Compute the SVD of B = S Σ V ⊤ O (( r + p ) n 2 )) 4. Note A ≈ QQ ⊤ A = QB = Q ( S Σ V ⊤ ) 5. Taking U = QS , return the SVD A = U Σ V ⊤ Hey, but how do I compute Q ? At random! :)
Randomized SVD [HMT11] Computation of r -rank SVD of A ∈ R m × n 1. Compute a column-ort. Q ∈ R m × ( r + p ) s.t. A ≈ QQ ⊤ A 2. Construct B = Q ⊤ A , now B ∈ R ( r + p ) × n 3. Compute the SVD of B = S Σ V ⊤ O (( r + p ) n 2 )) 4. Note A ≈ QQ ⊤ A = QB = Q ( S Σ V ⊤ ) 5. Taking U = QS , return the SVD A = U Σ V ⊤ Hey, but how do I compute Q ? At random! :) 1. Take Y = A Ω, where Ω i,j ∼ N (0 , 1). 2. Ω ∈ R n × ( r + p ) , but allows efficient multiplication 3. Compute Y = QR O ( m ( r + p ) 2 ) 4. Return Q ∈ R m × ( r + p )
Exercise
Some random examples Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings
Dimensionality reduction Random projections offer fast and efficient dimensionality reduction. w ∈ R 40500 × 100 x 1 y 1 δ δ (1 ± ǫ ) y 2 x 2 w ∈ R 40500 × 100 R 40500 R 100 (1 − ǫ ) � x 1 − x 2 � 2 ≤ � y 1 − y 2 � 2 ≤ (1 + ǫ ) � x 1 − x 2 � 2 This result is formalized in the Johnson-Lindenstrauss Lemma
The Johnson-Lindenstrauss Lemma The proof is one example of Erd¨ os’ probabilistic method (1947). Paul Erd¨ os Joram Lindenstrauss William Johnson 1913-1996 1936-2012 1944- § 12.5 of Foundations of Machine Learning (Mohri et al., 2012)
Auxiliary Lemma 1 Let Q be a random variable following a χ 2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1 / 2: Pr[(1 − ǫ ) k ≤ Q ≤ (1 + ǫ ) k ] ≥ 1 − 2 e − ( ǫ 2 − ǫ 3 ) k/ 4 . Proof:
Auxiliary Lemma 1 Let Q be a random variable following a χ 2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1 / 2: Pr[(1 − ǫ ) k ≤ Q ≤ (1 + ǫ ) k ] ≥ 1 − 2 e − ( ǫ 2 − ǫ 3 ) k/ 4 . � � Pr[ X ≥ a ] ≤ E [ X ] Proof: start with Markov’s inequality : a Pr[ Q ≥ (1+ ǫ ) k ] =
Auxiliary Lemma 1 Let Q be a random variable following a χ 2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1 / 2: Pr[(1 − ǫ ) k ≤ Q ≤ (1 + ǫ ) k ] ≥ 1 − 2 e − ( ǫ 2 − ǫ 3 ) k/ 4 . � � Pr[ X ≥ a ] ≤ E [ X ] Proof: start with Markov’s inequality : a Pr[ Q ≥ (1+ ǫ ) k ] = Pr[ e λQ ≥ e λ (1+ ǫ ) k ] ≤ E [ e λQ ] e λ (1+ ǫ ) k =
Auxiliary Lemma 1 Let Q be a random variable following a χ 2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1 / 2: Pr[(1 − ǫ ) k ≤ Q ≤ (1 + ǫ ) k ] ≥ 1 − 2 e − ( ǫ 2 − ǫ 3 ) k/ 4 . � � Pr[ X ≥ a ] ≤ E [ X ] Proof: start with Markov’s inequality : a Pr[ Q ≥ (1+ ǫ ) k ] = Pr[ e λQ ≥ e λ (1+ ǫ ) k ] ≤ E [ e λQ ] e λ (1+ ǫ ) k = (1 − 2 λ ) − k/ 2 , e λ (1+ ǫ ) k where E [ e λQ ] = (1 − 2 λ ) − k/ 2 is the mgf of a χ 2 distr., λ < 1 2 .
Auxiliary Lemma 1 Let Q be a random variable following a χ 2 distribution with k degrees of freedom. Then, for any 0 < ǫ < 1 / 2: Pr[(1 − ǫ ) k ≤ Q ≤ (1 + ǫ ) k ] ≥ 1 − 2 e − ( ǫ 2 − ǫ 3 ) k/ 4 . � � Pr[ X ≥ a ] ≤ E [ X ] Proof: start with Markov’s inequality : a Pr[ Q ≥ (1+ ǫ ) k ] = Pr[ e λQ ≥ e λ (1+ ǫ ) k ] ≤ E [ e λQ ] e λ (1+ ǫ ) k = (1 − 2 λ ) − k/ 2 , e λ (1+ ǫ ) k where E [ e λQ ] = (1 − 2 λ ) − k/ 2 is the mgf of a χ 2 distr., λ < 1 2 . ǫ To tight the bound we minimize the rhs with λ = 2(1+ ǫ ) : ǫ 1+ ǫ ) − k/ 2 � k/ 2 (1 − = (1 + ǫ ) k/ 2 � 1 + ǫ Pr[ Q ≥ (1 + ǫ ) k ] ≤ = . e ǫk/ 2 ( e ǫ ) k/ 2 e ǫ
Auxiliary Lemma 1 Using 1 + ǫ ≤ e ǫ − ( ǫ 2 − ǫ 3 ) / 2 yields e ǫ − ǫ 2 − ǫ 3 � k/ 2 � 1 + ǫ 2 = e − k 4 ( ǫ 2 − ǫ 3 ) . Pr[ Q ≥ (1 + ǫ ) k ] ≤ ≤ e ǫ e ǫ Pr[ Q ≤ (1 − ǫ ) k ] is bounded similarly, and the lemma follows by union bound.
Auxiliary Lemma 2 Let x ∈ R N , k < N and A ∈ R k × N with A ij ∼ N (0 , 1). Then, for any 0 ≤ ǫ ≤ 1 / 2: Pr[(1 − ǫ ) � x � 2 ≤ � 1 Ax � 2 ≤ (1 + ǫ ) � x � 2 ] ≥ 1 − 2 e − ( ǫ 2 − ǫ 3 ) k/ 4 . √ k Proof: let ˆ x = Ax . Then, E [ˆ x j ] = 0, and � N � N � 2 N � x 2 � = E � A 2 ji x 2 � x 2 i = � x � 2 . E [ˆ j ] = E A ji x i = i i i i x j / � x � ∼ N (0 , 1). Then, Q = � k i T 2 j ∼ χ 2 Note that T j = ˆ k . Remember the previous lemma?
Auxiliary Lemma 2 x j / � x � ∼ N (0 , 1), Q = � k i T 2 j ∼ χ 2 Remember: ˆ x = Ax , T j = ˆ k : Pr[(1 − ǫ ) � x � 2 ≤ � 1 Ax � 2 ≤ (1 + ǫ ) � x � 2 ] = √ k x � 2 Pr[(1 − ǫ ) � x � 2 ≤ � ˆ ≤ (1 + ǫ ) � x � 2 ] = k x � 2 Pr[(1 − ǫ ) k ≤ � ˆ � x � 2 ≤ (1 + ǫ ) k ] = � k � � T 2 Pr (1 − ǫ ) k ≤ j ≤ (1 + ǫ ) k = i Pr [(1 − ǫ ) k ≤ Q ≤ (1 + ǫ ) k ] ≥ 1 − 2 e − ( ǫ 2 − ǫ 3 ) k/ 4
The Johnson-Lindenstrauss Lemma For any 0 < ǫ < 1 / 2 and any integer m > 4, let k = 20 log m . ǫ 2 Then, for any set V of m points in R N ∃ f : R N → R k s.t. ∀ u , v ∈ V : (1 − ǫ ) � u − v � 2 ≤ � f ( u ) − f ( v ) � 2 ≤ (1 + ǫ ) � u − v � 2 . 1 k A , A ∈ R k × N , k < N and A ij ∼ N (0 , 1). Proof: Let f = √ ◮ Apply previous lemma with x = u − v to lower bound the success probability by 1 − 2 e − ( ǫ 2 − ǫ 3 ) k/ 4 . ◮ Union bound over the m 2 pairs in V with k = 20 log m and ǫ 2 ǫ < 1 / 2 to obtain: Pr[ success ] ≥ 1 − 2 m 2 e − ( ǫ 2 − ǫ 3 ) k/ 4 = 1 − 2 m 5 ǫ − 3 > 1 − 2 m − 1 / 2 > 0 .
Exercise
Some random examples Building atomic bombs Truncated SVD Dimensionality reduction Kernel methods for big data Nonlinear component analysis Dependence measurament Low-dimensional kernel mean embeddings
The kernel trick k ( x, x ′ ) = � φ ( x ) , φ ( x ′ ) � H , f ( x )
The kernel trick? k ( x, x ′ ) = � φ ( x ) , φ ( x ′ ) � H , n � f ( x ) ≈ α i k ( x, x i ) . i =1
The kernel trap! To compute { α i } n i =1 , construct the n × n monster: k ( x i , x j ) · K
Mercer’s theorem Theorem (Mercer’s condition [Mer09]) Under mild technical assumptions, k admit a representation ∞ � k ( x, x ′ ) = λ j φ λ j ( x ) φ λ j ( x ′ ) . j =1 If � λ � 1 := � i | λ j | ≤ ∞ , we can cast the previous as k ( x, x ′ ) = � λ � 1 φ λ ( x ) φ λ ( x ′ ) � � . E λ ∼ p ( λ ) Any ideas? :)
Recommend
More recommend