Subspace Embeddings and ℓ p -Regression Using Exponential Random Variables David P. Woodruff and Qin Zhang IBM Research Almaden COLT’13, June 12, 2013 1-1
Subspace embeddings Subspace embeddings: A distribution over linear maps Π : R n → R m , s.t., for any fixed d -dimensional subspace of R n (denoted by M ), w. pr. 0 . 99 � Mx � p ≤ � Π Mx � q ≤ κ � Mx � p simultaneously for all vectors x ∈ R d . Goal: to minimize 1. m : the dimension of the subspace embedding. 2. κ : the distortion of the embedding. 3. t : the time to compute Π M . 2-1
Subspace embeddings Subspace embeddings: A distribution over linear maps Π : R n → R m , s.t., for any fixed d -dimensional subspace of R n (denoted by M ), w. pr. 0 . 99 � Mx � p ≤ � Π Mx � q ≤ κ � Mx � p simultaneously for all vectors x ∈ R d . Goal: to minimize 1. m : the dimension of the subspace embedding. 2. κ : the distortion of the embedding. 3. t : the time to compute Π M . Applications: ℓ p -regression (next slide), low-rank approximation, quantile regression, . . . 2-2
All matter: embedding time, dimension and distortion Using ℓ p subspace embedding (SE) to solve ℓ p regression: � ¯ � � min x ∈ R d Mx − b � p For convenience, let ¯ M ∈ R n × ( d − 1) , and let M = [ ¯ M , − b ] ∈ R n × d . n ≫ d . Let Π be a SE with dimension m , distortion κ and embedding time t 3-1
All matter: embedding time, dimension and distortion Using ℓ p subspace embedding (SE) to solve ℓ p regression: � ¯ � � min x ∈ R d Mx − b � p For convenience, let ¯ M ∈ R n × ( d − 1) , and let M = [ ¯ M , − b ] ∈ R n × d . n ≫ d . Let Π be a SE with dimension m , distortion κ and embedding time t 1. Compute Π M . (cost t ) 2. Use Π M to compute a matrix R ∈ R d × d (change-of-basis matrix) s.t. MR has some good properties. (cost ↑ if m ↑ ) 3. Given R , find a sampling matrix Π 1 ∈ R m ′ × n . ( m ′ ↑ if κ ↑ ) � Π 1 ¯ Mx − Π 1 b � � 4. Compute ˆ x of sub-sampled problem min x ∈ R d p . � (cost ↑ if m ′ ↑ , or κ ↑ ) Total running time ↑ if m ↑ or κ ↑ or t ↑ . 3-2
ℓ 1 regression � ¯ 1 ( ¯ M ∈ R n × ( d − 1) ). � � ℓ 1 regression: min x ∈ R d Mx − b � • Can be solved by linear programming, in time superlinear in n . • Clarkson 2005 gave an n · poly( d ) solution. • . . . Allow a (1 + ǫ )-approximation : • Sohler & Woodruff 2011 used ℓ 1 subspace embedding (SE), gave O ( nd ω − 1 ) + poly( d /ǫ ). ( ω < 3 is the exponent of matrix multiplication) • Clarkson et al. 2012 used a more structured ℓ 1 SE, gave O ( nd log n ) + poly( d /ǫ ). • Clarkson & Woodruff / Meng & Mahoney 2012 used other ℓ 1 SE’s, gave O (nnz( M ) log n ) + poly( d /ǫ ), nnz( M ) is # non-zero entries of M . 4-1
ℓ 1 regression � ¯ 1 ( ¯ M ∈ R n × ( d − 1) ). � � ℓ 1 regression: min x ∈ R d Mx − b � • Can be solved by linear programming, in time superlinear in n . • Clarkson 2005 gave an n · poly( d ) solution. • . . . Allow a (1 + ǫ )-approximation : • Sohler & Woodruff 2011 used ℓ 1 subspace embedding (SE), gave O ( nd ω − 1 ) + poly( d /ǫ ). ( ω < 3 is the exponent of matrix multiplication) • Clarkson et al. 2012 used a more structured ℓ 1 SE, gave O ( nd log n ) + poly( d /ǫ ). • Clarkson & Woodruff / Meng & Mahoney 2012 used other ℓ 1 SE’s, gave O (nnz( M ) log n ) + poly( d /ǫ ), nnz( M ) is # non-zero entries of M . This paper: further improves the ℓ 1 SE, thus also ℓ 1 regression. 4-2
Our results ℓ p subspace embeddings. Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 p = 2 has already been made optimal by Clarkson and Woodruff ’12 5-1
Our results ℓ p subspace embeddings. Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 p = 2 has already been made optimal by Clarkson and Woodruff ’12 In particular, p = 1 Time Distortion Dimemsion ˜ ˜ nd ω − 1 SW O ( d ) O ( d ) ˜ ˜ C + O ( d 2+ γ ) O ( d 5 ) nd log d ˜ ˜ O ( d 3 ) O ( d 5 ) MM nnz( M ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d 2 ) This paper O ( d ) O ( d 3 / 2 log 1 / 2 n ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d ) SW: Sohler & Woodruff ’11 ; C + : Clarkson et al. ’12; MM: Meng & Mahoney ’12; ω < 3 is the exponent of matrix multiplication. γ = 0 . 0000001. 5-2
Our results ℓ p subspace embeddings. Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 p = 2 has already been made optimal by Clarkson and Woodruff ’12 In particular, p = 1 Time Distortion Dimemsion ˜ ˜ nd ω − 1 SW O ( d ) O ( d ) ˜ ˜ C + O ( d 2+ γ ) O ( d 5 ) nd log d ˜ ˜ O ( d 3 ) O ( d 5 ) MM nnz( M ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d 2 ) This paper O ( d ) O ( d 3 / 2 log 1 / 2 n ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d ) SW: Sohler & Woodruff ’11 ; C + : Clarkson et al. ’12; MM: Meng & Mahoney ’12; ω < 3 is the exponent of matrix multiplication. γ = 0 . 0000001. ℓ p regression Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 Have efficient distributed implementations. 5-3
Our subspace embedding matrices ( m , s ) − ℓ 2 -SE (oblivious subspace embedding for ℓ 2 norm) A distribution over linear maps S : R n → R m , s.t., for any fixed d -dimensional subspace of R n , w. pr. 0 . 99, ∀ x ∈ R d . 1 / 2 · � Mx � 2 ≤ � SMx � 2 ≤ 3 / 2 · � Mx � 2 , s = O (1) is the the max of # non-zero entries of each colummn in S . 6-1
Our subspace embedding matrices ( m , s ) − ℓ 2 -SE (oblivious subspace embedding for ℓ 2 norm) A distribution over linear maps S : R n → R m , s.t., for any fixed d -dimensional subspace of R n , w. pr. 0 . 99, ∀ x ∈ R d . 1 / 2 · � Mx � 2 ≤ � SMx � 2 ≤ 3 / 2 · � Mx � 2 , s = O (1) is the the max of # non-zero entries of each colummn in S . Our ℓ p subspace embedding matrix D ∈ R n × n 1 / u 1 / p 1 u i are i.i.d. × = Π ∈ R m × n S ∈ R m × n exponentials ℓ 2 -SE 1 / u 1 / p n Use different ℓ 2 -SEs (from CW12, MM12, Nelson & Nguyen 12) for 1 ≤ p < 2 and p > 2. Can compute Π M in O (nnz( M )) time. 6-2
Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. 7-1
Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. p -stable distribution: Previous pet for subspace embedding. D p is p -stable , if for any vector α = ( α 1 , . . . , α n ) ∈ R n and i . i . d . ∼ D p , we have � v 1 , . . . , v n i ∈ [ n ] α i v i ≃ � α � p v , where v ∼ D p . 7-2
Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. p -stable distribution: Previous pet for subspace embedding. D p is p -stable , if for any vector α = ( α 1 , . . . , α n ) ∈ R n and i . i . d . ∼ D p , we have � v 1 , . . . , v n i ∈ [ n ] α i v i ≃ � α � p v , where v ∼ D p . E.g., for p = 2 it is the Gaussian distribution; for p = 1 it is the Cauchy distribution. 7-3
Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. p -stable distribution: Previous pet for subspace embedding. D p is p -stable , if for any vector α = ( α 1 , . . . , α n ) ∈ R n and i . i . d . ∼ D p , we have � v 1 , . . . , v n i ∈ [ n ] α i v i ≃ � α � p v , where v ∼ D p . E.g., for p = 2 it is the Gaussian distribution; for p = 1 it is the Cauchy distribution. Similar embedding matrix v 1 D ′ ∈ R n × n × Π ∈ R m × n S ∈ R m × n = v i are i.i.d. ℓ 2 -SE p -stables vn 7-4
Exponential distribution is superior than p -stables Why exponential distribution is better? 8-1
Exponential distribution is superior than p -stables Why exponential distribution is better? 1. p -stables only exist for p ∈ [1 , 2]; while exponential can be used for all ℓ p -SE ( p ≥ 1). 8-2
Exponential distribution is superior than p -stables Why exponential distribution is better? 1. p -stables only exist for p ∈ [1 , 2]; while exponential can be used for all ℓ p -SE ( p ≥ 1). 2. The lower tail of the reciprocal of exponential decreases faster than p -stable, while its the upper tail is similar to p -stables. 8-3
Recommend
More recommend