Random Projections, Margins, Kernels and Feature Selection Adithya Pediredla Rice University Electrical and Computer Engineering 1
SVM: Revision f ( x i ) = w T x i + b 2
SVM: Revision f ( x i ) = w T x i + b N w ∈R d � w � 2 + C � Primal: min max(0 , 1 − y i f ( x i )); i 2
SVM: Revision f ( x i ) = w T x i + b N w ∈R d � w � 2 + C � Primal: min max(0 , 1 − y i f ( x i )); i α i − 1 � � α i α j y j y k ( x T Dual: max j x k ); 2 α i ≥ 0 i j , k � S.T. 0 ≤ α i ≤ C ; α i y i = 0 , ∀ i i 2
SVM: Revision f ( x i ) = w T x i + b N w ∈R d � w � 2 + C � Primal: min max(0 , 1 − y i f ( x i )); i α i − 1 � � α i α j y j y k ( x T Dual: max j x k ); 2 α i ≥ 0 i j , k � S.T. 0 ≤ α i ≤ C ; α i y i = 0 , ∀ i i only inner products matter 2
SVM: Revision f ( x i ) = w T x i + b N w ∈R d � w � 2 + C max(0 , 1 − y i f ( x i )); O ( nd 2 + d 3 ) � Primal: min i α i − 1 j x k ); O ( dn 2 + n 3 ) � � α i α j y j y k ( x T Dual: max 2 α i ≥ 0 i j , k � S.T. 0 ≤ α i ≤ C ; α i y i = 0 , ∀ i i only inner products matter 2
Decreasing computations Only inner products matter. 3
Decreasing computations Only inner products matter. Can we approximate x i with z i so that dim( z i ) << dim( x i ) and x T i x j ≈ z T i z j . 3
Decreasing computations Only inner products matter. Can we approximate x i with z i so that dim( z i ) << dim( x i ) and x T i x j ≈ z T i z j . One way z i = Ax i . 3
Decreasing computations Only inner products matter. Can we approximate x i with z i so that dim( z i ) << dim( x i ) and x T i x j ≈ z T i z j . One way z i = Ax i . Any comment on rows vs columns of A . 3
Decreasing computations Only inner products matter. Can we approximate x i with z i so that dim( z i ) << dim( x i ) and x T i x j ≈ z T i z j . One way z i = Ax i . Any comment on rows vs columns of A . Turns out a random A is good !! 3
Johnson-Linderstrauss Lemma If d new = ω ( 1 γ 2 log n ), relative angles are preserved up to 1 ± γ . 4
Johnson-Linderstrauss Lemma If d new = ω ( 1 γ 2 log n ), relative angles are preserved up to 1 ± γ . How big can γ be? 4
which data set can have higher γ 20 20 15 15 10 10 5 5 0 0 -5 -5 -10 -10 -15 -15 -20 -20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 5
which data set can have higher γ 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 -5 -5 -5 -10 -10 -10 -15 -15 -15 -20 -20 -20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 6
which data set can have higher γ 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 -5 -5 -5 -10 -10 -10 -15 -15 -15 -20 -20 -20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 -5 -5 -5 -10 -10 -10 -15 -15 -15 -20 -20 -20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 7
How else can big margin help A simple weak learner whose speed is proportional to margin. step 1: Pick random h. step 2: Evaluate error in step 1. 2 − γ If error < 1 4 , stop else, goto step 1. 8
How else can big margin help A simple weak learner whose speed is proportional to margin. step 1: Pick random h. step 2: Evaluate error in step 1. 2 − γ If error < 1 4 , stop else, goto step 1. Bigger the margin, lesser the iterations 8
Dimensionality reduction: random projection Coming back to random projection. A d × D 1 Choose columns to be D random orthogonal unit-length vectors. 9
Dimensionality reduction: random projection Coming back to random projection. A d × D 1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 9
Dimensionality reduction: random projection Coming back to random projection. A d × D 1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 3 Choose each entry in A to be 1 or -1 independently at random. 9
Dimensionality reduction: random projection Coming back to random projection. A d × D 1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 3 Choose each entry in A to be 1 or -1 independently at random. For (2) and (3): Pr A [(1 − γ ) � u − v � 2 ≤ � u ′ − v ′ � 2 ≤ (1 + γ ) � u − v � 2 ] ≥ 1 − 2 e − ( γ 2 − γ 3 ) d 4 9
Dimensionality reduction: random projection Coming back to random projection. A d × D 1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 3 Choose each entry in A to be 1 or -1 independently at random. For (2) and (3): Pr A [(1 − γ ) � u − v � 2 ≤ � u ′ − v ′ � 2 ≤ (1 + γ ) � u − v � 2 ] ≥ 1 − 2 e − ( γ 2 − γ 3 ) d 4 Can we do better? 9
Can we do better If Pr ( error < ǫ ) < δ 10
Can we do better If Pr ( error < ǫ ) < δ d = O ( 1 γ 2 log( 1 ǫδ )) is sufficient. 10
Kernel functions What if we know that K ( x 1 , x 2 ) = φ ( x 1 ) φ ( x 2 )? 11
Kernel functions What if we know that K ( x 1 , x 2 ) = φ ( x 1 ) φ ( x 2 )? What if we do not? 11
Kernel functions What if we know that K ( x 1 , x 2 ) = φ ( x 1 ) φ ( x 2 )? What if we do not? Finding Inner products approximately is enough 11
Kernel functions What if we know that K ( x 1 , x 2 ) = φ ( x 1 ) φ ( x 2 )? What if we do not? Finding Inner products approximately is enough We need to know the distribution of data set 11
Mapping-1 Lemma: Consider any distribution over labelled data. 12
Mapping-1 Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P [ � w · x � > γ ] = 0. 12
Mapping-1 Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P [ � w · x � > γ ] = 0. � 1 � If we draw z 1 , z 2 , . . . z d iid with d ≥ 8 γ 2 + ln 1 then with ǫ δ probability ≥ 1 − δ , ∃ w ′ = span( z 1 , z 2 , . . . , z d ) ∋ P [ � w ′ · x � > γ/ 2] < ǫ 12
Mapping-1 Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P [ � w · x � > γ ] = 0. � 1 � If we draw z 1 , z 2 , . . . z d iid with d ≥ 8 γ 2 + ln 1 then with ǫ δ probability ≥ 1 − δ , ∃ w ′ = span( z 1 , z 2 , . . . , z d ) ∋ P [ � w ′ · x � > γ/ 2] < ǫ Therefore, if ∃ w in φ − space, by sampling x 1 , x 2 , . . . x n , we are guaranteed: w ′ = α 1 φ ( x 1 ) + α 2 φ ( x 2 ) + · · · + α d φ ( x d ) Hence, w ′ φ ( x ) = α 1 K ( x , x 1 ) + α 2 K ( x , x 2 ) + . . . α d K ( x , x d ); 12
Mapping-1 Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P [ � w · x � > γ ] = 0. � 1 � If we draw z 1 , z 2 , . . . z d iid with d ≥ 8 γ 2 + ln 1 then with ǫ δ probability ≥ 1 − δ , ∃ w ′ = span( z 1 , z 2 , . . . , z d ) ∋ P [ � w ′ · x � > γ/ 2] < ǫ Therefore, if ∃ w in φ − space, by sampling x 1 , x 2 , . . . x n , we are guaranteed: w ′ = α 1 φ ( x 1 ) + α 2 φ ( x 2 ) + · · · + α d φ ( x d ) Hence, w ′ φ ( x ) = α 1 K ( x , x 1 ) + α 2 K ( x , x 2 ) + . . . α d K ( x , x d ); If we define F 1 ( x ) = ( K ( x , x 1 ) , . . . , K ( x , x d )); then with high probability the vector ( α 1 , . . . α d ) is an approximate linear separator. 12
Mapping-2 We can normalize K ( x , x i ) and get better bounds. 13
Mapping-2 We can normalize K ( x , x i ) and get better bounds. Compute K = U T U ; 13
Mapping-2 We can normalize K ( x , x i ) and get better bounds. Compute K = U T U ; Compute F 2 ( x ) = F 1 ( x ) U − 1 . 13
Mapping-2 We can normalize K ( x , x i ) and get better bounds. Compute K = U T U ; Compute F 2 ( x ) = F 1 ( x ) U − 1 . F 2 is linearly separable with error at most ǫ at margin γ/ 2 13
Key take aways Inner products are enough. Random projections are good. Higher the margin, lower the dimension. If okay with error, we can project to much lower dimension. While using Kernels, randomly drawn data points act as good features. 14
Recommend
More recommend