Efficient Algorithms for Learning Sparse Models from High Dimensional Data Yoram Singer Google Research Machine Learning Summer School, UC Santa Cruz, July 18, 2012 1
Sparsity ? THE HIGHER MINIMUM WAGE THAT WAS SIGNED INTO LAW ... WILL BE WELCOME RELIEF OF WORKERS ... THE 90 CENT-AN-HOUR INCREASE... REGULATIONS LABOUR ECONOMICS • In many applications there are numerous input features (e.g. all words in a dictionary, all possible html tokens) • However, only a fraction of the features are highly relevant to the task on hand • Keeping all the features might also be computationally infeasible
Sparsity ? • A large collections of investments tools (stocks, bonds, ETFs, cash, options, ...) • Cannot afford and/or maintain investments in all possible financial instruments across the globe • Need to select a relatively small number of financial instruments to achieve a certain goal (e.g. volatility-return profile)
High Dimensional Sparse Data • Web search and advertisement placement employ large number of boolean predicates • Most predicates evaluate to be false most of the time • Example: • User types: [flowers] sees “Fernando’s Flower Shop” • Instantiated features: Which Predicates query: “flowers” are important? query:”flowers” && creative_keyword: “flower” lang: “en-US” • Resulting instance: x ∈ { 0 , 1 } n (0 , 0 , 1 , 0 , 1 , 0 , . . . , 0 , 0 , 1 , 0)
Methods to Achieve Compact Models • Forward greedy feature induction (bottom-up) • Backward feature pruning (top-down) • Combination (FOBA) - alternate between: (i) Feature induction (ii) Model fitting (iii) Feature pruning • This tutorial: Efficient algorithms for learning “compact” linear models from large amounts of high dimensional data
Linear Models Linear Models x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 Input X w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 Weights W n X Prediction ˆ y = w · x = w j x j j =1 ` ( y, ˆ y ) True Target y (loss function) ⇒ Example of losses y ) 2 y ) = e − y ˆ ` ( y, ˆ y ) = ( y − ˆ ` ( y, ˆ y squared error exponential loss
Empirical Loss Minimization S = { ( x i , y i ) } m • Training set, sample i =1 • Goal: find W that attains low loss L( w ) on S m L ( w ) = 1 X ` ( y i , w · x i ) m i =1 .... and performs well on unseen data • Empirical Risk Minimization (ERM) balances between loss minimization and “complexity” of W
Two Forms of ERM m 1 σ k w k 2 + X arg min [1 � y i ( w · x i )] + m w i =1 PENALIZED EMPIRICAL RISK arg min � R ( w ) + E ( x ,y ) ∼ D [ ` ( w ; ( x , y )] w T X log ( w · x t ) s.t. w ∈ ∆ DOMAIN CONSTRAINED t =1 EMPIRICAL RISK arg min E ( x ,y ) ∼ D [ ` ( w ; ( x , y )] s.t. w ∈ Ω w
Sparse Linear Models Weights W w 1 0 w 3 w 4 0 w 6 0 w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 0 • Use regularization functions or domain constraints that promote sparsity • Base tools: penalize or constrain the 1-norm of W n X k w k 1 = | w j | j =1 • Use the base tools to build (promote) models with structural (block) sparsity
Why 1-norm ? w 2 z = L ( w ) k 1 w k w 1 “CORNER”
1-norm as Proxy to “0-norm” • L 0 counts the number of non-zero coefficients (not a norm) k w k 0 = |{ j | w j = 0 }| • ERM with 0-norm constraints is NP-Hard • 1-norm is a relaxation of 0-norm 1-norm 0-norm • Under (mild to restrictive) conditions: 1-norm “behaves like” 0-norm (Candes’06, Donoho’06, ...)
L 1 and Generalization • By constraining or penalizing the 1-norm of the weights we prevent excessively “complex” predictors • The penalties / constraints are especially important in very high-dimensional data with binary features x ∈ { 0 , 1 } n (0 , 0 , 1 , 0 , 1 , 0 , . . . , 0 , 0 , 1 , 0) k w k 1 z , k x k ∞ = 1 ) | w · x | z (Holder Inequality) 1-norm constraint caps maximal value of predictions
Rough Outline • Algorithms with sparsity promoting domain constraints • Algorithms with sparsity promoting regularization • Efficient implementation in high dimensions • Structural sparsity from base algorithms • Improved algorithms from base algorithms • Coordinate descent w/ sparsity promoting regularization [time permitting] • Few experimental results Work really well Try Yourself Trust Me
Loss Minimization & Gradient Descent w 1 L ( w ) w 2 L ( w 1 ) r w 3 L ( w 2 ) r L ( w 3 ) r w • Gradient descent main loop: • Compute gradient � 1 @ � X t L = @ w ` ( w ; ( x i , y i )) r � | S | � � i ∈ S • Update w = w t STEP SIZE η t ∼ 1 t or η t ∼ 1 w t +1 w t � η t r t L √ t
Stochastic Gradient • Often when the training set is large we can use an estimate of the gradient � 1 @ � ˆ X t L = @ w ` ( w ; ( x i , y i )) r � | S 0 | � � i 2 S 0 w = w t S 0 ⊂ S
Projection Onto A Convex Set w Π Ω ( ) w Ω Π Ω ( w ) = arg min v ∈ Ω k v � w k w u s u = Π Ω ( w ) z z Ball of u = s w where u j = k w k w j radius z
Gradient Decent with Domain Constraints • Loop: • Compute gradient � 1 @ � ˆ X t L = @ w ` ( w ; ( x i , y i )) r � | S 0 | � � i 2 S 0 w = w t • Update ⇣ ⌘ w t � η t ˆ w t +1 = Π Ω r t L Similar convergence guarantees to GD
Gradient Decent with 1-norm Constraint w t w t +1 w t +2
Projection Onto 1-norm Ball θ θ v 1 − θ v 1 := v 2 − θ v 2 :=
� 1 Projection Onto Ball max { 0 , v 1 − θ } := v 1 max { 0 , v 2 − θ } := v 2
� 1 Projection Onto Ball sign( v j ) max { 0 , | v j | − θ }
Algebraic-Geometric View − θ v 4 − θ v 5 − θ v 2 − θ v 1 0 0 0 v 3 θ v 7 v 6
Algebraic-Geometric View − θ v 4 − θ v 5 − θ v 2 − θ v 1 v 3 ( v 1 − θ ) + ( v 2 − θ ) + ( v 4 − θ ) + ( v 5 − θ ) = z θ v 7 ⇒ θ = v 1 + v 2 + v 4 + v 5 − z v 6 4
Chicken and Egg Problem • Had we known the threshold we could have found all the zero elements • Had we known the elements that become zero we could have calculated the threshold
From Egg to Omelet If v j < v k then if after the projection the k’th component is zero, the j’th component must be zero as well θ v 3 v 6
The Omelet If two feasible solutions exist with k and k+1 non-zero elements then the solution with k+1 elements attains a lower loss v 4 v 5 v 2 v 1 v 3 v 7 v 6
Calculating Projection � 1 • Sort vector to be projected v ⇒ µ s.t. µ 1 ≥ µ 2 ≥ µ 3 ≥ ... ≥ µ n • If j is a feasible index then � j ⇥ µ j > 1 ⇤ µ j > θ µ r − z ⇒ j r =1 ⌃ ⇧⌅ ⌥ • Number of non-zero elements ρ θ � j ⇤ ⇥ ⌅ j : µ j − 1 ⇧ ρ = max µ r − z > 0 j r =1
Calculating the Projection v 4 − ( v 4 − z ) > 0 v 5 − 1 2( v 4 + v 5 − z ) > 0 v 4 v 5 v 2 v 1 v 3 θ = 1 v 7 ρ = 3 3 ( v 2 + v 4 + v 5 − z ) v 6
More Efficient Procedure • Assume we know number of elements greater than v j ρ ( v j ) = |{ v i : v i ≥ v j }| • Assume we know the sum of elements great than v j � s ( v j ) = v i i : v i ≥ v j • Then, we can check in constant time the status of v j 1 ρ ( v j ) ( s ( v j ) − z ) ⇔ s ( v j ) − ρ ( v j ) v j < z v j > θ ⇔ v j > • Randomized median-like search [O(n) instead O(n log(n))]
Efficient Implementation in High Dimensions • In many applications the dimension is very high [ text applications: dictionaries of 20+ million words] [ web data: often > 10 10 different html tokens ] • Small number of non-zero elements in each example [ text applications: a news document contains 1000s of words ] [ web data: web page is often short, less than 10 4 html tokens ] • Online/stochastic updates only modify the weights corresponding to non-zero features in example • Use red-black ( RB ) tree to store only non-zero weights + additional data structure + lazy evaluation • Upon projection, removal of whole sub-tree is performed in log time w/ Tarajan’s (83) algorithm for splitting RB tree
Empirical Results for Digit Recognition • 60,000 training examples, 28x28 pixel images • Engineered 25,000 features • Multiclass logistic regression: • Gradient decent with L 1 projection • Exponentiated Gradient (EG, mirror decent) by Prof. Manfred and colleagues w t +1 ,j = w t,j e � η t ˆ r L t,j Z t • Batch (deterministic) and stochastic GD & EG
GD+L 1 vs. EG on MNIST Stochastic EG L1 0 10 Deterministic f � f * EG L1 � 1 10 50 100 150 200 250 300 350 400 Stochastic Subgradient Evaluations 0 f � f * 10 � 1 10 2 4 6 8 10 12 14 16 18 20 Gradient Evaluations
Sparsity “on-the-fly” 7 6 5 % Sparsity 4 3 2 % of Total Features 1 % of Total Seen 0 0 1 2 3 4 5 6 7 8 Training Examples 5 x 10 Text classification (800,000 docs.)
Penalized Risk Minimization & L 1 • Penalized empirical risk minimization min w L ( w ) + λ � w � 1 L ( w ) + k w k 1
Subgradients • Subgradient set of function f g : f ( x ) ≥ f ( x 0 ) + g � ( x − x 0 ) � ⇥ ∂ f ( x 0 ) = x 0
Recommend
More recommend