efficient algorithms for learning sparse models from high
play

Efficient Algorithms for Learning Sparse Models from High - PowerPoint PPT Presentation

Efficient Algorithms for Learning Sparse Models from High Dimensional Data Yoram Singer Google Research Machine Learning Summer School, UC Santa Cruz, July 18, 2012 1 Sparsity ? THE HIGHER MINIMUM WAGE THAT WAS SIGNED INTO LAW ... WILL BE


  1. Efficient Algorithms for Learning Sparse Models from High Dimensional Data Yoram Singer Google Research Machine Learning Summer School, UC Santa Cruz, July 18, 2012 1

  2. Sparsity ? THE HIGHER MINIMUM WAGE THAT WAS SIGNED INTO LAW ... WILL BE WELCOME RELIEF OF WORKERS ... THE 90 CENT-AN-HOUR INCREASE... REGULATIONS LABOUR ECONOMICS • In many applications there are numerous input features (e.g. all words in a dictionary, all possible html tokens) • However, only a fraction of the features are highly relevant to the task on hand • Keeping all the features might also be computationally infeasible

  3. Sparsity ? • A large collections of investments tools (stocks, bonds, ETFs, cash, options, ...) • Cannot afford and/or maintain investments in all possible financial instruments across the globe • Need to select a relatively small number of financial instruments to achieve a certain goal (e.g. volatility-return profile)

  4. High Dimensional Sparse Data • Web search and advertisement placement employ large number of boolean predicates • Most predicates evaluate to be false most of the time • Example: • User types: [flowers] sees “Fernando’s Flower Shop” • Instantiated features: Which Predicates query: “flowers” are important? query:”flowers” && creative_keyword: “flower” lang: “en-US” • Resulting instance: x ∈ { 0 , 1 } n (0 , 0 , 1 , 0 , 1 , 0 , . . . , 0 , 0 , 1 , 0)

  5. Methods to Achieve Compact Models • Forward greedy feature induction (bottom-up) • Backward feature pruning (top-down) • Combination (FOBA) - alternate between: (i) Feature induction (ii) Model fitting (iii) Feature pruning • This tutorial: Efficient algorithms for learning “compact” linear models from large amounts of high dimensional data

  6. Linear Models Linear Models x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 Input X w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 Weights W n X Prediction ˆ y = w · x = w j x j j =1 ` ( y, ˆ y ) True Target y (loss function) ⇒ Example of losses y ) 2 y ) = e − y ˆ ` ( y, ˆ y ) = ( y − ˆ ` ( y, ˆ y squared error exponential loss

  7. Empirical Loss Minimization S = { ( x i , y i ) } m • Training set, sample i =1 • Goal: find W that attains low loss L( w ) on S m L ( w ) = 1 X ` ( y i , w · x i ) m i =1 .... and performs well on unseen data • Empirical Risk Minimization (ERM) balances between loss minimization and “complexity” of W

  8. Two Forms of ERM m 1 σ k w k 2 + X arg min [1 � y i ( w · x i )] + m w i =1 PENALIZED EMPIRICAL RISK arg min � R ( w ) + E ( x ,y ) ∼ D [ ` ( w ; ( x , y )] w T X log ( w · x t ) s.t. w ∈ ∆ DOMAIN CONSTRAINED t =1 EMPIRICAL RISK arg min E ( x ,y ) ∼ D [ ` ( w ; ( x , y )] s.t. w ∈ Ω w

  9. Sparse Linear Models Weights W w 1 0 w 3 w 4 0 w 6 0 w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 0 • Use regularization functions or domain constraints that promote sparsity • Base tools: penalize or constrain the 1-norm of W n X k w k 1 = | w j | j =1 • Use the base tools to build (promote) models with structural (block) sparsity

  10. Why 1-norm ? w 2 z = L ( w ) k 1 w k w 1 “CORNER”

  11. 1-norm as Proxy to “0-norm” • L 0 counts the number of non-zero coefficients (not a norm) k w k 0 = |{ j | w j = 0 }| • ERM with 0-norm constraints is NP-Hard • 1-norm is a relaxation of 0-norm 1-norm 0-norm • Under (mild to restrictive) conditions: 1-norm “behaves like” 0-norm (Candes’06, Donoho’06, ...)

  12. L 1 and Generalization • By constraining or penalizing the 1-norm of the weights we prevent excessively “complex” predictors • The penalties / constraints are especially important in very high-dimensional data with binary features x ∈ { 0 , 1 } n (0 , 0 , 1 , 0 , 1 , 0 , . . . , 0 , 0 , 1 , 0) k w k 1  z , k x k ∞ = 1 ) | w · x |  z (Holder Inequality) 1-norm constraint caps maximal value of predictions

  13. Rough Outline • Algorithms with sparsity promoting domain constraints • Algorithms with sparsity promoting regularization • Efficient implementation in high dimensions • Structural sparsity from base algorithms • Improved algorithms from base algorithms • Coordinate descent w/ sparsity promoting regularization [time permitting] • Few experimental results Work really well Try Yourself Trust Me

  14. Loss Minimization & Gradient Descent w 1 L ( w ) w 2 L ( w 1 ) r w 3 L ( w 2 ) r L ( w 3 ) r w • Gradient descent main loop: • Compute gradient � 1 @ � X t L = @ w ` ( w ; ( x i , y i )) r � | S | � � i ∈ S • Update w = w t STEP SIZE η t ∼ 1 t or η t ∼ 1 w t +1 w t � η t r t L √ t

  15. Stochastic Gradient • Often when the training set is large we can use an estimate of the gradient � 1 @ � ˆ X t L = @ w ` ( w ; ( x i , y i )) r � | S 0 | � � i 2 S 0 w = w t S 0 ⊂ S

  16. Projection Onto A Convex Set w Π Ω ( ) w Ω Π Ω ( w ) = arg min v ∈ Ω k v � w k w u s u = Π Ω ( w ) z z Ball of u = s w where u j = k w k w j radius z

  17. Gradient Decent with Domain Constraints • Loop: • Compute gradient � 1 @ � ˆ X t L = @ w ` ( w ; ( x i , y i )) r � | S 0 | � � i 2 S 0 w = w t • Update ⇣ ⌘ w t � η t ˆ w t +1 = Π Ω r t L Similar convergence guarantees to GD

  18. Gradient Decent with 1-norm Constraint w t w t +1 w t +2

  19. Projection Onto 1-norm Ball θ θ v 1 − θ v 1 := v 2 − θ v 2 :=

  20. � 1 Projection Onto Ball max { 0 , v 1 − θ } := v 1 max { 0 , v 2 − θ } := v 2

  21. � 1 Projection Onto Ball sign( v j ) max { 0 , | v j | − θ }

  22. Algebraic-Geometric View − θ v 4 − θ v 5 − θ v 2 − θ v 1 0 0 0 v 3 θ v 7 v 6

  23. Algebraic-Geometric View − θ v 4 − θ v 5 − θ v 2 − θ v 1 v 3 ( v 1 − θ ) + ( v 2 − θ ) + ( v 4 − θ ) + ( v 5 − θ ) = z θ v 7 ⇒ θ = v 1 + v 2 + v 4 + v 5 − z v 6 4

  24. Chicken and Egg Problem • Had we known the threshold we could have found all the zero elements • Had we known the elements that become zero we could have calculated the threshold

  25. From Egg to Omelet If v j < v k then if after the projection the k’th component is zero, the j’th component must be zero as well θ v 3 v 6

  26. The Omelet If two feasible solutions exist with k and k+1 non-zero elements then the solution with k+1 elements attains a lower loss v 4 v 5 v 2 v 1 v 3 v 7 v 6

  27. Calculating Projection � 1 • Sort vector to be projected v ⇒ µ s.t. µ 1 ≥ µ 2 ≥ µ 3 ≥ ... ≥ µ n • If j is a feasible index then � j ⇥ µ j > 1 ⇤ µ j > θ µ r − z ⇒ j r =1 ⌃ ⇧⌅ ⌥ • Number of non-zero elements ρ θ � j ⇤ ⇥ ⌅ j : µ j − 1 ⇧ ρ = max µ r − z > 0 j r =1

  28. Calculating the Projection v 4 − ( v 4 − z ) > 0 v 5 − 1 2( v 4 + v 5 − z ) > 0 v 4 v 5 v 2 v 1 v 3 θ = 1 v 7 ρ = 3 3 ( v 2 + v 4 + v 5 − z ) v 6

  29. More Efficient Procedure • Assume we know number of elements greater than v j ρ ( v j ) = |{ v i : v i ≥ v j }| • Assume we know the sum of elements great than v j � s ( v j ) = v i i : v i ≥ v j • Then, we can check in constant time the status of v j 1 ρ ( v j ) ( s ( v j ) − z ) ⇔ s ( v j ) − ρ ( v j ) v j < z v j > θ ⇔ v j > • Randomized median-like search [O(n) instead O(n log(n))]

  30. Efficient Implementation in High Dimensions • In many applications the dimension is very high [ text applications: dictionaries of 20+ million words] [ web data: often > 10 10 different html tokens ] • Small number of non-zero elements in each example [ text applications: a news document contains 1000s of words ] [ web data: web page is often short, less than 10 4 html tokens ] • Online/stochastic updates only modify the weights corresponding to non-zero features in example • Use red-black ( RB ) tree to store only non-zero weights + additional data structure + lazy evaluation • Upon projection, removal of whole sub-tree is performed in log time w/ Tarajan’s (83) algorithm for splitting RB tree

  31. Empirical Results for Digit Recognition • 60,000 training examples, 28x28 pixel images • Engineered 25,000 features • Multiclass logistic regression: • Gradient decent with L 1 projection • Exponentiated Gradient (EG, mirror decent) by Prof. Manfred and colleagues w t +1 ,j = w t,j e � η t ˆ r L t,j Z t • Batch (deterministic) and stochastic GD & EG

  32. GD+L 1 vs. EG on MNIST Stochastic EG L1 0 10 Deterministic f � f * EG L1 � 1 10 50 100 150 200 250 300 350 400 Stochastic Subgradient Evaluations 0 f � f * 10 � 1 10 2 4 6 8 10 12 14 16 18 20 Gradient Evaluations

  33. Sparsity “on-the-fly” 7 6 5 % Sparsity 4 3 2 % of Total Features 1 % of Total Seen 0 0 1 2 3 4 5 6 7 8 Training Examples 5 x 10 Text classification (800,000 docs.)

  34. Penalized Risk Minimization & L 1 • Penalized empirical risk minimization min w L ( w ) + λ � w � 1 L ( w ) + k w k 1

  35. Subgradients • Subgradient set of function f g : f ( x ) ≥ f ( x 0 ) + g � ( x − x 0 ) � ⇥ ∂ f ( x 0 ) = x 0

Recommend


More recommend