a majorization minimization algorithm for multiple
play

A majorization-minimization algorithm for (multiple) hyperparameter - PowerPoint PPT Presentation

A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University ICML 2009 Montreal, Canada 17 th June 2009 Supervised learning Training set of m IID examples


  1. A majorization-minimization algorithm for (multiple) hyperparameter learning Chuan-Sheng Foo Chuong B. Do Andrew Y. Ng Stanford University ICML 2009 Montreal, Canada 17 th June 2009

  2. Supervised learning • Training set of m IID examples Labels may be real-valued, discrete, structured • Probabilistic model • Estimate parameters

  3. Regularization prevents overfitting • Regularized maximum likelihood estimation Regularization Hyperparameter L 2 -regularized Logistic Regression • Also maximum a posteriori (MAP) estimation Data log- Log-prior over likelihood model parameters

  4. How to select the hyperparameter(s)? • Grid search + Simple to implement − Scales exponentially with # hyperparameters • Gradient-based algorithms + Scales well with # hyperparameters − Non-trivial to implement Can we get the best of both worlds?

  5. Our contribution � Striking ease of implementation � Simple, closed-form updates for C � Leverage existing solvers � Scales well to multiple hyperparameter case � Applicable to wide range of models

  6. Outline 1. Problem definition 2. The “integrate out” strategy 3. The Majorization-Minimization algorithm 4. Experiments 5. Discussion

  7. The “integrate out” strategy • Treat hyperparameter C as a random variable • Analytically integrate out C • Need a convenient prior p(C)

  8. Integrating out a single hyperparameter • For L 2 regularization, • A convenient prior: • The result: 1. C is gone 2. Neither convex nor concave in w

  9. The M ajorization- M inimization Algorithm • Replace hard problem by series of easier ones • EM-like; two steps: 1. M ajorization Upper bound the objective function 2. M inimization Minimize the upper bound

  10. MM1: Upper-bounding the new prior • New prior: • Linearize the log: 4 3 2 1 0 y -1 -2 log(x) -3 expansion at x=1 expansion at x=1.5 -4 expansion at x=2 -5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x

  11. MM2: Solving the resultant optimization problem • Resultant linearized prior Terms independent of w • Get standard L 2 -regularization! Use existing solvers!

  12. Visualization of the upper bound 3 4 log(0.5 x 2 + 1) 3 expansion at x=1 2.5 expansion at x=1.5 2 expansion at x=2 2 1 0 1.5 y y -1 1 -2 log(x) -3 expansion at x=1 0.5 expansion at x=1.5 -4 expansion at x=2 0 -5 -5 -4 -3 -2 -1 0 1 2 3 4 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x x

  13. Overall algorithm 1. Closed form updates for C 2. Leverage existing solvers Converges to a local minimum

  14. What about multiple hyperparameters? • Regularization groups w = ( w 1 , w 2 , w 3 , w 4 , w 5 ) Unigram Bigram “To C or not to C. That NLP feature feature is the question…” weights weights Mapping from weights to groups RNA Secondary Hairpin Bulge Structure loops loops Prediction C = ( C 1 , C 2 )

  15. What about multiple hyperparameters? Separately update each regularization group Sum weights in each group Weighted L 2 -regularization

  16. Experiments • 4 probabilistic models – Linear regression (too easy, not shown) – Binary logistic regression – Multinomial logistic regression – Conditional log-linear model • 3 competing algorithms – Grid search – Gradient-based algorithm (Do et al., 2007) – Direct optimization of new objective • Algorithm run with α = 0, β = 1

  17. Accuracy Results: Binary Logistic Regression 100 50 60 70 80 90 australian breast- cancer diabetes Grid german- numer Grad heart ionosphere Direct liver- disorders mushrooms MM sonar splice w1a

  18. Accuracy Results: Multinomial Logistic Regression 100 30 40 50 60 70 80 90 � connect-4 � dna � glass Grid � iris � le6er Grad mnist1 � sa7mage Direct � segment � svmguide2 � usps MM � vehicle � vowel � wine

  19. Results: Conditional Log-Linear Models • RNA secondary structure ROC Area prediction 0.65 • Multiple hyperparameters 0.64 0.63 0.62 AGCAGAGUGGCGCA 0.61 GUGGAAGCGUGCUG 0.6 GUCCCAUAACCCAGA GGUCCGAGGAUCGA 0.59 AACCUUGCUCUGCUA 0.58 Single Grouped (((((((((((((.......))))..((((((.... Gradient Direct MM (((....)))....))))))......))))))))).

  20. Discussion • How to choose α, β in Gamma prior? – Sensitivity experiments – Simple choice reasonable – Further investigation required • Simple assumptions sometimes wrong • But competitive performance with Grid, Grad • Suited for ‘Quick-and-dirty’ implementations

  21. Thank you!

Recommend


More recommend