a blueprint of standardized and composable ml
play

A Blueprint of Standardized and Composable ML Eric Xing and - PowerPoint PPT Presentation

A Blueprint of Standardized and Composable ML Eric Xing and Zhiting Hu Petuum & Carnegie Mellon 1 The universe of problems ML/AI is trying to solve 2 2 Data and experiences of all kinds Type-2 diabetes is 90% more common than type-1


  1. A Blueprint of Standardized and Composable ML Eric Xing and Zhiting Hu Petuum & Carnegie Mellon 1

  2. The universe of problems ML/AI is trying to solve 2 2

  3. Data and experiences of all kinds Type-2 diabetes is 90% more common than type-1 Rewards Data examples Auxiliary agents Constraints … And all combinations of of that … Adversaries 3 3

  4. How human beings solve them ALL? 4 4

  5. The Zoo of ML/AI Models ● Neural networks ● Kernel machines ◯ Convolutional networks Radial Basis Function Networks ◯ Gaussian processes ◯ AlexNet, GoogleNet, ResNet ◯ Deep kernel learning ◯ Recurrent networks, LSTM ◯ Maximum margin ◯ Transformers ◯ SVMs ◯ ◯ BERT, GPT2 ● Decision trees ● Graphical models ● PCA, Probabilistic PCA, Kernel ◯ Bayesian networks PCA, ICA ◯ Markov Random fields ● Boosting ◯ Topic models, LDA ◯ HMM, CRF 5 5

  6. The Zoo of algorithms and heuristics maxim aximum um likelihood likelihood estim estimation ation reinfor einforcem cement lear ent learning ning as infer as inference ence data re da re-weig weighting hting inverse R inver se RL active learning active lear ning policy op olicy optim timization ization re reward rd-aug augmented ented m maxim aximum um likelihood likelihood da data augme gmentation actor-cr actor critic itic softm softmax ax policy g olicy grad adient ient label sm lab el smoothing oothing im imitation lear itation learning ning ad adver versar sarial d ial dom omain ad ain adap aptation tation poster osterior ior r reg egular ularization ization GANs GA Ns constraint- constr aint-driven lear iven learning ning knowled knowledge d e distillation istillation intrinsic r intr insic rewar eward gener eneralized alized exp expectation ectation pred ediction m iction minim inimization ization reg egular ularized ized B Bayes ayes lear learning ning fr from om m measur easurem ements ents energy-based ener ased GA GANs Ns weak/d weak/distant sup istant super ervision vision 6 6

  7. Really hard to navigate, and to realize ● Depending on individual expertise and creativity ● Bespoke, delicate pieces of art ● Like an airport with different runways for every different types of aircrafts 7 7

  8. Physics in the 1800’s ● Electricity & magnetism: Coulomb’s law, Ampère, Faraday, ... ◯ ● Theory of light beams: Particle theory: Isaac Newton, Laplace, Plank ◯ Wave theory: Grimaldi, Chris Huygens, Thomas Young, Maxwell ◯ ● Law of gravity Aristotle, Galileo, Newton, … ◯ 8 8

  9. Maxwell’s equations Further simplified w/ Maxwell’s Eqns: Simplified w/ symmetry of special original form rotational symmetry relativity Diverse ε uvk λ ∂ v F k λ = 0 electro- magnetic ∂ v F uV = 4 π theories c j u 9 9

  10. How about a blueprint of ML ● Loss ● Optimization solver ● Model architecture min $ ℒ & Optimization Loss Model solver architecture 10 10

  11. How about a blueprint of ML ● Loss ss ● Optimization solver ● Model architecture !"# ' − ) − ℍ $, & Experience Divergence Uncertainty 11 11

  12. MLE at a close look: ● The most classical learning algorithm 1 ● Supervised: log + , (( ∗ |% ∗ ) min − 2 % ∗ ,( ∗ ∼ ! ◯ Observe data ! = {(% ∗ , ( ∗ )} , ◯ Solve with SGD ● Unsupervised: 1 % ∗ + , (% ∗ , () ◯ Observe ! = , ( is latent variable log 8 min − 2 % ∗ ∼ ! , ◯ Posterior + , ((|%) ( ◯ Solve with EM: E-step imputes latent variable ( through expectation on complete likelihood § M-step: supervised MLE § 12 12

  13. MLE as Entropy Maximization ● Duality between Supervised MLE and maximum entropy, when ! is exponential family Shannon entropy + %(',)) + ! min features 0(', )) s.t. / % 0(', )) = / (2 ∗ ,4 ∗ )∼6 0(', )) data as constraints ⇒ Solve w/ Lagrangian method Lagrangian multiplier ; ! ', ) = exp ; ⋅ 0 ' / >(;) min −/ (' ∗ ,) ∗ )∼6 ; ⋅ 0(', )) + log >(;) Negative log-likelihood ? 13 13

  14. MLE as Entropy Maximization ● Unsupervised MLE can be achieved by maximizing the negative free energy: Introduce auxiliary distribution !(#|%) (and then play with its entropy and cross entropy, etc.) ◯ + , (% ∗ , #) = 0 1(#|% ∗ ) log + , % ∗ , # + KL ! # % ∗ || + , # % ∗ log * ! # % ∗ # ≥ 6 ! #|% ∗ + 0 1(#|% ∗ ) log + , (% ∗ , #) 14 14

  15. Algorithms for Unsupervised MLE 1 6 * (0 ∗ , .) min − + 0 ∗ ∼ ; log = * . 1) Solve with EM EM 6 * (0 ∗ , .) = + ,(.|0 ∗ ) log 6 * 0 ∗ , . + KL " . 0 ∗ || 6 * . 0 ∗ log = " . 0 ∗ . ≥ D " .|0 ∗ + + ,(.|0 ∗ ) log 6 * (0 ∗ , .) E-step: Maximize ℒ ", $ w.r.t " , equivalent to minimizing KL by setting q " . 0 ∗ = 6 * ?@A (.|0 ∗ ) + ,(.|0 ∗ ) log 6 * 0 ∗ , . M-step: Maximize ℒ ", $ w.r.t $ : max q * 15 15

  16. Algorithms for Unsupervised MLE (cont’d) ! 4 (& ∗ , $) = 5 6($|& ∗ ) log ! 4 & ∗ , $ + KL + $ & ∗ || ! 4 $ & ∗ log : + $ & ∗ $ ≥ > + $|& ∗ + 5 6($|& ∗ ) log ! 4 (& ∗ , $) 2) When model ! " is complex, directly working with the true posterior ! " ($|& ∗ ) is intractable ⇒ Vari riational EM Consider a sufficiently restricted family * of +($|&) so that minimizing the § KL is tractable E.g., parametric distributions, factorized distributions q E-step: Maximize ℒ +, " w.r.t + ∈ * , equivalent to minimizing KL § 5 6($|& ∗ ) log ! 4 & ∗ , $ M-step: Maximize ℒ +, " w.r.t " : max § 4 16 16

  17. Algorithms for Unsupervised MLE (cont’d) 5 ) (/ ∗ , -) = * +(-|/ ∗ ) log 5 ) / ∗ , - + KL ! - / ∗ || 5 ) - / ∗ log ; ! - / ∗ - ≥ ? ! -|/ ∗ + * +(-|/ ∗ ) log 5 ) (/ ∗ , -) 3) When ! is complex, e.g., deep NNs, optimizing ! in E-step is difficult (e.g., high variance) ⇒ Wake ke-S -Sleep algori rithm m [Hinton et al., 1995] KL(5 ) - / ∗ ||! 8 - / ∗ ) min • Sleep-phase (E-step): Reverse KL 8 * +(-|/ ∗ ) log 5 ) / ∗ , - • Wake-phase (M-step): Maximize ℒ !, % w.r.t % : max ) Other tricks: reparameterization in VAE (‘2014), control variates in NVIL (‘2014) 17 17

  18. Quick summary of MLE ● Supervised: ◯ Duality with MaxEnt ◯ Solve with SGD, IPF … ● Unsupervised: Lower bounded by negative free energy ◯ ◯ Solve with EM, VEM, Wake-Sleep, … ● Close connections to MaxEnt ● With MaxEnt, algorithms (e.g., EM) arises naturally 18 18

  19. Posterior Regularization (PR) ● Make use of constraints in Bayesian learning An auxiliary posterior distribution ! " ◯ Slack variable # , constant weight $ = & > 0 ◯ 1 log 7 8 9, : min ,, . − $0 ! − &1 , + # ;. =. −1 , > 8 9 , : ≤ # [Ganchev et al., 2010] E.g., max-margin constraint for linear regression [Jaakkola et al., 1999] and ◯ general models (e.g., LDA, NNs) [Zhu et al., 2014] –– more later ● Solution for ! & log 7 8 (9, :) + > 9 , : D ! @ = exp / F D $ 19 19

  20. More general learning leveraging PR ● No need to limit to Bayesian learning ● E.g., Complex rule constraints on general models [Hu et al., 2016], where ! can be over arbitrary variables, e.g., !(#, %) ◯ ' ( #, % is NNs of arbitrary architectures with parameters ) ◯ 1 E.g., 3(#, %) is a 1st-order logical rule: log ' ( #, % ., (,7 − 89 ! − :- . min + 1 If sentence # contains word ``but’’ 1 ⇒ its sentiment % is the same as the 1 − 3(#, %) *. ,. - . #,% ≤ 1 sentiment after “but” 20 20

  21. EM for the general PR ● Rewrite without slack variable: 1 1 min ;, 1 − 6> ! − ,: ; − : ; ",$ 5 " , $ log 0 1 ", $ Solve with EM ◯ , log 0 1 (", $) + 5 " , $ ) ! ", $ = exp / + E-step: § ) 6 1 min : ; log 0 1 ", $ M-step: § 1 21 21

  22. Reformulating unsupervised MLE with PR + , (# ∗ , ") ≥ 2 ! "|# ∗ + 5 6("|# ∗ ) log + , (# ∗ , ") log * " ● Introduce arbitrary ! " # 1 6, ,, : − <2 ! − =5 6 min + ? log + , #, " Data as constraint. Data as constr aint. 1 Given # ∼ % , this constraint doesn’t @. B. −5 6 < ? D # ; % influence the solution of ! and & D # ; % ∶= log 5 H ∗ ∼% I # ∗ # ◯ A constraint saying # must equal to one of the true data points § Or alternatively, the (log) expected similarity of # to dataset % , with § I ⋅ as the similarity measure (we’ll come back to this later) < = = = 1 ◯ 22 22

  23. The standard equation min % ℒ ' 1 Optimization Loss Model 0 7, 8 , : & 7, 8 $, &, '() *+ min − .ℍ 0 + 2 solver architecture 1 ; 7 , 8 3. 5. −6 $ 7,8 < 2 Equivalently: 1 1 0 7, 8 , : & 7, 8 ; 7 , 8 min $, & − 6 $ 7,8 + *+ − .ℍ 0 3 terms: Experiences Divergence Uncertainty (exogenous regularizations) (fitness) (self-regularization) e.g., data examples, rules e.g., Cross Entropy e.g., Shannon entropy Uncertainty Textbook Student Teacher ; 7 , 8| . : & 7, 8 0 7, 8 23 23

  24. Re-visit unsupervised MLE under SE 1 1 ! $, < min 3, 5 − 78 9 − :, 3 − , 3 log = 5 $, < 7 = : = 1 ! ≔ !( $ ; &) = log , $ ∗ ∼& / $ ∗ $ 9 = 9(<|$) 24 24

  25. Re-visit supervised MLE under SE 1 10 ., / log 4 & ., / min $, & − () * − +, $ − , $ .,/ ( = 1, + = 6 0: = 0 . , / ; 9 = log , (. ∗ , / ∗ )∼9 > (. ∗ ,/ ∗ ) ., / 25 25

Recommend


More recommend