review
play

Review Linear separability (and use of features) Class - PowerPoint PPT Presentation

Review Linear separability (and use of features) Class probabilities for linear discriminants sigmoid (logistic) function Applications: USPS, fMRI ' figure from book 1 #$& 2 #$% ) 0.5 #$! #$" 0 # ! ! ! "


  1. Review • Linear separability (and use of features) • Class probabilities for linear discriminants � sigmoid (logistic) function • Applications: USPS, fMRI ' figure from book 1 #$& φ 2 #$% ) 0.5 #$! #$" 0 # ! ! ! " # " ! 0 0.5 1 ( φ 1 1

  2. Review • Generative vs. discriminative � maximum conditional likelihood • Logistic regression • Weight space ! � each example adds a penalty to ' " all weight vectors that ! ,-./0)1/2/*+ & misclassify it # % � penalty is approximately $ piecewise linear ! ! ! " ! # $ # " ! ()*+ 2

  3. Example " #%( #%' #%& #%! # ! ! ! " # " ! $ 3

  4. –log(P(Y 1..3 | X 1..3 , W)) % " ! # ! % ! $ ! & ! $ ! % " % $ & ' ! " 4

  5. Generalization: multiple classes • One weight vector per class: Y � {1,2,…,C} � P(Y=k) = � Z k = • In 2-class case: 5

  6. Multiclass example 6 4 2 0 � 2 figure from book � 4 � 6 � 6 � 4 � 2 0 2 4 6 6

  7. Priors and conditional MAP • P(Y | X, W) = � Z = • As in linear regression, can put prior on W � common priors: L 2 (ridge), L 1 (sparsity) • max w P(W=w | X, Y) 7

  8. Software • Logistic regression software is easily available: most stats packages provide it � e.g., glm function in R � or, http://www.cs.cmu.edu/~ggordon/IRLS-example/ • Most common algorithm: Newton’s method on log-likelihood (or L 2 -penalized version) � called “iteratively reweighted least squares” � for L 1 , slightly harder (less software available) 8

  9. Historical application: Fisher iris data P(I. virginica) petal length 9

  10. 10

  11. Bayesian regression • In linear and logistic regression, we’ve looked at � conditional MLE: max w P(Y | X, w) � conditional MAP: max w P(W=w | X, Y) • But of course, a true Bayesian would turn up nose at both � why? 11

  12. Sample from posterior & ' " ! # ! ' ! & ! % ! $ " $ #" ! " 12

  13. Predictive distribution # "$' "$& "$% "$! " ! !" ! #" " #" !" 13

  14. Overfitting • Overfit: training likelihood � test likelihood � often a result of overconfidence • Overfitting is an indicator that the MLE or MAP approximation is a bad one • Bayesian inference rarely overfits � may still lead to bad results for other reasons! � e.g., not enough data, bad model class, … 14

  15. So, we want the predictive distribution • Most of the time… � Graphical model is big and highly connected � Variables are high-arity or continuous • Can’t afford exact inference � Inference reduces to numerical integration (and/ or summation) � We’ll look at randomized algorithms 15

  16. Numerical integration $ ' ) ,-+.*/ ( ! " ! ! ! "#$ " ! "#$ "#% "#& "#' "#( " ! "#( ! "#' ! "#& ! ! "#% ! ! + * 16

  17. 2D is 2 easy! • We care about high-D problems • Often, much of the mass is hidden in a tiny fraction of the volume � simultaneously try to discover it and estimate amount 17

  18. Application: SLAM 18

  19. Integrals in multi-million-D Eliazar and Parr, IJCAI-03 19

  20. Simple 1D problem )" (" $" '" &" %" !" " ! ! ! "#$ " "#$ ! 20

  21. Uniform sampling )" (" $" '" &" %" !" " ! ! ! "#$ " "#$ ! 21

  22. Uniform sampling E(f(X)) = • So, is desired integral • But standard deviation can be big • Can reduce it by averaging many samples • But only at rate 1/ � N 22

  23. Importance sampling • Instead of X ~ uniform, use X ~ Q(x) • Q = • Should have Q(x) large where f(x) is large • Problem: 23

  24. Importance sampling • Instead of X ~ uniform, use X ~ Q(x) • Q = • Should have Q(x) large where f(x) is large • Problem: � E Q ( f ( X )) = Q ( x ) f ( x ) dx 23

  25. Importance sampling h ( x ) f ( x ) /Q ( x ) ≡ � E Q ( h ( X )) = Q ( x ) h ( x ) dx � = Q ( x ) f ( x ) /Q ( x ) dx � = f ( x ) dx 24

  26. Importance sampling • So, take samples of h(X) instead of f(X) • W i = 1/Q(X i ) is importance weight • Q = 1/V yields uniform sampling 25

  27. Importance sampling )" (" $" '" &" %" !" " ! ! ! "#$ " "#$ ! 26

  28. Variance • How does this help us control variance? • Suppose: � f big � Q small • Then h = f/Q: • Variance of each weighted sample is • Optimal Q? 27

  29. Importance sampling, part II • Suppose we want � � f ( x ) dx = P ( x ) g ( x ) dx = E P ( g ( X )) • Pick N samples X i from proposal Q(X) • Average W i g(X i ), where importance weight is � W i = 28

  30. Importance sampling, part II • Suppose we want � � f ( x ) dx = P ( x ) g ( x ) dx = E P ( g ( X )) • Pick N samples X i from proposal Q(X) • Average W i g(X i ), where importance weight is � W i = � � E Q ( Wg ( X )) = Q ( x )[ P ( x ) /Q ( x )] g ( x ) dx = P ( x ) g ( x ) dx 28

  31. Two variants of IS • Same algorithm, different terminology � want � f(x) dx vs. E P (f(X)) � W = 1/Q vs. W = P/Q 29

  32. Parallel importance sampling • Suppose we want � � f ( x ) dx = P ( x ) g ( x ) dx = E P ( g ( X )) • But P(x) is unnormalized (e.g., represented by a factor graph)—know only Z P(x) 30

  33. Parallel IS • Pick N samples X i from proposal Q(X) • If we knew W i = P(X i )/Q(X i ), could do IS • Instead, set � and, • Then: 31

  34. Parallel IS • Final estimate: 32

Recommend


More recommend