probability and statistical decision theory
play

Probability and Statistical Decision Theory Many slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James,


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 1

  2. Logistics • Recitation tonight: 730-830pm, Halligan 111B • More on pipelines and feature transforms • Cross validation Mike Hughes - Tufts COMP 135 - Spring 2019 2

  3. Unit Objectives • Probability Basics • Discrete random variables • Continuous random variables • Decision Theory: Making optimal predictions • Limits of learning • The curse of dimensionality • The bias-variance tradeoff Mike Hughes - Tufts COMP 135 - Spring 2019 3

  4. What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 4

  5. Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Spring 2019 5

  6. Model Complexity vs Error Overfitting Underfitting Mike Hughes - Tufts COMP 135 - Spring 2019 6

  7. Today: Bias and Variance Credit: Scott Fortmann-Roe http://scott.fortmann-roe.com/docs/BiasVariance.html Mike Hughes - Tufts COMP 135 - Spring 2019 7

  8. Model Complexity vs Error High Variance High Bias Mike Hughes - Tufts COMP 135 - Spring 2019 8

  9. Discrete Random Variable Examples: • Coin flip! Heads or tails? • Dice roll! 1 or 2 or … 6? In general, random variable is defined by: • Countable set of all possible outcomes • Probability value for each outcome Mike Hughes - Tufts COMP 135 - Spring 2019 9

  10. Probability Mass Function Notation: - X is random variable - x is a particular observed value - Probability of observation: p (" = $) Function p is a probability mass function (pmf) Maps possible values to probabilities in [0, 1] Must sum to one over domain of X Mike Hughes - Tufts COMP 135 - Spring 2019 10

  11. Pair exercise • Draw the pmf for a normal 6-sided dice roll • Draw pmf if there are: • 2 sides with 1 pip • 0 sides with 2 pips Mike Hughes - Tufts COMP 135 - Spring 2019 11

  12. Expected Values What is the expected value of a dice roll? Expected means probability-weighted average X E [ X ] = p ( X = x ) x x Mike Hughes - Tufts COMP 135 - Spring 2019 12

  13. Joint Probability X Y p ( X = candidate A AND Y = young) Mike Hughes - Tufts COMP 135 - Spring 2019 13

  14. Marginal Probability X Marginal p(Y): Y Marginal p(X): Mike Hughes - Tufts COMP 135 - Spring 2019 14

  15. Conditional Probability What is the probability of support for candidate A, if we assume that the voter is young? p ( X = candidate A | Y = young) Goal: X Marginal p(Y): Y Try it with your partner! Mike Hughes - Tufts COMP 135 - Spring 2019 15

  16. Conditional Probability What is the probability of support for candidate A, if we assume that the voter is young? p ( X = candidate A | Y = young) Goal: X Marginal p(Y): Y Answer: Mike Hughes - Tufts COMP 135 - Spring 2019 16

  17. The Rules of Probability = p ( X | Y ) p ( Y ) Mike Hughes - Tufts COMP 135 - Spring 2019 17

  18. Continuous Random Variables Any r.v. whose possible outcomes are not a discrete set, but take values on a number line Examples: uniform draw between 0 and 1 draw from Gaussian “bell curve” distribution Mike Hughes - Tufts COMP 135 - Spring 2019 18

  19. Probability Density Function • Generalizes pmf for discrete r.v. to continuous • Any pdf p(x) must satisfy two properties: ∀ x : p ( x ) ≥ 0 Z p ( x ) dx = 1 x Mike Hughes - Tufts COMP 135 - Spring 2019 19

  20. Example Consider a uniform distribution over entire real line (from -inf to + inf) Draw the pdf, verify that it can meet the required conditions (nonnegative, integrates to one). Is there a problem here? Mike Hughes - Tufts COMP 135 - Spring 2019 20

  21. Plots of Gaussian pdf What do you notice about y-axis values… Is there a problem here? Mike Hughes - Tufts COMP 135 - Spring 2019 21

  22. Probability Density Function • Generalizes pmf for discrete r.v. to continuous • Any pdf p(x) must satisfy two properties: ∀ x : p ( x ) ≥ 0 Z p ( x ) dx = 1 x Value of p(x) can take ANY value > 0, even sometimes larger than 1 Should NOT interpret as “probability of drawing exactly x” Should interpret as “density at vanishingly small interval around x” Remember: density = mass / volume Mike Hughes - Tufts COMP 135 - Spring 2019 22

  23. Continuous Expectations Z E [ X ] = xp ( x ) dx x ∈ domain( X ) Z E [ h ( X )] = h ( x ) p ( x ) dx x ∈ domain( X ) Mike Hughes - Tufts COMP 135 - Spring 2019 23

  24. Approximating Expectations Use “Monte Carlo”: average of a sample! • 1) Draw S i.i.d. samples from distribution x 1 , x 2 , . . . x S ∼ p ( x ) • 2) Compute mean of these sampled values S E [ h ( X )] ≈ 1 X h ( x s ) S s =1 For any function h, the mean of this random estimator is unbiased. As number of samples S increases, variance of estimator decreases. Mike Hughes - Tufts COMP 135 - Spring 2019 24

  25. Statistical Decision Theory • See ESL textbook in Ch. 2 and Ch. 7 Mike Hughes - Tufts COMP 135 - Spring 2019 25

  26. How to predict best if we know conditional probability? Assume we have: a specific x input of interest a known “true” conditional p(Y | X) error metric we care about How should we set our predictor ? Minimize the expected error! ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Key ideas: prediction will be a scalar conditional distribution p(Y|X) tells us everything we need to know Mike Hughes - Tufts COMP 135 - Spring 2019 26

  27. Expected y at a given fixed x Z E [ Y | X = x ] = y p ( y | X = x ) dy y Mike Hughes - Tufts COMP 135 - Spring 2019 27

  28. Recall from HW1 • Two constant value estimators • Mean of training set • Median of training set • Two possible error metrics • Squared error • Absolute error Which estimator did best under which error metric? Mike Hughes - Tufts COMP 135 - Spring 2019 28

  29. Minimize expected squared error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z y ) 2 p ( y | X = x ) dy E [err( Y, ˆ y ) | X = x ] = ( y − ˆ y What is your intuition from HW1? Express in terms of p(Y|X=x)… How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Mike Hughes - Tufts COMP 135 - Spring 2019 29

  30. Minimize expected squared error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z y ) 2 p ( y | X = x ) dy E [err( Y, ˆ y ) | X = x ] = ( y − ˆ y How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Optimal predictor for squared error: mean y value under p(Y|X=x) In practice, mean of sampled y = E [ Y | X = x ] ˆ y values at/around x Mike Hughes - Tufts COMP 135 - Spring 2019 30

  31. Minimize expected absolute error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z E [err( Y, ˆ y ) | X = x ] = | y − ˆ y | p ( y | X = x ) dy y How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y What is your intuition from HW 1? Mike Hughes - Tufts COMP 135 - Spring 2019 31

  32. Minimize expected absolute error Assume we have: a specific x input of interest a known “true” conditional p(y | x) Z E [err( Y, ˆ y ) | X = x ] = | y − ˆ y | p ( y | X = x ) dy y How should we set our predictor to minimize the expected error? ˆ y min E [err( Y, ˆ y ) | X = x ] ˆ y Optimal predictor for squared error: median y value under p(Y|X=x) y ∗ = median( p ( Y | X = x )) In practice, median of ˆ sampled y values at/around x Mike Hughes - Tufts COMP 135 - Spring 2019 32

  33. Minimizing error with K-NN Ideal Approximation know “true” conditional p(y | x) Use neighborhood around x • • Take average of y values in • neighborhood If we have enough training data, K-NN is good approximation Some theorems say KNN estimate ideal as # examples (N) gets infinitely large Problem in practice: we never have enough data, esp. if feature dimensions are large Mike Hughes - Tufts COMP 135 - Spring 2019 33

  34. Curse of Dimensionality Mike Hughes - Tufts COMP 135 - Spring 2019 34

  35. MSE as dimension increases -- Linear Regression • K Neighbors Regression Credit: ISL textbook, Fig 3.20 Mike Hughes - Tufts COMP 135 - Spring 2019 35

  36. Write MSE via Bias & Variance is known “true” response value at given fixed input x y is a Random Variable obtained by fitting estimator to random ˆ y sample of N training data examples, then predicting at fixed x h � � 2 i h y − y ) 2 i y ( x tr , y tr ) − y ˆ = E (ˆ E h yy + y 2 i y 2 − 2ˆ = E ˆ h y 2 i yy + y 2 = E ˆ − 2¯ y , E [ˆ ¯ y ] Mike Hughes - Tufts COMP 135 - Spring 2019 36

  37. Write MSE via Bias & Variance h � � 2 i h y − y ) 2 i y ( x tr , y tr ) − y ˆ = E (ˆ E h yy + y 2 i y 2 − 2ˆ = E ˆ h y 2 i yy + y 2 = E ˆ − 2¯ h y 2 i y 2 + ¯ y 2 − 2¯ yy + y 2 Add net value of zero = E ˆ − ¯ Pick 0 = -a + a Mike Hughes - Tufts COMP 135 - Spring 2019 37

Recommend


More recommend