regression probabilistic perspective
play

Regression: Probabilistic perspective CE-717: Machine Learning - PowerPoint PPT Presentation

Regression: Probabilistic perspective CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Curve fitting: probabilistic perspective } Describing uncertainty over value of target variable as a probability distribution


  1. Regression: Probabilistic perspective CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018

  2. Curve fitting: probabilistic perspective } Describing uncertainty over value of target variable as a probability distribution } Example: ๐‘”(๐‘ฆ; ๐’™) ๐‘”(๐‘ฆ ' ; ๐’™) ๐‘ž(๐‘ง|๐‘ฆ ' , ๐’™, ๐œ) 2

  3. The learning diagram including noisy target } Type equation here. โ„Ž: ๐’ด โ†’ ๐’ต ๐‘ฆ A , โ€ฆ , ๐‘ฆ C ๐‘ฆ A , ๐‘ง (A) , โ€ฆ , ๐‘ฆ C , ๐‘ง (C) ๐‘” ๐’š = โ„Ž(๐’š) ๐‘”: ๐’ด โ†’ ๐’ต ๐‘„ ๐‘ฆ, ๐‘ง = ๐‘„ ๐‘ฆ ๐‘„(๐‘ง|๐‘ฆ) Target Distribution on features distribution 3 [Y.S. Abou Mostafa, 2012]

  4. Curve fitting: probabilistic perspective (Example) } Special case: Observed output = function + noise ๐‘ง = ๐‘” ๐’š; ๐’™ + ๐œ— e.g., ๐œ—~๐‘‚(0, ๐œ L ) } Noise: Whatever we cannot capture with our chosen family of functions 4

  5. Curve fitting: probabilistic perspective (Example) } Best regression ๐”ฝ ๐‘ง|๐’š = ๐น ๐‘”(๐’š; ๐’™) + ๐œ— = ๐‘”(๐’š; ๐’™) ๐œ—~๐‘‚(0,๐œ L ) } ๐‘” ๐’š; ๐’™ is trying to capture the mean of the observations ๐‘ง given the input ๐’š : } ๐”ฝ ๐‘ง|๐’š : conditional expectation of ๐‘ง given ๐’š } evaluated according to the model (not according to the underlying distribution ๐‘„ ) 5

  6. Curve fitting using probabilistic estimation } Maximum Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian approach 6

  7. Maximum likelihood estimation R ๐’š P , ๐‘ง P } Given observations ๐’  = PQA } Find the parameters that maximize the (conditional) likelihood of the outputs: R ๐‘€ ๐’  ; ๐œพ = ๐‘ž ๐’› ๐’€, ๐œพ = W ๐‘ž(๐‘ง P |๐’š P , ๐œพ) PQA (A) (A) ๐‘ฆ [ 1 ๐‘ฆ A โ‹ฏ ๐‘ง (A) โ‹ฏ (L) (L) 1 ๐‘ฆ [ ๐‘ฆ A ๐’› = โ‹ฎ ๐’€ = โ‹ฑ โ‹ฎ โ‹ฎ โ‹ฎ ๐‘ง (R) (R) (R) ๐‘ฆ [ 1 ๐‘ฆ A โ‹ฏ 7

  8. ๏ฟฝ Maximum likelihood estimation (Contโ€™d) ๐‘ง = ๐‘” ๐’š; ๐’™ + ๐œ— , ๐œ—~๐‘‚(0, ๐œ L ) } ๐‘ง given ๐’š is normally distributed with mean ๐‘”(๐’š; ๐’™) and variance ๐œ L : } we model the uncertainty in the predictions, not just the mean 1 {โˆ’ 1 L } ๐‘ž(๐‘ง|๐’š, ๐’™, ๐œ L ) = exp 2๐œ L ๐‘ง โˆ’ ๐‘” ๐’š; ๐’™ 2๐œŒ ๐œ 8

  9. ๏ฟฝ Maximum likelihood estimation (Contโ€™d) } Example: univariate linear function 1 {โˆ’ 1 ๐‘ž(๐‘ง|๐’š, ๐’™, ๐œ L ) = 2๐œ L ๐‘ง โˆ’ ๐‘ฅ ' โˆ’ ๐‘ฅ A ๐‘ฆ L } exp 2๐œŒ ๐œ โ€ข Why is this line a bad fit according to the likelihood criterion? โ€ข ๐‘ž(๐‘ง|๐’š,๐’™,๐œ L ) for most of the points will be โ€ข near zero (as they are far from this line) 9

  10. Maximum likelihood estimation (Contโ€™d) } Maximize the likelihood of the outputs (i.i.d): C ๐‘€ ๐’ ; ๐’™, ๐œ L = W ๐‘ž(๐‘ง P |๐’š (P) , ๐’™, ๐œ L ) PQA ๐‘€ ๐’ ; ๐’™, ๐œ L ๐’™ e = argmax ๐’™ C W ๐‘ž(๐‘ง P |๐’š (P) , ๐’™, ๐œ L ) = argmax ๐’™ PQA 10

  11. Maximum likelihood estimation (Contโ€™d) } It is often easier (but equivalent) to try to maximize the log-likelihood: ln ๐‘ž ๐’› ๐’€, ๐’™, ๐œ L ๐’™ e = argmax ๐’™ C C ln W ๐‘ž(๐‘ง P |๐’š (P) , ๐’™, ๐œ L ) = i ln ๐’ช(๐‘ง P |๐’š P , ๐’™, ๐œ L ) PQA PQA C = โˆ’๐‘‚ ln ๐œ โˆ’ ๐‘‚ 2 ln 2๐œŒ โˆ’ 1 L 2๐œ L i ๐‘ง P โˆ’ ๐‘”(๐’š P ; ๐’™) PQA sum of squares error 11

  12. Maximum likelihood estimation (Contโ€™d) } Maximizing log-likelihood (when we assume ๐‘ง = ๐‘” ๐’š; ๐’™ + ๐œ— , ๐œ—~๐‘‚(0, ๐œ L ) ) is equivalent to minimizing SSE e be the maximum likelihood (here least squares) setting } Let ๐’™ of the parameters. } What is the maximum likelihood estimate of ๐œ L ? ๐œ– log ๐‘€(๐’ ; ๐’™, ๐œ L ) = 0 ๐œ–๐œ L C m L = 1 L ๐‘‚ i ๐‘ง P โˆ’ ๐‘”(๐’š P ; ๐’™ e) โ‡’ ๐œ PQA Mean squared prediction error 12

  13. Maximum likelihood estimation (Contโ€™d) } Generally, maximizing log-likelihood is equivalent to minimizing empirical loss when the loss is defined according to: ๐‘€๐‘๐‘ก๐‘ก ๐‘ง P , ๐‘” ๐’š P , ๐’™ = โˆ’ ln ๐‘ž(๐‘ง P |๐’š (P) , ๐’™, ๐œพ) } Loss: negative log-probability } More general distributions for ๐‘ž(๐‘ง|๐’š) can be considered 13

  14. Maximum A Posterior (MAP) estimation } MAP: } Given observations ๐’  } Find the parameters that maximize the probabilities of the parameters after observing the data (posterior probabilities): ๐œพ pqr = max ๐‘ž ๐œพ| ๐’  ) ๐œพ Since ๐‘ž ๐œพ| ๐’  โˆ ๐‘ž ๐’  |๐œพ ๐‘ž(๐œพ) ๐œพ pqr = max ๐‘ž ๐’  |๐œพ ๐‘ž(๐œพ) ๐œพ 14

  15. ๏ฟฝ Maximum A Posterior (MAP) estimation C ๐’š P , ๐‘ง P } Given observations ๐’  = PQA max ๐’™ ๐‘ž(๐’™|๐’€, ๐’›) โˆ ๐‘ž ๐’› ๐’€, ๐’™ ๐‘ž(๐’™) [wA 1 ๐‘“๐‘ฆ๐‘ž โˆ’ 1 ๐‘ž ๐’™ = ๐’ช ๐Ÿ, ๐›ฝ L ๐‘ฑ = 2๐›ฝ L ๐’™ y ๐’™ 2๐œŒ ๐›ฝ 15

  16. Maximum A Posterior (MAP) estimation C ๐’š P , ๐‘ง P } Given observations ๐’  = PQA ๐’™ ln ๐‘ž ๐’› ๐’€, ๐’™, ๐œ L ๐‘ž(๐’™) max C 1 + 1 L ๐œ L i ๐‘ง P โˆ’ ๐‘”(๐’š P ; ๐’™) ๐›ฝ L ๐’™ y ๐’™ min ๐’™ PQA { | } Equivalent to regularized SSE with ๐œ‡ = } | 16

  17. ๏ฟฝ ๏ฟฝ Bayesian approach C ๐’š P , ๐‘ง P } Given observations ๐’  = PQA } Find the parameters that maximize the probabilities of observations ๐‘ž ๐‘ง ๐’š, ๐’  = ~ ๐‘ž ๐‘ง ๐’™, ๐’š ๐‘ž ๐’™| ๐’  ๐‘’๐’™ } Example of prior distribution: ๐‘ž ๐’™ = ๐’ช(๐Ÿ, ๐›ฝ L ๐‘ฑ) ๐’ C = 1 โ€šA ) } In this case: ๐‘ž ๐’™| ๐’  = ๐’ช(๐’ C , ๐‘ป C โ€šA ๐’€ y ๐’› ๐œ L ๐‘ป C ๐‘ป C = 1 ๐›ฝ L ๐‘ฑ + 1 ๐œ L ๐’€ y ๐’€ 17

  18. ๏ฟฝ ๏ฟฝ Bayesian approach C ๐’š P , ๐‘ง P } Given observations ๐’  = PQA } Find the parameters that maximize the probabilities of observations C ๐‘ž ๐’  ๐’™ = ๐‘€ ๐’ ; ๐’™, ๐œพ = W ๐‘ž ๐‘ง P ๐’™ y ๐’š P , ๐œพ PQA ๐‘ž ๐‘ง P ๐‘” ๐’š P , ๐’™ , ๐œพ = ๐’ช(๐‘ง P |๐’™ y ๐’š P , ๐œ L ) ๐‘ž ๐’™ = ๐’ช(๐Ÿ, ๐›ฝ L ๐‘ฑ) ๐‘ž(๐’™|๐’ ) โˆ ๐‘ž ๐’  ๐’™ ๐‘ž(๐’™) ๐’ C = 1 โ€šA ๐’€ y ๐’› ๐œ L ๐‘ป C ๐‘ž(๐‘ง|๐’š, ๐’ ) = ~ ๐‘ž ๐‘ง ๐’™, ๐’š ๐‘ž ๐’™|๐’  ๐‘’๐’™ Predictive ๐‘ป C = 1 ๐›ฝ L ๐‘ฑ + 1 distribution ๐œ L ๐’€ y ๐’€ y ๐’š, ๐œ C L (๐’š) ๐‘ž ๐‘ง ๐’š, ๐’  = ๐‘‚ ๐’ C L ๐’š = ๐œ L + ๐’š y ๐‘ป C โ€šA ๐’š ๐œ C 18

  19. Predictive distribution: example } Example: Sinusoidal data, 9 Gaussian basis functions Red curve shows the mean of the predictive distribution Pink region spans one standard deviation either side of the mean 19 [Bishop]

  20. Predictive distribution: example } Functions whose parameters are sampled from ๐‘ž(๐’™|๐’ ) 20 [Bishop]

  21. Resource } C. Bishop, โ€œPattern Recognition and Machine Learningโ€, Chapter 3.3. 21

Recommend


More recommend