Regression: Probabilistic perspective CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018
Curve fitting: probabilistic perspective } Describing uncertainty over value of target variable as a probability distribution } Example: ๐(๐ฆ; ๐) ๐(๐ฆ ' ; ๐) ๐(๐ง|๐ฆ ' , ๐, ๐) 2
The learning diagram including noisy target } Type equation here. โ: ๐ด โ ๐ต ๐ฆ A , โฆ , ๐ฆ C ๐ฆ A , ๐ง (A) , โฆ , ๐ฆ C , ๐ง (C) ๐ ๐ = โ(๐) ๐: ๐ด โ ๐ต ๐ ๐ฆ, ๐ง = ๐ ๐ฆ ๐(๐ง|๐ฆ) Target Distribution on features distribution 3 [Y.S. Abou Mostafa, 2012]
Curve fitting: probabilistic perspective (Example) } Special case: Observed output = function + noise ๐ง = ๐ ๐; ๐ + ๐ e.g., ๐~๐(0, ๐ L ) } Noise: Whatever we cannot capture with our chosen family of functions 4
Curve fitting: probabilistic perspective (Example) } Best regression ๐ฝ ๐ง|๐ = ๐น ๐(๐; ๐) + ๐ = ๐(๐; ๐) ๐~๐(0,๐ L ) } ๐ ๐; ๐ is trying to capture the mean of the observations ๐ง given the input ๐ : } ๐ฝ ๐ง|๐ : conditional expectation of ๐ง given ๐ } evaluated according to the model (not according to the underlying distribution ๐ ) 5
Curve fitting using probabilistic estimation } Maximum Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian approach 6
Maximum likelihood estimation R ๐ P , ๐ง P } Given observations ๐ = PQA } Find the parameters that maximize the (conditional) likelihood of the outputs: R ๐ ๐ ; ๐พ = ๐ ๐ ๐, ๐พ = W ๐(๐ง P |๐ P , ๐พ) PQA (A) (A) ๐ฆ [ 1 ๐ฆ A โฏ ๐ง (A) โฏ (L) (L) 1 ๐ฆ [ ๐ฆ A ๐ = โฎ ๐ = โฑ โฎ โฎ โฎ ๐ง (R) (R) (R) ๐ฆ [ 1 ๐ฆ A โฏ 7
๏ฟฝ Maximum likelihood estimation (Contโd) ๐ง = ๐ ๐; ๐ + ๐ , ๐~๐(0, ๐ L ) } ๐ง given ๐ is normally distributed with mean ๐(๐; ๐) and variance ๐ L : } we model the uncertainty in the predictions, not just the mean 1 {โ 1 L } ๐(๐ง|๐, ๐, ๐ L ) = exp 2๐ L ๐ง โ ๐ ๐; ๐ 2๐ ๐ 8
๏ฟฝ Maximum likelihood estimation (Contโd) } Example: univariate linear function 1 {โ 1 ๐(๐ง|๐, ๐, ๐ L ) = 2๐ L ๐ง โ ๐ฅ ' โ ๐ฅ A ๐ฆ L } exp 2๐ ๐ โข Why is this line a bad fit according to the likelihood criterion? โข ๐(๐ง|๐,๐,๐ L ) for most of the points will be โข near zero (as they are far from this line) 9
Maximum likelihood estimation (Contโd) } Maximize the likelihood of the outputs (i.i.d): C ๐ ๐ ; ๐, ๐ L = W ๐(๐ง P |๐ (P) , ๐, ๐ L ) PQA ๐ ๐ ; ๐, ๐ L ๐ e = argmax ๐ C W ๐(๐ง P |๐ (P) , ๐, ๐ L ) = argmax ๐ PQA 10
Maximum likelihood estimation (Contโd) } It is often easier (but equivalent) to try to maximize the log-likelihood: ln ๐ ๐ ๐, ๐, ๐ L ๐ e = argmax ๐ C C ln W ๐(๐ง P |๐ (P) , ๐, ๐ L ) = i ln ๐ช(๐ง P |๐ P , ๐, ๐ L ) PQA PQA C = โ๐ ln ๐ โ ๐ 2 ln 2๐ โ 1 L 2๐ L i ๐ง P โ ๐(๐ P ; ๐) PQA sum of squares error 11
Maximum likelihood estimation (Contโd) } Maximizing log-likelihood (when we assume ๐ง = ๐ ๐; ๐ + ๐ , ๐~๐(0, ๐ L ) ) is equivalent to minimizing SSE e be the maximum likelihood (here least squares) setting } Let ๐ of the parameters. } What is the maximum likelihood estimate of ๐ L ? ๐ log ๐(๐ ; ๐, ๐ L ) = 0 ๐๐ L C m L = 1 L ๐ i ๐ง P โ ๐(๐ P ; ๐ e) โ ๐ PQA Mean squared prediction error 12
Maximum likelihood estimation (Contโd) } Generally, maximizing log-likelihood is equivalent to minimizing empirical loss when the loss is defined according to: ๐๐๐ก๐ก ๐ง P , ๐ ๐ P , ๐ = โ ln ๐(๐ง P |๐ (P) , ๐, ๐พ) } Loss: negative log-probability } More general distributions for ๐(๐ง|๐) can be considered 13
Maximum A Posterior (MAP) estimation } MAP: } Given observations ๐ } Find the parameters that maximize the probabilities of the parameters after observing the data (posterior probabilities): ๐พ pqr = max ๐ ๐พ| ๐ ) ๐พ Since ๐ ๐พ| ๐ โ ๐ ๐ |๐พ ๐(๐พ) ๐พ pqr = max ๐ ๐ |๐พ ๐(๐พ) ๐พ 14
๏ฟฝ Maximum A Posterior (MAP) estimation C ๐ P , ๐ง P } Given observations ๐ = PQA max ๐ ๐(๐|๐, ๐) โ ๐ ๐ ๐, ๐ ๐(๐) [wA 1 ๐๐ฆ๐ โ 1 ๐ ๐ = ๐ช ๐, ๐ฝ L ๐ฑ = 2๐ฝ L ๐ y ๐ 2๐ ๐ฝ 15
Maximum A Posterior (MAP) estimation C ๐ P , ๐ง P } Given observations ๐ = PQA ๐ ln ๐ ๐ ๐, ๐, ๐ L ๐(๐) max C 1 + 1 L ๐ L i ๐ง P โ ๐(๐ P ; ๐) ๐ฝ L ๐ y ๐ min ๐ PQA { | } Equivalent to regularized SSE with ๐ = } | 16
๏ฟฝ ๏ฟฝ Bayesian approach C ๐ P , ๐ง P } Given observations ๐ = PQA } Find the parameters that maximize the probabilities of observations ๐ ๐ง ๐, ๐ = ~ ๐ ๐ง ๐, ๐ ๐ ๐| ๐ ๐๐ } Example of prior distribution: ๐ ๐ = ๐ช(๐, ๐ฝ L ๐ฑ) ๐ C = 1 โA ) } In this case: ๐ ๐| ๐ = ๐ช(๐ C , ๐ป C โA ๐ y ๐ ๐ L ๐ป C ๐ป C = 1 ๐ฝ L ๐ฑ + 1 ๐ L ๐ y ๐ 17
๏ฟฝ ๏ฟฝ Bayesian approach C ๐ P , ๐ง P } Given observations ๐ = PQA } Find the parameters that maximize the probabilities of observations C ๐ ๐ ๐ = ๐ ๐ ; ๐, ๐พ = W ๐ ๐ง P ๐ y ๐ P , ๐พ PQA ๐ ๐ง P ๐ ๐ P , ๐ , ๐พ = ๐ช(๐ง P |๐ y ๐ P , ๐ L ) ๐ ๐ = ๐ช(๐, ๐ฝ L ๐ฑ) ๐(๐|๐ ) โ ๐ ๐ ๐ ๐(๐) ๐ C = 1 โA ๐ y ๐ ๐ L ๐ป C ๐(๐ง|๐, ๐ ) = ~ ๐ ๐ง ๐, ๐ ๐ ๐|๐ ๐๐ Predictive ๐ป C = 1 ๐ฝ L ๐ฑ + 1 distribution ๐ L ๐ y ๐ y ๐, ๐ C L (๐) ๐ ๐ง ๐, ๐ = ๐ ๐ C L ๐ = ๐ L + ๐ y ๐ป C โA ๐ ๐ C 18
Predictive distribution: example } Example: Sinusoidal data, 9 Gaussian basis functions Red curve shows the mean of the predictive distribution Pink region spans one standard deviation either side of the mean 19 [Bishop]
Predictive distribution: example } Functions whose parameters are sampled from ๐(๐|๐ ) 20 [Bishop]
Resource } C. Bishop, โPattern Recognition and Machine Learningโ, Chapter 3.3. 21
Recommend
More recommend