Regression: Probabilistic perspective CE-717: Machine Learning - PowerPoint PPT Presentation

Regression: Probabilistic perspective CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018

Curve fitting: probabilistic perspective } Describing uncertainty over value of target variable as a probability distribution } Example: 𝑔(𝑦; 𝒙) 𝑔(𝑦 ' ; 𝒙) 𝑞(𝑧|𝑦 ' , 𝒙, 𝜏) 2

The learning diagram including noisy target } Type equation here. ℎ: 𝒴 → 𝒵 𝑦 A , … , 𝑦 C 𝑦 A , 𝑧 (A) , … , 𝑦 C , 𝑧 (C) 𝑔 𝒚 = ℎ(𝒚) 𝑔: 𝒴 → 𝒵 𝑄 𝑦, 𝑧 = 𝑄 𝑦 𝑄(𝑧|𝑦) Target Distribution on features distribution 3 [Y.S. Abou Mostafa, 2012]

Curve fitting: probabilistic perspective (Example) } Special case: Observed output = function + noise 𝑧 = 𝑔 𝒚; 𝒙 + 𝜗 e.g., 𝜗~𝑂(0, 𝜏 L ) } Noise: Whatever we cannot capture with our chosen family of functions 4

Curve fitting: probabilistic perspective (Example) } Best regression 𝔽 𝑧|𝒚 = 𝐹 𝑔(𝒚; 𝒙) + 𝜗 = 𝑔(𝒚; 𝒙) 𝜗~𝑂(0,𝜏 L ) } 𝑔 𝒚; 𝒙 is trying to capture the mean of the observations 𝑧 given the input 𝒚 : } 𝔽 𝑧|𝒚 : conditional expectation of 𝑧 given 𝒚 } evaluated according to the model (not according to the underlying distribution 𝑄 ) 5

Curve fitting using probabilistic estimation } Maximum Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian approach 6

Maximum likelihood estimation R 𝒚 P , 𝑧 P } Given observations 𝒠 = PQA } Find the parameters that maximize the (conditional) likelihood of the outputs: R 𝑀 𝒠 ; 𝜾 = 𝑞 𝒛 𝒀, 𝜾 = W 𝑞(𝑧 P |𝒚 P , 𝜾) PQA (A) (A) 𝑦 [ 1 𝑦 A ⋯ 𝑧 (A) ⋯ (L) (L) 1 𝑦 [ 𝑦 A 𝒛 = ⋮ 𝒀 = ⋱ ⋮ ⋮ ⋮ 𝑧 (R) (R) (R) 𝑦 [ 1 𝑦 A ⋯ 7

� Maximum likelihood estimation (Cont’d) 𝑧 = 𝑔 𝒚; 𝒙 + 𝜗 , 𝜗~𝑂(0, 𝜏 L ) } 𝑧 given 𝒚 is normally distributed with mean 𝑔(𝒚; 𝒙) and variance 𝜏 L : } we model the uncertainty in the predictions, not just the mean 1 {− 1 L } 𝑞(𝑧|𝒚, 𝒙, 𝜏 L ) = exp 2𝜏 L 𝑧 − 𝑔 𝒚; 𝒙 2𝜌 𝜏 8

� Maximum likelihood estimation (Cont’d) } Example: univariate linear function 1 {− 1 𝑞(𝑧|𝒚, 𝒙, 𝜏 L ) = 2𝜏 L 𝑧 − 𝑥 ' − 𝑥 A 𝑦 L } exp 2𝜌 𝜏 • Why is this line a bad fit according to the likelihood criterion? • 𝑞(𝑧|𝒚,𝒙,𝜏 L ) for most of the points will be • near zero (as they are far from this line) 9

Maximum likelihood estimation (Cont’d) } Maximize the likelihood of the outputs (i.i.d): C 𝑀 𝒠; 𝒙, 𝜏 L = W 𝑞(𝑧 P |𝒚 (P) , 𝒙, 𝜏 L ) PQA 𝑀 𝒠; 𝒙, 𝜏 L 𝒙 e = argmax 𝒙 C W 𝑞(𝑧 P |𝒚 (P) , 𝒙, 𝜏 L ) = argmax 𝒙 PQA 10

Maximum likelihood estimation (Cont’d) } It is often easier (but equivalent) to try to maximize the log-likelihood: ln 𝑞 𝒛 𝒀, 𝒙, 𝜏 L 𝒙 e = argmax 𝒙 C C ln W 𝑞(𝑧 P |𝒚 (P) , 𝒙, 𝜏 L ) = i ln 𝒪(𝑧 P |𝒚 P , 𝒙, 𝜏 L ) PQA PQA C = −𝑂 ln 𝜏 − 𝑂 2 ln 2𝜌 − 1 L 2𝜏 L i 𝑧 P − 𝑔(𝒚 P ; 𝒙) PQA sum of squares error 11

Maximum likelihood estimation (Cont’d) } Maximizing log-likelihood (when we assume 𝑧 = 𝑔 𝒚; 𝒙 + 𝜗 , 𝜗~𝑂(0, 𝜏 L ) ) is equivalent to minimizing SSE e be the maximum likelihood (here least squares) setting } Let 𝒙 of the parameters. } What is the maximum likelihood estimate of 𝜏 L ? 𝜖 log 𝑀(𝒠; 𝒙, 𝜏 L ) = 0 𝜖𝜏 L C m L = 1 L 𝑂 i 𝑧 P − 𝑔(𝒚 P ; 𝒙 e) ⇒ 𝜏 PQA Mean squared prediction error 12

Maximum likelihood estimation (Cont’d) } Generally, maximizing log-likelihood is equivalent to minimizing empirical loss when the loss is defined according to: 𝑀𝑝𝑡𝑡 𝑧 P , 𝑔 𝒚 P , 𝒙 = − ln 𝑞(𝑧 P |𝒚 (P) , 𝒙, 𝜾) } Loss: negative log-probability } More general distributions for 𝑞(𝑧|𝒚) can be considered 13

Maximum A Posterior (MAP) estimation } MAP: } Given observations 𝒠 } Find the parameters that maximize the probabilities of the parameters after observing the data (posterior probabilities): 𝜾 pqr = max 𝑞 𝜾| 𝒠 ) 𝜾 Since 𝑞 𝜾| 𝒠 ∝ 𝑞 𝒠 |𝜾 𝑞(𝜾) 𝜾 pqr = max 𝑞 𝒠 |𝜾 𝑞(𝜾) 𝜾 14

� Maximum A Posterior (MAP) estimation C 𝒚 P , 𝑧 P } Given observations 𝒠 = PQA max 𝒙 𝑞(𝒙|𝒀, 𝒛) ∝ 𝑞 𝒛 𝒀, 𝒙 𝑞(𝒙) [wA 1 𝑓𝑦𝑞 − 1 𝑞 𝒙 = 𝒪 𝟏, 𝛽 L 𝑱 = 2𝛽 L 𝒙 y 𝒙 2𝜌 𝛽 15

Maximum A Posterior (MAP) estimation C 𝒚 P , 𝑧 P } Given observations 𝒠 = PQA 𝒙 ln 𝑞 𝒛 𝒀, 𝒙, 𝜏 L 𝑞(𝒙) max C 1 + 1 L 𝜏 L i 𝑧 P − 𝑔(𝒚 P ; 𝒙) 𝛽 L 𝒙 y 𝒙 min 𝒙 PQA { | } Equivalent to regularized SSE with 𝜇 = } | 16

� � Bayesian approach C 𝒚 P , 𝑧 P } Given observations 𝒠 = PQA } Find the parameters that maximize the probabilities of observations 𝑞 𝑧 𝒚, 𝒠 = ~ 𝑞 𝑧 𝒙, 𝒚 𝑞 𝒙| 𝒠 𝑒𝒙 } Example of prior distribution: 𝑞 𝒙 = 𝒪(𝟏, 𝛽 L 𝑱) 𝒏 C = 1 ‚A ) } In this case: 𝑞 𝒙| 𝒠 = 𝒪(𝒏 C , 𝑻 C ‚A 𝒀 y 𝒛 𝜏 L 𝑻 C 𝑻 C = 1 𝛽 L 𝑱 + 1 𝜏 L 𝒀 y 𝒀 17

� � Bayesian approach C 𝒚 P , 𝑧 P } Given observations 𝒠 = PQA } Find the parameters that maximize the probabilities of observations C 𝑞 𝒠 𝒙 = 𝑀 𝒠; 𝒙, 𝜾 = W 𝑞 𝑧 P 𝒙 y 𝒚 P , 𝜾 PQA 𝑞 𝑧 P 𝑔 𝒚 P , 𝒙 , 𝜾 = 𝒪(𝑧 P |𝒙 y 𝒚 P , 𝜏 L ) 𝑞 𝒙 = 𝒪(𝟏, 𝛽 L 𝑱) 𝑞(𝒙|𝒠) ∝ 𝑞 𝒠 𝒙 𝑞(𝒙) 𝒏 C = 1 ‚A 𝒀 y 𝒛 𝜏 L 𝑻 C 𝑞(𝑧|𝒚, 𝒠) = ~ 𝑞 𝑧 𝒙, 𝒚 𝑞 𝒙|𝒠 𝑒𝒙 Predictive 𝑻 C = 1 𝛽 L 𝑱 + 1 distribution 𝜏 L 𝒀 y 𝒀 y 𝒚, 𝜏 C L (𝒚) 𝑞 𝑧 𝒚, 𝒠 = 𝑂 𝒏 C L 𝒚 = 𝜏 L + 𝒚 y 𝑻 C ‚A 𝒚 𝜏 C 18

Predictive distribution: example } Example: Sinusoidal data, 9 Gaussian basis functions Red curve shows the mean of the predictive distribution Pink region spans one standard deviation either side of the mean 19 [Bishop]

Predictive distribution: example } Functions whose parameters are sampled from 𝑞(𝒙|𝒠) 20 [Bishop]

Resource } C. Bishop, “Pattern Recognition and Machine Learning”, Chapter 3.3. 21

Regression: Probabilistic perspective CE-717: Machine Learning - PowerPoint PPT Presentation

Regression: Probabilistic perspective CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Curve fitting: probabilistic perspective } Describing uncertainty over value of target variable as a probability distribution

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Perspective LanguaL Structured Vocabulary: USDA Perspective Joanne Holden Perspective: Earth

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Review CS 3516 Computer CS 3516 Computer Networks Protocol Home Access Networks What

Monte Carlo Methods and Simulating Quarks Michael Creutz Brookhaven Lab 1946: Stanislaus Ulam

Quantum variational error correction by Peter Johnson, Jonathan Romero, Jonathan Olson, Yudong

Fouldy-Wouthuysen tranformation for non-Hermitian Hamiltonians Jean Alexandre Kings College

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.3 Iterative

Absorption cooling Mechanical Engineering ME462/562 Sustainable Energy: an Exergy Analysis

DSC Credit Committee Operational Stats July 2017 Created : 09/08/2017 Monthly Breakdown of Cash

Preconditioned Locally Harmonic Residual Methods for Interior Eigenvalue Computations Eugene

Regression: Probabilistic perspective CE-717: Machine Learning - PowerPoint PPT Presentation

Regression: Probabilistic perspective CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Curve fitting: probabilistic perspective } Describing uncertainty over value of target variable as a probability distribution

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Perspective LanguaL Structured Vocabulary: USDA Perspective Joanne Holden Perspective: Earth

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Review CS 3516 Computer CS 3516 Computer Networks Protocol Home Access Networks What

Monte Carlo Methods and Simulating Quarks Michael Creutz Brookhaven Lab 1946: Stanislaus Ulam

Quantum variational error correction by Peter Johnson, Jonathan Romero, Jonathan Olson, Yudong

Fouldy-Wouthuysen tranformation for non-Hermitian Hamiltonians Jean Alexandre Kings College

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.3 Iterative

Absorption cooling Mechanical Engineering ME462/562 Sustainable Energy: an Exergy Analysis

DSC Credit Committee Operational Stats July 2017 Created : 09/08/2017 Monthly Breakdown of Cash

Preconditioned Locally Harmonic Residual Methods for Interior Eigenvalue Computations Eugene

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and