Bayesian Interpretations of Regularization Charlie Frogner 9.520 Class 17 April 6, 2011 C. Frogner Bayesian Interpretations of Regularization
The Plan Regularized least squares maps { ( x i , y i ) } n i = 1 to a function that minimizes the regularized loss: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 f S = arg min H 2 f ∈H i = 1 Can we interpret RLS from a probabilistic point of view? C. Frogner Bayesian Interpretations of Regularization
Some notation Training set: S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . Inputs: X = { x 1 , . . . , x n } . Labels: Y = { y 1 , . . . , y n } . Parameters: θ ∈ R p . p ( Y | X , θ ) is the joint distribution over labels Y given inputs X and the parameters. C. Frogner Bayesian Interpretations of Regularization
Where do probabilities show up? n 1 V ( y i , f ( x i )) + λ � 2 � f � 2 H 2 i = 1 becomes p ( Y | f , X ) · p ( f ) Likelihood , a.k.a. noise model : p ( Y | f , X ) . � f ∗ ( x i ) , σ 2 � Gaussian: y i ∼ N i Poisson: y i ∼ Pois ( f ∗ ( x i )) Prior : p ( f ) . C. Frogner Bayesian Interpretations of Regularization
Where do probabilities show up? n 1 V ( y i , f ( x i )) + λ � 2 � f � 2 H 2 i = 1 becomes p ( Y | f , X ) · p ( f ) Likelihood , a.k.a. noise model : p ( Y | f , X ) . � f ∗ ( x i ) , σ 2 � Gaussian: y i ∼ N i Poisson: y i ∼ Pois ( f ∗ ( x i )) Prior : p ( f ) . C. Frogner Bayesian Interpretations of Regularization
Estimation The estimation problem: i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N Find a good f to explain data. C. Frogner Bayesian Interpretations of Regularization
The Plan Maximum likelihood estimation for ERM MAP estimation for linear RLS MAP estimation for kernel RLS Transductive model Infinite dimensions get more complicated C. Frogner Bayesian Interpretations of Regularization
Maximum likelihood estimation i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N A good f is one that maximizes p ( Y | f , X ) . C. Frogner Bayesian Interpretations of Regularization
Maximum likelihood and least squares For least squares, noise model is: � f ( x i ) , σ 2 � y i | f , x i ∼ N a.k.a. � � Y | f , X ∼ N f ( X ) , σ 2 I So N � � 1 1 p ( Y | f , X ) = � σ 2 ( y i − f ( x i )) 2 − ( 2 πσ 2 ) N / 2 exp i = 1 C. Frogner Bayesian Interpretations of Regularization
Maximum likelihood and least squares For least squares, noise model is: � f ( x i ) , σ 2 � y i | f , x i ∼ N a.k.a. � � Y | f , X ∼ N f ( X ) , σ 2 I So N � � 1 1 p ( Y | f , X ) = � σ 2 ( y i − f ( x i )) 2 − ( 2 πσ 2 ) N / 2 exp i = 1 C. Frogner Bayesian Interpretations of Regularization
Maximum likelihood and least squares Maximum likelihood: maximize � N � 1 1 p ( Y | f , X ) = � σ 2 ( y i − f ( x i ))) 2 ( 2 πσ 2 ) N / 2 exp − i = 1 Empirical risk minimization: minimize N � ( y i − f ( x i )) 2 i = 1 C. Frogner Bayesian Interpretations of Regularization
... N � ( y i − f ( x i )) 2 i = 1 C. Frogner Bayesian Interpretations of Regularization
... N 1 σ 2 ( y i − f ( x i )) 2 − � e i = 1 C. Frogner Bayesian Interpretations of Regularization
What about regularization? RLS: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 arg min H 2 f i = 1 Is there a model of Y and f that yields RLS? Yes. � n � 1 ( y i − f ( x i )) 2 − λ 2 � f � 2 � − 2 σ 2 H e ε i = 1 p ( Y | f , X ) · p ( f ) C. Frogner Bayesian Interpretations of Regularization
What about regularization? RLS: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 arg min H 2 f i = 1 Is there a model of Y and f that yields RLS? Yes. � n � 1 ( y i − f ( x i )) 2 − λ 2 � f � 2 � − 2 σ 2 H e ε i = 1 p ( Y | f , X ) · p ( f ) C. Frogner Bayesian Interpretations of Regularization
What about regularization? RLS: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 arg min H 2 f i = 1 Is there a model of Y and f that yields RLS? Yes. � n � 1 ( y i − f ( x i )) 2 � − · e − λ 2 σ 2 2 � f � 2 e ε i = 1 H p ( Y | f , X ) · p ( f ) C. Frogner Bayesian Interpretations of Regularization
What about regularization? RLS: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 arg min H 2 f i = 1 Is there a model of Y and f that yields RLS? Yes. � n � 1 ( y i − f ( x i )) 2 � − · e − λ 2 σ 2 2 � f � 2 e ε i = 1 H p ( Y | f , X ) · p ( f ) C. Frogner Bayesian Interpretations of Regularization
Posterior function estimates i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N Find a good f to explain data. (If we can get p ( f | Y , X ) ) Bayes least squares estimate : ˆ f BLS = E ( f | X , Y ) [ f ] i.e. the mean of the posterior. MAP estimate : f MAP ( Y | X ) = arg max p ( f | X , Y ) ˆ f i.e. a mode of the posterior. C. Frogner Bayesian Interpretations of Regularization
Posterior function estimates i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N Find a good f to explain data. (If we can get p ( f | Y , X ) ) Bayes least squares estimate : ˆ f BLS = E ( f | X , Y ) [ f ] i.e. the mean of the posterior. MAP estimate : f MAP ( Y | X ) = arg max p ( f | X , Y ) ˆ f i.e. a mode of the posterior. C. Frogner Bayesian Interpretations of Regularization
Posterior function estimates i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N Find a good f to explain data. (If we can get p ( f | Y , X ) ) Bayes least squares estimate : ˆ f BLS = E ( f | X , Y ) [ f ] i.e. the mean of the posterior. MAP estimate : f MAP ( Y | X ) = arg max p ( f | X , Y ) ˆ f i.e. a mode of the posterior. C. Frogner Bayesian Interpretations of Regularization
A posterior on functions? How to find p ( f | Y , X ) ? Bayes’ rule : p ( f | X , Y ) = p ( Y | X , f ) · p ( f ) p ( Y | X ) = p ( Y | X , f ) · p ( f ) p ( Y | X , f ) dp ( f ) � When is this well-defined? C. Frogner Bayesian Interpretations of Regularization
A posterior on functions? How to find p ( f | Y , X ) ? Bayes’ rule : p ( f | X , Y ) = p ( Y | X , f ) · p ( f ) p ( Y | X ) = p ( Y | X , f ) · p ( f ) p ( Y | X , f ) dp ( f ) � When is this well-defined? C. Frogner Bayesian Interpretations of Regularization
A posterior on functions? Functions vs. parameters: H ∼ = R p Represent functions in H by their coordinates w.r.t. a basis: f ∈ H ↔ θ ∈ R p Assume (for the moment): p < ∞ C. Frogner Bayesian Interpretations of Regularization
A posterior on functions? Functions vs. parameters: H ∼ = R p Represent functions in H by their coordinates w.r.t. a basis: f ∈ H ↔ θ ∈ R p Assume (for the moment): p < ∞ C. Frogner Bayesian Interpretations of Regularization
A posterior on functions? Mercer’s theorem : � K ( x i , x j ) = ν k ψ k ( x i ) ψ k ( x j ) k � where ν k ψ k ( · ) = K ( · , y ) ψ k ( y ) dy for all k . The functions {√ ν k ψ k ( · ) } form an orthonormal basis for H K . Let φ ( · ) = [ √ ν 1 ψ 1 ( · ) , . . . , √ ν p ψ p ( · )] . Then: H K = { φ ( · ) θ | θ ∈ R p } C. Frogner Bayesian Interpretations of Regularization
Prior on infinite-dimensional space Problem: there’s no such thing as θ ∼ N ( 0 , I ) when θ ∈ R ∞ ! C. Frogner Bayesian Interpretations of Regularization
Posterior for linear RLS Linear function: f ( x ) = � x , θ � Noise model: � � Y | X , θ ∼ N X θ, σ 2 ε I Add a prior : θ ∼ N ( 0 , I ) C. Frogner Bayesian Interpretations of Regularization
Posterior for linear RLS Model: Y | X , θ ∼ N � X θ, σ 2 � ε I , θ ∼ N ( 0 , I ) Joint over Y and θ : � Y � XX T + σ 2 �� 0 X � � �� ε I ∼ N , X T θ 0 I Condition on Y . C. Frogner Bayesian Interpretations of Regularization
Posterior for linear RLS Posterior: θ | X , Y ∼ N � � µ θ | X , Y , Σ θ | X , Y where µ θ | X , Y = X T ( XX T + σ 2 ε I ) − 1 Y Σ θ | X , Y = I − X T ( XX T + σ 2 ε I ) − 1 X This is Gaussian, so θ BLS ( Y | X ) = X T ( XX T + σ 2 θ MAP ( Y | X ) = ˆ ε I ) − 1 Y ˆ C. Frogner Bayesian Interpretations of Regularization
Posterior for linear RLS Posterior: θ | X , Y ∼ N � � µ θ | X , Y , Σ θ | X , Y where µ θ | X , Y = X T ( XX T + σ 2 ε I ) − 1 Y Σ θ | X , Y = I − X T ( XX T + σ 2 ε I ) − 1 X This is Gaussian, so θ BLS ( Y | X ) = X T ( XX T + σ 2 θ MAP ( Y | X ) = ˆ ε I ) − 1 Y ˆ C. Frogner Bayesian Interpretations of Regularization
Linear RLS as a MAP estimator Model: � � Y | X , θ ∼ N X θ, σ 2 θ ∼ N ( 0 , I ) ε I , θ MAP ( Y | X ) = X T ( XX T + σ 2 ε I ) − 1 Y ˆ Recall the linear RLS solution: N 1 ( y i − � x i , θ � ) 2 + λ θ RLS ( Y | X ) = arg min ˆ � 2 � θ � 2 2 θ i = 1 = X T ( XX T + λ 2 I ) − 1 Y So what’s λ ? C. Frogner Bayesian Interpretations of Regularization
Recommend
More recommend