Minimum Stein Discrepancy Estimators Fran¸ cois-Xavier Briol University of Cambridge & The Alan Turing Institute ICML Workshop on “Stein’s Method for Machine Learning and Statistics” 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 1 / 15
Collaborators Alessandro Barp Andrew Duncan Mark Girolami Lester Mackey ICL ICL U. Cambridge Microsoft Barp, A., Briol, F-X., Duncan, A., Girolami, M., Mackey, L. (2019) Minimum Stein Discrepancy Estimators. (preprint available here: https://fxbriol.github.io ) 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 2 / 15
Statistical Inference for Unnormalised Models Motivation: Suppose we observe some data { x 1 , . . . , x n } . Given a parametric family of distributions { P θ : θ ∈ Θ } with densities denoted p θ , we seek θ ∗ ∈ Θ which best approximates the empirical distribution: n � Q n = 1 δ x i n i =1 Challenge: For complex models, we often only have access to the likelihood in unnormalised form: p θ ( x ) = ˜ p θ ( x ) C where C > 0 is unknown and ˜ p can be evaluated pointwise. Examples include models of natural images, large graphical models, deep energy models, etc... 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 3 / 15
Statistical Inference for Unnormalised Models Motivation: Suppose we observe some data { x 1 , . . . , x n } . Given a parametric family of distributions { P θ : θ ∈ Θ } with densities denoted p θ , we seek θ ∗ ∈ Θ which best approximates the empirical distribution: n � Q n = 1 δ x i n i =1 Challenge: For complex models, we often only have access to the likelihood in unnormalised form: p θ ( x ) = ˜ p θ ( x ) C where C > 0 is unknown and ˜ p can be evaluated pointwise. Examples include models of natural images, large graphical models, deep energy models, etc... 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 3 / 15
Statistical Inference for Unnormalised Models Motivation: Suppose we observe some data { x 1 , . . . , x n } . Given a parametric family of distributions { P θ : θ ∈ Θ } with densities denoted p θ , we seek θ ∗ ∈ Θ which best approximates the empirical distribution: n � Q n = 1 δ x i n i =1 Challenge: For complex models, we often only have access to the likelihood in unnormalised form: p θ ( x ) = ˜ p θ ( x ) C where C > 0 is unknown and ˜ p can be evaluated pointwise. Examples include models of natural images, large graphical models, deep energy models, etc... 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 3 / 15
Minimum Discrepancy Estimators Let D be a function such that D ( Q || P θ ) ≥ 0 measures the discrepancy between the empirical distribution Q and P θ . We say that ˆ θ ∈ Θ is a minimum discrepancy estimator if: ˆ θ n ∈ argmin θ ∈ Θ D ( Q n || P θ ) This includes, but is not limited to: KL-divergence or other Bregman Divergence 1 Wasserstein distance or Sinkhorn Divergence 2 Maximum Mean Discrepancy 3 ... 4 Question: Which discrepancy should we use for unnormalised models? 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 4 / 15
Minimum Discrepancy Estimators Let D be a function such that D ( Q || P θ ) ≥ 0 measures the discrepancy between the empirical distribution Q and P θ . We say that ˆ θ ∈ Θ is a minimum discrepancy estimator if: ˆ θ n ∈ argmin θ ∈ Θ D ( Q n || P θ ) This includes, but is not limited to: KL-divergence or other Bregman Divergence 1 Wasserstein distance or Sinkhorn Divergence 2 Maximum Mean Discrepancy 3 ... 4 Question: Which discrepancy should we use for unnormalised models? 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 4 / 15
Minimum Discrepancy Estimators Let D be a function such that D ( Q || P θ ) ≥ 0 measures the discrepancy between the empirical distribution Q and P θ . We say that ˆ θ ∈ Θ is a minimum discrepancy estimator if: ˆ θ n ∈ argmin θ ∈ Θ D ( Q n || P θ ) This includes, but is not limited to: KL-divergence or other Bregman Divergence 1 Wasserstein distance or Sinkhorn Divergence 2 Maximum Mean Discrepancy 3 ... 4 Question: Which discrepancy should we use for unnormalised models? 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 4 / 15
Score Matching Estimators The score matching estimator [Hyvarinen, 2006] is based on the Fisher Divergence: � �∇ log q ( x ) − ∇ log p θ ( x ) � 2 SM( Q || P θ ) := 2 Q ( dx ) X � ( �∇ log p θ ( x ) � 2 = 2 + 2∆ log p θ ( x )) Q ( dx ) + Z X where Z ∈ R is independent of θ This is one of the most competitive methods to date with applications for inference in natural images, deep energy models and directional statistics. Several Failure Modes: This approach requires second-order derivatives and struggles with heavy-tailed data [Swersky, 2011]. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 5 / 15
Score Matching Estimators The score matching estimator [Hyvarinen, 2006] is based on the Fisher Divergence: � �∇ log q ( x ) − ∇ log p θ ( x ) � 2 SM( Q || P θ ) := 2 Q ( dx ) X � ( �∇ log p θ ( x ) � 2 = 2 + 2∆ log p θ ( x )) Q ( dx ) + Z X where Z ∈ R is independent of θ This is one of the most competitive methods to date with applications for inference in natural images, deep energy models and directional statistics. Several Failure Modes: This approach requires second-order derivatives and struggles with heavy-tailed data [Swersky, 2011]. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 5 / 15
Score Matching Estimators The score matching estimator [Hyvarinen, 2006] is based on the Fisher Divergence: � �∇ log q ( x ) − ∇ log p θ ( x ) � 2 SM( Q || P θ ) := 2 Q ( dx ) X � ( �∇ log p θ ( x ) � 2 = 2 + 2∆ log p θ ( x )) Q ( dx ) + Z X where Z ∈ R is independent of θ This is one of the most competitive methods to date with applications for inference in natural images, deep energy models and directional statistics. Several Failure Modes: This approach requires second-order derivatives and struggles with heavy-tailed data [Swersky, 2011]. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 5 / 15
Minimum Stein Discrepancy Estimators Let Γ( Y ) := { f : X → Y} . A function class G ⊂ Γ( R d ) is a Stein class, with corresponding Stein operator S P θ : G ⊂ Γ( R d ) → Γ( R d ) if: � S P θ [ f ] d P θ = 0 ∀ f ∈ G X This leads to the notion of Stein discrepancy (SD) [Gorham, 2015]: � � � � � � � � SD S P θ [ G ] ( Q || P θ ) := sup fd P θ − fd Q � � f ∈S P θ [ G ] X X � � � � � � � S P θ [ g ] d Q = sup � , � g ∈G X on which we base our minimum Stein discrepancy estimators: ˆ θ n ∈ argmin θ ∈ Θ SD S P θ [ G ] ( Q n || P θ ) . 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 6 / 15
Minimum Stein Discrepancy Estimators Let Γ( Y ) := { f : X → Y} . A function class G ⊂ Γ( R d ) is a Stein class, with corresponding Stein operator S P θ : G ⊂ Γ( R d ) → Γ( R d ) if: � S P θ [ f ] d P θ = 0 ∀ f ∈ G X This leads to the notion of Stein discrepancy (SD) [Gorham, 2015]: � � � � � � � � SD S P θ [ G ] ( Q || P θ ) := sup fd P θ − fd Q � � f ∈S P θ [ G ] X X � � � � � � � S P θ [ g ] d Q = sup � , � g ∈G X on which we base our minimum Stein discrepancy estimators: ˆ θ n ∈ argmin θ ∈ Θ SD S P θ [ G ] ( Q n || P θ ) . 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 6 / 15
Score Matching Estimators are Minimum Stein Discrepancy Estimators Consider the Stein operator S m 1 p [ g ] := p θ ∇ · ( p θ g ) and the Stein class: � � g = ( g 1 , . . . , g d ) ∈ C 1 ( X , R d ) ∩ L 2 ( X ; Q ) : � g � L 2 ( X ; Q ) ≤ 1 G = . In this case, the Stein discrepancy is the Score Matching divergence: SD S P θ [ G ] ( Q || P θ ) = SM( Q || P θ ) . Our paper also shows that several other popular estimators for unnormalised, including contrastive divergence and minimum probability flow are minimum SD estimators. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 7 / 15
Score Matching Estimators are Minimum Stein Discrepancy Estimators Consider the Stein operator S m 1 p [ g ] := p θ ∇ · ( p θ g ) and the Stein class: � � g = ( g 1 , . . . , g d ) ∈ C 1 ( X , R d ) ∩ L 2 ( X ; Q ) : � g � L 2 ( X ; Q ) ≤ 1 G = . In this case, the Stein discrepancy is the Score Matching divergence: SD S P θ [ G ] ( Q || P θ ) = SM( Q || P θ ) . Our paper also shows that several other popular estimators for unnormalised, including contrastive divergence and minimum probability flow are minimum SD estimators. 15 th June 2019 F-X Briol (University of Cambridge) Minimum Stein Discrepancy Estimators 7 / 15
Recommend
More recommend