studying model asymptotics with singular learning theory
play

Studying Model Asymptotics with Singular Learning Theory Shaowei - PowerPoint PPT Presentation

Studying Model Asymptotics with Singular Learning Theory Shaowei Lin (UC Berkeley) shaowei@ math.berkeley.edu Joint work with Russell Steele (McGill) 13 July 2012 MMDS 2012, Stanford University Workshop on Algorithms for Modern Massive Data


  1. Studying Model Asymptotics with Singular Learning Theory Shaowei Lin (UC Berkeley) shaowei@ math.berkeley.edu Joint work with Russell Steele (McGill) 13 July 2012 MMDS 2012, Stanford University Workshop on Algorithms for Modern Massive Data Sets 1 / 27

  2. Sparsity Penalties • Regression • BIC Integral Asymptotics Singular Learning RLCTs Sparsity Penalties 2 / 27

  3. Linear Regression Sparsity Penalties ω, X ∈ R d , Y = ω · X + ε , Y ∈ R , ε ∈ N (0 , 1) Model • Regression ( Y 1 , X 1 ) , . . . , ( Y N , X N ) Data • BIC Integral Asymptotics � N i =1 | Y i − ω · X i | 2 min ω Least squares Singular Learning i =1 | Y i − ω · X i | 2 + π ( ω ) � N min ω Penalized regression RLCTs LASSO Bayesian Info Criterion (BIC) π ( ω ) = | ω | 1 · β π ( ω ) = | ω | 0 · log N Parameter space is partitioned into regions (submodels). 3 / 27

  4. Bayesian Information Criterion Sparsity Penalties Given region Ω of parameters and a prior ϕ ( ω ) dω on Ω , • • Regression the marginal likelihood of the data is proportional to • BIC � Integral Asymptotics e − Nf ( ω ) ϕ ( ω ) dω Z N = Singular Learning Ω RLCTs � N 1 i =1 | Y i − ω · X i | 2 . where f ( ω ) = 2 N Laplace approximation : Asymptotically as sample size N → ∞ , • − log Z N ≈ Nf ( ω ∗ ) + d 2 log N + O (1) where ω ∗ = argmin ω ∈ Ω f ( ω ) and d = dim Ω . • Studying model asymptotics allows us to derive the BIC. But Laplace approx only works when the model is regular. Many models in machine learning are singular , e.g. mixtures, neural networks, hidden variables. 4 / 27

  5. Sparsity Penalties Integral Asymptotics • Estimation • RLCT • Geometry • Desingularization • Algorithm Singular Learning RLCTs Integral Asymptotics 5 / 27

  6. Estimating Integrals Sparsity Penalties Generally, there are three ways to estimate statistical integrals. Integral Asymptotics 1. Exact methods • Estimation • RLCT Compute a closed form formula for the integral, • Geometry e.g. (Lin · Sturmfels · Xu, 2009). • Desingularization • Algorithm 2. Numerical methods Singular Learning RLCTs Approximate using Markov Chain Monte Carlo (MCMC) and other sampling techniques. 3. Asymptotic methods Analyze how the integral behaves for large samples. 6 / 27

  7. Real Log Canonical Threshold Asymptotic theory (Arnol’d · Guse˘ ın-Zade · Varchenko, 1985) Sparsity Penalties Integral Asymptotics states that for a Laplace integral, • Estimation • RLCT e − Nf ( ω ) ϕ ( ω ) dω ≈ e − Nf ∗ · CN − λ (log N ) θ − 1 � • Geometry Z ( N ) = • Desingularization Ω • Algorithm Singular Learning asymptotically as N → ∞ for some positive constants C, λ, θ RLCTs and where f ∗ = min ω ∈ Ω f ( ω ) . The pair ( λ, θ ) is the real log canonical threshold of f ( ω ) with respect to the measure ϕ ( ω ) dω . 7 / 27

  8. Geometry of the Integral e − Nf ( ω ) ϕ ( ω ) dω ≈ e − Nf ∗ · CN − λ (log N ) θ − 1 Sparsity Penalties � Z ( N ) = Integral Asymptotics Ω • Estimation • RLCT Integral asymptotics depend on minimum locus of exponent f ( ω ) . • Geometry • Desingularization • Algorithm f ( x, y ) = x 2 + y 2 Singular Learning RLCTs f ( x, y ) = ( xy ) 2 f ( x, y ) = ( y 2 − x 3 ) 2 Plots of integrand e − Nf ( x,y ) for N = 1 and N = 10 8 / 27

  9. Desingularizations Let Ω ⊂ R d and f : Ω → R real analytic function. Sparsity Penalties Integral Asymptotics We say ρ : U → Ω desingularizes f if • • Estimation • RLCT • Geometry U is a d -dimensional real analytic manifold covered 1. • Desingularization by coordinate patches U 1 , . . . , U s ( ≃ subsets of R d ). • Algorithm Singular Learning ρ is a proper real analytic map that is an isomorphism 2. RLCTs onto the subset { ω ∈ Ω : f ( ω ) � = 0 } . For each restriction ρ : U i → Ω , 3. f ◦ ρ ( µ ) = a ( µ ) µ κ , det ∂ρ ( µ ) = b ( µ ) µ τ where a ( µ ) and b ( µ ) are nonzero on U i . • Hironaka (1964) proved that desingularizations always exist. 9 / 27

  10. Algorithm for Computing RLCTs • Sparsity Penalties We know how to find RLCTs of monomial functions (AGV, 1985). Integral Asymptotics � κd e − Nω κ 1 1 · · · ω τ d 1 ··· ω d ω τ 1 d dω ≈ CN − λ (log N ) θ − 1 • Estimation • RLCT Ω • Geometry where λ = min i τ i +1 κ i , θ = |{ i : τ i +1 = λ }| . • Desingularization κ i • Algorithm To compute the RLCT of any function f ( ω ) : • Singular Learning RLCTs Find minimum f ∗ of f over Ω . 1. Find a desingularization ρ for f − f ∗ . 2. Use AGV Theorem to find ( λ i , θ i ) on each patch U i . 3. λ = min { λ i } , θ = max { θ i : λ i = λ } . 4. • The difficult part is finding a desingularization, e.g (Bravo · Encinas · Villamayor, 2005). 10 / 27

  11. Sparsity Penalties Integral Asymptotics Singular Learning • Sumio Watanabe • Bayesian Statistics • Standard Form • Learning Coefficient • Geometry • AIC and DIC Singular Learning Theory RLCTs 11 / 27

  12. Sumio Watanabe Sparsity Penalties Integral Asymptotics Singular Learning • Sumio Watanabe • Bayesian Statistics • Standard Form • Learning Coefficient • Geometry • AIC and DIC RLCTs Sumio Watanabe Heisuke Hironaka In 1998, Sumio Watanabe discovered how to study the asymptotic behavior of singular models. His insight was to use a deep result in algebraic geometry known as Hironaka’s Resolution of Singularities . Heisuke Hironaka proved this celebrated result in 1964. His accomplishment won him the Field’s Medal in 1970. 12 / 27

  13. Bayesian Statistics Sparsity Penalties random variable with state space X (e.g. { 1 , 2 , . . . , k } , R k ) X Integral Asymptotics ∆ space of probability distributions on X Singular Learning • Sumio Watanabe M ⊂ ∆ statistical model, image of p : Ω → ∆ • Bayesian Statistics • Standard Form Ω parameter space • Learning Coefficient p ( x | ω ) dx distribution at ω ∈ Ω • Geometry • AIC and DIC ϕ ( ω ) dω prior distribution on Ω RLCTs Suppose samples X 1 , . . . , X N drawn from true distribution q ∈ M . N � � p ( X i | ω ) ϕ ( ω ) dω. Z N = Marginal likelihood Ω i =1 � q ( x ) log q ( x ) K ( ω ) = p ( x | ω ) dx. Kullback-Leibler function X 13 / 27

  14. Standard Form of Log Likelihood Ratio Sparsity Penalties Define log likelihood ratio . Note that its expectation is K ( ω ) . Integral Asymptotics K N ( ω ) = 1 i =1 log q ( X i ) � N p ( X i | ω ) . Singular Learning N • Sumio Watanabe • Bayesian Statistics • Standard Form Standard Form of Log Likelihood Ratio (Watanabe) • Learning Coefficient • Geometry If ρ : U → Ω desingularizes K ( ω ) , then on each patch U i , • AIC and DIC 1 K N ◦ ρ ( µ ) = µ 2 κ − RLCTs µ κ ξ N ( µ ) √ N where ξ N ( µ ) converges in law to a Gaussian process on U . For regular models, this is a Central Limit Theorem . 14 / 27

  15. Learning Coefficient � N Sparsity Penalties Define empirical entropy S N = − 1 i =1 log q ( X i ) . N Integral Asymptotics Singular Learning Convergence of stochastic complexity (Watanabe) • Sumio Watanabe • Bayesian Statistics The stochastic complexity has the asymptotic expansion • Standard Form • Learning Coefficient − log Z N = NS N + λ q log N − ( θ q − 1) log log N + O p (1) • Geometry • AIC and DIC where λ q , θ q describe the asymptotics of the deterministic integral RLCTs � e − NK ( ω ) ϕ ( ω ) dω ≈ CN − λ q (log N ) θ q − 1 . Z ( N ) = Ω For regular models, this is the Bayesian Information Criterion. Various names for ( λ q , θ q ) : statistics - learning coefficient of the model M at q algebraic geometry - real log canonical threshold of K ( ω ) 15 / 27

  16. Geometry of Singular Models Sparsity Penalties Integral Asymptotics Singular Learning • Sumio Watanabe • Bayesian Statistics • Standard Form • Learning Coefficient • Geometry • AIC and DIC RLCTs 16 / 27

  17. AIC and DIC Bayes generalization error B N . The Kullback-Leibler distance Sparsity Penalties Integral Asymptotics from the true distribution q ( x ) to the predictive distribution p ( x | D ) . Singular Learning • Sumio Watanabe Asymptotically , B N is equivalent to • Bayesian Statistics • Standard Form • Akaike Information Criterion for regular models • Learning Coefficient AIC = − � N • Geometry i =1 log p ( X i | ω ∗ ) + d • AIC and DIC • Akaike Information Criterion for singular models RLCTs AIC = − � N i =1 log p ( X i | ω ∗ ) + 2( singular fluctuation ) Numerically , B N can be estimated using MCMC methods. • Deviance Information Criterion for regular models DIC = E X [log p ( X | E ω [ ω ])] − 2 E ω [ E X [log p ( X | ω )]] • Widely Applicable Information Criterion for singular models WAIC = E X [log E ω [ p ( X | ω )]] − 2 E ω [ E X [log p ( X | ω )]] 17 / 27

  18. Sparsity Penalties Integral Asymptotics Singular Learning RLCTs • Sparsity Penalty • Newton Polyhedra • Upper Bounds Real Log Canonical Thresholds 18 / 27

Recommend


More recommend