Studying Model Asymptotics with Singular Learning Theory Shaowei Lin (UC Berkeley) shaowei@ math.berkeley.edu Joint work with Russell Steele (McGill) 13 July 2012 MMDS 2012, Stanford University Workshop on Algorithms for Modern Massive Data Sets 1 / 27
Sparsity Penalties • Regression • BIC Integral Asymptotics Singular Learning RLCTs Sparsity Penalties 2 / 27
Linear Regression Sparsity Penalties ω, X ∈ R d , Y = ω · X + ε , Y ∈ R , ε ∈ N (0 , 1) Model • Regression ( Y 1 , X 1 ) , . . . , ( Y N , X N ) Data • BIC Integral Asymptotics � N i =1 | Y i − ω · X i | 2 min ω Least squares Singular Learning i =1 | Y i − ω · X i | 2 + π ( ω ) � N min ω Penalized regression RLCTs LASSO Bayesian Info Criterion (BIC) π ( ω ) = | ω | 1 · β π ( ω ) = | ω | 0 · log N Parameter space is partitioned into regions (submodels). 3 / 27
Bayesian Information Criterion Sparsity Penalties Given region Ω of parameters and a prior ϕ ( ω ) dω on Ω , • • Regression the marginal likelihood of the data is proportional to • BIC � Integral Asymptotics e − Nf ( ω ) ϕ ( ω ) dω Z N = Singular Learning Ω RLCTs � N 1 i =1 | Y i − ω · X i | 2 . where f ( ω ) = 2 N Laplace approximation : Asymptotically as sample size N → ∞ , • − log Z N ≈ Nf ( ω ∗ ) + d 2 log N + O (1) where ω ∗ = argmin ω ∈ Ω f ( ω ) and d = dim Ω . • Studying model asymptotics allows us to derive the BIC. But Laplace approx only works when the model is regular. Many models in machine learning are singular , e.g. mixtures, neural networks, hidden variables. 4 / 27
Sparsity Penalties Integral Asymptotics • Estimation • RLCT • Geometry • Desingularization • Algorithm Singular Learning RLCTs Integral Asymptotics 5 / 27
Estimating Integrals Sparsity Penalties Generally, there are three ways to estimate statistical integrals. Integral Asymptotics 1. Exact methods • Estimation • RLCT Compute a closed form formula for the integral, • Geometry e.g. (Lin · Sturmfels · Xu, 2009). • Desingularization • Algorithm 2. Numerical methods Singular Learning RLCTs Approximate using Markov Chain Monte Carlo (MCMC) and other sampling techniques. 3. Asymptotic methods Analyze how the integral behaves for large samples. 6 / 27
Real Log Canonical Threshold Asymptotic theory (Arnol’d · Guse˘ ın-Zade · Varchenko, 1985) Sparsity Penalties Integral Asymptotics states that for a Laplace integral, • Estimation • RLCT e − Nf ( ω ) ϕ ( ω ) dω ≈ e − Nf ∗ · CN − λ (log N ) θ − 1 � • Geometry Z ( N ) = • Desingularization Ω • Algorithm Singular Learning asymptotically as N → ∞ for some positive constants C, λ, θ RLCTs and where f ∗ = min ω ∈ Ω f ( ω ) . The pair ( λ, θ ) is the real log canonical threshold of f ( ω ) with respect to the measure ϕ ( ω ) dω . 7 / 27
Geometry of the Integral e − Nf ( ω ) ϕ ( ω ) dω ≈ e − Nf ∗ · CN − λ (log N ) θ − 1 Sparsity Penalties � Z ( N ) = Integral Asymptotics Ω • Estimation • RLCT Integral asymptotics depend on minimum locus of exponent f ( ω ) . • Geometry • Desingularization • Algorithm f ( x, y ) = x 2 + y 2 Singular Learning RLCTs f ( x, y ) = ( xy ) 2 f ( x, y ) = ( y 2 − x 3 ) 2 Plots of integrand e − Nf ( x,y ) for N = 1 and N = 10 8 / 27
Desingularizations Let Ω ⊂ R d and f : Ω → R real analytic function. Sparsity Penalties Integral Asymptotics We say ρ : U → Ω desingularizes f if • • Estimation • RLCT • Geometry U is a d -dimensional real analytic manifold covered 1. • Desingularization by coordinate patches U 1 , . . . , U s ( ≃ subsets of R d ). • Algorithm Singular Learning ρ is a proper real analytic map that is an isomorphism 2. RLCTs onto the subset { ω ∈ Ω : f ( ω ) � = 0 } . For each restriction ρ : U i → Ω , 3. f ◦ ρ ( µ ) = a ( µ ) µ κ , det ∂ρ ( µ ) = b ( µ ) µ τ where a ( µ ) and b ( µ ) are nonzero on U i . • Hironaka (1964) proved that desingularizations always exist. 9 / 27
Algorithm for Computing RLCTs • Sparsity Penalties We know how to find RLCTs of monomial functions (AGV, 1985). Integral Asymptotics � κd e − Nω κ 1 1 · · · ω τ d 1 ··· ω d ω τ 1 d dω ≈ CN − λ (log N ) θ − 1 • Estimation • RLCT Ω • Geometry where λ = min i τ i +1 κ i , θ = |{ i : τ i +1 = λ }| . • Desingularization κ i • Algorithm To compute the RLCT of any function f ( ω ) : • Singular Learning RLCTs Find minimum f ∗ of f over Ω . 1. Find a desingularization ρ for f − f ∗ . 2. Use AGV Theorem to find ( λ i , θ i ) on each patch U i . 3. λ = min { λ i } , θ = max { θ i : λ i = λ } . 4. • The difficult part is finding a desingularization, e.g (Bravo · Encinas · Villamayor, 2005). 10 / 27
Sparsity Penalties Integral Asymptotics Singular Learning • Sumio Watanabe • Bayesian Statistics • Standard Form • Learning Coefficient • Geometry • AIC and DIC Singular Learning Theory RLCTs 11 / 27
Sumio Watanabe Sparsity Penalties Integral Asymptotics Singular Learning • Sumio Watanabe • Bayesian Statistics • Standard Form • Learning Coefficient • Geometry • AIC and DIC RLCTs Sumio Watanabe Heisuke Hironaka In 1998, Sumio Watanabe discovered how to study the asymptotic behavior of singular models. His insight was to use a deep result in algebraic geometry known as Hironaka’s Resolution of Singularities . Heisuke Hironaka proved this celebrated result in 1964. His accomplishment won him the Field’s Medal in 1970. 12 / 27
Bayesian Statistics Sparsity Penalties random variable with state space X (e.g. { 1 , 2 , . . . , k } , R k ) X Integral Asymptotics ∆ space of probability distributions on X Singular Learning • Sumio Watanabe M ⊂ ∆ statistical model, image of p : Ω → ∆ • Bayesian Statistics • Standard Form Ω parameter space • Learning Coefficient p ( x | ω ) dx distribution at ω ∈ Ω • Geometry • AIC and DIC ϕ ( ω ) dω prior distribution on Ω RLCTs Suppose samples X 1 , . . . , X N drawn from true distribution q ∈ M . N � � p ( X i | ω ) ϕ ( ω ) dω. Z N = Marginal likelihood Ω i =1 � q ( x ) log q ( x ) K ( ω ) = p ( x | ω ) dx. Kullback-Leibler function X 13 / 27
Standard Form of Log Likelihood Ratio Sparsity Penalties Define log likelihood ratio . Note that its expectation is K ( ω ) . Integral Asymptotics K N ( ω ) = 1 i =1 log q ( X i ) � N p ( X i | ω ) . Singular Learning N • Sumio Watanabe • Bayesian Statistics • Standard Form Standard Form of Log Likelihood Ratio (Watanabe) • Learning Coefficient • Geometry If ρ : U → Ω desingularizes K ( ω ) , then on each patch U i , • AIC and DIC 1 K N ◦ ρ ( µ ) = µ 2 κ − RLCTs µ κ ξ N ( µ ) √ N where ξ N ( µ ) converges in law to a Gaussian process on U . For regular models, this is a Central Limit Theorem . 14 / 27
Learning Coefficient � N Sparsity Penalties Define empirical entropy S N = − 1 i =1 log q ( X i ) . N Integral Asymptotics Singular Learning Convergence of stochastic complexity (Watanabe) • Sumio Watanabe • Bayesian Statistics The stochastic complexity has the asymptotic expansion • Standard Form • Learning Coefficient − log Z N = NS N + λ q log N − ( θ q − 1) log log N + O p (1) • Geometry • AIC and DIC where λ q , θ q describe the asymptotics of the deterministic integral RLCTs � e − NK ( ω ) ϕ ( ω ) dω ≈ CN − λ q (log N ) θ q − 1 . Z ( N ) = Ω For regular models, this is the Bayesian Information Criterion. Various names for ( λ q , θ q ) : statistics - learning coefficient of the model M at q algebraic geometry - real log canonical threshold of K ( ω ) 15 / 27
Geometry of Singular Models Sparsity Penalties Integral Asymptotics Singular Learning • Sumio Watanabe • Bayesian Statistics • Standard Form • Learning Coefficient • Geometry • AIC and DIC RLCTs 16 / 27
AIC and DIC Bayes generalization error B N . The Kullback-Leibler distance Sparsity Penalties Integral Asymptotics from the true distribution q ( x ) to the predictive distribution p ( x | D ) . Singular Learning • Sumio Watanabe Asymptotically , B N is equivalent to • Bayesian Statistics • Standard Form • Akaike Information Criterion for regular models • Learning Coefficient AIC = − � N • Geometry i =1 log p ( X i | ω ∗ ) + d • AIC and DIC • Akaike Information Criterion for singular models RLCTs AIC = − � N i =1 log p ( X i | ω ∗ ) + 2( singular fluctuation ) Numerically , B N can be estimated using MCMC methods. • Deviance Information Criterion for regular models DIC = E X [log p ( X | E ω [ ω ])] − 2 E ω [ E X [log p ( X | ω )]] • Widely Applicable Information Criterion for singular models WAIC = E X [log E ω [ p ( X | ω )]] − 2 E ω [ E X [log p ( X | ω )]] 17 / 27
Sparsity Penalties Integral Asymptotics Singular Learning RLCTs • Sparsity Penalty • Newton Polyhedra • Upper Bounds Real Log Canonical Thresholds 18 / 27
Recommend
More recommend