Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center
Outline • “Modern” nonparametric inference for high dimensional data ◮ Nonparametric reduced rank regression • Risk-computation tradeoffs ◮ Covariance-constrained linear regression • Other research and teaching activities 2
Context for High Dimensional Nonparametrics Great progress in recent years on high dimensional linear models Many problems have important nonlinear structure. We’ve been studying “ purely functional ” methods for high dimensional, nonparametric inference • no basis expansions • no Mercer kernels 3
Additive Models Fully nonparametric models appear hopeless • Logarithmic scaling, p = log n (e.g., “Rodeo” Lafferty and Wasserman (2008)) Additive models are useful compromise • Exponential scaling, p = exp ( n c ) (e.g., “SpAM” Ravikumar, Lafferty, Liu and Wasserman (2009)) 4
Additive Models 300 190 180 250 170 200 160 150 150 100 − 0.10 − 0.05 0.00 0.05 0.10 − 0.10 − 0.05 0.00 0.05 0.10 0.15 Age Bmi 240 160 150 200 140 160 130 120 120 110 − 0.10 − 0.05 0.00 0.05 0.10 − 0.10 − 0.05 0.00 0.05 0.10 0.15 Map Tc 5
Multivariate Regression Y ∈ R q and X ∈ R p . Regression function m ( X ) = E ( Y | X ) . Linear model Y = BX + ǫ where B ∈ R q × p . Reduced rank regression: r = rank ( B ) ≤ C . Recent work has studied properties and high dimensional scaling of reduced rank regression where nuclear norm � B � ∗ is used as convex surrogate for rank constraint (Yuan et al., 2007; Negahban and Wainwright, 2011). E.g., �� � Var ( ǫ ) r ( p + q ) � � B n − B ∗ � F = O P n 6
Low-Rank Matrices and Convex Relaxation low rank matrices convex hull rank ( X ) ≤ t � X � ∗ ≤ t 7
Nuclear Norm Regularization Algorithms for nuclear norm minimization are a lot like iterative soft thresholding for lasso problems. To project a matrix B onto the nuclear norm ball � X � ∗ ≤ t : • Compute the SVD: B = U diag ( σ ) V T • Soft threshold the singular values: B ← U diag ( Soft λ ( σ )) V T 8
Nonparametric Reduced Rank Regression Foygel, Horrell, Drton and Lafferty (NIPS 2012) Nonparametric multivariate regression m ( X ) = ( m 1 ( X ) , . . . , m q ( X )) T Each component an additive model p � m k ( X ) = m k j ( X j ) j = 1 What is the nonparametric analogue of � B � ∗ penalty? 9
Low Rank Functions What does it mean for a set of functions m 1 ( x ) , . . . , m q ( x ) to be low rank? Let x 1 , . . . , x n be a collection of points. We require the n × q matrix M ( x 1 : n ) = [ m k ( x i )] is low rank. Stochastic setting: M = [ m k ( X i )] . Natural penalty is � q q � � 1 1 λ s ( 1 n M T M ) √ n � M � ∗ = σ s ( M ) = √ n s = 1 s = 1 Population version: � � � Σ( M ) 1 / 2 � � � � � � � ||| M ||| ∗ := � Cov ( M ( X )) � ∗ = � ∗ 10
Constrained Rank Additive Models ( CRAM ) Let Σ j = Cov ( M j ) . Two natural penalties: � � � � � � � � � � � � � Σ 1 / 2 � Σ 1 / 2 � Σ 1 / 2 ∗ + ∗ + · · · + � � � p 1 2 ∗ � � � � � (Σ 1 / 2 Σ 1 / 2 · · · Σ 1 / 2 ) � p 1 2 ∗ � � �� � �� � �� �� � Y − � 2 + λ � 2 � � � M j � 1 Population risk (first penalty) j M j ( X j ) 2 E � j ∗ Linear case: p � � p � � � � � Σ 1 / 2 ∗ = � B j � 2 � p j = 1 j = 1 � � � � � (Σ 1 / 2 Σ 1 / 2 · · · Σ 1 / 2 ) ∗ = � B � ∗ � p 1 2 11
CRAM Backfitting Algorithm (Penalty 1) Input: Data ( X i , Y i ) , regularization parameter λ . Iterate until convergence: For each j = 1 , . . . , p : Compute residual: R j = Y − � k � = j � M k ( X k ) Estimate projection P j = E ( R j | X j ) , smooth: � P j = S j R j n � P j � Compute SVD: 1 P T j = U diag ( τ ) U T M j = U diag ([ 1 − λ/ √ τ ] + ) U T � Soft-threshold: � P j M ( X i ) = � Output: Estimator � j � M j ( X ij ) . 12
Scaling of Estimation Error Using a “double covering” technique, ( 1 2 -parametric, 1 2 -nonparametric), we bound the deviation between empirical and population functional covariance matrices in spectral norm: �� � � � q + log ( pq ) � � � Σ( V ) − � sup Σ n ( V ) sp = O P . � n V This allows us to bound the excess risk of the empirical estimator relative to an oracle. 13
Summary • Variations on additive models enjoy most of the good statistical and computational properties of sparse or low-rank linear models. • We’re building a toolbox for large scale, high dimensional nonparametric inference. 14
Computation-Risk Tradeoffs • In “traditional” computational learning theory, dividing line between learnable and non-learnable is polynomial vs. exponential time • Valiant’s PAC model • Mostly negative results: It is not possible to efficiently learn in natural settings • Claim: Distinctions in polynomial time matter most 15
Analogy: Numerical Optimization In numerical optimization, it is understood how to tradeoff computation for speed of convergence • First order methods: linear cost, linear convergence • Quasi-Newton methods: quadratic cost, superlinear convergence • Newton’s method: cubic cost, quadratic convergence Are similar tradeoffs possible in statistical learning? 16
Hints of a Computation-Risk Tradeoff Graph estimation: • Our method for estimating graph for Ising models: n = Ω( d 3 log p ) , T = O ( p 4 ) for graphs with p nodes and maximum degree d • Information-theoretic lower bound: n = Ω( d log p ) 17
Statistical vs. Computational Efficiency Challenge: Understand how families of estimators with different computational efficiencies can yield different statistical efficiencies Risk ( � Rate H , F ( n ) = m n ∈H sup inf m n , m ) � m ∈F • H : computationally constrained hypothesis class • F : smoothness constraints on “true” model 18
Computation-Risk Tradeoffs for Linear Regression Dinah Shender has been studying such a tradeoff in the setting of high dimensional linear regression 19
Computation-Risk Tradeoffs for Linear Regression Standard ridge estimator solves � 1 � β λ = 1 nX T X + λ n I � nX T Y Sparsify sample covariance to get estimator � �� β t ,λ = 1 T t [ � nX T Y Σ] + λ n I where T t [ � Σ] is hard-thresholded sample covariance: � � T t ([ m ij ]) = m ij 1 ( | m ij | > t ) Recent advance in theoretical CS (Spielman et al.): Solving a symmetric diagonally-dominant linear system with m nonzero matrix entries can be done in time O ( m log 2 p ) � 20
Computation-Risk Tradeoffs for Linear Regression Dinah has recently proved the statistical error scales as � � β t ,λ − β ∗ � = O P ( � T t (Σ) − Σ � 2 ) = O ( t 1 − q ) � β ∗ � for class of covariance matrices with rows in sparse ℓ q balls (as studied by Bickel and Levina). • Combined with the computational advance, this gives us an explicit, fine-grained risk/computation tradeoff 21
Simulation 1.4 1.3 1.2 1.1 risk 1.0 0.9 0.8 0.0 0.5 1.0 1.5 2.0 lambda 22
Some Other Projects Minhua Chen : Convex optimization for dictionary learning Eric Janofsky : Nonparanormal component analysis Min Xu : High dimensional conditional density and graph estimation 23
Courses in the Works • Winter 2013: Nonparametric Inference (Undergraduate and Masters) • Spring 2013: Machine Learning for Big Data (Undergraduate Statistics and Computer Science) Charles Cary : Developing Cloud-based infras- tructure for the course. Candidate data: 80 mil- lion images, Yahoo! clickthrough data, Science journal articles, City of Chicago datasets. 24
Recommend
More recommend