computational and statistical aspects of statistical

Computational and Statistical Aspects of Statistical Machine - PowerPoint PPT Presentation

Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric reduced rank

  1. Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center

  2. Outline • “Modern” nonparametric inference for high dimensional data ◮ Nonparametric reduced rank regression • Risk-computation tradeoffs ◮ Covariance-constrained linear regression • Other research and teaching activities 2

  3. Context for High Dimensional Nonparametrics Great progress in recent years on high dimensional linear models Many problems have important nonlinear structure. We’ve been studying “ purely functional ” methods for high dimensional, nonparametric inference • no basis expansions • no Mercer kernels 3

  4. Additive Models Fully nonparametric models appear hopeless • Logarithmic scaling, p = log n (e.g., “Rodeo” Lafferty and Wasserman (2008)) Additive models are useful compromise • Exponential scaling, p = exp ( n c ) (e.g., “SpAM” Ravikumar, Lafferty, Liu and Wasserman (2009)) 4

  5. Additive Models 300 190 180 250 170 200 160 150 150 100 − 0.10 − 0.05 0.00 0.05 0.10 − 0.10 − 0.05 0.00 0.05 0.10 0.15 Age Bmi 240 160 150 200 140 160 130 120 120 110 − 0.10 − 0.05 0.00 0.05 0.10 − 0.10 − 0.05 0.00 0.05 0.10 0.15 Map Tc 5

  6. Multivariate Regression Y ∈ R q and X ∈ R p . Regression function m ( X ) = E ( Y | X ) . Linear model Y = BX + ǫ where B ∈ R q × p . Reduced rank regression: r = rank ( B ) ≤ C . Recent work has studied properties and high dimensional scaling of reduced rank regression where nuclear norm � B � ∗ is used as convex surrogate for rank constraint (Yuan et al., 2007; Negahban and Wainwright, 2011). E.g., �� � Var ( ǫ ) r ( p + q ) � � B n − B ∗ � F = O P n 6

  7. Low-Rank Matrices and Convex Relaxation low rank matrices convex hull rank ( X ) ≤ t � X � ∗ ≤ t 7

  8. Nuclear Norm Regularization Algorithms for nuclear norm minimization are a lot like iterative soft thresholding for lasso problems. To project a matrix B onto the nuclear norm ball � X � ∗ ≤ t : • Compute the SVD: B = U diag ( σ ) V T • Soft threshold the singular values: B ← U diag ( Soft λ ( σ )) V T 8

  9. Nonparametric Reduced Rank Regression Foygel, Horrell, Drton and Lafferty (NIPS 2012) Nonparametric multivariate regression m ( X ) = ( m 1 ( X ) , . . . , m q ( X )) T Each component an additive model p � m k ( X ) = m k j ( X j ) j = 1 What is the nonparametric analogue of � B � ∗ penalty? 9

  10. Low Rank Functions What does it mean for a set of functions m 1 ( x ) , . . . , m q ( x ) to be low rank? Let x 1 , . . . , x n be a collection of points. We require the n × q matrix M ( x 1 : n ) = [ m k ( x i )] is low rank. Stochastic setting: M = [ m k ( X i )] . Natural penalty is � q q � � 1 1 λ s ( 1 n M T M ) √ n � M � ∗ = σ s ( M ) = √ n s = 1 s = 1 Population version: � � � Σ( M ) 1 / 2 � � � � � � � ||| M ||| ∗ := � Cov ( M ( X )) � ∗ = � ∗ 10

  11. Constrained Rank Additive Models ( CRAM ) Let Σ j = Cov ( M j ) . Two natural penalties: � � � � � � � � � � � � � Σ 1 / 2 � Σ 1 / 2 � Σ 1 / 2 ∗ + ∗ + · · · + � � � p 1 2 ∗ � � � � � (Σ 1 / 2 Σ 1 / 2 · · · Σ 1 / 2 ) � p 1 2 ∗ � � �� � �� � �� �� � Y − � 2 + λ � 2 � � � M j � 1 Population risk (first penalty) j M j ( X j ) 2 E � j ∗ Linear case: p � � p � � � � � Σ 1 / 2 ∗ = � B j � 2 � p j = 1 j = 1 � � � � � (Σ 1 / 2 Σ 1 / 2 · · · Σ 1 / 2 ) ∗ = � B � ∗ � p 1 2 11

  12. CRAM Backfitting Algorithm (Penalty 1) Input: Data ( X i , Y i ) , regularization parameter λ . Iterate until convergence: For each j = 1 , . . . , p : Compute residual: R j = Y − � k � = j � M k ( X k ) Estimate projection P j = E ( R j | X j ) , smooth: � P j = S j R j n � P j � Compute SVD: 1 P T j = U diag ( τ ) U T M j = U diag ([ 1 − λ/ √ τ ] + ) U T � Soft-threshold: � P j M ( X i ) = � Output: Estimator � j � M j ( X ij ) . 12

  13. Scaling of Estimation Error Using a “double covering” technique, ( 1 2 -parametric, 1 2 -nonparametric), we bound the deviation between empirical and population functional covariance matrices in spectral norm: �� � � � q + log ( pq ) � � � Σ( V ) − � sup Σ n ( V ) sp = O P . � n V This allows us to bound the excess risk of the empirical estimator relative to an oracle. 13

  14. Summary • Variations on additive models enjoy most of the good statistical and computational properties of sparse or low-rank linear models. • We’re building a toolbox for large scale, high dimensional nonparametric inference. 14

  15. Computation-Risk Tradeoffs • In “traditional” computational learning theory, dividing line between learnable and non-learnable is polynomial vs. exponential time • Valiant’s PAC model • Mostly negative results: It is not possible to efficiently learn in natural settings • Claim: Distinctions in polynomial time matter most 15

  16. Analogy: Numerical Optimization In numerical optimization, it is understood how to tradeoff computation for speed of convergence • First order methods: linear cost, linear convergence • Quasi-Newton methods: quadratic cost, superlinear convergence • Newton’s method: cubic cost, quadratic convergence Are similar tradeoffs possible in statistical learning? 16

  17. Hints of a Computation-Risk Tradeoff Graph estimation: • Our method for estimating graph for Ising models: n = Ω( d 3 log p ) , T = O ( p 4 ) for graphs with p nodes and maximum degree d • Information-theoretic lower bound: n = Ω( d log p ) 17

  18. Statistical vs. Computational Efficiency Challenge: Understand how families of estimators with different computational efficiencies can yield different statistical efficiencies Risk ( � Rate H , F ( n ) = m n ∈H sup inf m n , m ) � m ∈F • H : computationally constrained hypothesis class • F : smoothness constraints on “true” model 18

  19. Computation-Risk Tradeoffs for Linear Regression Dinah Shender has been studying such a tradeoff in the setting of high dimensional linear regression 19

  20. Computation-Risk Tradeoffs for Linear Regression Standard ridge estimator solves � 1 � β λ = 1 nX T X + λ n I � nX T Y Sparsify sample covariance to get estimator � �� β t ,λ = 1 T t [ � nX T Y Σ] + λ n I where T t [ � Σ] is hard-thresholded sample covariance: � � T t ([ m ij ]) = m ij 1 ( | m ij | > t ) Recent advance in theoretical CS (Spielman et al.): Solving a symmetric diagonally-dominant linear system with m nonzero matrix entries can be done in time O ( m log 2 p ) � 20

  21. Computation-Risk Tradeoffs for Linear Regression Dinah has recently proved the statistical error scales as � � β t ,λ − β ∗ � = O P ( � T t (Σ) − Σ � 2 ) = O ( t 1 − q ) � β ∗ � for class of covariance matrices with rows in sparse ℓ q balls (as studied by Bickel and Levina). • Combined with the computational advance, this gives us an explicit, fine-grained risk/computation tradeoff 21

  22. Simulation 1.4 1.3 1.2 1.1 risk 1.0 0.9 0.8 0.0 0.5 1.0 1.5 2.0 lambda 22

  23. Some Other Projects Minhua Chen : Convex optimization for dictionary learning Eric Janofsky : Nonparanormal component analysis Min Xu : High dimensional conditional density and graph estimation 23

  24. Courses in the Works • Winter 2013: Nonparametric Inference (Undergraduate and Masters) • Spring 2013: Machine Learning for Big Data (Undergraduate Statistics and Computer Science) Charles Cary : Developing Cloud-based infras- tructure for the course. Candidate data: 80 mil- lion images, Yahoo! clickthrough data, Science journal articles, City of Chicago datasets. 24


More recommend