High-dimensional statistics: Some progress and challenges ahead - PowerPoint PPT Presentation

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 3 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.

Non-parametric regression Goal: How to predict output from covariates? given covariates ( x 1 , x 2 , x 3 , . . . , x p ) output variable y want to predict y based on ( x 1 , . . . , x p )

Non-parametric regression Goal: How to predict output from covariates? given covariates ( x 1 , x 2 , x 3 , . . . , x p ) output variable y want to predict y based on ( x 1 , . . . , x p ) Different models: Ordered in terms of complexity/richness: linear non-linear but still parametric semi-parametric non-parametric

Non-parametric regression Goal: How to predict output from covariates? given covariates ( x 1 , x 2 , x 3 , . . . , x p ) output variable y want to predict y based on ( x 1 , . . . , x p ) Different models: Ordered in terms of complexity/richness: linear non-linear but still parametric semi-parametric non-parametric Challenge: How to control statistical and computational complexity for large number of predictors p ?

High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w .

High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w . Sample complexity: How many samples n for reliable prediction? linear models p/ǫ 2 ◮ without any structure: sample size n ≍ necessary/sufficient �� linear in p

High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w . Sample complexity: How many samples n for reliable prediction? linear models p/ǫ 2 ◮ without any structure: sample size n ≍ necessary/sufficient �� linear in p ( s log p ) /ǫ 2 ◮ with sparsity s ≪ p : sample size n ≍ necessary/sufficient � �� logarithmic in p

High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w . Sample complexity: How many samples n for reliable prediction? linear models p/ǫ 2 ◮ without any structure: sample size n ≍ necessary/sufficient �� linear in p ( s log p ) /ǫ 2 ◮ with sparsity s ≪ p : sample size n ≍ necessary/sufficient � �� logarithmic in p non-parametric models: p -dimensional, smoothness α (1 /ǫ ) 2+ p/α Curse of dimensionality: n ≍ � �� Exponential in p

Structure in non-parametric regression Upshot: Essential to impose structural constraints for high-dimensional non-parametric models. Reduced dimension models: dimension-reducing function: ϕ : R p → R k , where k ≪ p lower-dimensional function: g : R k → R composite function: f : R p → R � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 4 / 27

Structure in non-parametric regression Reduced dimension models: dimension-reducing function: ϕ : R p → R k , where k ≪ p lower-dimensional function: g : R k → R composite function: f : R p → R � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) Example: Regression on k -dimensional manifold: 1.5 1 Form of model 0.5 0 � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) −0.5 −1 −1.5 ϕ is co-ordinate mapping. 0.1 0.05 1.5 1 0 0.5 0 −0.05 −0.5 −1 −0.1 −1.5 Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 4 / 27

Structure in non-parametric regression Reduced dimension models: dimension-reducing function: ϕ : R p → R k , where k ≪ p lower-dimensional function: g : R k → R composite function: f : R p → R � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) Example: Ridge functions Form of model: k � � � f ( x 1 , x 2 , . . . , x p ) = g j � a j , x � j =1 Dimension-reducing mapping for some A ∈ R k × p . ϕ ( x 1 , . . . , x p ) = Ax Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 4 / 27

Remainder of lecture 1 Sparse additive models ◮ formulation, applications ◮ families of estimators ◮ efficient implementation as SOCP 2 Statistical rates ◮ Kernel complexity ◮ Subset selection plus univariate function estimation 3 Minimax lower bounds ◮ Statistics as channel coding ◮ Metric entropy and lower bounds Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 5 / 27

Sparse additive models additive models f ( x 1 , x 2 , . . . , x p ) = � p j =1 f j ( x j ) (Stone, 1985; Hastie & Tibshirani, 1990) Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27

Sparse additive models additive models f ( x 1 , x 2 , . . . , x p ) = � p j =1 f j ( x j ) (Stone, 1985; Hastie & Tibshirani, 1990) additivity with sparsity � f ( x 1 , x 2 , . . . , x p ) = f j ( x j ) for unknown subset of cardinality | S | = s j ∈ S Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27

Sparse additive models additive models f ( x 1 , x 2 , . . . , x p ) = � p j =1 f j ( x j ) (Stone, 1985; Hastie & Tibshirani, 1990) additivity with sparsity � f ( x 1 , x 2 , . . . , x p ) = f j ( x j ) for unknown subset of cardinality | S | = s j ∈ S studied by previous authors: ◮ Lin & Zhang, 2006: COSSO relaxation, ◮ Ravikumar et al., 2007: SPAM back-fitting procedure, consistency ◮ Bach et al., 2008: multiple kernel learning (MLK), consistency in classical setting ◮ Meier et al., 2007, L 2 ( P n ) regularization ◮ Koltchinski & Yuan, 2008, 2010. ◮ Raskutti, W. & Yu, 2009: minimax lower bounds Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27

Application: Copula methods and graphical models X t 1 X t 2 transform X j �→ Z j = f j ( X j ) model ( Z 1 , . . . , Z p ) as jointly X t 5 Gaussian Markov random field � � � P ( z 1 , z 2 , . . . , z p ) ∝ exp θ st z s z t . X s X t 3 ( s,t ) ∈ E X t 4 exploit Markov properties: neighborhood-based selection for learning graphs (Besag, 1974; Meinshausen & Buhlmann, 2006) combined with copula method: semi-parametric approach to graphical model learning (Liu, Lafferty & Wasserman, 2009) Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 7 / 27

Sparse and smooth Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ with: sparse representation: f ∗ = � j ∈ S f ∗ j univariate functions are smooth: f j ∈ H j

Sparse and smooth Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ with: sparse representation: f ∗ = � j ∈ S f ∗ j univariate functions are smooth: f j ∈ H j Disregarding computational cost: n � � � 2 1 min min y i − f ( x i ) f = � n | S |≤ s f j i =1 � �� j ∈ S f j ∈H j � y − f � 2 n

Sparse and smooth Disregarding computational cost: n � � � 2 1 min min y i − f ( x i ) f = � n | S |≤ s f j i =1 � �� j ∈ S f j ∈H j � y − f � 2 n 1-Hilbert-norm as convex surrogate: p � � f � 1 , H := � f j � H j j =1

Sparse and smooth Disregarding computational cost: n � � � 2 1 min min y i − f ( x i ) f = � n | S |≤ s f j i =1 � �� j ∈ S f j ∈H j � y − f � 2 n 1-Hilbert-norm as convex surrogate: p � � f � 1 , H := � f j � H j j =1 1- L 2 ( P n )-norm as convex surrogate: p � � f � 1 ,n := � f j � L 2 ( P n ) j =1 � n where � f j � 2 L 2 ( P n ) := 1 i =1 f 2 j ( x ij ). n

A family of estimators Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ = � j ∈ S f ∗ j .

A family of estimators Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ = � j ∈ S f ∗ j . Estimator: � 1 � n p � � � � 2 + ρ n � f � 1 , H + µ n � f � 1 ,n � f ∈ arg min y i − f j ( x ij ) . f = � p n j =1 f j i =1 j =1

A family of estimators Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ = � j ∈ S f ∗ j . Estimator: � 1 � n p � � � � 2 + ρ n � f � 1 , H + µ n � f � 1 ,n � f ∈ arg min y i − f j ( x ij ) . f = � p n j =1 f j i =1 j =1 Two kinds of regularization: � � p p n � � � � � 1 f 2 � f � 1 ,n = � f j � L 2 ( P n ) = j ( x ij ) , and n j =1 j =1 i =1 p � � f � 1 , H = � f j � H j . j =1

High-dimensional statistics: Some progress and challenges ahead - PowerPoint PPT Presentation

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 3 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh,

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur

Mean field asymptotics in high-dimensional statistics: A few references Andrea Montanari July

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

High-dimensional statistics and probability Christophe Giraud Universit e Paris Saclay M2

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

High-Dimensional Nearest Neighbor Search High-Dimensional Nearest Neighbor Search Who?

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel

High Dimensional Data Alark Joshi High dimensional data Data with multiple dimensions,

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat

Statistics for high-dimensional data: p-values and confidence intervals Peter B uhlmann

SOCIAL PROGRESS INDEX SOCIAL SOCIAL PROGRESS PROGRESS IMPERATIVE IMPERATIVE Social Progress

Three-dimensional Radial Visualization of High-dimensional Continuous or Discrete Datasets Fan

Degree Theory and Infinite Dimensional Topology . . . Takayuki Kihara Department of

2 In plane loading walls and beams 2.3 Compatibility and deformation capacity 06.10.2020

L ECTURE 7 Last time Communication complexity Other models of computation Today

On Solvability of a non-linear heat equation with non-integrable convective term and the

Up Upda date te M. Tzanov Louisiana State University DRA Me Meeti ting, July 1 y 10 th th ,

Dark matter at LHC and beyond Alejandro Ibarra ICTP, Trieste May 2019 There The re is e is

Adaptive Diffusions for Scalable and Robust Learning over Graphs ICASSP2017 Georgios B.

Agenda Language Basics Comments Variables Datatypes yp Operators Constants