Some bias and a pinch of variance Sara van de Geer November 2, 2016 Joint work with: Andreas Elsener, Alan Muro, Jana Jankov´ a, Benjamin Stucky
... this talk is about theory for machine learning algorithms ...
... this talk is about theory for machine learning algorithms ... ... for high-dimensional data ...
... it is about prediction performance of algorithms trained on random data ... it is not about the scripts used
Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin Curvature empirical risk Triangle property minimization Adaptation
Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin Curvature empirical risk Triangle property minimization Adaptation
Problem: Let f : X → R , X ⊂ R m Find min x ∈X f ( x )
Problem: Let f : X → R , X ⊂ R m Find min x ∈X f ( x ) Severe Problem: The function f is unknown!
What we do know: � f ( x ) = ℓ ( x , y ) dP ( y ) =: f P ( x ) where ◦ ℓ ( x , y ) is a given “loss” function: ℓ : X × Y → R ◦ P is an unknown probability measure on the space Y
Example ◦ X := the persons you consider marrying ◦ Y := possible states of the world ◦ ℓ ( x , y ) := the loss when marrying x in world y ◦ P := the distribution of possible states of the world � ◦ f ( x ) = ℓ ( x , y ) dP ( y ) the “risk” of marrying x
Let Q be a given probability measure on Y We replace P by Q : � f Q ( x ) := ℓ ( x , y ) dQ ( y ) and estimate x P := arg min x ∈X f P ( x ) by x Q := arg min x ∈X f Q ( x ) Question: How “good” is this estimate?
0.25 empirical risk 0.20 f(x) Q 0.15 theoretical risk risk Q x 0.10 excess risk f(x) 0.05 P P x 0.00 0.0 0.2 0.4 0.6 0.8 1.0 x beta
Question: Is x Q close to x P ? f ( x Q ) close to f ( x P )
... in our setup ... we have to regularize: accept some bias to reduce variance
Our setup: Q := corresponds to a sample Y 1 , . . . , Y n from P n := sample size Thus n f n ( x ) = 1 � f Q ( x ) := ˆ ℓ ( x , Y i ) , x ∈ X ⊂ R m n i =1 (a random function)
number of parameters m number of observations n high-dimensional statistics : m ≫ n
DATA Y 1 , . . . , Y n ↓ x ∈ R m ˆ
In our setup with m ≫ n we need to regularize That is: accept some bias to be able to reduce the variance.
Regularized empirical risk minimization Target: x P := x 0 = arg min f P ( x ) x ∈X⊂ R m � �� � unobservable risk Estimator based on sample: � � x Q := ˆ x := arg min f Q ( x ) + pen ( x ) x ∈X⊂ R m � �� � � �� � empirical risk regularization penalty
Example: Let Z ∈ R n × m be a given design matrix and b 0 ∈ R n unobserved vector 2 := � n Let � v � 2 i =1 v 2 i and f P ( x ) � �� � x 0 ∈ arg min � b 0 − Zx � 2 2 x ∈ R m Sample Y = b 0 + ǫ, ǫ ∈ R n noise “Lasso” with “tuning parameter” λ ≥ 0: := � m f Q ( x ) j =1 | x j | � � � �� � ���� � Y − Zx � 2 ˆ x := arg min 2 +2 λ � x � 1 x ∈ R p n := number of observations , m := number of parameters . High-dimensional: m ≫ n
Definition We call j an active parameter if (roughly speaking) x 0 j � = 0 We say x 0 is sparse if the number of active parameters is small We write the active set of x 0 as S 0 := { j : x 0 j � = 0 } We call s 0 := | S 0 | the sparsity of x 0
Goal: ◦ derive oracle inequalities for norm-penalized empirical risk minimizers oracle: an estimator that knows the “true” sparsity oracle inequalities: Adaptation to unknown sparsity
Benchmark Low-dimensional x ∈X⊂ R m ˆ x = arg ˆ min f n ( x ) Then typically x ) − f P ( x 0 ) ∼ m n = number of parameters f P (ˆ number of observations High-dimensional � � ˆ x = arg ˆ min f n ( x ) + pen ( x ) x ∈X⊂ R m Aim is Adaptation x ) − f P ( x 0 ) ∼ s 0 n = number of active parameters f P (ˆ number of observations
Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin curvature empirical risk Triangle property minimization Adaptation
Exact recovery Let Z ∈ R n × m be given and b 0 ∈ R n be given with m ≫ n Consider the system Zx 0 = b 0 of n equations with m unknowns Basis pursuit: � � x ∗ := arg min � x � 1 : Zx = b 0 x ∈ R m
Notation Active set: S 0 := { j : x 0 j � = 0 } Sparsity: s 0 := | S 0 | Effective sparsity: � � x S 0 � 2 � s 0 Γ 2 1 0 := = max 2 / n : � x − S 0 � 1 ≤ � x S 0 � 1 ˆ � Zx � 2 φ 2 ( S 0 ) � �� � “ cone condition ” Compatibility constant: ˆ φ 2 ( S 0 )
The compatibility constant is canonical correlation ... ... in the ℓ 1 -world The effective sparsity Γ 2 0 is ≈ the sparsity s 0 but taking into account the correlation between variables.
Compatibility constant: (in R 2 ) Z Z , . . . , m 2 ˆ φ (1 , { 1 } ) Z 1 φ ( S ) = ˆ ˆ φ (1 , S ) for the case S = { 1 }
Basis Pursuit Z given n × m matrix with m ≫ n . Let x 0 be the sparsest solution of Zx = b 0 . Basis Pursuit [Chen, Donoho and Saunders (1998) ]: � � x ∗ := min � x � 1 : Zx = b 0 Exact recovery Γ( S 0 ) < ∞ ⇒ x ∗ = x 0
Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin curvature empirical risk Triangle property minimization Adaptation
General norms Let Ω be a norm on R m The Ω − world
Norm-regularized empirical risk minimization � � x Q := ˆ x := arg min f Q ( x ) + λ Ω( x ) x ∈X⊂ R m � �� � � �� � empirical risk regularization penalty where ◦ Ω is a given norm on R p , ◦ λ > 0 is a tuning parameter
Examples of norms ℓ 1 -norm: Ω( x ) = � x � 1 =: � m j =1 | x j |
Examples of norms ℓ 1 -norm: Ω( x ) = � x � 1 =: � m j =1 | x j | given ˜ Oscar: λ > 0 p � (˜ Ω( x ) := λ ( j − 1) + 1) | x | ( j ) where | x | (1) ≥ · · · ≥ | x | ( p ) j =1 [Bondell and Reich 2008]
Examples of norms ℓ 1 -norm: Ω( x ) = � x � 1 =: � m j =1 | x j | given ˜ Oscar: λ > 0 p � (˜ Ω( x ) := λ ( j − 1) + 1) | x | ( j ) where | x | (1) ≥ · · · ≥ | x | ( p ) j =1 [Bondell and Reich 2008] sorted ℓ 1 -norm: given λ 1 ≥ · · · ≥ λ p > 0, p � Ω( x ) := λ j | x | ( j ) where | x | (1) ≥ · · · ≥ | x | ( p ) j =1 [Bogdan et al. 2013]
norms generated from cones: � � x 2 � m 1 , A ⊂ R m Ω( x ) := min a ∈A a j + a j j + j =1 2 [Micchelli et al. 2010] [Jenatton et al. 2011] [Bach et al. 2012] unit ball for wedge norm unit ball for group Lasso norm A = { a : a 1 ≥ a 2 ≥ · · · }
nuclear norm for matrices: X ∈ R m 1 × m 2 , √ Ω( X ) := � X � nuclear := trace ( X T X )
nuclear norm for matrices: X ∈ R m 1 × m 2 , √ Ω( X ) := � X � nuclear := trace ( X T X ) nuclear norm for tensors: X ∈ R m 1 × m 2 × m 3 , Ω( X ) := dual norm of Ω ∗ where � u 1 � 2 = � u 2 � 2 = � u 3 � 2 =1 trace ( W T u 1 ⊗ u 2 ⊗ u 3 ) , W ∈ R m 1 × m 2 × m 3 Ω ∗ ( W ) := max [Yuan and Zhang 2014]
Some concepts 4 Let ˙ ∂ f P ( x ) := ∂ x f P ( x ) 3 The Bregman divergence ˆ is R f(x) 2 P D ( x � ˆ x ) f(x) x ) − ˙ x ) T ( x − ˆ = f P ( x ) − f P (ˆ f P (ˆ x ) 1 P ˆ D(x|| x) 0 ˆ x x -1.0 -0.5 0.0 0.5 1.0 beta Definition (Property of f P ) We have margin curvature G if x ) ≥ G ( τ ( x ∗ − ˆ D ( x ∗ � ˆ x ))
Definition (Property of Ω) The triangle property holds at x ∗ if ∃ semi-norms Ω + and Ω − such that Ω( x ∗ ) − Ω( x ) ≤ Ω + ( x − x ∗ ) − Ω − ( x ) property triangle Definition The effective sparsity at x ∗ is �� Ω + ( x ) � 2 � Γ 2 : Ω − ( x ) ≤ L Ω + ( x ) ∗ ( L ) := max τ ( x ) � �� � “ cone condition ” L ≥ 1 is a stretching factor.
Problem statement Concepts: Detour: Sparsity exact recovery Effective sparsity Norm penalized Margin curvature empirical risk Triangle property minimization Adaptation
Norm-regularized empirical risk minimization � � x Q := ˆ x := arg min f Q ( x ) + λ Ω( x ) x ∈X⊂ R m � �� � � �� � empirical risk regularization penalty where ◦ Ω is a given norm on R p , ◦ λ > 0 is a tuning parameter
A sharp oracle inequality Theorem [vdG, 2016] Let this measures how close Q is to P ↓ � � ( ˙ f Q − ˙ ( i . e . remove most λ > λ ǫ ≥ Ω ∗ f P )(ˆ x ) of the variance ) ↑ dual norm Define ¯ λ λ := λ − λ ǫ , ¯ λ := λ + λ ǫ , L = λ. H := convex x = x Q , x 0 = x P ) conjugate Then (recall ˆ of G ↓ � � H (¯ f P ( x ∗ ) − f P ( x 0 ) x ) − f P ( x 0 ) ≤ min x ∗ ∈X f P (ˆ + λ Γ ∗ ( L )) . � �� � � �� � “ bias ” pinch of “ variance ” that is: Adaptation
Recommend
More recommend