Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grünwald, Mark Reid, Bob Williamson SMILE Seminar, 24 September 2012
Summary • Stochastic mixability fast rates of convergence in different settings: • statistical learning (margin condition) • sequential prediction (mixability)
Outline • Part 1: Statistical learning • Stochastic mixability (definition) • Equivalence to margin condition • Part 2: Sequential prediction • Part 3: Convexity interpretation for stochastic mixability • Part 4: Grünwald’s idea for adaptation to the margin
Notation
Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ]
Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ] Classification Y = { 0 , 1 } , A = { 0 , 1 } ( 0 if y = a ` ( y, a ) = 1 if y 6 = a
Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ] Classification Density estimation Y = { 0 , 1 } , A = { 0 , 1 } A = density functions on Y ( ` ( y, p ) = − log p ( y ) 0 if y = a ` ( y, a ) = 1 if y 6 = a
Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ] Classification Density estimation Y = { 0 , 1 } , A = { 0 , 1 } A = density functions on Y ( ` ( y, p ) = − log p ( y ) 0 if y = a ` ( y, a ) = 1 if y 6 = a Without X : F ⊂ A
Statistical Learning
Statistical Learning iid ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P ∗ f ∗ = arg min E [ ` ( Y, f ( X ))] f ∈ F d ( ˆ f, f ∗ ) = E [ ` ( Y, ˆ f ( X )) − ` ( Y, f ∗ ( X ))]
Statistical Learning iid ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P ∗ f ∗ = arg min E [ ` ( Y, f ( X ))] f ∈ F d ( ˆ f, f ∗ ) = E [ ` ( Y, ˆ f ( X )) − ` ( Y, f ∗ ( X ))]
Statistical Learning iid ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P ∗ f ∗ = arg min E [ ` ( Y, f ( X ))] f ∈ F d ( ˆ f, f ∗ ) = E [ ` ( Y, ˆ f ( X )) − ` ( Y, f ∗ ( X ))] = O ( n − ? )
Statistical Learning iid ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P ∗ f ∗ = arg min E [ ` ( Y, f ( X ))] f ∈ F d ( ˆ f, f ∗ ) = E [ ` ( Y, ˆ f ( X )) − ` ( Y, f ∗ ( X ))] = O ( n − ? ) • Two factors that determine rate of convergence: 1. complexity of 2. the margin condition F
Definition of Stochastic Mixability • Let . Then is -stochastically mixable if ( ` , F , P ∗ ) η ≥ 0 η f ∗ ∈ F there exists an such that e − ⌘` ( Y,f ( X )) � E ≤ 1 for all f ∈ F . e − ⌘` ( Y,f ∗ ( X )) • Stochastically mixable: this holds for some η > 0
Immediate Consequences e − ⌘` ( Y,f ( X )) � E ≤ 1 for all f ∈ F e − ⌘` ( Y,f ∗ ( X )) f ∗ = arg min • minimizes risk over : f ∗ E [ ` ( Y, f ( X ))] F f ∈ F • The larger , the stronger the property of being - η η stochastically mixable
Density estimation example 1 • Log-loss: , ` ( y, p ) = − log p ( y ) F = { p θ | θ ∈ Θ } • Suppose is the true density p θ ∗ ∈ F • Then for and any : p θ ∈ F η = 1 e − ⌘` ( Y,p θ ) � p ✓ ( y ) Z = p ✓ ∗ ( y ) P ∗ (d y ) = 1 E e − ⌘` ( Y,p θ ∗ )
Density estimation example 2
Density estimation example 2 • Normal location family with fixed variance : σ 2 P ∗ = N ( µ ∗ , τ 2 ) F = {N ( µ, σ 2 ) | µ ∈ R } • -stochastically mixable for : η = σ 2 / τ 2 η e − ⌘` ( Y,p µ ) � Z 2 σ 2 ( y − µ ) 2 + 2 σ 2 ( y − µ ∗ ) 2 − 2 τ 2 ( y − µ ∗ ) 2 d y 1 η η 1 E = e − √ e − ⌘` ( Y,p µ ∗ ) 2 ⇡⌧ 2 Z 2 τ 2 ( y − µ ) 2 d y = 1 1 1 = e − √ 2 ⇡⌧ 2
Density estimation example 2 • Normal location family with fixed variance : σ 2 P ∗ = N ( µ ∗ , τ 2 ) F = {N ( µ, σ 2 ) | µ ∈ R } • -stochastically mixable for : η = σ 2 / τ 2 η e − ⌘` ( Y,p µ ) � Z 2 σ 2 ( y − µ ) 2 + 2 σ 2 ( y − µ ∗ ) 2 − 2 τ 2 ( y − µ ∗ ) 2 d y 1 η η 1 E = e − √ e − ⌘` ( Y,p µ ∗ ) 2 ⇡⌧ 2 Z 2 τ 2 ( y − µ ) 2 d y = 1 1 1 = e − √ 2 ⇡⌧ 2 2 σ 2 n = η − 1 τ 2 • If is empirical mean: E [ d ( ˆ ˆ f f, f ∗ )] = 2 n
Outline • Part 1: Statistical learning • Stochastic mixability (definition) • Equivalence to margin condition • Part 2: Sequential prediction • Part 3: Convexity interpretation for stochastic mixability • Part 4: Grünwald’s idea for adaptation to the margin
Margin condition c 0 V ( f, f ∗ ) κ ≤ d ( f, f ∗ ) for all f ∈ F • where d ( f, f ∗ ) = E [ ` ( Y, f ( X )) − ` ( Y, f ∗ ( X ))] � 2 � V ( f, f ∗ ) = E ` ( Y, f ( X )) − ` ( Y, f ∗ ( X )) κ ≥ 1 , c 0 > 0 • For 0/1-loss implies rate of convergence O ( n − κ / (2 κ − 1) ) [Tsybakov, 2004] • So smaller is better κ
Stochastic mixability margin c 0 V ( f, f ∗ ) κ ≤ d ( f, f ∗ ) for all f ∈ F • Thm [ ] : Suppose takes values in . Then is ` [0 , V ] ( ` , F , P ∗ ) κ = 1 stochastically mixable if and only if there exists such c 0 > 0 that the margin condition is satisfied with . κ = 1
Margin condition with κ > 1 F ✏ = { f ∗ } ∪ { f ∈ F | d ( f, f ∗ ) ≥ ✏ } • Thm [ all ] : Suppose takes values in . Then the κ ≥ 1 [0 , V ] ` margin condition is satisfied if and only if there exists a constant such that, for all , is - ✏ > 0 ( ` , F ✏ , P ∗ ) C > 0 η ⌘ = C ✏ ( κ − 1) / κ stochastically mixable for .
Outline • Part 1: Statistical learning • Part 2: Sequential prediction • Part 3: Convexity interpretation for stochastic mixability • Part 4: Grünwald’s idea for adaptation to the margin
Sequential Prediction with Expert Advice • For rounds : t = 1 , . . . , n ˆ t , . . . , ˆ f 1 f K • K experts predict t ˆ • Predict by choosing ( x t , y t ) f t • Observe ( x t , y t ) n n 1 1 ` ( y t , ˆ ` ( y t , ˆ X X f k • Regret = f t ( x t )) − min t ( x t )) n n k t =1 t =1 • Game-theoretic (minimax) analysis: want to guarantee small regret against adversarial data
Sequential Prediction with Expert Advice • For rounds : t = 1 , . . . , n ˆ t , . . . , ˆ f 1 f K • K experts predict t ˆ • Predict by choosing ( x t , y t ) f t • Observe ( x t , y t ) n n 1 1 ` ( y t , ˆ ` ( y t , ˆ X X f k • Regret = f t ( x t )) − min t ( x t )) n n k t =1 t =1 • Worst-case regret = iff the loss is mixable! [Vovk, 1995] O (1 /n )
Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π e − ⌘` ( y,A ) � E A ∼ ⇡ ≤ 1 for all y . e − ⌘` ( y,a π ) • Vovk: fast rates if and only if loss is mixable O (1 /n )
(Stochastic) Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π e − ⌘` ( y,A ) � E A ∼ ⇡ ≤ 1 for all y . e − ⌘` ( y,a π ) • is -stochastically mixable if ( ` , F , P ∗ ) η e − ⌘` ( Y,f ( X )) � E X,Y ∼ P ∗ ≤ 1 for all f ∈ F . e − ⌘` ( Y,f ∗ ( X ))
(Stochastic) Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π ` ( y, a ⇡ ) ≤ − 1 Z e − ⌘` ( y,a ) ⇡ (d a ) ⌘ ln for all y .
(Stochastic) Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π ` ( y, a ⇡ ) ≤ − 1 Z e − ⌘` ( y,a ) ⇡ (d a ) ⌘ ln for all y . • Thm: is -stochastically mixable iff for any ( ` , F , P ∗ ) η f ∗ ∈ F distribution on there exists such that F π E [ ` ( Y, f ∗ ( X ))] ≤ E [ − 1 Z e − ⌘` ( Y,f ( X )) ⇡ (d f )] ⌘ ln
Equivalence of Stochastic Mixability and Ordinary Mixability
Equivalence of Stochastic Mixability and Ordinary Mixability F full = { all functions from X to A} • Thm : Suppose is a proper loss and is discrete. Then ` ` X is -mixable if and only if is -stochastically ( ` , F full , P ∗ ) η η mixable for all . P ∗
Recommend
More recommend