Mixability in Statistical Learning Tim van Erven Joint work with: - PowerPoint PPT Presentation

Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grünwald, Mark Reid, Bob Williamson SMILE Seminar, 24 September 2012

Summary • Stochastic mixability fast rates of convergence in different settings: • statistical learning (margin condition) • sequential prediction (mixability)

Outline • Part 1: Statistical learning • Stochastic mixability (definition) • Equivalence to margin condition • Part 2: Sequential prediction • Part 3: Convexity interpretation for stochastic mixability • Part 4: Grünwald’s idea for adaptation to the margin

Notation

Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ]

Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ] Classification Y = { 0 , 1 } , A = { 0 , 1 } ( 0 if y = a ` ( y, a ) = 1 if y 6 = a

Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ] Classification Density estimation Y = { 0 , 1 } , A = { 0 , 1 } A = density functions on Y ( ` ( y, p ) = − log p ( y ) 0 if y = a ` ( y, a ) = 1 if y 6 = a

Notation • Data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) • Predict from : F = { f : X → A} Y X • Loss: ` : Y × A → [0 , ∞ ] Classification Density estimation Y = { 0 , 1 } , A = { 0 , 1 } A = density functions on Y ( ` ( y, p ) = − log p ( y ) 0 if y = a ` ( y, a ) = 1 if y 6 = a Without X : F ⊂ A

Statistical Learning

Statistical Learning iid ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P ∗ f ∗ = arg min E [ ` ( Y, f ( X ))] f ∈ F d ( ˆ f, f ∗ ) = E [ ` ( Y, ˆ f ( X )) − ` ( Y, f ∗ ( X ))]

Statistical Learning iid ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P ∗ f ∗ = arg min E [ ` ( Y, f ( X ))] f ∈ F d ( ˆ f, f ∗ ) = E [ ` ( Y, ˆ f ( X )) − ` ( Y, f ∗ ( X ))] = O ( n − ? )

Statistical Learning iid ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ∼ P ∗ f ∗ = arg min E [ ` ( Y, f ( X ))] f ∈ F d ( ˆ f, f ∗ ) = E [ ` ( Y, ˆ f ( X )) − ` ( Y, f ∗ ( X ))] = O ( n − ? ) • Two factors that determine rate of convergence: 1. complexity of 2. the margin condition F

Definition of Stochastic Mixability • Let . Then is -stochastically mixable if ( ` , F , P ∗ ) η ≥ 0 η f ∗ ∈ F there exists an such that  e − ⌘` ( Y,f ( X )) � E ≤ 1 for all f ∈ F . e − ⌘` ( Y,f ∗ ( X )) • Stochastically mixable: this holds for some η > 0

Immediate Consequences  e − ⌘` ( Y,f ( X )) � E ≤ 1 for all f ∈ F e − ⌘` ( Y,f ∗ ( X )) f ∗ = arg min • minimizes risk over : f ∗ E [ ` ( Y, f ( X ))] F f ∈ F • The larger , the stronger the property of being - η η stochastically mixable

Density estimation example 1 • Log-loss: , ` ( y, p ) = − log p ( y ) F = { p θ | θ ∈ Θ } • Suppose is the true density p θ ∗ ∈ F • Then for and any : p θ ∈ F η = 1  e − ⌘` ( Y,p θ ) � p ✓ ( y ) Z = p ✓ ∗ ( y ) P ∗ (d y ) = 1 E e − ⌘` ( Y,p θ ∗ )

Density estimation example 2

Density estimation example 2 • Normal location family with fixed variance : σ 2 P ∗ = N ( µ ∗ , τ 2 ) F = {N ( µ, σ 2 ) | µ ∈ R } • -stochastically mixable for : η = σ 2 / τ 2 η  e − ⌘` ( Y,p µ ) � Z 2 σ 2 ( y − µ ) 2 + 2 σ 2 ( y − µ ∗ ) 2 − 2 τ 2 ( y − µ ∗ ) 2 d y 1 η η 1 E = e − √ e − ⌘` ( Y,p µ ∗ ) 2 ⇡⌧ 2 Z 2 τ 2 ( y − µ ) 2 d y = 1 1 1 = e − √ 2 ⇡⌧ 2

Density estimation example 2 • Normal location family with fixed variance : σ 2 P ∗ = N ( µ ∗ , τ 2 ) F = {N ( µ, σ 2 ) | µ ∈ R } • -stochastically mixable for : η = σ 2 / τ 2 η  e − ⌘` ( Y,p µ ) � Z 2 σ 2 ( y − µ ) 2 + 2 σ 2 ( y − µ ∗ ) 2 − 2 τ 2 ( y − µ ∗ ) 2 d y 1 η η 1 E = e − √ e − ⌘` ( Y,p µ ∗ ) 2 ⇡⌧ 2 Z 2 τ 2 ( y − µ ) 2 d y = 1 1 1 = e − √ 2 ⇡⌧ 2 2 σ 2 n = η − 1 τ 2 • If is empirical mean: E [ d ( ˆ ˆ f f, f ∗ )] = 2 n

Outline • Part 1: Statistical learning • Stochastic mixability (definition) • Equivalence to margin condition • Part 2: Sequential prediction • Part 3: Convexity interpretation for stochastic mixability • Part 4: Grünwald’s idea for adaptation to the margin

Margin condition c 0 V ( f, f ∗ ) κ ≤ d ( f, f ∗ ) for all f ∈ F • where d ( f, f ∗ ) = E [ ` ( Y, f ( X )) − ` ( Y, f ∗ ( X ))] � 2 � V ( f, f ∗ ) = E ` ( Y, f ( X )) − ` ( Y, f ∗ ( X )) κ ≥ 1 , c 0 > 0 • For 0/1-loss implies rate of convergence O ( n − κ / (2 κ − 1) ) [Tsybakov, 2004] • So smaller is better κ

Stochastic mixability margin c 0 V ( f, f ∗ ) κ ≤ d ( f, f ∗ ) for all f ∈ F • Thm [ ] : Suppose takes values in . Then is ` [0 , V ] ( ` , F , P ∗ ) κ = 1 stochastically mixable if and only if there exists such c 0 > 0 that the margin condition is satisfied with . κ = 1

Margin condition with κ > 1 F ✏ = { f ∗ } ∪ { f ∈ F | d ( f, f ∗ ) ≥ ✏ } • Thm [ all ] : Suppose takes values in . Then the κ ≥ 1 [0 , V ] ` margin condition is satisfied if and only if there exists a constant such that, for all , is - ✏ > 0 ( ` , F ✏ , P ∗ ) C > 0 η ⌘ = C ✏ ( κ − 1) / κ stochastically mixable for .

Outline • Part 1: Statistical learning • Part 2: Sequential prediction • Part 3: Convexity interpretation for stochastic mixability • Part 4: Grünwald’s idea for adaptation to the margin

Sequential Prediction with Expert Advice • For rounds : t = 1 , . . . , n ˆ t , . . . , ˆ f 1 f K • K experts predict t ˆ • Predict by choosing ( x t , y t ) f t • Observe ( x t , y t ) n n 1 1 ` ( y t , ˆ ` ( y t , ˆ X X f k • Regret = f t ( x t )) − min t ( x t )) n n k t =1 t =1 • Game-theoretic (minimax) analysis: want to guarantee small regret against adversarial data

Sequential Prediction with Expert Advice • For rounds : t = 1 , . . . , n ˆ t , . . . , ˆ f 1 f K • K experts predict t ˆ • Predict by choosing ( x t , y t ) f t • Observe ( x t , y t ) n n 1 1 ` ( y t , ˆ ` ( y t , ˆ X X f k • Regret = f t ( x t )) − min t ( x t )) n n k t =1 t =1 • Worst-case regret = iff the loss is mixable! [Vovk, 1995] O (1 /n )

Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π  e − ⌘` ( y,A ) � E A ∼ ⇡ ≤ 1 for all y . e − ⌘` ( y,a π ) • Vovk: fast rates if and only if loss is mixable O (1 /n )

(Stochastic) Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π  e − ⌘` ( y,A ) � E A ∼ ⇡ ≤ 1 for all y . e − ⌘` ( y,a π ) • is -stochastically mixable if ( ` , F , P ∗ ) η  e − ⌘` ( Y,f ( X )) � E X,Y ∼ P ∗ ≤ 1 for all f ∈ F . e − ⌘` ( Y,f ∗ ( X ))

(Stochastic) Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π ` ( y, a ⇡ ) ≤ − 1 Z e − ⌘` ( y,a ) ⇡ (d a ) ⌘ ln for all y .

(Stochastic) Mixability • A loss is -mixable if for any ` : Y × A → [0 , ∞ ] η distribution on there exists an action such that A a π ∈ A π ` ( y, a ⇡ ) ≤ − 1 Z e − ⌘` ( y,a ) ⇡ (d a ) ⌘ ln for all y . • Thm: is -stochastically mixable iff for any ( ` , F , P ∗ ) η f ∗ ∈ F distribution on there exists such that F π E [ ` ( Y, f ∗ ( X ))] ≤ E [ − 1 Z e − ⌘` ( Y,f ( X )) ⇡ (d f )] ⌘ ln

Equivalence of Stochastic Mixability and Ordinary Mixability

Equivalence of Stochastic Mixability and Ordinary Mixability F full = { all functions from X to A} • Thm : Suppose is a proper loss and is discrete. Then ` ` X is -mixable if and only if is -stochastically ( ` , F full , P ∗ ) η η mixable for all . P ∗

Mixability in Statistical Learning Tim van Erven Joint work with: - PowerPoint PPT Presentation

Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grnwald, Mark Reid, Bob Williamson SMILE Seminar, 24 September 2012 Summary Stochastic mixability fast rates of convergence in different settings: statistical

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Nov 2010 Statistical Literacy: Harper's Magazine Fall 2010 1 Fall 2010 2 Statistical

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

KAGRA future discussion 1 The 3 rd KAGRA International Workshop May 2017 Tokyo Institute of

2020 Poster Slam Richard Vath, Session Facilitator Saturday, March 28 th AIAMC Annual Meeting

In-Person Meeting January 24 th -25 th , 2018 The Task Force for Global Health (Decatur, GA)

Introduction to Machine Learning 13. Learning Theory Geoff Gordon and Alex Smola Carnegie Mellon

Report from VLDATA F2F meeting in London G. Ganis, 16 June 2014 Reminder End of May

A Generic Data Exchange System for F2F Networks Cyril Soler C.Soler The GXS System 03 Feb.

o status update for the low momentum TF - F2F oct 12 Jakob Lettenbichler, Rudolf Fr uhwirth

Overfi'ng and Model selec1on Aar$ Singh Machine Learning

Mixability in Statistical Learning Tim van Erven Joint work with: - PowerPoint PPT Presentation

Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grnwald, Mark Reid, Bob Williamson SMILE Seminar, 24 September 2012 Summary Stochastic mixability fast rates of convergence in different settings: statistical

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Nov 2010 Statistical Literacy: Harper's Magazine Fall 2010 1 Fall 2010 2 Statistical

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

KAGRA future discussion 1 The 3 rd KAGRA International Workshop May 2017 Tokyo Institute of

2020 Poster Slam Richard Vath, Session Facilitator Saturday, March 28 th AIAMC Annual Meeting

In-Person Meeting January 24 th -25 th , 2018 The Task Force for Global Health (Decatur, GA)

Introduction to Machine Learning 13. Learning Theory Geoff Gordon and Alex Smola Carnegie Mellon

Report from VLDATA F2F meeting in London G. Ganis, 16 June 2014 Reminder End of May

A Generic Data Exchange System for F2F Networks Cyril Soler C.Soler The GXS System 03 Feb.

o status update for the low momentum TF - F2F oct 12 Jakob Lettenbichler, Rudolf Fr uhwirth

Overfi'ng and Model selec1on Aar$ Singh Machine Learning

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar