A trichotomy of rates in supervised learning Amir Yehudayoff - PowerPoint PPT Presentation

A trichotomy of rates in supervised learning Amir Yehudayoff (Technion) Olivier Bousquet (Google) Steve Hanneke (TTIC) Shay Moran (Technion & Google) Ramon van Handel (Princeton)

background learning theory PAC learning is standard definition sometimes fails to provide valuable information – specific algorithms (nearest neighbor, neural nets, ...) – specific problems learning rates

framework input: sample of size n ∈ ( X × { 0 , 1 } ) n � � S = ( x 1 , y 1 ) , . . . , ( x n , y n ) output: an hypothesis A h ∈ { 0 , 1 } X S �→ learning algorithm A

generalization goal: PAC learning � � if S = ( x 1 , y 1 ) , . . . , ( x n , y n ) is i.i.d. from unknown µ then h = A ( S ) is typically close to µ closeness is measured by err ( h ) = ( x , y ) ∼ µ [ h ( x ) � = y ] Pr

context without “context” learning is “impossible” what is next element of 1 , 2 , 3 , 4 , 5 , . . . ? few possible definitions for a class H , the distribution µ is realizable if inf { err ( h ) : h ∈ H} = 0 where err ( h ) = Pr ( x , y ) ∼ µ [ h ( x ) � = y ]

PAC learning error of algorithm for sample size n � � ERR n ( A , H ) = sup S ∼ µ n err ( A ( S )) : µ is H -realizable E the class H is PAC learnable if there is A so that n →∞ ERR n ( A , H ) = 0 lim

VC theory theorem [Vapnik-Chervonenkis, Blumer-Ehrenfeucht-Haussler-Warmuth, ...] H is PAC learnable ⇔ VC dimension of H is finite

learning curve [Schuurmans] error “should” decrease as more examples are seen this improvement is important (predict, estimate, ...)

rates usually: µ is unknown but fixed want definition to capture this the rate of algorithm A with respect to µ is rate( n ) = rate A ,µ ( n ) = E S err ( A ( S )) where err ( h ) = Pr ( x , y ) ∼ µ [ h ( x ) � = y ] and | S | = n

VC classes thm: upper envelope ≈ VC [Vapnik-Chervonenkis, Blumer-Ehrenfeucht-Haussler-Warmuth, ...] n experiments: rate ( n ) � exp( − n ) for fixed µ [Cohn-Tesauro]

rate of class R : N → [0 , 1] is a rate function the class H has rate ≤ R if � n � ∃ A ∀ µ ∃ C ∀ n E err ( A ( S )) < CR C the class H has rate ≥ R if E err ( A ( S )) > R ( Cn ) ∃ C ∀ A ∃ µ for ∞ many n C the class H has rate R if both

rates: comments rate ≤ R if ∃ A ∀ µ ∃ C ∀ n E err ( A ( S )) < CR ( n / C ) algorithm A does not know distribution µ the “complexity” of µ is captured by delay factor C = C ( µ )

trichotomy theorem ∗ the rate of H can be – exponential ( e − n ) – linear ( 1 n ) – arbitrarily slow (for every R → 0, at least R ) ∗ realizable, |H| > 2, standard measurability assumptions

trichotomy: comments rate 2 −√ n e.g. is not an option Schuurmans proved a special case (dichotomy for chains) the higher the complexity of H , the slower the rate the complexity is characterized by “shattering capabilities”

exponential rate proposition the rate of H is exponential iff H does not shatter an infinite Littlestone tree

exponential rate lower bound: if |H| > 2 then rate is ≥ e − n upper bound: if H does not shatter an infinite Littlestone tree then rate is ≤ e − n ∃ A ∀ µ ∃ C ∀ n E err ( A ( S )) < Ce − n / C

exponential rate lower bound: if |H| > 2 then rate is ≥ e − n upper bound: if H does not shatter an infinite Littlestone tree then rate is ≤ e − n ∃ A ∀ µ ∃ C ∀ n E err ( A ( S )) < Ce − n / C need: no tree ⇒ algorithm

duality (LP, games,...) no tree ⇒ algorithm simplest example: no point in intersection of two convex bodies ⇒ a separating hyperplane

duality (LP, games,...) no tree ⇒ algorithm simplest example: no point in intersection of two convex bodies ⇒ a separating hyperplane duality for Gale-Stewart games: one of players have a winning strategy

duality (LP, games,...) no tree ⇒ algorithm simplest example: no point in intersection of two convex bodies ⇒ a separating hyperplane duality for Gale-Stewart games: one of players have a winning strategy problem: how complex is this strategy?

measurability value of position is an ordinal measures “how many steps to victory” n -steps to mate [Evans, Hamkins]

measurability value of position is an ordinal measures “how many steps to victory” n -steps to mate [Evans, Hamkins] the Littlestone dimension of H is the ordinal  0 |H| = 1    ∞ H has ∞ tree LD ( H ) = � ��  � � sup x ∈X min y ∈{ 0 , 1 } LD H + 1 otherwise   � x �→ y

measurability value of position is an ordinal measures “how many steps to victory” n -steps to mate [Evans, Hamkins] the Littlestone dimension of H is the ordinal  0 |H| = 1    ∞ H has ∞ tree LD ( H ) = � ��  � � sup x ∈X min y ∈{ 0 , 1 } LD H + 1 otherwise   � x �→ y theorem (relies on [Kunen-Martin]) if H is measurable ∗ then LD ( H ) is countable

summary learning rates capture distribution specific performance there are 3 possible learning rates in realizable case rate is characterizes by shattering capabilities – shattering ⇒ hard distribution via construction – no shattering ⇒ algorithm via duality complexity of algorithm via ordinals etc.

to do agnostic case accurate bounds on rates applications for shattering framework

A trichotomy of rates in supervised learning Amir Yehudayoff - PowerPoint PPT Presentation

A trichotomy of rates in supervised learning Amir Yehudayoff (Technion) Olivier Bousquet (Google) Steve Hanneke (TTIC) Shay Moran (Technion & Google) Ramon van Handel (Princeton) background learning theory PAC learning is standard

A new proof of Zilbers relative trichotomy conjecture Dmitry Sustretov Ben Gurion University

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Incidence counting and trichotomy in o-minimal structures Artem Chernikov (joint with A. Basit,

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Clearance Rates Office of Research and Data Analysis Clearance Rates Clearance rates are the

Advanced Macroeconomics 7. Exchange Rates, Interest Rates and Expectations Karl Whelan School of

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

11 { < t (0), t (1), t ( n -1)> Correct Concept: Learn a decent approximation most of t (

Program-level Assessment Committee (PAC) Meeting Minutes April 1, 2019 Attendance: Paul Mixon,

Data Dependent Priors in PAC-Bayes Bounds John Shawe-Taylor University College London Joint work

A Search for the LHCb Charmed Pentaquark using Photoproduction of J/ at Threshold in Hall

CHARACTERS AND STRINGS CSSE 120Rose Hulman Institute of Technology Characters and Strings g

and How De te r mining Whe the r the GDPR Will Apply to You Visit www.a dvise nltd.c o

Albany Medical Center Hospital PPS PAC Meeting July 18, 2016 Meeting Attendance Please email

Protocol for Carrying Authentication for Network Access (PANA) (draft-ietf-pana-pana-00.txt)

A trichotomy of rates in supervised learning Amir Yehudayoff - PowerPoint PPT Presentation

A trichotomy of rates in supervised learning Amir Yehudayoff (Technion) Olivier Bousquet (Google) Steve Hanneke (TTIC) Shay Moran (Technion & Google) Ramon van Handel (Princeton) background learning theory PAC learning is standard

A new proof of Zilbers relative trichotomy conjecture Dmitry Sustretov Ben Gurion University

PROPERTY RATES PROPERTY RATES PROPERTY RATES PROPERTY RATES BUFFALO CITY MUNICIPALITY

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Incidence counting and trichotomy in o-minimal structures Artem Chernikov (joint with A. Basit,

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Clearance Rates Office of Research and Data Analysis Clearance Rates Clearance rates are the

Advanced Macroeconomics 7. Exchange Rates, Interest Rates and Expectations Karl Whelan School of

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

11 { &lt; t (0), t (1), t ( n -1)&gt; Correct Concept: Learn a decent approximation most of t (

Program-level Assessment Committee (PAC) Meeting Minutes April 1, 2019 Attendance: Paul Mixon,

Data Dependent Priors in PAC-Bayes Bounds John Shawe-Taylor University College London Joint work

A Search for the LHCb Charmed Pentaquark using Photoproduction of J/ at Threshold in Hall

CHARACTERS AND STRINGS CSSE 120Rose Hulman Institute of Technology Characters and Strings g

and How De te r mining Whe the r the GDPR Will Apply to You Visit www.a dvise nltd.c o

Albany Medical Center Hospital PPS PAC Meeting July 18, 2016 Meeting Attendance Please email

Protocol for Carrying Authentication for Network Access (PANA) (draft-ietf-pana-pana-00.txt)

11 { < t (0), t (1), t ( n -1)> Correct Concept: Learn a decent approximation most of t (