Statistical Learning Theory: A Hitchhiker’s Guide John Shawe-Taylor UCL Omar Rivasplata UCL / DeepMind December 2018 Neural Information Processing Systems Slide 1 / 52
Why SLT NeurIPS 2018 Slide 2 / 52
Error distribution picture 20 Parzen window 18 Linear SVM 16 14 95% confidence 12 10 95% confidence 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 mean mean NeurIPS 2018 Slide 3 / 52
SLT is about high confidence Why SLT For a fixed algorithm, function class and sample size, Overview generating random samples −→ distribution of test errors Notation Focusing on the mean of the error distribution? � First generation ⊲ can be misleading: learner only has one sample Second generation Next generation Statistical Learning Theory: tail of the distribution � ⊲ finding bounds which hold with high probability over random samples of size m Compare to a statistical test – at 99% confidence level � ⊲ chances of the conclusion not being true are less than 1% PAC: probably approximately correct � P m [large error] ≤ δ Use a ‘confidence parameter’ δ : δ is probability of being misled by the training set Hence high confidence: P m [approximately correct] ≥ 1 − δ � NeurIPS 2018 Slide 4 / 52
Error distribution picture 20 Parzen window 18 Linear SVM 16 14 95% confidence 12 10 95% confidence 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 mean mean NeurIPS 2018 Slide 5 / 52
Overview NeurIPS 2018 Slide 6 / 52
The Plan � Definitions and Notation: (John) ⊲ risk measures, generalization � First generation SLT: (Omar) ⊲ worst-case uniform bounds ⊲ Vapnik-Chervonenkis characterization � Second generation SLT: (John) ⊲ hypothesis-dependent complexity ⊲ SRM, Margin, PAC-Bayes framework � Next generation SLT? (Omar) ⊲ Stability. Deep NN’s. Future directions NeurIPS 2018 Slide 7 / 52
What to expect We will... ⊲ Focus on aims / methods / key ideas ⊲ Outline some proofs ⊲ Hitchhiker’s guide! We will not... ⊲ Detailed proofs / full literature (apologies!) ⊲ Complete history / other learning paradigms ⊲ Encyclopaedic coverage of SLT NeurIPS 2018 Slide 8 / 52
Definitions and Notation NeurIPS 2018 Slide 9 / 52
Mathematical formalization A : Z m → H Why SLT Learning algorithm Overview Notation Z = X × Y H = hypothesis class • • First generation Second generation X = set of inputs = set of predictors Next generation Y = set of labels (e.g. classifiers) Training set (aka sample): S m = (( X 1 , Y 1 ) , . . . , ( X m , Y m )) a finite sequence of input-label examples. SLT assumptions : A data-generating distribution P over Z . • • Learner doesn’t know P , only sees the training set. S m ∼ P m The training set examples are i.i.d. from P : • ⊲ these can be relaxed (but beyond the scope of this tutorial) NeurIPS 2018 Slide 10 / 52
What to achieve from the sample? Why SLT Use the available sample to: Overview (1) learn a predictor Notation (2) certify the predictor’s performance First generation Second generation Learning a predictor: Next generation • algorithm driven by some learning principle • informed by prior knowledge resulting in inductive bias Certifying performance: • what happens beyond the training set • generalization bounds Actually these two goals interact with each other! NeurIPS 2018 Slide 11 / 52
Risk (aka error) measures A loss function ℓ ( h ( X ) , Y ) is used to measure the discrepancy Why SLT between a predicted label h ( X ) and the true label Y . Overview Notation First generation � m R in ( h ) = 1 i = 1 ℓ ( h ( X i ) , Y i ) Empirical risk: m Second generation (in-sample) Next generation R out ( h ) = E � ℓ ( h ( X ) , Y ) � Theoretical risk: (out-of-sample) Examples: ℓ ( h ( X ) , Y ) = 1 [ h ( X ) � Y ] : 0-1 loss (classification) • ℓ ( h ( X ) , Y ) = ( Y − h ( X )) 2 : square loss (regression) • ℓ ( h ( X ) , Y ) = (1 − Yh ( X )) + : hinge loss • ℓ ( h ( X ) , Y ) = − log( h ( X )) : log loss (density estimation) • NeurIPS 2018 Slide 12 / 52
Generalization If classifier h does well on the in-sample ( X , Y ) pairs... Why SLT Overview ...will it still do well on out-of-sample pairs? Notation First generation ∆ ( h ) = R out ( h ) − R in ( h ) Generalization gap: Second generation Next generation ∆ ( h ) ≤ ǫ ( m , δ ) Upper bounds: w.h.p. R out ( h ) ≤ R in ( h ) + ǫ ( m , δ ) ◮ ∆ ( h ) ≥ ˜ ǫ ( m , δ ) Lower bounds: w.h.p. Flavours: distribution-free distribution-dependent � � algorithm-free algorithm-dependent � � NeurIPS 2018 Slide 13 / 52
First generation SLT NeurIPS 2018 Slide 14 / 52
Building block: One single function For one fixed (non data-dependent) h : Why SLT Overview � m � � E [ R in ( h )] = E i = 1 ℓ ( h ( X i ) , Y i ) = R out ( h ) 1 Notation m First generation Second generation P m [ ∆ ( h ) > ǫ ] = P m � E [ R in ( h )] − R in ( h ) > ǫ � deviation ineq. Next generation ◮ ℓ ( h ( X i ) , Y i ) are independent r.v.’s ◮ If 0 ≤ ℓ ( h ( X ) , Y ) ≤ 1, using Hoe ff ding’s inequality: ◮ P m � ∆ ( h ) > ǫ � ≤ exp � − 2 m ǫ 2 � = δ Given δ ∈ (0 , 1), equate RHS to δ , solve equation for ǫ , get ◮ P m � � ∆ ( h ) > (1 / 2 m ) log(1 /δ ) � ≤ δ � 1 � � R out ( h ) ≤ R in ( h ) + 1 with probability ≥ 1 − δ , 2 m log ◮ δ NeurIPS 2018 Slide 15 / 52
Finite function class Algorithm A : Z m → H Why SLT Function class H with | H | < ∞ Overview P m � ∀ f ∈ H , ∆ ( f ) ≤ ǫ � ≥ 1 − δ Aim for a uniform bound: Notation First generation P m ( E 1 or E 2 or · · · ) ≤ P m ( E 1 ) + P m ( E 2 ) + · · · Basic tool: Second generation Next generation known as the union bound (aka countable sub-additivity) P m � � f ∈ H P m � � ∃ f ∈ H , ∆ ( f ) > ǫ ∆ ( f ) > ǫ ≤ � � − 2 m ǫ 2 � ≤ | H | exp = δ � | H | � ∀ h ∈ H , R out ( h ) ≤ R in ( h ) + � 1 w.p. ≥ 1 − δ , 2 m log δ NeurIPS 2018 Slide 16 / 52
Uncountably infinite function class? Algorithm A : Z m → H Why SLT Function class H with | H | ≥ | N | Overview Double sample trick: a second ‘ghost sample’ Notation true error ↔ empirical error on the ‘ghost sample’ First generation � hence reduce to a finite number of behaviours Second generation � Next generation make union bound, but bad events grouped together � Symmetrization: bound the probability of good performance on one sample � but bad performance on the other sample swapping examples between actual and ghost sample � Growth function of class H : G H ( m ) = largest number of dichotomies ( ± 1 labels) � generated by the class H on any m points. VC dimension of class H : VC ( H ) = largest m such that G H ( m ) = 2 m � NeurIPS 2018 Slide 17 / 52
VC upper bound For any m , for any δ ∈ (0 , 1), Why SLT Vapnik & Chervonenkis: Overview � 4 G H (2 m ) � � ∀ h ∈ H , ∆ ( h ) ≤ 8 Notation w.p. ≥ 1 − δ , m log δ First generation Second generation growth function Next generation Bounding the growth function → Sauer’s Lemma � If d = VC ( H ) finite, then G H ( m ) ≤ � d � m � for all m � k = 0 k implies G H ( m ) ≤ ( em / d ) d (polynomial in m ) For H with d = VC ( H ) finite, for any m , for any δ ∈ (0 , 1), � + 8 � 8 d m log � 2 em ∀ h ∈ H , ∆ ( h ) ≤ m log � 4 � w.p. ≥ 1 − δ , d δ NeurIPS 2018 Slide 18 / 52
PAC learnability Why SLT VC upper bound: Overview Note that the bound is: � Notation the same for all functions in the class (uniform over H ) First generation and the same for all distributions (uniform over P ) Second generation Next generation VC lower bound: VC dimension characterises learnability in PAC setting: � there exist distributions such that with large probability over m random examples, the gap between the risk and the best possible risk achievable over the class is at least � d m . NeurIPS 2018 Slide 19 / 52
Limitations of the VC framework Why SLT The theory is certainly valid and tight – lower and upper � Overview bounds match! Notation VC bounds motivate Empirical Risk Minimization (ERM), � First generation as apply to a hypothesis space, not hypothesis-dependent Second generation Next generation Practical algorithms often do not search a fixed hypothesis � space but regularise to trade complexity with empirical error, e.g. k -NN or SVMs or DNNs Mismatch between theory and practice � Let’s illustrate this with SVMs... � NeurIPS 2018 Slide 20 / 52
SVM with Gaussian kernel 40 Parzen window 35 Kernel SVM � � − � x − z � 2 30 κ ( x , z ) = exp 2 σ 2 25 20 15 10 5 0 0 0.2 0.4 0.6 0.8 1 NeurIPS 2018 Slide 21 / 52
SVM with Gaussian kernel: A case study Why SLT VC dimension −→ infinite � Overview but observed performance is often excellent � Notation VC bounds aren’t able to explain this � First generation lower bounds appear to contradict the observations � Second generation How to resolve this apparent contradiction? � Next generation Coming up... large margin ⊲ distribution may not be worst-case � NeurIPS 2018 Slide 22 / 52
Hitchhiker’s guide Why SLT Overview right but wrong Notation First generation Second generation Next generation nice and complete Practical usefulness Theory not so much NeurIPS 2018 Slide 23 / 52
Recommend
More recommend