Ri Risk bo bounds unds for r some me classificati tion n and nd re regre ression models that interpolate Daniel Hsu Columbia University Joint work with: Misha Belkin (The Ohio State University) Partha Mitra (Cold Spring Harbor Laboratory)
(Breiman, 1995) 2
When is "interpolation" justified in ML? • Supervised learning : use training examples to find function that predicts accurately on new example • Interpolation : find function that perfectly fits training examples • Some call this "overfitting" • PAC learning (Valiant, 1984; Blumer, Ehrenfeucht, Haussler, & Warmuth, 1987; …) : • realizable, noise-free setting • bounded-capacity hypothesis class • Regression models : • Can interpolate if no noise! • E.g., linear models with ! ≥ # 3
Overfitting 4
(Zhang, Bengio, Hardt, Recht, & Vinyals, 2017) Some observations from the field • Can fit any training data, given enough time and large enough network. • Can generalize even when training data has substantial amount of label noise. 5
(Belkin, Ma, & Mandal, 2018) More observations from the field MNIST • Can fit any training data, given enough time and rich enough feature space. • Can generalize even when training data has substantial amount of label noise. 6
Summary of some empirical observations • Training produces a function ! " that perfectly fits noisy training data. • ! " is likely a very complex function! • Yet, test error of ! " is non-trivial: e.g., noise rate + 5%. Can theory explain these observations? 7
"Classical" learning theory Generalization : 0 true error rate ≤ training error rate + deviation bound • Deviation bound : depends on "complexity" of learned function • Capacity control, regularization, smoothing, algorithmic stability, margins, … • None known to be non-trivial for functions interpolating noisy data. • E.g., function is chosen from class rich enough to express all possible ways to label Ω(%) training examples. • Bound must exploit specific properties of chosen function. 8
(Wyner, Olson, Bleich, & Mease, 2017) Even more observations from the field • Some "local interpolation" methods are robust to label noise. • Can limit influence of noisy points in other parts of data space. 9
What is known in theory? Nearest neighbor (Cover & Hart, 1967) Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998) • Predict with label of nearest • Bandwidth-free(!) Nadaraya-Watson training example smoothing kernel regression • Interpolates training data • Interpolates training data • Not always consistent, but almost • Consistent(!!), but no rates 1 ! " − " $ = " − " $ ' 10
Our goals • Counter the "conventional wisdom" re: interpolation Show interpolation methods can be consistent (or almost consistent) for classification & regression problems • Identify some useful properties of certain local prediction methods • Suggest connections to practical methods 11
Our new results Analyses of two new interpolation schemes 1. Simplicial interpolation • Natural linear interpolation based on multivariate triangulation • Asymptotic advantages compared to nearest neighbor rule 2. Weighted k -NN interpolation • Consistency + non-asymptotic convergence rates 12
1. Simplicial interpolation 13
Interpolation via multivariate triangulation • IID training examples ! " , $ " , … , ! & , $ & ∈ ℝ ) ×[0,1] • Partition / ≔ conv ! " , … , ! & into simplices with ! 5 as vertices via Delaunay. • Define ̂ 7(!) on each simplex by affine interpolation of vertices' labels. • Result is piecewise linear on / . (Punt on what happens outside of / .) • For classification ( $ ∈ {0,1} ), let < = be plug-in classifier based on ̂ 7 . 14
̂ What happens on a single simplex • Simplex on ! " , … , ! '(" with corresponding labels ) " , … , ) '(" • Test point ! in simplex, with barycentric coordinates (+ " , … , + '(" ) . • Linear interpolation at ! (i.e., least squares fit, evaluated at ! ): '(" ! " . ! = 0 + 1 ) 1 ! 12" ! $ Key idea : aggregates information from all vertices to make prediction. ! # (C.f. nearest neighbor rule.) 15
Comparison to nearest neighbor rule • Suppose ! " = Pr(' = 1 ∣ ") < 1/2 for all points in a simplex • Bayes optimal prediction is 0 for all points in simplex. • Suppose ' . = ⋯ = ' 0 = 0 , but ' 02. = 1 (due to "label noise") x 1 x 1 3 4 " = 1 here 0 0 Effect even more pronounced in high dimensions! 0 1 0 1 x 2 x 3 x 2 x 3 Nearest neighbor rule Simplicial interpolation 16
̂ Asymptotic risk Theorem : Assume distribution of ! is uniform on some convex set, " is Holder smooth. Then simplicial interpolation estimate satisfies 2 - ≤ 0 + 1 3 - limsup * " ! − " ! ) and plug-in classifier satisfies 1 6 limsup Pr 7 ! ≠ 7 9:; ! ≤ < 0 ) = • Near-consistency in high-dimension : Bayes optimal + < > • C.f. nearest neighbor classifier : ≤ twice Bayes optimal • "Blessing" of dimensionality (with caveat about convergence rate). 17
2. Weighted k -NN interpolation 18
̂ Weighted k -NN scheme • For given test point ! , let ! (#) , … , ! ' be ( nearest neighbors in training data, and let ) (#) , … , ) ' be corresponding labels. Define ' ∑ /0# 1(!, ! / ) ) / ! (#) , ! = ' ∑ /0# 1(!, ! / ) where ! ! (*) 34 , 1 !, ! / = ! − ! / 5 > 0 ! (') Interpolation : ̂ , ! → ) / as ! → ! / 19
̂ ̂ Comparison to Hilbert kernel estimate Weighted k -NN Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998) 9 ) " # = ∑ &'( *(#, # & ) . & ∑ &'( *(#, # & ) . & " # = 9 ∑ &'( *(#, # & ) ) ∑ &'( *(#, # & ) 12 *(#, # & ) = ‖# − # & ‖ 12 * #, # & = # − # & Our analysis needs 0 < 5 < 6/2 MUST have 5 = 6 for consistency Localization makes it possible to prove non-asymptotic rate. 20
̂ Convergence rates Theorem : Assume distribution of ! is uniform on some compact set satisfying regularity condition, and " is # -Holder smooth. For appropriate setting of $ , weighted k -NN estimate satisfies ) ≤ + , -)./().12) % " ' − " ' If Tsybakov noise condition with parameter 4 > 0 also holds, then plug-in classifier, with appropriate setting of $ , satisfies 9 ≤ + , -.?/(. )1? 12) Pr : ! ≠ : <=> ! 21
Closing thoughts 22
Connections to models used in practice • Kernel ridge regression: • Simplicial interpolation is like Laplace kernel in ℝ " • Random forests: • Large ensembles with random thresholds may approximate locally-linear interpolation (Cutler & Zhao, 2001) • Neural nets: • Many recent empirical studies that find similarities between neural nets and k -NN in terms of performance and noise-robustness (Drory, Avidan, & Giryes, 2018; Cohen, Sapiro, & Giryes, 2018) 23
"Adversarial examples" • Interpolation works because mass of region immediately around noisily-labeled training examples is small in high-dimensions. But also a great source of adversarial examples -- easy to find using local optimization around training examples. 24
Open problems • Generalization theory to explain behavior of interpolation methods • Kernel methods: ( subject to ' ) * = , * for all - = 1, … , 1 . min $∈ℋ ' ℋ When does this work (with noisy labels) ? • Very recent work by T. Liang and A. Rakhlin (2018+) provides some analysis in some regimes. • Benefits of interpolation? 25
Acknowledgements • National Science Foundation • Sloan Foundation • Simons Institute for the Theory of Computing arxiv.org/abs/1806.05161 26
Recommend
More recommend