Ri Risk bo bounds unds for r some me classificati tion n and - PowerPoint PPT Presentation

Ri Risk bo bounds unds for r some me classificati tion n and nd re regre ression models that interpolate Daniel Hsu Columbia University Joint work with: Misha Belkin (The Ohio State University) Partha Mitra (Cold Spring Harbor Laboratory)

(Breiman, 1995) 2

When is "interpolation" justified in ML? • Supervised learning : use training examples to find function that predicts accurately on new example • Interpolation : find function that perfectly fits training examples • Some call this "overfitting" • PAC learning (Valiant, 1984; Blumer, Ehrenfeucht, Haussler, & Warmuth, 1987; …) : • realizable, noise-free setting • bounded-capacity hypothesis class • Regression models : • Can interpolate if no noise! • E.g., linear models with ! ≥ # 3

Overfitting 4

(Zhang, Bengio, Hardt, Recht, & Vinyals, 2017) Some observations from the field • Can fit any training data, given enough time and large enough network. • Can generalize even when training data has substantial amount of label noise. 5

(Belkin, Ma, & Mandal, 2018) More observations from the field MNIST • Can fit any training data, given enough time and rich enough feature space. • Can generalize even when training data has substantial amount of label noise. 6

Summary of some empirical observations • Training produces a function ! " that perfectly fits noisy training data. • ! " is likely a very complex function! • Yet, test error of ! " is non-trivial: e.g., noise rate + 5%. Can theory explain these observations? 7

"Classical" learning theory Generalization : 0 true error rate ≤ training error rate + deviation bound • Deviation bound : depends on "complexity" of learned function • Capacity control, regularization, smoothing, algorithmic stability, margins, … • None known to be non-trivial for functions interpolating noisy data. • E.g., function is chosen from class rich enough to express all possible ways to label Ω(%) training examples. • Bound must exploit specific properties of chosen function. 8

(Wyner, Olson, Bleich, & Mease, 2017) Even more observations from the field • Some "local interpolation" methods are robust to label noise. • Can limit influence of noisy points in other parts of data space. 9

What is known in theory? Nearest neighbor (Cover & Hart, 1967) Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998) • Predict with label of nearest • Bandwidth-free(!) Nadaraya-Watson training example smoothing kernel regression • Interpolates training data • Interpolates training data • Not always consistent, but almost • Consistent(!!), but no rates 1 ! " − " $ = " − " $ ' 10

Our goals • Counter the "conventional wisdom" re: interpolation Show interpolation methods can be consistent (or almost consistent) for classification & regression problems • Identify some useful properties of certain local prediction methods • Suggest connections to practical methods 11

Our new results Analyses of two new interpolation schemes 1. Simplicial interpolation • Natural linear interpolation based on multivariate triangulation • Asymptotic advantages compared to nearest neighbor rule 2. Weighted k -NN interpolation • Consistency + non-asymptotic convergence rates 12

1. Simplicial interpolation 13

Interpolation via multivariate triangulation • IID training examples ! " , $ " , … , ! & , $ & ∈ ℝ ) ×[0,1] • Partition / ≔ conv ! " , … , ! & into simplices with ! 5 as vertices via Delaunay. • Define ̂ 7(!) on each simplex by affine interpolation of vertices' labels. • Result is piecewise linear on / . (Punt on what happens outside of / .) • For classification ( $ ∈ {0,1} ), let < = be plug-in classifier based on ̂ 7 . 14

̂ What happens on a single simplex • Simplex on ! " , … , ! '(" with corresponding labels ) " , … , ) '(" • Test point ! in simplex, with barycentric coordinates (+ " , … , + '(" ) . • Linear interpolation at ! (i.e., least squares fit, evaluated at ! ): '(" ! " . ! = 0 + 1 ) 1 ! 12" ! $ Key idea : aggregates information from all vertices to make prediction. ! # (C.f. nearest neighbor rule.) 15

Comparison to nearest neighbor rule • Suppose ! " = Pr(' = 1 ∣ ") < 1/2 for all points in a simplex • Bayes optimal prediction is 0 for all points in simplex. • Suppose ' . = ⋯ = ' 0 = 0 , but ' 02. = 1 (due to "label noise") x 1 x 1 3 4 " = 1 here 0 0 Effect even more pronounced in high dimensions! 0 1 0 1 x 2 x 3 x 2 x 3 Nearest neighbor rule Simplicial interpolation 16

̂ Asymptotic risk Theorem : Assume distribution of ! is uniform on some convex set, " is Holder smooth. Then simplicial interpolation estimate satisfies 2 - ≤ 0 + 1 3 - limsup * " ! − " ! ) and plug-in classifier satisfies 1 6 limsup Pr 7 ! ≠ 7 9:; ! ≤ < 0 ) = • Near-consistency in high-dimension : Bayes optimal + < > • C.f. nearest neighbor classifier : ≤ twice Bayes optimal • "Blessing" of dimensionality (with caveat about convergence rate). 17

2. Weighted k -NN interpolation 18

̂ Weighted k -NN scheme • For given test point ! , let ! (#) , … , ! ' be ( nearest neighbors in training data, and let ) (#) , … , ) ' be corresponding labels. Define ' ∑ /0# 1(!, ! / ) ) / ! (#) , ! = ' ∑ /0# 1(!, ! / ) where ! ! (*) 34 , 1 !, ! / = ! − ! / 5 > 0 ! (') Interpolation : ̂ , ! → ) / as ! → ! / 19

̂ ̂ Comparison to Hilbert kernel estimate Weighted k -NN Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998) 9 ) " # = ∑ &'( *(#, # & ) . & ∑ &'( *(#, # & ) . & " # = 9 ∑ &'( *(#, # & ) ) ∑ &'( *(#, # & ) 12 *(#, # & ) = ‖# − # & ‖ 12 * #, # & = # − # & Our analysis needs 0 < 5 < 6/2 MUST have 5 = 6 for consistency Localization makes it possible to prove non-asymptotic rate. 20

̂ Convergence rates Theorem : Assume distribution of ! is uniform on some compact set satisfying regularity condition, and " is # -Holder smooth. For appropriate setting of $ , weighted k -NN estimate satisfies ) ≤ + , -)./().12) % " ' − " ' If Tsybakov noise condition with parameter 4 > 0 also holds, then plug-in classifier, with appropriate setting of $ , satisfies 9 ≤ + , -.?/(. )1? 12) Pr : ! ≠ : <=> ! 21

Closing thoughts 22

Connections to models used in practice • Kernel ridge regression: • Simplicial interpolation is like Laplace kernel in ℝ " • Random forests: • Large ensembles with random thresholds may approximate locally-linear interpolation (Cutler & Zhao, 2001) • Neural nets: • Many recent empirical studies that find similarities between neural nets and k -NN in terms of performance and noise-robustness (Drory, Avidan, & Giryes, 2018; Cohen, Sapiro, & Giryes, 2018) 23

"Adversarial examples" • Interpolation works because mass of region immediately around noisily-labeled training examples is small in high-dimensions. But also a great source of adversarial examples -- easy to find using local optimization around training examples. 24

Open problems • Generalization theory to explain behavior of interpolation methods • Kernel methods: ( subject to ' ) * = , * for all - = 1, … , 1 . min $∈ℋ ' ℋ When does this work (with noisy labels) ? • Very recent work by T. Liang and A. Rakhlin (2018+) provides some analysis in some regimes. • Benefits of interpolation? 25

Acknowledgements • National Science Foundation • Sloan Foundation • Simons Institute for the Theory of Computing arxiv.org/abs/1806.05161 26

Ri Risk bo bounds unds for r some me classificati tion n and - PowerPoint PPT Presentation

Ri Risk bo bounds unds for r some me classificati tion n and nd re regre ression models that interpolate Daniel Hsu Columbia University Joint work with: Misha Belkin (The Ohio State University) Partha Mitra (Cold Spring Harbor

Principal Components Analysis (PCA) BIOE 210 Cl Classificati tion vs. Under erstanding The

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

The e Pre restige stige Disaster ter and d the the Role of Classificati ification on

Risk Management Workshop 1 Risk management workshop Why do we Risk Risk and need risk

tail bounds tail bounds For a random variable X, the tails of X are the parts of the PMF/density

Randomness in Computing L ECTURE 10 Last time Chernoff Bounds Today Hoeffding Bounds

On some topological upper bounds of the apex trees Sarfraz Ahmad Department of Mathematics,

Endo ndowment nt Di Disclosur sclosures FSP FAS 117-1: Endow owme ments ts of of Not-f

Computing Tight Bounds for Insurance Payments with Nonlinear Risk Man Hong WONG 1 Shuzhong ZHANG 2

Links visited in class Hedging nonlinear risk 1 2.5 Put-call parity 2.6 Upper and lower bounds on

L e ve r aging L imite d F unds for Maximum Impac t Ove rc o ming c ha lle ng e s with inno

Eleme ment nts, s, Compoun unds, ds, and nd Mixtu xtures res MATTER Makes up

AUD UDIT ITING H HAVA F FUND UNDS February 2011 Who o is s the I e IG and w nd what doe

Propose sed 202 2020-21 B Build ildin ings and G d Groun unds ds Budget et 2020 2020-21

IL L INOIS ST AT E T RE ASURE RS OF F ICE Ab o ut T he I llino is F unds e ate d

EDIS DISON L N LOCAL SCHOOL D DISTRICT CT Your ur L Levy y Fund unds a at t Work Class

Multi-dimensional Packet Classification Yadi Ma, Suman Banerjee University of Wisconsin-Madison

An unsupervised classification process for large datasets based on web reasoning Rafael PEIXOTO,

HTTPS Traffic Classification Wazen M. Shbair, Thibault Cholez, J er ome Fran cois,

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, Kanishka Rao, Konstantinos

Distributed Data Classification Chih-Jen Lin Department of Computer Science National Taiwan

NPFL103: Information Retrieval (9) Vector Space Classification Pavel Pecina Institute of Formal

Automated Application Signature Generation Using LASER and Cosine Similarity Byungchul Park, Jae

SANTA CLARA UNIVERSITY HUMAN RESOURCES COMMUNICATIONS COMMITTEE December 4, 2019 Agenda 2