Loss minimization and parameter estimation with heavy tails Sivan - PowerPoint PPT Presentation

Loss minimization and parameter estimation with heavy tails Sivan Sabato # † Daniel Hsu ? ? Department of Computer Science, Columbia University # Microsoft Research New England † On the job market—don’t miss this amazing hiring opportunity! 1

Outline 1. Introduction 2. Warm-up: estimating a scalar mean 3. Linear regression with heavy-tail distributions 4. Concluding remarks 2

1. Introduction 3

Heavy-tail distributions Distribution with “tail” that is “heavier” than that of Exponential. For random vectors, consider the distribution of k X k . 4

Multivariate heavy-tail distributions Heavy-tail distributions for random vectors X 2 R d : I Marginal distributions of X i have heavy tails, or I Strong dependencies between the X i . 5

Multivariate heavy-tail distributions Heavy-tail distributions for random vectors X 2 R d : I Marginal distributions of X i have heavy tails, or I Strong dependencies between the X i . Can we use the same procedures originally designed for distributions without heavy tails? Or do we need new procedures? 5

Minimax optimal but not deviation optimal Empirical mean achieves minimax rate for estimating E ( X ) , but suboptimal when deviations are concerned: Squared error of empirical mean is ✓ � 2 ◆ Ω n � with probability � 2 � for some distribution. (n = sample size, � 2 = var ( X ) < 1 .) 6

Minimax optimal but not deviation optimal Empirical mean achieves minimax rate for estimating E ( X ) , but suboptimal when deviations are concerned: Squared error of empirical mean is ✓ � 2 ◆ Ω n � with probability � 2 � for some distribution. (n = sample size, � 2 = var ( X ) < 1 .) Note : If data were Gaussian, squared error would be ✓ � 2 log ( 1 / � ) ◆ O . n 6

Main result New computationally e ffi cient estimator for least squares linear regression when distributions of X 2 R d and Y 2 R may have heavy tails. 7

Main result New computationally e ffi cient estimator for least squares linear regression when distributions of X 2 R d and Y 2 R may have heavy tails. Assuming bounded ( 4 + ✏ ) -order moments and regularity conditions, convergence rate is ✓ � 2 d log ( 1 / � ) ◆ O n with probability � 1 � � as soon as n � ˜ O ( d log ( 1 / � ) + log 2 ( 1 / � )) . ( n = sample size, � 2 = optimal squared error.) 7

Main result New computationally e ffi cient estimator for least squares linear regression when distributions of X 2 R d and Y 2 R may have heavy tails. Assuming bounded ( 4 + ✏ ) -order moments and regularity conditions, convergence rate is ✓ � 2 d log ( 1 / � ) ◆ O n with probability � 1 � � as soon as n � ˜ O ( d log ( 1 / � ) + log 2 ( 1 / � )) . ( n = sample size, � 2 = optimal squared error.) Previous state-of-the-art : [Audibert and Catoni, AoS 2011], essentially same conditions and rate, but computationally ine ffi cient. General technique with many other applications : ridge, Lasso, matrix approximation, etc . 7

2. Warm-up: estimating a scalar mean 8

Warm-up: estimating a scalar mean Forget X ; how do we estimate E ( Y ) ? (Set µ := E ( Y ) and � 2 := var ( Y ) ; assume � 2 < 1 .) 9

Empirical mean Let Y 1 , Y 2 , . . . , Y n be iid copies of Y , and set n X µ := 1 b Y i n i = 1 (empirical mean). 10

Empirical mean Let Y 1 , Y 2 , . . . , Y n be iid copies of Y , and set n X µ := 1 b Y i n i = 1 (empirical mean). There exists distributions for Y with � 2 < 1 s.t. ✓ ◆ µ � µ ) 2 � � 2 2 n � ( 1 � 2 e � / n ) n � 1 ( b � 2 � . P (Catoni, 2012) 10

Median-of-means [Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999] 11

Median-of-means [Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999] 1. Split the sample { Y 1 , . . . , Y n } into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b µ i := mean ( S i ) . 3. Return b µ := median ( { b µ 1 , b µ 2 , . . . , b µ k } ) . 11

Median-of-means [Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999] 1. Split the sample { Y 1 , . . . , Y n } into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b µ i := mean ( S i ) . 3. Return b µ := median ( { b µ 1 , b µ 2 , . . . , b µ k } ) . Theorem (Folklore) Set k := 4 . 5 ln ( 1 / � ) . With probability at least 1 � � , ✓ � 2 log ( 1 / � ) ◆ µ � µ ) 2  O ( b . n 11

Analysis of median-of-means 1. Assume | S i | = k / n for simplicity. By Chebyshev’s inequality, for each i = 1 , 2 , . . . , k : ! r 6 � 2 k Pr | b µ i � µ |  � 5 / 6 . n 12

Analysis of median-of-means 1. Assume | S i | = k / n for simplicity. By Chebyshev’s inequality, for each i = 1 , 2 , . . . , k : ! r 6 � 2 k Pr | b µ i � µ |  � 5 / 6 . n p 6 � 2 k / n } . By Hoe ff ding’s inequality, 2. Let b i := 1 {| b µ i � µ |  k ! X Pr b i > k / 2 � 1 � exp ( � k / 4 . 5 ) . i = 1 12

Analysis of median-of-means 1. Assume | S i | = k / n for simplicity. By Chebyshev’s inequality, for each i = 1 , 2 , . . . , k : ! r 6 � 2 k Pr | b µ i � µ |  � 5 / 6 . n p 6 � 2 k / n } . By Hoe ff ding’s inequality, 2. Let b i := 1 {| b µ i � µ |  k ! X Pr b i > k / 2 � 1 � exp ( � k / 4 . 5 ) . i = 1 3. In the event that more than half of the b µ i are within p 6 � 2 k / n of µ , the median b µ is as well. 12

Alternative: minimize a robust loss function Alternative is to minimize a “robust” loss function [Catoni, 2012] : ✓ µ � Y i ◆ n X µ := arg min b . ` � µ 2 R i = 1 Example: ` ( z ) := log cosh ( z ) . Optimal rate and constants. Catch : need to know � 2 . 13

3. Linear regression with heavy-tail distributions 14

Linear regression (for out-of-sample prediction) 1. Response variable : random variable Y 2 R . 2. Covariates : random vector X 2 R d . > � 0.) (Assume Σ := E XX 3. Given : Sample S of n iid copies of ( X , Y ) . β ( S ) 2 R d to minimize population loss 4. Goal : find b β = b L ( β ) := E ( Y � β > X ) 2 . 15

Linear regression (for out-of-sample prediction) 1. Response variable : random variable Y 2 R . 2. Covariates : random vector X 2 R d . > � 0.) (Assume Σ := E XX 3. Given : Sample S of n iid copies of ( X , Y ) . β ( S ) 2 R d to minimize population loss 4. Goal : find b β = b L ( β ) := E ( Y � β > X ) 2 . Recall : Let β ? := arg min β 0 2 R d L ( β 0 ) . For any β 2 R d , � � 2 � � =: k β � β ? k 2 � Σ 1 / 2 ( β � β ? ) L ( β ) � L ( β ? ) = Σ . � 15

Generalization of median-of-means 1. Split the sample S into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b β i := ordinary least squares ( S i ) . ⇣n o⌘ 3. Return b b β 1 , b β 2 , . . . , b β := select good one . β k 16

Generalization of median-of-means 1. Split the sample S into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b β i := ordinary least squares ( S i ) . ⇣n o⌘ 3. Return b b β 1 , b β 2 , . . . , b β := select good one . β k Questions : 1. Guarantees for b β i = OLS ( S i ) ? 2. How to select a good b β i ? 16

Ordinary least squares Under moment conditions ⇤ , b β i := OLS ( S i ) satisfies s ! � � � 2 d � � �b β i � β ? Σ = O � | S i | with probability at least 5 / 6 as soon as | S i | � O ( d log d ) . ⇤⇤ ⇤ Requires Kurtosis condition for this simplified bound. ⇤⇤ Can replace d log d with d under some regularity conditions [Srivastava and Vershynin, AoP 2013]. 17

Ordinary least squares Under moment conditions ⇤ , b β i := OLS ( S i ) satisfies s ! � � � 2 d � � �b β i � β ? Σ = O � | S i | with probability at least 5 / 6 as soon as | S i | � O ( d log d ) . ⇤⇤ Upshot : If k := O ( log ( 1 / � )) , then with probability � 1 � � , more p than half of the b � 2 d log ( 1 / � ) / n of β ? . β i will be within " := ⇤ Requires Kurtosis condition for this simplified bound. ⇤⇤ Can replace d log d with d under some regularity conditions [Srivastava and Vershynin, AoP 2013]. 17

Selecting a good b β i assuming Σ is known Consider metric ⇢ ( a , b ) := k a � b k Σ . 1. For each i = 1 , 2 , . . . , k : n o ⇢ ( b β i , b Let r i := median β j ) : j = 1 , 2 , . . . , k . 2. Let i ? := arg min r i . 3. Return b β := b β i ? . 18

Selecting a good b β i assuming Σ is known Consider metric ⇢ ( a , b ) := k a � b k Σ . 1. For each i = 1 , 2 , . . . , k : n o ⇢ ( b β i , b Let r i := median β j ) : j = 1 , 2 , . . . , k . 2. Let i ? := arg min r i . 3. Return b β := b β i ? . Claim : If more than half of the b β i are within distance " of β ? , then b β is within distance 3 " of β ? . 18

Selecting a good b β i when Σ is unknown General case : Σ is unknown; can’t compute distances k a � b k Σ . 19

Selecting a good b β i when Σ is unknown General case : Σ is unknown; can’t compute distances k a � b k Σ . � k � Solution : Estimate distances using fresh (unlabeled) samples. 2 19

Loss minimization and parameter estimation with heavy tails Sivan - PowerPoint PPT Presentation

Loss minimization and parameter estimation with heavy tails Sivan Sabato # Daniel Hsu ? ? Department of Computer Science, Columbia University # Microsoft Research New England On the job marketdont miss this amazing hiring opportunity!

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Learning as Loss Minimization Machine Learning 1 Learning as loss minimization The setup

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Outline Introduction Knowledge Structures Parameter Estimation Maximum Likelihood Estimation

Exercise 12: Heavy ions beams Exercise 12: Heavy ions beams Beginners FLUKA Course Exercise

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

EARTHQUAKE LOSS ESTIMATION AND RISK EARTHQUAKE LOSS ESTIMATION AND RISK ASSESSMENT METHODOLOGY

Decoupling Lock-Free Data Structures from Memory Reclamation for Static Analysis [POPL'19]

Lazy Retirement: A Power Aware Register Management Mechanism Guillermo (Eli) Savransky WCED

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 H ANDLING QUERIES query Primary The

Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed

Concurrent Counting is harder than Queuing Costas Busch Rensselaer Polytechnic Intitute

Security Analysis of Key-Alternating Feistel Ciphers Rodolphe Lampe and Yannick Seurin University

Proving linearizability & lock-freedom Viktor Vafeiadis MPI-SWS Michael & Scott

Learning higher-order logic programs Andrew Cropper, Rolf Morel, and Stephen Muggleton Program

Loss minimization and parameter estimation with heavy tails Sivan - PowerPoint PPT Presentation

Loss minimization and parameter estimation with heavy tails Sivan Sabato # Daniel Hsu ? ? Department of Computer Science, Columbia University # Microsoft Research New England On the job marketdont miss this amazing hiring opportunity!

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Learning as Loss Minimization Machine Learning 1 Learning as loss minimization The setup

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Outline Introduction Knowledge Structures Parameter Estimation Maximum Likelihood Estimation

Exercise 12: Heavy ions beams Exercise 12: Heavy ions beams Beginners FLUKA Course Exercise

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

EARTHQUAKE LOSS ESTIMATION AND RISK EARTHQUAKE LOSS ESTIMATION AND RISK ASSESSMENT METHODOLOGY

Decoupling Lock-Free Data Structures from Memory Reclamation for Static Analysis [POPL'19]

Lazy Retirement: A Power Aware Register Management Mechanism Guillermo (Eli) Savransky WCED

EECS 591 D ISTRIBUTED S YSTEMS Manos Kapritsos Fall 2020 H ANDLING QUERIES query Primary The

Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed

Concurrent Counting is harder than Queuing Costas Busch Rensselaer Polytechnic Intitute

Security Analysis of Key-Alternating Feistel Ciphers Rodolphe Lampe and Yannick Seurin University

Proving linearizability &amp; lock-freedom Viktor Vafeiadis MPI-SWS Michael &amp; Scott

Learning higher-order logic programs Andrew Cropper, Rolf Morel, and Stephen Muggleton Program

Proving linearizability & lock-freedom Viktor Vafeiadis MPI-SWS Michael & Scott