Loss minimization and parameter estimation with heavy tails Sivan Sabato # † Daniel Hsu ? ? Department of Computer Science, Columbia University # Microsoft Research New England † On the job market—don’t miss this amazing hiring opportunity! 1
Outline 1. Introduction 2. Warm-up: estimating a scalar mean 3. Linear regression with heavy-tail distributions 4. Concluding remarks 2
1. Introduction 3
Heavy-tail distributions Distribution with “tail” that is “heavier” than that of Exponential. For random vectors, consider the distribution of k X k . 4
Multivariate heavy-tail distributions Heavy-tail distributions for random vectors X 2 R d : I Marginal distributions of X i have heavy tails, or I Strong dependencies between the X i . 5
Multivariate heavy-tail distributions Heavy-tail distributions for random vectors X 2 R d : I Marginal distributions of X i have heavy tails, or I Strong dependencies between the X i . Can we use the same procedures originally designed for distributions without heavy tails? Or do we need new procedures? 5
Minimax optimal but not deviation optimal Empirical mean achieves minimax rate for estimating E ( X ) , but suboptimal when deviations are concerned: Squared error of empirical mean is ✓ � 2 ◆ Ω n � with probability � 2 � for some distribution. (n = sample size, � 2 = var ( X ) < 1 .) 6
Minimax optimal but not deviation optimal Empirical mean achieves minimax rate for estimating E ( X ) , but suboptimal when deviations are concerned: Squared error of empirical mean is ✓ � 2 ◆ Ω n � with probability � 2 � for some distribution. (n = sample size, � 2 = var ( X ) < 1 .) Note : If data were Gaussian, squared error would be ✓ � 2 log ( 1 / � ) ◆ O . n 6
Main result New computationally e ffi cient estimator for least squares linear regression when distributions of X 2 R d and Y 2 R may have heavy tails. 7
Main result New computationally e ffi cient estimator for least squares linear regression when distributions of X 2 R d and Y 2 R may have heavy tails. Assuming bounded ( 4 + ✏ ) -order moments and regularity conditions, convergence rate is ✓ � 2 d log ( 1 / � ) ◆ O n with probability � 1 � � as soon as n � ˜ O ( d log ( 1 / � ) + log 2 ( 1 / � )) . ( n = sample size, � 2 = optimal squared error.) 7
Main result New computationally e ffi cient estimator for least squares linear regression when distributions of X 2 R d and Y 2 R may have heavy tails. Assuming bounded ( 4 + ✏ ) -order moments and regularity conditions, convergence rate is ✓ � 2 d log ( 1 / � ) ◆ O n with probability � 1 � � as soon as n � ˜ O ( d log ( 1 / � ) + log 2 ( 1 / � )) . ( n = sample size, � 2 = optimal squared error.) Previous state-of-the-art : [Audibert and Catoni, AoS 2011], essentially same conditions and rate, but computationally ine ffi cient. General technique with many other applications : ridge, Lasso, matrix approximation, etc . 7
2. Warm-up: estimating a scalar mean 8
Warm-up: estimating a scalar mean Forget X ; how do we estimate E ( Y ) ? (Set µ := E ( Y ) and � 2 := var ( Y ) ; assume � 2 < 1 .) 9
Empirical mean Let Y 1 , Y 2 , . . . , Y n be iid copies of Y , and set n X µ := 1 b Y i n i = 1 (empirical mean). 10
Empirical mean Let Y 1 , Y 2 , . . . , Y n be iid copies of Y , and set n X µ := 1 b Y i n i = 1 (empirical mean). There exists distributions for Y with � 2 < 1 s.t. ✓ ◆ µ � µ ) 2 � � 2 2 n � ( 1 � 2 e � / n ) n � 1 ( b � 2 � . P (Catoni, 2012) 10
Median-of-means [Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999] 11
Median-of-means [Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999] 1. Split the sample { Y 1 , . . . , Y n } into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b µ i := mean ( S i ) . 3. Return b µ := median ( { b µ 1 , b µ 2 , . . . , b µ k } ) . 11
Median-of-means [Nemirovsky and Yudin, 1983; Alon, Matias, and Szegedy, JCSS 1999] 1. Split the sample { Y 1 , . . . , Y n } into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b µ i := mean ( S i ) . 3. Return b µ := median ( { b µ 1 , b µ 2 , . . . , b µ k } ) . Theorem (Folklore) Set k := 4 . 5 ln ( 1 / � ) . With probability at least 1 � � , ✓ � 2 log ( 1 / � ) ◆ µ � µ ) 2 O ( b . n 11
Analysis of median-of-means 1. Assume | S i | = k / n for simplicity. By Chebyshev’s inequality, for each i = 1 , 2 , . . . , k : ! r 6 � 2 k Pr | b µ i � µ | � 5 / 6 . n 12
Analysis of median-of-means 1. Assume | S i | = k / n for simplicity. By Chebyshev’s inequality, for each i = 1 , 2 , . . . , k : ! r 6 � 2 k Pr | b µ i � µ | � 5 / 6 . n p 6 � 2 k / n } . By Hoe ff ding’s inequality, 2. Let b i := 1 {| b µ i � µ | k ! X Pr b i > k / 2 � 1 � exp ( � k / 4 . 5 ) . i = 1 12
Analysis of median-of-means 1. Assume | S i | = k / n for simplicity. By Chebyshev’s inequality, for each i = 1 , 2 , . . . , k : ! r 6 � 2 k Pr | b µ i � µ | � 5 / 6 . n p 6 � 2 k / n } . By Hoe ff ding’s inequality, 2. Let b i := 1 {| b µ i � µ | k ! X Pr b i > k / 2 � 1 � exp ( � k / 4 . 5 ) . i = 1 3. In the event that more than half of the b µ i are within p 6 � 2 k / n of µ , the median b µ is as well. 12
Alternative: minimize a robust loss function Alternative is to minimize a “robust” loss function [Catoni, 2012] : ✓ µ � Y i ◆ n X µ := arg min b . ` � µ 2 R i = 1 Example: ` ( z ) := log cosh ( z ) . Optimal rate and constants. Catch : need to know � 2 . 13
3. Linear regression with heavy-tail distributions 14
Linear regression (for out-of-sample prediction) 1. Response variable : random variable Y 2 R . 2. Covariates : random vector X 2 R d . > � 0.) (Assume Σ := E XX 3. Given : Sample S of n iid copies of ( X , Y ) . β ( S ) 2 R d to minimize population loss 4. Goal : find b β = b L ( β ) := E ( Y � β > X ) 2 . 15
Linear regression (for out-of-sample prediction) 1. Response variable : random variable Y 2 R . 2. Covariates : random vector X 2 R d . > � 0.) (Assume Σ := E XX 3. Given : Sample S of n iid copies of ( X , Y ) . β ( S ) 2 R d to minimize population loss 4. Goal : find b β = b L ( β ) := E ( Y � β > X ) 2 . Recall : Let β ? := arg min β 0 2 R d L ( β 0 ) . For any β 2 R d , � � 2 � � =: k β � β ? k 2 � Σ 1 / 2 ( β � β ? ) L ( β ) � L ( β ? ) = Σ . � 15
Generalization of median-of-means 1. Split the sample S into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b β i := ordinary least squares ( S i ) . ⇣n o⌘ 3. Return b b β 1 , b β 2 , . . . , b β := select good one . β k 16
Generalization of median-of-means 1. Split the sample S into k parts S 1 , S 2 , . . . , S k of equal size (say, randomly). 2. For each i = 1 , 2 , . . . , k : set b β i := ordinary least squares ( S i ) . ⇣n o⌘ 3. Return b b β 1 , b β 2 , . . . , b β := select good one . β k Questions : 1. Guarantees for b β i = OLS ( S i ) ? 2. How to select a good b β i ? 16
Ordinary least squares Under moment conditions ⇤ , b β i := OLS ( S i ) satisfies s ! � � � 2 d � � �b β i � β ? Σ = O � | S i | with probability at least 5 / 6 as soon as | S i | � O ( d log d ) . ⇤⇤ ⇤ Requires Kurtosis condition for this simplified bound. ⇤⇤ Can replace d log d with d under some regularity conditions [Srivastava and Vershynin, AoP 2013]. 17
Ordinary least squares Under moment conditions ⇤ , b β i := OLS ( S i ) satisfies s ! � � � 2 d � � �b β i � β ? Σ = O � | S i | with probability at least 5 / 6 as soon as | S i | � O ( d log d ) . ⇤⇤ Upshot : If k := O ( log ( 1 / � )) , then with probability � 1 � � , more p than half of the b � 2 d log ( 1 / � ) / n of β ? . β i will be within " := ⇤ Requires Kurtosis condition for this simplified bound. ⇤⇤ Can replace d log d with d under some regularity conditions [Srivastava and Vershynin, AoP 2013]. 17
Selecting a good b β i assuming Σ is known Consider metric ⇢ ( a , b ) := k a � b k Σ . 1. For each i = 1 , 2 , . . . , k : n o ⇢ ( b β i , b Let r i := median β j ) : j = 1 , 2 , . . . , k . 2. Let i ? := arg min r i . 3. Return b β := b β i ? . 18
Selecting a good b β i assuming Σ is known Consider metric ⇢ ( a , b ) := k a � b k Σ . 1. For each i = 1 , 2 , . . . , k : n o ⇢ ( b β i , b Let r i := median β j ) : j = 1 , 2 , . . . , k . 2. Let i ? := arg min r i . 3. Return b β := b β i ? . Claim : If more than half of the b β i are within distance " of β ? , then b β is within distance 3 " of β ? . 18
Selecting a good b β i when Σ is unknown General case : Σ is unknown; can’t compute distances k a � b k Σ . 19
Selecting a good b β i when Σ is unknown General case : Σ is unknown; can’t compute distances k a � b k Σ . � k � Solution : Estimate distances using fresh (unlabeled) samples. 2 19
Recommend
More recommend