Online Learning with Model Selection Lizhe Sun, Adrian Barbu Florida State University abarbu@stat.fsu.edu October 16, 2019 Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 1 / 40
Outline 1 Introduction 2 Literature Review 3 Online Learning Algorithms by Running Averages 4 Theoretical Analysis 5 Numerical Results 6 Future Work Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 2 / 40
Introduction Online Learning Online Learning In big data learning, we often encounter datasets so large that they 1 cannot fit in the computer memory. Online learning methods are capable of addressing these issues by 2 constructing the model sequentially, one example at a time. We assume that the samples are i.i.d / adversary. 3 Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 3 / 40
Introduction The Framework for An Online Learning Algorithm Assuming w 1 = 0, and we only can access data samples { ( x i , y i ) : i = 1 , 2 , · · · } streaming in one at a time. for i = 1, 2 · · · Receive observation x i ∈ R n Predict ˆ y i Receive the true value y i ∈ R Suffer the loss function f ( w i ; z i ) in which z i = ( x i , y i ) Update w i +1 from w i and z i end Target : minimize the cumulative loss 1 � n i =1 f ( w i ; z i ). n Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 4 / 40
Introduction Regret In the theoretical analysis of online learning, it is of interest to bound the regret: n n R n = 1 1 � � f ( w i ; z i ) − min f ( w ; z i ) , n n w i =1 i =1 which measures what is lost compared to offline learning, in a way measuring the convergence speed of online algorithms. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 5 / 40
Literature Review Literature Review: SGD Stochastic Gradient Descent (SGD) SGD is the most widely used in traditional online learning area. The original idea can be traced back to Robbins and Monro (1951) and Kiefer and Wolfowitz (1952). However, the SGD algorithm cannot select features. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 6 / 40
Literature Review Literature Review: Online Learning with Sparsity To learn a better model, we need to consider feature selection in online learning. Langford et al. (2009) proposed the framework of truncated gradient. Shalev-Shwartz and Tewari (2011) designed stochastic mirror descent. Truncated second order methods in Fan et al. (2018); Langford et al. (2009); Wu et al. (2017). Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 7 / 40
Literature Review Literature Review: OPG and RDA Two main frameworks for online learning with regularization Online Proximal Gradient Descent (OPG) 1 Regularized Dual Averaging (RDA) 2 OPG is designed by Duchi and Singer (2009) and Duchi et al. (2010), and RDA is proposed by Xiao (2010). Some variants, designed by Suzuki (2013) and Ouyang et al. (2013). OPG-SADMM 1 RDA-SADMM 2 Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 8 / 40
Literature Review Literature Review Hazan et al. (2007) An online Newton method Uses a similar idea with running averages, to update the inverse of the Hessian matrix Has O ( p 2 ) computational complexity Did not address the issues of variable standardization and feature selection. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 9 / 40
Literature Review Literature Review: Summary The classical online learning algorithms, such as SGD, cannot select features. In recent years, many new online learning algorithms are proposed to select features. However, no matter in theory or numerical experiments, the proposed algorithms cannot recover the true features. This concern motivates us to develop our running averages framework. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 10 / 40
Online Learning Algorithms by Running Averages Framework of Running Averages Algorithms Figure: The running averages are updated as the data is received. The model is extracted from the running averages only when desired. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 11 / 40
Online Learning Algorithms by Running Averages Running Averages We have samples x i = ( x i 1 , x i 2 , · · · , x ip ) T ∈ R p and responses y i ∈ R , we can compute running averages as follows: S x = µ x = 1 � n i =1 x i , S y = µ y = 1 � n i =1 y i n n S xx = 1 � n i =1 x i x T i n S xy = 1 � n i =1 y i x i n S yy = 1 � n i =1 y 2 i n Sample size: n Can be updated online, e.g. n 1 µ ( n +1) n + 1 µ ( n ) = + n + 1 x n +1 . x x Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 12 / 40
Online Learning Algorithms by Running Averages Standardization of Running Averages The standardization of data matrix X and vector y . ˜ X = ( X − 1 n µ T x ) D ˜ y = ( y − µ y 1 n ) D is diagonal matrix with the inverse of the standard deviation of X i . The equivalent standardization using running averages: X T ˜ n ˜ y = 1 y = 1 n DX T y − µ y D µ x = DS xy − µ y D µ x S ˜ x ˜ X = D ( X T X X T ˜ n ˜ x = 1 − µ x µ T x ) D = D ( S xx − µ x µ T S ˜ x ) D x ˜ n We will assume data is standardized in all algorithms below Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 13 / 40
Online Learning Algorithms by Running Averages Online Least Squares ( OLS ) Normal equations 1 n X T X β = 1 n X T y . Since 1 n X T X and 1 n X T y can be computed by running averages, we obtain: S xx β = S xy . Thus, online least squares is equivalent to offline least squares. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 14 / 40
Online Learning Algorithms by Running Averages Online Least Squares with Thresholding ( OLSth ) Aimed at solving the following constrained minimization problem: 1 2 n � y − X β � 2 . min β , � β � 0 ≤ k A non-convex and NP-hard problem because of the sparsity constraint. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 15 / 40
Online Learning Algorithms by Running Averages Algorithm 1 Online Least Squares with Thresholding (OLSth) Input: Running averages S xx , S xy , sample size n , sparsity level k . Output: Trained regression parameter vector β with � β � 0 ≤ k . 1: Fit the model by OLS, obtaining ˆ β 2: Keep only the k variables with largest | ˆ β j | 3: Fit the model on the selected features by OLS Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 16 / 40
Online Learning Algorithms by Running Averages Online Feature Selection with Annealing ( OFSA ) An iterative thresholding algorithm (Barbu et al., 2017). Can simultaneously estimate coefficients and select features. � y − X β � 2 ∂ Uses the gradient = S xx β − S xy , which can be updated ∂ β N online. Uses an annealing schedule M e to gradually remove features Figure: Different annealing schedules M e vs iteration number e . Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 17 / 40
Online Learning Algorithms by Running Averages Algorithm 2 Online Feature Selection with Annealing (OFSA) Input: Running averages S xx , S xy , sample size n , sparsity level k , anneal- ing parameter µ . Output: Trained regression parameter vector β with � β � 0 ≤ k . Initialize β = 0. for t = 1 to N iter do Update β ← β − η ( S xx β − S xy ) Keep only the M t variables with highest | β j | and renumber them 1 , ..., M t . end for Fit the model on the selected features by OLS. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 18 / 40
Online Learning Algorithms by Running Averages Online Lasso and Online Adaptive Lasso The Lasso estimator, proposed in (Tibshirani, 1996), solves the optimization problem p 1 2 n � y − X β � 2 + λ � | β j | , arg min β j =1 where λ > 0 is a tuning parameter. However, because Lasso estimator cannot recover the true features, Zou (2006) proposed the adaptive Lasso, which solves the weighted Lasso p 1 2 � y − X β � 2 + λ n � arg min w j | β j | , j = 1 , 2 , · · · , p , β j =1 where w j is the weight for β j . We can use the OLS coefficients as weights when n > p . Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 19 / 40
Online Learning Algorithms by Running Averages Algorithm 3 Online Adaptive Lasso (OALa) Input: Running averages S xx , S xy , sample size n , penalty λ . Output: Trained sparse regression parameter vector β . ols . 1: Compute the OLS estimate ˆ β ols | as diagonal 2: Define a p × p diagonal weight matrix Σ w with the | ˆ β entries. 3: Denote S w xx = Σ w S xx Σ w and S w xy = Σ w S xy Initialize β = 0. for t = 1 to N iter do Update β ← β − η ( S w xx β − S w xy ) Update β ← S ηλ ( β ) ( S ηλ ( · ) is the soft thresholding operator). end for Fit the model on the selected features by OLS. Lizhe Sun, Adrian Barbu (FSU) Online Learning with Model Selection October 16, 2019 20 / 40
Recommend
More recommend