Coresets for Data-efficient Training of Machine Learning Models Baharan Mirzasoleiman*, Jeff Bilmes**, Jure Leskovec* * * **
Machine Learning Becomes Mainstream Personalized medicine Robotics Finance Autonomous cars 2
Data is the Fuel for Machine Learning Example: object detection Object detection performance in mAP@[.5,.95] on COCO minival [ ]
Problem 1: Training on Large Data is Expensive Example: training a single deep model for NLP (with NAS) [SGM’19] 11.4 days 3.2M 5.3X the yearly energy consumption of the average American 5x a lifetime of a car CO2 4
Problem 2: What We Care About is Underrepresented Example: self driving data 14% 80% 1% 5% 5
How can we find the “right” data for e ffi cient machine learning? 6
Setting : Training Machine Learning Models Often reduces to minimizing a regularized empirical risk function Feature Label Training data volume: {( x i , y i ), i ∈ V } Regularizer f ( w ) = ∑ w * ∈ arg min w ∈𝒳 f ( w ), f i ( w ) + r ( w ), f i ( w ) = l ( w , ( x i , y i )) i ∈ V Loss function associated with training example i ∈ V Examples: Convex : Linear regression, logistic regression, ridge regression, f ( w ) • regularized support vector machines (SVM) Non-convex : Neural networks f ( w ) • 7
Setting : Training Machine Learning Models Incremental gradient methods are used to train on large data • Sequentially step along the gradient of functions f i w k i = w k i − 1 − α k ∇ f i ( w i − 1 ) • Consider every ∇ f ( . )= ∑ as an unbiased estimate of ∇ f i ( . ) ∇ f i ( . ) • Therefore, they are slow to converge i ∈ V 8
Problem: How to Find the "Right" Data for Machine Learning? The most informative subset , s.t. S * = arg max S ⊆ V F ( S ) | S | ≤ k • What is a good choice for F ( S ) ? • S* = { } V = { } 10 20 30 10 10 20 20 30 30 10 20 30 • If we can find , we get a speedup by only training on S * | V | / | S * | S * 9
Finding is Challenging S * 1. How to chose an informative subset for training? Points close to decision boundary vs. a diverse subset? • 2. Finding must be fast S * Otherwise we don’t get any speedup • 3. We also need to decide on the step-sizes 4. We need theoretical guarantees For the quality of the trained model • For convergence of incremental gradient methods on the subset • 10
Our Approach: Learning from Coresets Idea : select the smallest subset and weights that closely S * γ estimates the full gradient w ∈𝒳 ∥ ∑ ∇ f i ( w ) − ∑ s.t. S * = arg min S ⊆ V , γ j ≥ 0 ∀ j | S | , max γ j ∇ f j ( w ) ∥ ≤ ϵ . i ∈ V j ∈ S Full gradient Gradient of S Solution : for every is the set of exemplars of all the , w ∈ 𝒳 S * data points in the gradient space Training Data: {( x i , y i ), i ∈ V } V ′ = { ∇ f i ( w ), i ∈ V } Gradients at w = { } S* = { } V = { } V ′ 11
Our Approach: Learning from Coresets How can we find exemplars in big datasets? • Exemplar clustering is submodular! F ( S *) = ∑ min j ∈ S * ∥∇ f i ( w ) − ∇ f j ( w ) ∥ ≤ ϵ i ∈ V Submodularity is a natural diminishing returns property ∀ A ⊆ B and B ∌ x : F ( A ∪ { x }) - F ( A ) ≥ F ( B ∪ { x }) - F ( B ) A simple greedy algorithm can find exemplars in large datasets S * However, depends on ! S * w • We have to update after every SGD update S * Slow! :( 12
Our approach: Learning from Coresets Can we find a subset that bounds the estimation error for S * all ? w ∈𝒳 F ( S *) = ∑ min j ∈ S * ∥∇ f i ( w ) − ∇ f j ( w ) ∥ ≤ ϵ i ∈ V Idea : consider worst-case approximation of the estimation error over the entire parameter space 𝒳 F ( S *) = ∑ j ∈ S * ∥∇ f i ( w ) − ∇ f j ( w ) ∥ ≤ ∑ min min j ∈ S * max w ∈𝒳 ∥∇ f i ( w ) − ∇ f j ( w ) ∥ ≤ ϵ i ∈ V i ∈ V d ij : upper-bound on the gradient di ff erence d ij over the entire parameter space 𝒳 13
Our approach: Learning from Coresets How can we e ffi ciently find upper-bounds ? d ij • Convex : Linear/logistic/ridge regression, regularized SVM f ( w ) Feature vector d ij ≤ const. ∥ x i − x j ∥ can be found as a preprocessing step S * • Non-convex : Neural networks f ( w ) Input to the last layer d ij ≤ const. ( ∥∇ z ( L ) [KF’19] i f i ( w ) − ∇ z ( L ) j f j ( w ) ∥ ) is cheap to compute, but we have to update d ij S * 14
Our Approach: CRAIG Idea : select a weighted subset that closely estimates the full gradient Algorithm: • (1) use greedy to find the set of exemplars from dataset S * V • (2) weight every elements of by the size of the corresponding cluster S * • (3) apply weighted incremental gradient descent on S * ➤ ➤ w =0.2 1 epoch w =0.3 w =0.05 w =0.1 Gradients of data points i ∈ V Loss function 15
Our approach: CRAIG Weighted incremental gradient descent on the subset of S ⊆ V exemplars in the gradient space w =0.2 w =0.3 w =0.05 w =0.1 Theorem: For a -strongly convex loss function, CRAIG with decaying μ step-size converges to a neighborhood of the Θ (1/ k τ ), τ < 1 2 ϵ / μ optimal solution, with a rate of 𝒫 (1/ k τ ) We get up to | V |/| S | speedup! 16
Existing Techniques Speeding up stochastic gradient methods Variance reduction techniques [JZ’13, DB’14, A’18] • Choosing better step sizes [KB’14, DHS’11, Z’12] • Importance sampling [NSW’13, ZZ'14, KF’18] • CRAIG is complementary to all the above methods 17
Application of CRAIG to Logistic Regression Training on subsets of size 10% of Covtype with 581K points Up to 6x faster than training on the full data, with the same accuracy 18
Application of CRAIG to Logistic Regression Training on subsets of various size of Ijcnn1 with 50K points (Imbalanced) 10% 30% 90% 50% 70% 10% SGD+ 20% All data 90% 30% Up to 7x faster than training on the full data, with the same accuracy 19
Application of CRAIG to Neural Networks Training on MNIST with a 2-layer neural network with 50K points 2x-3x faster than training on the full data, with better generalization 20
Application of CRAIG to Deep Networks Training ResNet20 on subsets of various size from CIFAR10 with 50K points CRAIG is data-e ffi cient 21
Summary • We developed the first rigorous method for data- e ffi cient training of general machine learning models Converges to the near optimal solution • Similar convergence rate as Incremental gradient methods • Speeds up training by up to 7x for logistic regression and • 3x for deep neural networks Come to our poster for more details! 22
Recommend
More recommend