Large-scale learning for image classification Zaid Harchaoui - PowerPoint PPT Presentation

Large-scale learning for image classification Zaid Harchaoui CVML’13, July 2013 Zaid Harchaoui (INRIA) LL July 25th 2013 1 / 75

Large-scale image datasets From “The Promise and Perils of Benchmark Datasets and Challenges”, D. Forsyth, A. Efros, F.-F. Li, A. Torralba and A. Zisserman, Talk at “Frontiers of Computer Vision” Zaid Harchaoui (INRIA) LL July 25th 2013 2 / 75

Large-scale supervised learning Large-scale image classification Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × { 1 , . . . , k } be labelled training images n λ Ω( W ) + 1 � � � y i , W T x i Minimize L n W ∈ R d × k i =1 Problem : minimizing such objectives in the large-scale setting n ≫ 1 , d ≫ 1 , k ≫ 1 Zaid Harchaoui (INRIA) LL July 25th 2013 3 / 75

Large-scale supervised learning Large-scale image classification Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × { 1 , . . . , k } be labelled training images n λ Ω( W ) + 1 � � � y i , W T x i Minimize L n W ∈ R d × k i =1 Problem : minimizing such objectives in the large-scale setting n ≫ 1 , d ≫ 1 , k ≫ 1 Zaid Harchaoui (INRIA) LL July 25th 2013 4 / 75

Machine learning cuboid k d n Zaid Harchaoui (INRIA) LL July 25th 2013 5 / 75

Working example : ImageNet dataset ImageNet dataset Large number of examples : n = 17 millions Large feature size : d = 4 . 10 3 , . . . , 2 . 10 5 Large number of categories : k = 10 , 000 Zaid Harchaoui (INRIA) LL July 25th 2013 6 / 75

General strategy for large-scale problems Strategy Most approaches boil down to a general "divide-and-conquer" strategy Break the large learning problem into small and easy pieces Zaid Harchaoui (INRIA) LL July 25th 2013 7 / 75

Machine learning cuboid k d n Zaid Harchaoui (INRIA) LL July 25th 2013 8 / 75

Decomposition principle Decomposition principle Decomposition over examples : stochastic/incremental gradient descent Decomposition over features : (primal) regular coordinate descent Decomposition over categories : one-versus-rest strategy Decomposition over latent structure : atomic decomposition Zaid Harchaoui (INRIA) LL July 25th 2013 9 / 75

Decomposition principle Decomposition principle Decomposition over examples : stochastic/incremental gradient descent Decomposition over features : (primal) coordinate descent Decomposition over categories : one-versus-rest strategy Decomposition over latent structure : atomic decomposition Zaid Harchaoui (INRIA) LL July 25th 2013 10 / 75

Decomposition over examples Decomposition over examples Stochastic/incremental gradient descent Bru, 1890 : algorithm to adjust a slant θ of cannon in order to obtain a specified range r by trial and error, firing one shell after another θ t = θ t − 1 − γ 0 t ( r − r t ) Perceptron, Rosenblatt, 1957 w t = w t − 1 − γ t ( y t φ ( x t )) if y t φ ( x t ) ≤ 0 = w t − 1 otherwise Zaid Harchaoui (INRIA) LL July 25th 2013 11 / 75

Decomposition over examples Decomposition over examples Stochastic/incremental gradient descent Bru, 1890 : algorithm to adjust a slant θ of cannon in order to obtain a specified range r by trial and error Perceptron, Rosenblatt, 1957 60s-70s : extensions in learning, optimal control, and adaptive signal processing 80s-90s : extensions to non-convex learning problems see "Efficient backprop" in Neural networks : Tricks of the trade , LeCun et al., 1998, for wise advice and overview on sgd algorithms Zaid Harchaoui (INRIA) LL July 25th 2013 12 / 75

Decomposition over examples Decomposition over examples Stochastic/incremental gradient descent Initialize : W = 0 Iterate : pick an example ( x t , y t ) W t +1 = W t − γ t ∇ W Q ( W ; x t , y t ) � �� one example at a time Why ? Where does these update rules come from ? Zaid Harchaoui (INRIA) LL July 25th 2013 13 / 75

Decomposition over examples Plain gradient descent Plain gradient descent versus stochastic/incremental gradient descent Grouping the regularization penalty and the empirical risk n ∇ W J ( W ) = 1 � � � y i , W T x i �� nλ Ω( W ) + L n i =1 Zaid Harchaoui (INRIA) LL July 25th 2013 14 / 75

Decomposition over examples Plain gradient descent Plain gradient descent versus stochastic/incremental gradient descent Grouping the regularization penalty and the empirical risk, and expanding the sum onto the examples n ∇ W J ( W ) = 1 � � � y i , W T x i �� nλ Ω( W ) + L n i =1 � � n 1 � = ∇ W Q ( W ; x i , y i ) n i =1 Zaid Harchaoui (INRIA) LL July 25th 2013 15 / 75

Decomposition over examples Plain gradient descent Plain gradient descent Initialize : W = 0 Iterate : W t +1 = W t − γ t ∇ J ( W ) � � n 1 � = W t − γ t ∇ W Q ( W ; x i , y i ) n i =1 Zaid Harchaoui (INRIA) LL July 25th 2013 16 / 75

Decomposition over examples Plain gradient descent Plain gradient descent Initialize : W = 0 Iterate : W t +1 = W t − γ t ∇ W J ( W ) � � n 1 � = W t − γ t ∇ W Q ( W ; x i , y i ) n i =1 � �� sum over all examples ! Strengths and weaknesses Strength : robust to setting of step-size sequence (line-search) Weakness : demanding disk/memory requirements Zaid Harchaoui (INRIA) LL July 25th 2013 17 / 75

Decomposition over examples Stochastic/incremental gradient descent Stochastic/incremental gradient descent Leveraging the decomposable structure over examples n ∇ W J ( W ) = 1 � ∇ W Q ( W ; x i , y i ) n i =1 � � = 1 ∇ W Q ( W ; x 1 , y 1 ) + · · · + 1 n ( ∇ W Q ( W ; x n , y n ) n Zaid Harchaoui (INRIA) LL July 25th 2013 18 / 75

Decomposition over examples Decomposition over examples Stochastic/incremental gradient descent Leveraging the decomposable structure over examples       ∇ W J ( W ) = 1 ∇ W Q ( W ; x 1 , y 1 ) + · · · + ∇ W Q ( W ; x n , y n ) n  � ��    cheap to compute cheap to compute Make incremental gradient steps along Q ( W ; x t , y t ) at each iteration t , instead of full gradient steps along ∇ J ( W ) at each iteration Zaid Harchaoui (INRIA) LL July 25th 2013 19 / 75

Decomposition over examples Stochastic/incremental gradient descent Stochastic/incremental gradient descent Initialize : W = 0 Iterate : pick an example ( x t , y t ) W t +1 = W t − γ t ∇ W Q ( W ; x t , y t ) Zaid Harchaoui (INRIA) LL July 25th 2013 20 / 75

Decomposition over examples Stochastic/incremental gradient descent Stochastic/incremental gradient descent Initialize : W = 0 Iterate : pick an example ( x t , y t ) W t +1 = W t − γ t ∇ W Q ( W ; x t , y t ) � �� one example at a time Strengths and weaknesses Strength : little disk requirements Weakness : may be sensitive to setting of step-size sequence Zaid Harchaoui (INRIA) LL July 25th 2013 21 / 75

Decomposition over examples Stochastic/incremental gradient descent What’s "stochastic" in this algorithm ? Looking at the objective as a stochastic approximation of the expected training error n ∇ W J ( W ) = 1 � ∇ W Q ( W ; x i , y i ) n i =1 � � = 1 ∇ W Q ( W ; x 1 , y 1 ) + · · · + 1 n ( ∇ W Q ( W ; x n , y n ) n Zaid Harchaoui (INRIA) LL July 25th 2013 22 / 75

Decomposition over examples Stochastic/incremental gradient descent What’s "stochastic" in this algorithm ? n ∇ W J ( W ) = 1 � ∇ W Q ( W ; x i , y i ) n i =1 ≈ E x ,y [ ∇ W Q ( W ; x , y )] Practical consequences Shuffle the examples before launching the algorithm, in case they form a correlated sequence Perform several passes/epochs over the training data, shuffling the examples before each pass/epoch Zaid Harchaoui (INRIA) LL July 25th 2013 23 / 75

Decomposition over examples Mini-batch extensions Mini-batch extensions Regular stochastic gradient descent : extreme decomposition strategy picking one example at a time Mini-batch extensions : decomposition onto mini-batches of size B t at iteration t When to choose one or the other ? Regular stochastic gradient descent converges for simple objectives with ”moderate non-smoothness” For more sophisticated objectives, SGD does not converge, and mini-batch SGD is a must Zaid Harchaoui (INRIA) LL July 25th 2013 24 / 75

Decomposition over examples Theory digest Theory digest Fixed stepsize γ t ≡ γ − → stable convergence γ 0 Decreasing stepsize γ t = t + t 0 − → faster local convergence, with γ 0 and t 0 properly set Note : stochastic gradient descent is an extreme decomposition strategy picking one example at a time In practice Pick a random batch of reasonable size, and find best pair ( γ 0 , t 0 ) through cross-validation Run stochastic gradient descent with sequence of decreasing stepsize γ 0 γ t = t + t 0 Zaid Harchaoui (INRIA) LL July 25th 2013 25 / 75

Decomposition over examples Tricks of the trade : life is simpler in large-scale settings Life is simpler in large-scale settings Shuffle the examples before launching the algorithm, and process the examples in a balanced manner w.r.t the categories Regularization through early stopping : perform only a few several passes/epochs over the training data, and stop when the accuracy on a held-out validation set does not increase anymore Fixed step-size works fine : find best γ through cross-validation on a small batch Zaid Harchaoui (INRIA) LL July 25th 2013 26 / 75

Large-scale learning for image classification Zaid Harchaoui - PowerPoint PPT Presentation

Large-scale learning for image classification Zaid Harchaoui CVML13, July 2013 Zaid Harchaoui (INRIA) LL July 25th 2013 1 / 75 Large-scale image datasets From The Promise and Perils of Benchmark Datasets and Challenges, D. Forsyth,

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

for Large-Scale Image Classification Karn Simonyan, Andrea Vedaldi, Andrew Zisserman Visual

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

From image classification to object detection Image classification Object detection Image source

Bag-of-features models for category classification for category classification Cordelia Schmid

Bag-of-features for category classification for category classification Cordelia Schmid

Image Analysis System Example: Image Classification System pre feature feature segmentation

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Metric Learning Applied for Automatic Large Image Classification Supervisors TOON CALDERS (PhD) /

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Propagation of tropical heating perturbations to the midlatitudes and the role of orography

Asymptotic limits of the Shallow Water equations Carine Lucas MAPMO - univ. Orl eans, France

Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , Seung-Hwan Lim, Dali Wang, Jacob

Chapter 5: Concentration The Probabilistic Method Summer 2020 Freie Universitt Berlin Chapter

Matching Using LSH Forest Michael Cochez * 1st International KEYSTONE Conference * Industrial

Exploiting Locality in Distributed SDN Control Stefan Schmid (TU Berlin & T-Labs) Jukka

TEM for magnetism: challenges and competitors Olivier Fruchart Institut Nel (CNRS-UJF-INPG)

Graph Processing Frameworks Lecture 24 CSCI 4974/6971 5 Dec 2016 1 / 13 Todays Biz 1.

Large-scale learning for image classification Zaid Harchaoui - PowerPoint PPT Presentation

Large-scale learning for image classification Zaid Harchaoui CVML13, July 2013 Zaid Harchaoui (INRIA) LL July 25th 2013 1 / 75 Large-scale image datasets From The Promise and Perils of Benchmark Datasets and Challenges, D. Forsyth,

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

for Large-Scale Image Classification Karn Simonyan, Andrea Vedaldi, Andrew Zisserman Visual

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

From image classification to object detection Image classification Object detection Image source

Bag-of-features models for category classification for category classification Cordelia Schmid

Bag-of-features for category classification for category classification Cordelia Schmid

Image Analysis System Example: Image Classification System pre feature feature segmentation

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Metric Learning Applied for Automatic Large Image Classification Supervisors TOON CALDERS (PhD) /

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Propagation of tropical heating perturbations to the midlatitudes and the role of orography

Asymptotic limits of the Shallow Water equations Carine Lucas MAPMO - univ. Orl eans, France

Toward Large-Scale Image Segmentation On Summit Sudip K. Seal , Seung-Hwan Lim, Dali Wang, Jacob

Chapter 5: Concentration The Probabilistic Method Summer 2020 Freie Universitt Berlin Chapter

Matching Using LSH Forest Michael Cochez * 1st International KEYSTONE Conference * Industrial

Exploiting Locality in Distributed SDN Control Stefan Schmid (TU Berlin &amp; T-Labs) Jukka

TEM for magnetism: challenges and competitors Olivier Fruchart Institut Nel (CNRS-UJF-INPG)

Graph Processing Frameworks Lecture 24 CSCI 4974/6971 5 Dec 2016 1 / 13 Todays Biz 1.

Exploiting Locality in Distributed SDN Control Stefan Schmid (TU Berlin & T-Labs) Jukka