Guided Learning of Nonconvex Models through Successive Functional - PowerPoint PPT Presentation

Guided Learning of Nonconvex Models through Successive Functional Gradient Optimization Rie Johnson ∗ and Tong Zhang † RJ Research Consulting ∗ Hong Kong University of Science and Technology † 1 / 12

Training Deep Neural Networks Challenge: nonconvex optimization problem converge to local minimum with sub-optimal generalization Motivation 2 / 12

Training Deep Neural Networks Challenge: nonconvex optimization problem converge to local minimum with sub-optimal generalization This work: how to find a local minimum with better generalization Motivation 2 / 12

Training Deep Neural Networks Challenge: nonconvex optimization problem converge to local minimum with sub-optimal generalization This work: how to find a local minimum with better generalization Idea: restricting search space leads to better generalization Method: guided functional gradient training (guide restricts search space) Motivation 2 / 12

Problem Formulation Supervised learning:    1 � ˆ  . θ = arg min L ( f ( θ ; x ) , y ) + R ( θ ) | S | θ ( x , y ) ∈ S x : input y : output f ( θ ; x ) : vector function to predict y from x . θ : model parameter. S : training data L : loss function R ( θ ) : regularizer such as weight-decay λ � θ � 2 2 Example: K -class classification where y ∈ { 1 , 2 , . . . , K } f ( θ ; x ) is K -dimensional, linked to conditional probabilities Motivation 3 / 12

GULF: GUided Learning through Functional gradient General GULF Procedure ( f : model we are training): (Step 1) Generate a guide function f ∗ apply functional gradient to reduce the loss of the current model f , f ∗ is an improvement over f in terms of loss but not too far from f . (Step 2) Move the model f towards the guide function f ∗ using SGD according to some distance measure. guide serves as a restriction of model parameter search space Motivation 4 / 12

GULF: GUided Learning through Functional gradient General GULF Procedure ( f : model we are training): (Step 1) Generate a guide function f ∗ apply functional gradient to reduce the loss of the current model f , f ∗ is an improvement over f in terms of loss but not too far from f . (Step 2) Move the model f towards the guide function f ∗ using SGD according to some distance measure. guide serves as a restriction of model parameter search space Motivation: functional gradient learning of additive models in gradient boosting (Friedman, 2001) , known to have good generalization natural idea: use functional gradient learning to guide SGD Result: worse training error but better test error Motivation 4 / 12

Step 1: Move Guide Ahead We formulate Step 1 as   f ∗ ( x , y ):= argmin  + α ∇ L y ( f ( x )) ⊤ q  D h ( q , f ( x ))  , (1)  q � �� guide near previous model functional gradient where α is a meta-parameter, and the Bregman divergence D h is defined by D h ( u , v ) = h ( u ) − h ( v ) − ∇ h ( v ) ⊤ ( u − v ) . Motivation 5 / 12

Step 1: Move Guide Ahead We formulate Step 1 as   f ∗ ( x , y ):= argmin  + α ∇ L y ( f ( x )) ⊤ q  D h ( q , f ( x ))  , (1)  q � �� guide near previous model functional gradient where α is a meta-parameter, and the Bregman divergence D h is defined by D h ( u , v ) = h ( u ) − h ( v ) − ∇ h ( v ) ⊤ ( u − v ) . (1) is equivalent to mirror descent in function space. ∇ h ( f ∗ ( x , y ) ) = ∇ h ( f ( x ) ) − α ∇ L y ( f ( x )) . (2) � �� new guide previous model functional gradient Motivation 5 / 12

Step 2: Following the Guide Update network parameter θ to reduce � � D h ( f ( θ ; x ) , f ∗ ( x , y )) + R ( f ) (3) ( x , y ) ∈ S �� regularizer next model near guide with SGD repeatedly to improve model f ( θ ; · ) : �� D h ( f ( θ ; x ) , f ∗ ( x , y )) θ ← θ − η ∇ θ ( x , y ) ∈ B + R ( θ ) , (4) where B is a mini-batch sampled from a training set S . Motivation 6 / 12

Step 2: Following the Guide Update network parameter θ to reduce � � D h ( f ( θ ; x ) , f ∗ ( x , y )) + R ( f ) (3) ( x , y ) ∈ S �� regularizer next model near guide with SGD repeatedly to improve model f ( θ ; · ) : �� D h ( f ( θ ; x ) , f ∗ ( x , y )) θ ← θ − η ∇ θ ( x , y ) ∈ B + R ( θ ) , (4) where B is a mini-batch sampled from a training set S . Remarks: f ( θ ; · ) : move towards guide function f ∗ in Bregman divergence R ( θ ) : regularization term f ∗ ( x , y ) : guide to restrict SGD search space → better generalization Motivation 6 / 12

Convergence Result Define α -regularized loss ( x , y ) ∈ S + 1 � � ℓ α ( θ ) := L ( f ( θ ; x ) , y ) α R ( θ ) . (5) Theorem Under apporiate assumptions, consider the GULF algorithm with a sufficiently small α and η . Assume that θ t + 1 is an improvement of θ t with respect to minimizing � � D h ( f ( θ ; x ) , f ∗ ( x , y )) Q t ( θ ) := ( x , y ) ∈ S + R ( θ ) so that Q t ( θ t + 1 ) ≤ Q t ( θ t − η ∇ Q t ( θ t )) , then GULF finds a local minimum of ℓ α ( · ) : ∇ ℓ α ( θ t ) → 0 . Motivation 7 / 12

Remarks GULF is very different from standard training of α -regularized loss. better generalization from guide to restrict the search space Motivation 8 / 12

Remarks GULF is very different from standard training of α -regularized loss. better generalization from guide to restrict the search space For h = L y ( f ) with cross-entropy loss for classification, Step 2 becomes self-distillation parameter update: � � θ ← θ − η ∇ θ ( 1 − α ) L ( f θ , prob ( f θ t )) + α L y ( f θ ) ( x , y ) ∈ S � �� distillation with old model training loss Motivation 8 / 12

Remarks GULF is very different from standard training of α -regularized loss. better generalization from guide to restrict the search space For h = L y ( f ) with cross-entropy loss for classification, Step 2 becomes self-distillation parameter update: � � θ ← θ − η ∇ θ ( 1 − α ) L ( f θ , prob ( f θ t )) + α L y ( f θ ) ( x , y ) ∈ S � �� distillation with old model training loss Our result gives a convergence proof of self-distillation, and generalizes it to other loss functions. Motivation 8 / 12

Empirical Results Methods compared: (ini:random) GULF starting with random initialization (ini:base) GULF starting with initialization by regular training (base- λ/α ) standard training with α -regularized loss (base-loop) standard training with learning rate resets label-smoothing: use noisy label Motivation 9 / 12

Empirical Results Methods compared: (ini:random) GULF starting with random initialization (ini:base) GULF starting with initialization by regular training (base- λ/α ) standard training with α -regularized loss (base-loop) standard training with learning rate resets label-smoothing: use noisy label First three converge to local minimum solutions of α -regularized loss. Motivation 9 / 12

Result C10 C100 SVHN 1 base model 6.42 30.90 1.86 1.64 2 base- λ/α 6.60 30.24 1.78 1.67 baselines 3 base-loop 6.20 30.09 1.93 1.53 4 label smooth 6.66 30.52 1.71 1.60 5 ini:random 5.91 28.83 1.71 1.53 GULF2 6 ini:base 5.75 29.12 1.65 1.56 Table: Test error (%). Median of 3 runs. Resnet-28 (0.4M parameters) for CIFAR10/100, and WRN-16-4 (2.7M parameters) for SVHN. Two numbers for SVHN are without and with dropout. Motivation 10 / 12

Result C10 C100 SVHN 1 base model 6.42 30.90 1.86 1.64 2 base- λ/α 6.60 30.24 1.78 1.67 baselines 3 base-loop 6.20 30.09 1.93 1.53 4 label smooth 6.66 30.52 1.71 1.60 5 ini:random 5.91 28.83 1.71 1.53 GULF2 6 ini:base 5.75 29.12 1.65 1.56 Table: Test error (%). Median of 3 runs. Resnet-28 (0.4M parameters) for CIFAR10/100, and WRN-16-4 (2.7M parameters) for SVHN. Two numbers for SVHN are without and with dropout. Similar results with larger models and on imagenet. Motivation 10 / 12

Analysis: worse training loss but better generalization random Test loss (log-scale) Test loss (log-scale) random 4 4 base base ini:random regular training ini:base 2 2 1 1 0.03 0.3 3 0.03 0.3 3 Training loss (log-scale) Training loss (log-scale) (a) GULF2 (b) Regular training Figure: Test loss in relation to training loss. The arrows indicate the direction of time flow. CIFAR100. ResNet-28. GULF solution properties: worse training loss but better test loss (better generalization) different weight-decay behavior in regularizer Motivation 11 / 12

Summary Background: Nonconvex optimization stuck in local minimum Want to find a local minimum with better generalization Method: Guided learning through successive functional gradient optimization Find local solution with worse training loss but better generalization Why: Restricted search space → better generalization Our method generalizes self-distillation. summary 12 / 12

Guided Learning of Nonconvex Models through Successive Functional - PowerPoint PPT Presentation

Guided Learning of Nonconvex Models through Successive Functional Gradient Optimization Rie Johnson and Tong Zhang RJ Research Consulting Hong Kong University of Science and Technology 1 / 12 Training Deep Neural Networks

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

Guided Therapeutics in Cancer Surgery Guided Therapeutics in Cancer Surgery Guided Therapeutics

CHEVREUL Simultaneous Contrast Successive Contrast Successive Contrast Mixed Contrast look

MVC Guided Pathways Brief review of Guided Pathways at MVC Plan for Today Spring

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering,

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

for Active Learning Guided Inquiry Learning The POGIL Project Process Oriented, Guided

Year 3 Guided Pathways Plan Presentation Presented by: Palomar Guided Pathways Team DATE: May

Guided Pathways Equity & Education Update Feb 7, 2020 Guided Pathways Decision Making

17. Intro to nonconvex models Overview Discrete models Mixed-integer programming

Multidisciplinary Optimization using the Multidisciplinary Optimization using the Successive

Successive Integer-Forcing and its Sum-Rate Optimality Or Ordentlich Joint work with Uri Erez

Wireless Communication Systems @CS.NCTU Lecture 11: Successive Interference Cancellation

Wireless Communication Systems @CS.NCTU Lecture 8: Successive Interference Cancellation

Trust in the Cloud i2Coalition & the M3AAWG Hosting SIG Introductions by Matt Stith,

Session #9: Trapdoors and Applications Chris Peikert Georgia Institute of Technology Winter

Towards a Fully Encrypted Internet CS244 | Zakir Durumeric 2013 Snowden Revelations Explicit

Measuring small subgroup attacks against Diffie-Hellman Luke Valenta , David Adrian ,

Heavy neutral leptons at ANUBIS work with M. Hirsch arXiv: 2001.04750 Zeren Simon Wang Seventh

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University

Between Two Shapes, Using the Hausdorff Distance Marc van Kreveld, Till Miltzow, Tim Ophelders

CSE543 - Introduction to Computer and Network Security Module: Applied Cryptography Professor

Sambuz

Useful Links

Newsletter

Mail Us

Guided Learning of Nonconvex Models through Successive Functional - PowerPoint PPT Presentation

Guided Learning of Nonconvex Models through Successive Functional Gradient Optimization Rie Johnson and Tong Zhang RJ Research Consulting Hong Kong University of Science and Technology 1 / 12 Training Deep Neural Networks

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

Guided Therapeutics in Cancer Surgery Guided Therapeutics in Cancer Surgery Guided Therapeutics

CHEVREUL Simultaneous Contrast Successive Contrast Successive Contrast Mixed Contrast look

MVC Guided Pathways Brief review of Guided Pathways at MVC Plan for Today Spring

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering,

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

for Active Learning Guided Inquiry Learning The POGIL Project Process Oriented, Guided

Year 3 Guided Pathways Plan Presentation Presented by: Palomar Guided Pathways Team DATE: May

Guided Pathways Equity &amp; Education Update Feb 7, 2020 Guided Pathways Decision Making

17. Intro to nonconvex models Overview Discrete models Mixed-integer programming

Multidisciplinary Optimization using the Multidisciplinary Optimization using the Successive

Successive Integer-Forcing and its Sum-Rate Optimality Or Ordentlich Joint work with Uri Erez

Wireless Communication Systems @CS.NCTU Lecture 11: Successive Interference Cancellation

Wireless Communication Systems @CS.NCTU Lecture 8: Successive Interference Cancellation

Trust in the Cloud i2Coalition &amp; the M3AAWG Hosting SIG Introductions by Matt Stith,

Session #9: Trapdoors and Applications Chris Peikert Georgia Institute of Technology Winter

Towards a Fully Encrypted Internet CS244 | Zakir Durumeric 2013 Snowden Revelations Explicit

Measuring small subgroup attacks against Diffie-Hellman Luke Valenta , David Adrian ,

Heavy neutral leptons at ANUBIS work with M. Hirsch arXiv: 2001.04750 Zeren Simon Wang Seventh

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University

Between Two Shapes, Using the Hausdorff Distance Marc van Kreveld, Till Miltzow, Tim Ophelders

CSE543 - Introduction to Computer and Network Security Module: Applied Cryptography Professor

Sambuz

Useful Links

Newsletter

Mail Us

Guided Pathways Equity & Education Update Feb 7, 2020 Guided Pathways Decision Making

Trust in the Cloud i2Coalition & the M3AAWG Hosting SIG Introductions by Matt Stith,