Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 - PowerPoint PPT Presentation

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 Kiran Koshy Thekumparampil † , Prateek Jain ‡ , Praneeth Netrapalli ‡ , Sewoong Oh ± † University of Illinois at Urbana-Champaign, ‡ Microsoft Research, India, ± University of Washington, Seattle Oct 27, 2019 Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 1 / 25

Outline Minimax Optimization problem Efficient algorithm for Nonconvex–Concave minimax problem Optimal algorithm for Strongly-Convex—Concave minimax problem Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 2 / 25

Minimax problem Consider the general minimax problem min x ∈X max y ∈Y g ( x , y ) Two player game: y tries to maximize and x tries to minimize. The order of min & max or who plays first ( x above) is important max y ∈Y min x ∈X g ( x , y ) ≤ min x ∈X max y ∈Y g ( x , y ) Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 3 / 25

Examples of Minimax problem 1 GAN: min G max D V ( G , D ): � � � � E E min G max log ( D ( x )) + log (1 − D ( G ( z ))) = JS ( P X || Q X ) x ∼ P X z ∼ Q Z D 2 Constrained optimization: min x f ( x ), s.t. f i ( x ) ≤ 0, ∀ i ∈ [ m ] � � m � min x max L ( x , y ) = f ( x ) + y i f i ( x ) y ≥ 0 i =1 3 Robust estimation/optimization: � min max f ( x , ˆ z i ) x ˆ z i i ∆(ˆ z i , z i ) ≤ ε , ∀ i ∈ [ m ] . Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 4 / 25

Nonconvex minimax In general g ( x , y ) is non-convex in both x and y . E.g. Neural network based GAN Very few works on nonconvex minimax We focus on smooth nonconvex–concave minimax problem, i.e. g ( x , · ) is concave, and g is L -smooth: � � �� x − x ′ � � y − y ′ � � �� ∇ a g ( x , y ) − ∇ a g ( x ′ , y ′ ) � ≤ L � + max . a ∈{ x , y } E.g. smooth constrained optimization. In general: max y ∈Y min x ∈X g ( x , y ) < min x ∈X max y ∈Y g ( x , y ) We focus on the non-smooth nonconvex Primal problem: f ( x ) = max y g ( x , y ) Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 5 / 25

f ( x ) = max y ∈Y g ( x , y ) is non-smooth and weakly convex f is non-smooth due to maximization over y ρ -weakly convex function 2 � · � 2 is convex, i.e., We say that f is a ρ -weakly convex f if f + ρ � � − ρ u x , x ′ − x 2 � x ′ − x � 2 f ( x ′ ) , f ( x ) + ≤ echet subgradients u x ∈ ∂ f ( x ), for all x , x ′ ∈ X . for all Fr´ f ( x ) + x 2 2 ← − 1.4 1.2 f is 1-weakly convex f ( x ) = max {| x | , 1 − x 2 1.0 2 } ← − as f + �·� 2 0.8 is convex 2 0.6 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 6 / 25

f ( x ) = max y ∈Y g ( x , y ) is non-smooth and weakly convex f is non-smooth due to maximization over y ρ -weakly convex function 2 � · � 2 is convex, i.e., We say that f is a ρ -weakly convex f if f + ρ � � − ρ u x , x ′ − x 2 � x ′ − x � 2 f ( x ′ ) , f ( x ) + ≤ echet subgradients u x ∈ ∂ f ( x ), for all x , x ′ ∈ X . for all Fr´ Any L -smooth function is L -weakly convex � � − L ∇ x f ( x ) , x ′ − x 2 � x ′ − x � 2 f ( x ′ ) f ( x ) + ≤ −� x � is not weakly convex (due to upward pointing cusp). Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 7 / 25

f ( x ) = max y ∈Y g ( x , y ) is non-smooth and weakly convex f is non-smooth due to maximization over y ρ -weakly convex function 2 � · � 2 is convex, i.e., We say that f is a ρ -weakly convex f if f + ρ � � − ρ u x , x ′ − x 2 � x ′ − x � 2 f ( x ′ ) , f ( x ) + ≤ echet subgradients u x ∈ ∂ f ( x ), for all x , x ′ ∈ X . for all Fr´ f ( x ) = max y ∈Y g ( x , y ) is L -weakly convex, if g is L -smooth. � � − L ∇ x g ( x , y ) , x ′ − x 2 � x ′ − x � 2 g ( x ′ , y ) g ( x , y ) + ≤ � � − L u x , x ′ − x 2 � x ′ − x � 2 f ( x ′ ) = ⇒ f ( x ) + ≤ Cannot define approx. stationary point directly using subgradients Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 8 / 25

First order stationary point of weakly-convex function Moreau envelope f λ of a L -weakly convex function ( L < 1 λ ): x ′ f ( x ′ ) + 1 2 λ � x − x ′ � 2 . f λ ( x ) = min f λ is a smooth lower bound of f : ∇ f λ ( x ) = 0 = ⇒ 0 ∈ ∂ f ( x ) 1.0 f ( x ) = max {| x | , 1 − x 2 2 } ← − 0.9 0.8 f 0 . 5 ( x ) ← − 0.7 0.6 0.5 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 9 / 25

First order stationary point of weakly-convex function Moreau envelope f λ of a L -weakly convex function ( L < 1 λ ): x ′ f ( x ′ ) + 1 2 λ � x − x ′ � 2 . f λ ( x ) = min f λ is a smooth lower bound of f : ∇ f λ ( x ) = 0 = ⇒ 0 ∈ ∂ f ( x ) ε -first order stationary point ( ε -FOSP) We say that x is an ε -first order stationary point of a L -weakly convex f if �∇ f 1 2 L ( x ) � ≤ ε . Further this implies that there exists ˆ x s.t., � ˆ x − x � ≤ ε/ 2 L and min x ) � u � ≤ ε u ∈ ∂ f (ˆ Algorithm complexity is the no. of first-order oracle calls to obtain ε -FOSP. Convergence rate is ε k if after k oracle calls we get ε k -FOSP. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 10 / 25

Smooth nonconvex–concave minimax results Previous Setting Our result state-of-the-art � ε − 5 � � ε − 3 � � max y g ( x , y ) O [1] O � ε − 4 � � ε − 3 � � m � max i f i ( x ) = max i y i f i ( x ) O [2] O y ∈ ∆ m ∆ m is the simplex of dimension m . [1] Jin, C., Netrapalli, P., & Jordan, M. I. (2019). Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618. [2] Davis, D., & Drusvyatskiy, D. (2018). Stochastic subgradient method converges at the rate O ( k − 1 / 4 ) on weakly convex functions. arXiv preprint arXiv:1802.02988. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 11 / 25

Baseline: Subgradient method O ( ε − 5 ) [1, 2] Apply (inexact) subgradient method u x k = ∇ x g ( x k , y k ) , where, y k ≈ y ∗ ( x ) = arg max y ∈Y g ( x k , y ) x k +1 = P X ( x k − η u x k ) Sufficient condition: max y g ( x k , y ) − g ( x k , y k ) ≤ O ( ε 2 ) [1] Per-step # iterations Total Setting (AGD) (Subgrad. method) complexity � ε − 1 � � ε − 4 � � ε − 5 � max y g ( x , y ) O O O � ε − 4 � � ε − 4 � max i f i ( x ) O (1) O O Does not utilize the smooth minimax structure of f ( x ) = max y g ( x , y ) Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 12 / 25

Proximal Point method (PPM) (Inexact) Proximal point method x ∈X f ( x ) + L � x − x k � 2 x k +1 ≈ arg min ⇐ ⇒ x k +1 ≈ x k − 2 L u x k +1 , u x k +1 ∈ ∂ f ( x k +1 ) Iterations complexity to get ε -FOSP is O ( 1 ε 2 ) Proof sketch. L -weak convexity implies, � � − L / 2 � x k − x k +1 � 2 ≤ f ( x k ) f ( x k +1 ) + u x k +1 , x k − x k +1 Using update x k +1 = x k − 2 L u x k +1 we get a Descent Lemma: f ( x k +1 ) − f ( x k ) ≤ − 3 L / 2 � u x k +1 � 2 After O ( f ( x 0 ) − min x f ( x ) ) steps, min k � u x k +1 � = O ( ε ) . ε 2 Generalized to �∇ 1 2 L f ( x k ) � due to inexact update and non-smooth f . Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 13 / 25

Per-step complexity of PPM L -weakly convex + 2 L -strongly convex = L -strongly convex L � x − x k � 2 f ( x ) + Each iteration solves L -strongly-convex–concave problem: g k ( x , y ) = g ( x , y ) + 2 L / 2 � x − x k � 2 ] x k +1 = arg min x ∈X max y ∈Y [˜ Primal dual gap of O ( ε 2 ) is sufficient: g k ( x , y k +1 ) = O ( ε 2 ) max y ∈Y ˜ g k ( x k +1 , y ) − min x ∈X ˜ Per-step Total Algorithm for min x max y ˜ g k ( x , y ) complexity complexity � k − 1 � � ε − 2 � � ε − 4 � Cvx–Cve [Mirror-Prox, 3] O O O � k − 2 � � ε − 1 � � ε − 3 � Strongly-Cvx–Cve [ours] O O O [3] A. Nemirovski. “Prox-method with rate of convergence O (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex–concave saddle point problems”. In: SIAM Journal on Optimization 15.1 (2004), pp. 229–251. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 14 / 25

Nonconvex–concave experiment � � , where f i ( x ) = a i � x − b i � 2 min x ∈ R 2 f ( x ) = max 1 ≤ i ≤ m =9 f i ( x ) 2 + c i . 10 2 10 0 10 2 �∇ f 1 2 L ( x k ) � − → Subgradient 10 4 − → method 10 6 − → PPM (ours) 10 8 − → Adaptive − → PPM (ours) 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 no. of gradient oracle accesses k Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 15 / 25

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 - PowerPoint PPT Presentation

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 Kiran Koshy Thekumparampil , Prateek Jain , Praneeth Netrapalli , Sewoong Oh University of Illinois at Urbana-Champaign, Microsoft Research, India,

Strengthening Smooth Transition Strengthening Smooth Transition Strengthening Smooth Transition

ZEROTH-ORDER NON-CONVEX SMOOTH OPTIMIZATION: LOCAL MINIMAX RATES Yining Wang, CMU joint work

4. Minimax and planning problems Optimizing piecewise linear functions Minimax problems

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

Minimization Problem with Smooth Components Yu. Nesterov Presenter: Lei Tang Department of CSE

Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning Tuo Zhao

Algorithms for unconstrained local optimization Fabio Schoen 2008

Non-Smooth Convex Optimization in Data Sciences Jalal Fadili Normandie Universit-ENSICAEN,

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

A very complicated proof of the minimax theorem Jonathan Borwein FRSC FAAS FAA FBAS Centre for

Minimax risk of truncated series estimators over symmetric convex polytopes Adel Javanmard

Adversarial Search Volker Sorge Intro to AI: Problem of Games Lecture 4 Volker Sorge MiniMax

More on games (Ch. 5.4-5.6) Review: Minimax Afro Deli Shuang Cheng Cheese- Fried Lo Mein

Minimax Statistical Learning with Wasserstein distances Jaeho Lee & Maxim Raginsky NeurIPS

Minimax Pareto Fairness: A Multi-Objective Perspective Natalia Martinez, Martin Bertran,

RFID Hacking Live Free or RFID Hard 01 Aug 2013 Black Hat USA 2013 Las Vegas, NV Presen

Proximal methods S. Villa 21st October 2013 0.1 Review of the basics Often machine learning

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

The Rumen The Rumen The rumen , also known as the fermentation vat or paunch forms the larger

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

ProxSDP.jl: New developments on Semidefinite Programming in Julia/JuMP Mario Souto and Joaquim

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Gersende

v F c v F c 2 1 4 3 4 v < 3 v < 2 v < 1 v v F c (

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 - PowerPoint PPT Presentation

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 Kiran Koshy Thekumparampil , Prateek Jain , Praneeth Netrapalli , Sewoong Oh University of Illinois at Urbana-Champaign, Microsoft Research, India,

Strengthening Smooth Transition Strengthening Smooth Transition Strengthening Smooth Transition

ZEROTH-ORDER NON-CONVEX SMOOTH OPTIMIZATION: LOCAL MINIMAX RATES Yining Wang, CMU joint work

4. Minimax and planning problems Optimizing piecewise linear functions Minimax problems

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

Minimization Problem with Smooth Components Yu. Nesterov Presenter: Lei Tang Department of CSE

Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning Tuo Zhao

Algorithms for unconstrained local optimization Fabio Schoen 2008

Non-Smooth Convex Optimization in Data Sciences Jalal Fadili Normandie Universit-ENSICAEN,

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

A very complicated proof of the minimax theorem Jonathan Borwein FRSC FAAS FAA FBAS Centre for

Minimax risk of truncated series estimators over symmetric convex polytopes Adel Javanmard

Adversarial Search Volker Sorge Intro to AI: Problem of Games Lecture 4 Volker Sorge MiniMax

More on games (Ch. 5.4-5.6) Review: Minimax Afro Deli Shuang Cheng Cheese- Fried Lo Mein

Minimax Statistical Learning with Wasserstein distances Jaeho Lee &amp; Maxim Raginsky NeurIPS

Minimax Pareto Fairness: A Multi-Objective Perspective Natalia Martinez, Martin Bertran,

RFID Hacking Live Free or RFID Hard 01 Aug 2013 Black Hat USA 2013 Las Vegas, NV Presen

Proximal methods S. Villa 21st October 2013 0.1 Review of the basics Often machine learning

Generalized gradient descent Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725 1

The Rumen The Rumen The rumen , also known as the fermentation vat or paunch forms the larger

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

ProxSDP.jl: New developments on Semidefinite Programming in Julia/JuMP Mario Souto and Joaquim

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Gersende

v F c v F c 2 1 4 3 4 v &lt; 3 v &lt; 2 v &lt; 1 v v F c (

Minimax Statistical Learning with Wasserstein distances Jaeho Lee & Maxim Raginsky NeurIPS

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

v F c v F c 2 1 4 3 4 v < 3 v < 2 v < 1 v v F c (