Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 Kiran Koshy Thekumparampil † , Prateek Jain ‡ , Praneeth Netrapalli ‡ , Sewoong Oh ± † University of Illinois at Urbana-Champaign, ‡ Microsoft Research, India, ± University of Washington, Seattle Oct 27, 2019 Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 1 / 25
Outline Minimax Optimization problem Efficient algorithm for Nonconvex–Concave minimax problem Optimal algorithm for Strongly-Convex—Concave minimax problem Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 2 / 25
Minimax problem Consider the general minimax problem min x ∈X max y ∈Y g ( x , y ) Two player game: y tries to maximize and x tries to minimize. The order of min & max or who plays first ( x above) is important max y ∈Y min x ∈X g ( x , y ) ≤ min x ∈X max y ∈Y g ( x , y ) Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 3 / 25
Examples of Minimax problem 1 GAN: min G max D V ( G , D ): � � � � E E min G max log ( D ( x )) + log (1 − D ( G ( z ))) = JS ( P X || Q X ) x ∼ P X z ∼ Q Z D 2 Constrained optimization: min x f ( x ), s.t. f i ( x ) ≤ 0, ∀ i ∈ [ m ] � � m � min x max L ( x , y ) = f ( x ) + y i f i ( x ) y ≥ 0 i =1 3 Robust estimation/optimization: � min max f ( x , ˆ z i ) x ˆ z i i ∆(ˆ z i , z i ) ≤ ε , ∀ i ∈ [ m ] . Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 4 / 25
Nonconvex minimax In general g ( x , y ) is non-convex in both x and y . E.g. Neural network based GAN Very few works on nonconvex minimax We focus on smooth nonconvex–concave minimax problem, i.e. g ( x , · ) is concave, and g is L -smooth: � � �� � x − x ′ � � y − y ′ � � �� � ∇ a g ( x , y ) − ∇ a g ( x ′ , y ′ ) � ≤ L � + max . a ∈{ x , y } E.g. smooth constrained optimization. In general: max y ∈Y min x ∈X g ( x , y ) < min x ∈X max y ∈Y g ( x , y ) We focus on the non-smooth nonconvex Primal problem: f ( x ) = max y g ( x , y ) Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 5 / 25
f ( x ) = max y ∈Y g ( x , y ) is non-smooth and weakly convex f is non-smooth due to maximization over y ρ -weakly convex function 2 � · � 2 is convex, i.e., We say that f is a ρ -weakly convex f if f + ρ � � − ρ u x , x ′ − x 2 � x ′ − x � 2 f ( x ′ ) , f ( x ) + ≤ echet subgradients u x ∈ ∂ f ( x ), for all x , x ′ ∈ X . for all Fr´ f ( x ) + x 2 2 ← − 1.4 1.2 f is 1-weakly convex f ( x ) = max {| x | , 1 − x 2 1.0 2 } ← − as f + �·� 2 0.8 is convex 2 0.6 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 6 / 25
f ( x ) = max y ∈Y g ( x , y ) is non-smooth and weakly convex f is non-smooth due to maximization over y ρ -weakly convex function 2 � · � 2 is convex, i.e., We say that f is a ρ -weakly convex f if f + ρ � � − ρ u x , x ′ − x 2 � x ′ − x � 2 f ( x ′ ) , f ( x ) + ≤ echet subgradients u x ∈ ∂ f ( x ), for all x , x ′ ∈ X . for all Fr´ Any L -smooth function is L -weakly convex � � − L ∇ x f ( x ) , x ′ − x 2 � x ′ − x � 2 f ( x ′ ) f ( x ) + ≤ −� x � is not weakly convex (due to upward pointing cusp). Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 7 / 25
f ( x ) = max y ∈Y g ( x , y ) is non-smooth and weakly convex f is non-smooth due to maximization over y ρ -weakly convex function 2 � · � 2 is convex, i.e., We say that f is a ρ -weakly convex f if f + ρ � � − ρ u x , x ′ − x 2 � x ′ − x � 2 f ( x ′ ) , f ( x ) + ≤ echet subgradients u x ∈ ∂ f ( x ), for all x , x ′ ∈ X . for all Fr´ f ( x ) = max y ∈Y g ( x , y ) is L -weakly convex, if g is L -smooth. � � − L ∇ x g ( x , y ) , x ′ − x 2 � x ′ − x � 2 g ( x ′ , y ) g ( x , y ) + ≤ � � − L u x , x ′ − x 2 � x ′ − x � 2 f ( x ′ ) = ⇒ f ( x ) + ≤ Cannot define approx. stationary point directly using subgradients Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 8 / 25
First order stationary point of weakly-convex function Moreau envelope f λ of a L -weakly convex function ( L < 1 λ ): x ′ f ( x ′ ) + 1 2 λ � x − x ′ � 2 . f λ ( x ) = min f λ is a smooth lower bound of f : ∇ f λ ( x ) = 0 = ⇒ 0 ∈ ∂ f ( x ) 1.0 f ( x ) = max {| x | , 1 − x 2 2 } ← − 0.9 0.8 f 0 . 5 ( x ) ← − 0.7 0.6 0.5 0.4 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 9 / 25
First order stationary point of weakly-convex function Moreau envelope f λ of a L -weakly convex function ( L < 1 λ ): x ′ f ( x ′ ) + 1 2 λ � x − x ′ � 2 . f λ ( x ) = min f λ is a smooth lower bound of f : ∇ f λ ( x ) = 0 = ⇒ 0 ∈ ∂ f ( x ) ε -first order stationary point ( ε -FOSP) We say that x is an ε -first order stationary point of a L -weakly convex f if �∇ f 1 2 L ( x ) � ≤ ε . Further this implies that there exists ˆ x s.t., � ˆ x − x � ≤ ε/ 2 L and min x ) � u � ≤ ε u ∈ ∂ f (ˆ Algorithm complexity is the no. of first-order oracle calls to obtain ε -FOSP. Convergence rate is ε k if after k oracle calls we get ε k -FOSP. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 10 / 25
Smooth nonconvex–concave minimax results Previous Setting Our result state-of-the-art � ε − 5 � � ε − 3 � � max y g ( x , y ) O [1] O � ε − 4 � � ε − 3 � � m � max i f i ( x ) = max i y i f i ( x ) O [2] O y ∈ ∆ m ∆ m is the simplex of dimension m . [1] Jin, C., Netrapalli, P., & Jordan, M. I. (2019). Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618. [2] Davis, D., & Drusvyatskiy, D. (2018). Stochastic subgradient method converges at the rate O ( k − 1 / 4 ) on weakly convex functions. arXiv preprint arXiv:1802.02988. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 11 / 25
Baseline: Subgradient method O ( ε − 5 ) [1, 2] Apply (inexact) subgradient method u x k = ∇ x g ( x k , y k ) , where, y k ≈ y ∗ ( x ) = arg max y ∈Y g ( x k , y ) x k +1 = P X ( x k − η u x k ) Sufficient condition: max y g ( x k , y ) − g ( x k , y k ) ≤ O ( ε 2 ) [1] Per-step # iterations Total Setting (AGD) (Subgrad. method) complexity � ε − 1 � � ε − 4 � � ε − 5 � max y g ( x , y ) O O O � ε − 4 � � ε − 4 � max i f i ( x ) O (1) O O Does not utilize the smooth minimax structure of f ( x ) = max y g ( x , y ) Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 12 / 25
Proximal Point method (PPM) (Inexact) Proximal point method x ∈X f ( x ) + L � x − x k � 2 x k +1 ≈ arg min ⇐ ⇒ x k +1 ≈ x k − 2 L u x k +1 , u x k +1 ∈ ∂ f ( x k +1 ) Iterations complexity to get ε -FOSP is O ( 1 ε 2 ) Proof sketch. L -weak convexity implies, � � − L / 2 � x k − x k +1 � 2 ≤ f ( x k ) f ( x k +1 ) + u x k +1 , x k − x k +1 Using update x k +1 = x k − 2 L u x k +1 we get a Descent Lemma: f ( x k +1 ) − f ( x k ) ≤ − 3 L / 2 � u x k +1 � 2 After O ( f ( x 0 ) − min x f ( x ) ) steps, min k � u x k +1 � = O ( ε ) . ε 2 Generalized to �∇ 1 2 L f ( x k ) � due to inexact update and non-smooth f . Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 13 / 25
Per-step complexity of PPM L -weakly convex + 2 L -strongly convex = L -strongly convex L � x − x k � 2 f ( x ) + Each iteration solves L -strongly-convex–concave problem: g k ( x , y ) = g ( x , y ) + 2 L / 2 � x − x k � 2 ] x k +1 = arg min x ∈X max y ∈Y [˜ Primal dual gap of O ( ε 2 ) is sufficient: g k ( x , y k +1 ) = O ( ε 2 ) max y ∈Y ˜ g k ( x k +1 , y ) − min x ∈X ˜ Per-step Total Algorithm for min x max y ˜ g k ( x , y ) complexity complexity � k − 1 � � ε − 2 � � ε − 4 � Cvx–Cve [Mirror-Prox, 3] O O O � k − 2 � � ε − 1 � � ε − 3 � Strongly-Cvx–Cve [ours] O O O [3] A. Nemirovski. “Prox-method with rate of convergence O (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex–concave saddle point problems”. In: SIAM Journal on Optimization 15.1 (2004), pp. 229–251. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 14 / 25
Nonconvex–concave experiment � � , where f i ( x ) = a i � x − b i � 2 min x ∈ R 2 f ( x ) = max 1 ≤ i ≤ m =9 f i ( x ) 2 + c i . 10 2 10 0 10 2 �∇ f 1 2 L ( x k ) � − → Subgradient 10 4 − → method 10 6 − → PPM (ours) 10 8 − → Adaptive − → PPM (ours) 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 no. of gradient oracle accesses k Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 15 / 25
Recommend
More recommend