Improved Zeroth-Order Variance Reduced Algorithms and Analysis for Nonconvex Optimization Kaiyi Ji 1 , Zhe Wang 1 , Yi Zhou 2 , Yingbin Liang 1 1 Ohio State University, 2 Duke University ICML 2019 K. Ji, Z. Wang, Y. Zhou, Y. Liang Zeroth-Order Nonconvex Optimization (The Ohio State University) ICML 2019 1 / 8
Zeroth-order (Gradient-free) Nonconvex Optimization • Problem fomulation: n x ∈ R d f ( x ) := 1 � min f i ( x ) n i =1 ◮ f i ( · ): individual nonconvex loss function ◮ Gradient of f i ( · ) is unknown ◮ Only the function value of f i ( · ) is accessible ◮ Examples: Generation of black-box adversarial samples Parameter optimization for black-box systems Action exploration in reinforcement learning Generating black-box adversarial samples K. Ji, Z. Wang, Y. Zhou, Y. Liang Zeroth-Order Nonconvex Optimization (The Ohio State University) ICML 2019 2 / 8
Zeroth-order (Gradient-free) Nonconvex Optimization n x ∈ R d f ( x ) := 1 � min f i ( x ) n i =1 • Standard assumptions on f ( · ): ◮ f ( · ) is bounded below, i.e., f ∗ = inf x ∈ R d f ( x ) > −∞ ◮ ∇ f i ( · ) is L -smooth, i.e., �∇ f i ( x ) − ∇ f i ( y ) � ≤ L � x − y � ◮ (Online case) ∇ f i ( · ) has bounded variance, i.e., there exists σ > 0 s.t. n 1 � �∇ f i ( x ) − ∇ f ( x ) � 2 ≤ σ 2 n i =1 • Optimization goal: find an ǫ -accurate stationary solution E �∇ f ( x ) � 2 ≤ ǫ K. Ji, Z. Wang, Y. Zhou, Y. Liang Zeroth-Order Nonconvex Optimization (The Ohio State University) ICML 2019 3 / 8
Existing Zeroth-Order SVRG ZO-SVRG ( Liu et al, 2018) g s = ˆ • Each outer-loop iteration estimates gradient by ˆ ∇ rand f ( x s 0 , u s 0 ) • Each inner-loop iteration computes � ˆ t = 1 � � t ) − ˆ v s ∇ rand f i ( x s t ; u s ∇ rand f i ( x s 0 ; u s 0 ) + ˆ g s , | B | i ∈ B • Two-point gradient estimator: ˆ ∇ rand f i ( x s t , u s t ) = d β ( f i ( x s t + β u s t ) − f i ( x s t )) u s t • u s t : smoothing vector; β : smoothing parameter K. Ji, Z. Wang, Y. Zhou, Y. Liang Zeroth-Order Nonconvex Optimization (The Ohio State University) ICML 2019 4 / 8
Existing Zeroth-Order SVRG ZO-SVRG ( Liu et al, 2018) g s = ˆ • Each outer-loop iteration estimates gradient by ˆ ∇ rand f ( x s 0 , u s 0 ) • Each inner-loop iteration computes � ˆ t = 1 � � t ) − ˆ v s ∇ rand f i ( x s t ; u s ∇ rand f i ( x s 0 ; u s 0 ) + ˆ g s , | B | i ∈ B • Two-point gradient estimator: ˆ ∇ rand f i ( x s t , u s t ) = d β ( f i ( x s t + β u s t ) − f i ( x s t )) u s t • u s t : smoothing vector; β : smoothing parameter Algorithms Convergence rate # of function queries � O ( d ǫ − 2 ) ZO-SGD O ( d / T ) O ( d ǫ − 2 + n ǫ − 1 ) ZO-SVRG O ( d / T + 1 / | B | ) ◮ Issue: ZO-SVRG has worse query complexity than ZO-SGD K. Ji, Z. Wang, Y. Zhou, Y. Liang Zeroth-Order Nonconvex Optimization (The Ohio State University) ICML 2019 4 / 8
ZO-SVRG-Coord-Rand vs ZO-SVRG ZO-SVRG-Coord-Rand (This paper) g s = ˆ • Each outer-loop iteration estimates gradient by ˆ ∇ coord f S ( x k ) g s = ˆ ◮ As a comparison, ZO-SVRG uses ˆ ∇ rand f ( x s 0 , u s 0 ) • Each inner-loop iteration computes � ˆ t = 1 � � ) − ˆ v s ∇ rand f i ( x s u s ∇ rand f i ( x s u s t ; 0 ; ) + ˆ g s , i , t i , t | B | ���� ���� i ∈ B ZO-SVRG: u s ZO-SVRG: u s t 0 • ˆ ∇ coord f ( · ): coordinate-wise gradient estimator Algorithms Convergence rate Function query complexity � O ( d ǫ − 2 ) ZO-SGD O ( d / T ) O ( d ǫ − 2 + n ǫ − 1 ) ZO-SVRG O ( d / T + 1 / | B | ) � � d ǫ − 5 / 3 , dn 2 / 3 ǫ − 1 �� ZO-SVRG-Coord-Rand O (1 / T ) O min K. Ji, Z. Wang, Y. Zhou, Y. Liang Zeroth-Order Nonconvex Optimization (The Ohio State University) ICML 2019 5 / 8
Sharp Analysis for ZO-SVRG-Coord (Liu et al, 2018) ZO-SVRG-Coord (Liu et al, 2018) g s = ˆ • Each outer-loop iteration estimates gradient by ˆ ∇ coord f S ( x k ) • Each inner-loop iteration computes � ˆ t = 1 � � v s ∇ coord f i ( x s t ; u s i , t ) − ˆ ∇ coord f i ( x s 0 ; u s i , t ) + ˆ g s , | B | i ∈ B Algorithms Stepsize Convergence rate Function query complexity dn + d 2 � � O ( 1 O ( d ǫ + dn ZO-SVRG-Coord d ) T ) O ǫ � � ǫ 5 / 3 , dn 2 / 3 �� O ( 1 d ZO-SVRG-Coord (our analysis) O (1) T ) O min ǫ Key idea: • Coordinate-wise gradient estimator → high accuracy → faster rate K. Ji, Z. Wang, Y. Zhou, Y. Liang Zeroth-Order Nonconvex Optimization (The Ohio State University) ICML 2019 6 / 8
More Results • Develop a faster zeroth-order SPIDER-type algorithm • Develop improved zeroth-order algorithms for ◮ nonconvex nonsmooth optimization ◮ convex smooth optimization ◮ Polyak-� Lojasiewicz (PL) condition • Experiments: 20 16 20 17 ZO-SGD ZO-SGD ZO-SGD 16 ZO-SGD 18 18 ZO-SVRG-Ave ZO-SVRG-Ave ZO-SVRG-Ave 14 ZO-SVRG-Ave 16 SPIDER-SZO 16 SPIDER-SZO SPIDER-SZO SPIDER-SZO ZO-SVRG-Coord 13 ZO-SVRG-Coord ZO-SVRG-Coord Loss 14 Loss 12 ZO-SVRG-Coord Loss 14 Loss ZO-SVRG-Coord-Rand ZO-SVRG-Coord-Rand ZO-SVRG-Coord-Rand ZO-SVRG-Coord-Rand 12 12 ZO-SPIDER-Coord ZO-SPIDER-Coord 10 ZO-SPIDER-Coord ZO-SPIDER-Coord 10 10 10 8 8 8 7 0 450 900 1350 1800 1 2 3 4 0 200 400 600 800 1000 1 2 3 4 5 # of iterations # of function queries 10 5 # of iterations # of function queries 10 5 Generating black-box adversarial examples for DNNs K. Ji, Z. Wang, Y. Zhou, Y. Liang Zeroth-Order Nonconvex Optimization (The Ohio State University) ICML 2019 7 / 8
Thanks! K. Ji, Z. Wang, Y. Zhou, Y. Liang Zeroth-Order Nonconvex Optimization (The Ohio State University) ICML 2019 8 / 8
Recommend
More recommend