Stochastic Optimization for DC Functions and Non-smooth Non-convex Regularizers with Non-asymptotic Convergence Yi Xu 1 , Qi Qi 1 , Qihang Lin 1 , Rong Jin 2 , Tianbao Yang 1 1. The University of Iowa 2. Damo Academy at Alibaba June 12, 2019 ICML, Long Beach, CA Yi Xu (CS@UI) SSDC June 12, 2019 1 / 7
Non-Convex and Non-smooth Optimization A family of non-convex non-smooth optimization problems: x ∈ R d F ( x ) := g ( x ) − h ( x ) + r ( x ) , min (1) ◮ g ( · ), h ( · ): real-valued lower-semicontinuous convex ◮ r ( · ): proper lower-semicontinuous g ( x ) = E ξ [ g ( x ; ξ )], h ( x ) = E ς [ h ( x ; ς )] ◮ Finite-sum (a special case): � n 1 � n 2 g ( x ) = 1 i =1 g i ( x ), h ( x ) = 1 j =1 h j ( x ). n 1 n 2 It covers many applications ◮ Non-Convex Sparsity-Promoting Regularizers: LSP, MCP, SCAD, capped ℓ 1 , transformed ℓ 1 ◮ Weakly convex ◮ Least-squares Regression with ℓ 1 − 2 Regularization ◮ Positive-Unlabeled (PU) Learning Yi Xu (CS@UI) SSDC June 12, 2019 2 / 7
Main Goal Critical Point : a point ¯ x s.t. x ) ∩ ˆ ∂ h (¯ ∂ ( g + r )(¯ x ) � = ∅ . ◮ ˆ ∂ f ( x ): Fr´ echet subgradient; ∂ f ( x ): limiting subgradient An ǫ -Critical Point : a point ¯ x s.t. x ) , ˆ dist( ∂ h (¯ ∂ ( g + r )(¯ x )) ≤ ǫ. ◮ If g + r is non-differentiable, finding an ǫ -critical point is challenging. ◮ An example: g = | x | , h = r = 0, then dist(0 , ∂ | x | ) = 1 when x � = 0. Goal: finding a Nearly ǫ -Critical Point x : if there exists ¯ x such that x ) , ˆ � x − ¯ x � ≤ O ( ǫ ) , dist( ∂ h (¯ ∂ ( g + r )(¯ x )) ≤ ǫ. (2) Yi Xu (CS@UI) SSDC June 12, 2019 3 / 7
Stagewise Stochastic DC algorithm (SSDC- A ) When r ( x ) is convex, assume that the proximal mapping of r ( x ) can be 2 η � x − y � 2 + r ( x ) . 1 easily computed: prox η r ( y ) = arg min x ∈ R d Basic idea: solving a convex majorant function in stage-wise Stagewise Stochastic DC (SSDC) Algorithm [1 , 2 , 3] 1: for k = 1 , . . . , K do F γ x k ( x ) = g ( x )+ r ( x ) − ( h ( x k ) + ∂ h ( x k ) ⊤ ( x − x k ))+ γ 2 � x − x k � 2 . 2: x k +1 = A ( F γ x k ) 3: 4: end for 1Dinh, T.P., Souad, E.B. North-Holland Mathematics Studies, pp. 249-271, 1986. 2 Thi, H. A. L., Le, H. M., Phan, D. N., and Tran, B. in ICML, pp. 3394-3403, 2017. 3 Wen, B., Chen, X., and Pong, T. K. Computational Optimization and Applications, 69(2):297-324, 2018. Yi Xu (CS@UI) SSDC June 12, 2019 4 / 7
Stagewise Stochastic DC algorithm (SSDC- A ) When r ( x ) is convex, assume that the proximal mapping of r ( x ) can be 2 η � x − y � 2 + r ( x ) . 1 easily computed: prox η r ( y ) = arg min x ∈ R d Basic idea: solving a convex majorant function in stage-wise Stagewise Stochastic DC (SSDC) Algorithm [1 , 2 , 3] 1: for k = 1 , . . . , K do F γ x k ( x ) = g ( x )+ r ( x ) − ( h ( x k ) + ∂ h ( x k ) ⊤ ( x − x k ))+ γ 2 � x − x k � 2 . 2: x k +1 = A ( F γ x k ) 3: 4: end for 1Dinh, T.P., Souad, E.B. North-Holland Mathematics Studies, pp. 249-271, 1986. 2 Thi, H. A. L., Le, H. M., Phan, D. N., and Tran, B. in ICML, pp. 3394-3403, 2017. 3 Wen, B., Chen, X., and Pong, T. K. Computational Optimization and Applications, 69(2):297-324, 2018. Yi Xu (CS@UI) SSDC June 12, 2019 4 / 7
Stagewise Stochastic DC algorithm (SSDC- A ) When r ( x ) is convex, assume that the proximal mapping of r ( x ) can be 2 η � x − y � 2 + r ( x ) . 1 easily computed: prox η r ( y ) = arg min x ∈ R d Basic idea: solving a convex majorant function in stage-wise Stagewise Stochastic DC (SSDC) Algorithm [1 , 2 , 3] 1: for k = 1 , . . . , K do F γ x k ( x ) = g ( x )+ r ( x ) − ( h ( x k ) + ∂ h ( x k ) ⊤ ( x − x k ))+ γ 2 � x − x k � 2 . 2: x k +1 = A ( F γ x k ) 3: 4: end for A : stochastic algorithms (e.g., SPG, AdaGrad, SVRG) apply to F γ x k ( x ) 1Dinh, T.P., Souad, E.B. North-Holland Mathematics Studies, pp. 249-271, 1986. 2 Thi, H. A. L., Le, H. M., Phan, D. N., and Tran, B. in ICML, pp. 3394-3403, 2017. 3 Wen, B., Chen, X., and Pong, T. K. Computational Optimization and Applications, 69(2):297-324, 2018. Yi Xu (CS@UI) SSDC June 12, 2019 4 / 7
Stagewise Stochastic DC algorithm (SSDC- A ) When r ( x ) is convex, assume that the proximal mapping of r ( x ) can be 2 η � x − y � 2 + r ( x ) . 1 easily computed: prox η r ( y ) = arg min x ∈ R d Basic idea: solving a convex majorant function in stage-wise Stagewise Stochastic DC (SSDC) Algorithm [1 , 2 , 3] 1: for k = 1 , . . . , K do F γ x k ( x ) = g ( x )+ r ( x ) − ( h ( x k ) + ∂ h ( x k ) ⊤ ( x − x k ))+ γ 2 � x − x k � 2 . 2: x k +1 = A ( F γ x k ) 3: 4: end for A : stochastic algorithms (e.g., SPG, AdaGrad, SVRG) apply to F γ x k ( x ) Finding x k +1 s.t. E [ F γ x k ( x k +1 ) − min x ∈ R d F γ x k ( x )] ≤ c k . 1Dinh, T.P., Souad, E.B. North-Holland Mathematics Studies, pp. 249-271, 1986. 2 Thi, H. A. L., Le, H. M., Phan, D. N., and Tran, B. in ICML, pp. 3394-3403, 2017. 3 Wen, B., Chen, X., and Pong, T. K. Computational Optimization and Applications, 69(2):297-324, 2018. Yi Xu (CS@UI) SSDC June 12, 2019 4 / 7
Summary of Results ( r is convex) Table: Summary of results for finding a (nearly) ǫ -critical point of the problem (1) Algorithm A Complexity g h r O (1 /ǫ 4 ) - SM CX SPG, AdaGrad O ( n /ǫ 2 ) SM SM CX SVRG O (1 /ǫ 4 ) SM - CX, SM SPG, AdaGrad O ( n /ǫ 2 ) SM - CX, SM SVRG SM: smooth; CX: convex. n : the total number of components in a finite-sum problem. Yi Xu (CS@UI) SSDC June 12, 2019 5 / 7
Non-Smooth Non-Convex Regularization When r ( x ) is non-convex, the challenge is the presence of non-smooth non-convex function r . The Moreau envelope of r ( µ > 0) is a DC function [4] : � 1 � 2 µ � y − x � 2 + r ( y ) r µ ( x ) = min y ∈ R d � 1 � = 1 µ y ⊤ x − 1 2 µ � x � 2 − max 2 µ � y � 2 − r ( y ) , y ∈ R d � �� � R µ ( x ) Key idea: solving the following DC problem, x ∈ R d F µ ( x ) := g ( x ) − h ( x ) + 1 2 µ � x � 2 − R µ ( x ) . min 4Liu, T., Pong, T. K., and Takeda, A. Mathematical Programming, 2018. Yi Xu (CS@UI) SSDC June 12, 2019 6 / 7
Summary of Results ( r is non-convex) Table: Summary of results for finding a (nearly) ǫ -critical point of the problem (1) Algorithm A g h r Complexity O (1 /ǫ 8 ) SM SM NC, NS, LP SPG O (1 /ǫ 12 ) SM SM NC, NS, FV, LB SPG O ( n /ǫ 8 ) SM SM NC, NS, LP SVRG O ( n /ǫ 6 ) SM SM NC, NS, FV, LB SVRG O ( n /ǫ 6 ) SM SM NC, NS, FVC SVRG SM: smooth; CX: convex; NC: non-convex; NS: non-smooth; LP: Lipchitz continuous function; LB: lower bounded over R d ; FV: finite-valued over R d ; FVC: finite-valued over a compact set. Thank You! Poster #109, Pacific Ballroom, 06:30-09:00 PM Yi Xu (CS@UI) SSDC June 12, 2019 7 / 7
Summary of Results ( r is non-convex) Table: Summary of results for finding a (nearly) ǫ -critical point of the problem (1) Algorithm A g h r Complexity O (1 /ǫ 8 ) SM SM NC, NS, LP SPG O (1 /ǫ 12 ) SM SM NC, NS, FV, LB SPG O ( n /ǫ 8 ) SM SM NC, NS, LP SVRG O ( n /ǫ 6 ) SM SM NC, NS, FV, LB SVRG O ( n /ǫ 6 ) SM SM NC, NS, FVC SVRG SM: smooth; CX: convex; NC: non-convex; NS: non-smooth; LP: Lipchitz continuous function; LB: lower bounded over R d ; FV: finite-valued over R d ; FVC: finite-valued over a compact set. Thank You! Poster #109, Pacific Ballroom, 06:30-09:00 PM Yi Xu (CS@UI) SSDC June 12, 2019 7 / 7
Recommend
More recommend