Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32 March 9, 2012 1 / 32
Outline 1 Problems sizes 2 Random coordinate search 3 Confidence level of solutions 4 Sparse Optimization problems 5 Sparse updates for linear operators 6 Fast updates in computational trees 7 Simple subgradient methods 8 Application examples Yu. Nesterov () Huge-scale optimization problems 2/32 March 9, 2012 2 / 32
Nonlinear Optimization: problems sizes Class Operations Dimension Iter.Cost Memory 10 0 − 10 2 n 4 → n 3 10 3 Small-size All Kilobyte: 10 3 − 10 4 n 3 → n 2 A − 1 10 6 Medium-size Megabyte: 10 5 − 10 7 n 2 → n 10 9 Large-scale Gigabyte: Ax 10 8 − 10 12 10 12 Huge-scale x + y n → log n Terabyte: Sources of Huge-Scale problems Internet (New) Telecommunications (New) Finite-element schemes (Old) Partial differential equations (Old) Yu. Nesterov () Huge-scale optimization problems 3/32 March 9, 2012 3 / 32
Very old optimization idea: Coordinate Search Problem: x ∈ R n f ( x ) min ( f is convex and differentiable). Coordinate relaxation algorithm For k ≥ 0 iterate 1 Choose active coordinate i k . 2 Update x k +1 = x k − h k ∇ i k f ( x k ) e i k ensuring f ( x k +1 ) ≤ f ( x k ). ( e i is i th coordinate vector in R n .) Main advantage: Very simple implementation. Yu. Nesterov () Huge-scale optimization problems 4/32 March 9, 2012 4 / 32
Possible strategies 1 Cyclic moves. (Difficult to analyze.) 2 Random choice of coordinate (Why?) 3 Choose coordinate with the maximal directional derivative. Complexity estimate: assume x , y ∈ R n . �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � , Let us choose h k = 1 L . Then 2 L |∇ i k f ( x k ) | 2 ≥ 1 1 2 nL �∇ f ( x k ) � 2 f ( x k ) − f ( x k +1 ) ≥ 2 nLR 2 ( f ( x k ) − f ∗ ) 2 . 1 ≥ Hence, f ( x k ) − f ∗ ≤ 2 nLR 2 , k ≥ 1. (For Grad.Method, drop n .) k This is the only known theoretical result known for CDM! Yu. Nesterov () Huge-scale optimization problems 5/32 March 9, 2012 5 / 32
Criticism Theoretical justification: Complexity bounds are not known for the most of the schemes. The only justified scheme needs computation of the whole gradient. (Why don’t use GM?) Computational complexity: Fast differentiation: if function is defined by a sequence of operations, then C ( ∇ f ) ≤ 4 C ( f ). Can we do anything without computing the function’s values? Result: CDM are almost out of the computational practice. Yu. Nesterov () Huge-scale optimization problems 6/32 March 9, 2012 6 / 32
Google problem Let E ∈ R n × n be an incidence matrix of a graph. Denote e = (1 , . . . , 1) T and E = E · diag ( E T e ) − 1 . ¯ Thus, ¯ E T e = e . Our problem is as follows: Find x ∗ ≥ 0 : Ex ∗ = x ∗ . ¯ Optimization formulation: f ( x ) def Ex − x � 2 + γ 2 � ¯ = 1 2 [ � e , x � − 1] 2 → min x ∈ R n Yu. Nesterov () Huge-scale optimization problems 7/32 March 9, 2012 7 / 32
Huge-scale problems Main features The size is very big ( n ≥ 10 7 ). The data is distributed in space. The requested parts of data are not always available. The data is changing in time. Consequences Simplest operations are expensive or infeasible: Update of the full vector of variables. Matrix-vector multiplication. Computation of the objective function’s value, etc. Yu. Nesterov () Huge-scale optimization problems 8/32 March 9, 2012 8 / 32
Structure of the Google Problem Let ua look at the gradient of the objective: ∇ i f ( x ) = � a i , g ( x ) � + γ [ � e , x � − 1] , i = 1 , . . . , n , ¯ (¯ Ex − x ∈ R n , g ( x ) = E = ( a 1 , . . . , a n )) . Main observations: The coordinate move x + = x − h i ∇ i f ( x ) e i needs O ( p i ) a.o. ( p i is the number of nonzero elements in a i .) � E + γ ee T � def ∇ 2 f def E T ¯ = ¯ i = γ + 1 d i = diag p i are available. We can use them for choosing the step sizes ( h i = 1 d i ). Reasonable coordinate choice strategy? Random! Yu. Nesterov () Huge-scale optimization problems 9/32 March 9, 2012 9 / 32
Random coordinate descent methods (RCDM) x ∈ R N f ( x ) , min ( f is convex and differentiable) Main Assumption: | f ′ i ( x + h i e i ) − f ′ i ( x ) | ≤ L i | h i | , h i ∈ R , i = 1 , . . . , N , where e i is a coordinate vector. Then f ( x + h i e i ) ≤ f ( x ) + f ′ i ( x ) h i + L i 2 h 2 x ∈ R N , h i ∈ R . i . Define the coordinate steps: T i ( x ) def = x − 1 L i f ′ i ( x ) e i . Then, 1 2 L i [ f ′ i ( x )] 2 , f ( x ) − f ( T i ( x )) ≥ i = 1 , . . . , N . Yu. Nesterov () Huge-scale optimization problems 10/32 March 9, 2012 10 / 32
Random coordinate choice We need a special random counter R α , α ∈ R : � � − 1 N Prob [ i ] = p ( i ) � L α L α = i · , i = 1 , . . . , N . α j j =1 Note: R 0 generates uniform distribution. Method RCDM ( α, x 0 ) For k ≥ 0 iterate: 1) Choose i k = R α . 2) Update x k +1 = T i k ( x k ) . Yu. Nesterov () Huge-scale optimization problems 11/32 March 9, 2012 11 / 32
Complexity bounds for RCDM We need to introduce the following norms for x , g ∈ R N : � N � N � 1 / 2 � 1 / 2 � � 1 L α i [ x ( i ) ] 2 � g � ∗ i [ g ( i ) ] 2 � x � α = , α = . L α i =1 i =1 After k iterations, RCDM ( α, x 0 ) generates random output x k , which depends on ξ k = { i 0 , . . . , i k } . Denote φ k = E ξ k − 1 f ( x k ). Theorem. For any k ≥ 1 we have � � N � 2 φ k − f ∗ L α · R 2 ≤ k · 1 − α ( x 0 ) , j j =1 � � where R β ( x 0 ) = max x ∗ ∈ X ∗ � x − x ∗ � β : f ( x ) ≤ f ( x 0 ) max . x Yu. Nesterov () Huge-scale optimization problems 12/32 March 9, 2012 12 / 32
Interpretation 1. α = 0. Then S 0 = N , and we get φ k − f ∗ 2 N k · R 2 ≤ 1 ( x 0 ) . Note N � We use the metric � x � 2 L i [ x ( i ) ] 2 . 1 = i =1 A matrix with diagonal { L i } N i =1 can have its norm equal to n . Hence, for GM we can guarantee the same bound. But its cost of iteration is much higher! Yu. Nesterov () Huge-scale optimization problems 13/32 March 9, 2012 13 / 32
Interpretation 2. α = 1 2 . Denote � � 1 ≤ i ≤ N | x ( i ) − y ( i ) | : f ( x ) ≤ f ( x 0 ) D ∞ ( x 0 ) = max y ∈ X ∗ max max . x Then, R 2 1 / 2 ( x 0 ) ≤ S 1 / 2 D 2 ∞ ( x 0 ), and we obtain � N � 2 � L 1 / 2 φ k − f ∗ 2 · D 2 ≤ k · ∞ ( x 0 ) . i i =1 Note: For the first order methods, the worst-case complexity of minimizing over a box depends on N . Since S 1 / 2 can be bounded, RCDM can be applied in situations when the usual GM fail. Yu. Nesterov () Huge-scale optimization problems 14/32 March 9, 2012 14 / 32
Interpretation 3. α = 1. Then R 0 ( x 0 ) is the size of the initial level set in the standard Euclidean norm. Hence, � N � � � N � � φ k − f ∗ 2 · R 2 2 N 1 · R 2 ≤ k · 0 ( x 0 ) ≡ k · L i L i 0 ( x 0 ) . N i =1 i =1 Rate of convergence of GM can be estimated as f ( x k ) − f ∗ ≤ γ k R 2 0 ( x 0 ) , where γ satisfies condition f ′′ ( x ) � γ · I , x ∈ R N . Note: maximal eigenvalue of symmetric matrix can reach its trace. In the worst case, the rate of convergence of GM is the same as that of RCDM . Yu. Nesterov () Huge-scale optimization problems 15/32 March 9, 2012 15 / 32
Minimizing the strongly convex functions Theorem. Let f ( x ) be strongly convex with respect to � · � 1 − α with convexity parameter σ 1 − α > 0. Then, for { x k } generated by RCDM ( α, x 0 ) we have � � k 1 − σ 1 − α φ k − φ ∗ ( f ( x 0 ) − f ∗ ) . ≤ S α Proof: Let x k be generated by RCDM after k iterations. Let us estimate the expected result of the next iteration. N � p ( i ) f ( x k ) − E i k ( f ( x k +1 )) = α · [ f ( x k ) − f ( T i ( x k ))] i =1 N p ( i ) i ( x k )] 2 = � 1 2 L i [ f ′ 2 S α ( � f ′ ( x k ) � ∗ 1 − α ) 2 ≥ α i =1 ≥ σ 1 − α S α ( f ( x k ) − f ∗ ) . It remains to compute expectation in ξ k − 1 . Yu. Nesterov () Huge-scale optimization problems 16/32 March 9, 2012 16 / 32
Confidence level of the answers Note: We have proved that the expected values of random f ( x k ) are good. Can we guarantee anything after a single run? Confidence level: Probability β ∈ (0 , 1), that some statement about random output is correct. Main tool: Chebyschev inequality ( ξ ≥ 0): E ( ξ ) Prob [ ξ ≥ T ] ≤ T . Our situation: Prob [ f ( x k ) − f ∗ ≥ ǫ ] ≤ 1 ǫ [ φ k − f ∗ ] ≤ 1 − β. We need φ k − f ∗ ≤ ǫ · (1 − β ). Too expensive for β → 1? Yu. Nesterov () Huge-scale optimization problems 17/32 March 9, 2012 17 / 32
Regularization technique Consider f µ ( x ) = f ( x ) + µ 2 � x − x 0 � 2 1 − α . It is strongly convex. Therefore, we can obtain φ k − f ∗ µ ≤ ǫ · (1 − β ) in � � 1 1 µ S α ln iterations. O ǫ · (1 − β ) ǫ Theorem. Define α = 1, µ = 0 ( x 0 ) , and choose 4 R 2 � � 1 + 8 S 1 R 2 ln 2 S 1 R 2 0 ( x 0 ) 0 ( x 0 ) 1 k ≥ + ln . ǫ ǫ 1 − β Let x k be generated by RCDM (1 , x 0 ) as applied to f µ .Then Prob ( f ( x k ) − f ∗ ≤ ǫ ) ≥ β. ln 10 p = 2 . 3 p . β = 1 − 10 − p Note: ⇒ Yu. Nesterov () Huge-scale optimization problems 18/32 March 9, 2012 18 / 32
Recommend
More recommend