Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richt´ arik School of Mathematics The University of Edinburgh Joint work with Martin Tak´ aˇ c (Edinburgh) Les Houches ⋄ January 11, 2013 1 / 30
Lock-Free (Asynchronous) Updates Between the time when x is read by any given processor and an update is computed and applied to x by it, other processors apply their updates. x 6 ← x 5 + update ( x 3 ) Other processors Update x Update x Update x Update x x 3 x 6 time x 5 x 7 x 2 x 4 Read current x ( x 3 ) Compute update Write to current x ( x 5 ) Viewpoint of a single processor 2 / 30
Generic Parallel Lock-Free Algorithm In general: x j +1 = x j + update ( x r ( j ) ) ◮ r ( j ) = index of iterate current at reading time ◮ j = index of iterate current at writing time Assumption: j − r ( j ) ≤ τ τ + 1 ≈ # processors 3 / 30
The Problem and Its Structure � minimize x ∈ R | V | [ f ( x ) ≡ f e ( x )] ( OPT ) e ∈ E ◮ Set of vertices/coordinates: V ( x = ( x v , v ∈ V ) , dim x = | V | ) ◮ Set of edges: E ⊂ 2 V ◮ Set of blocks: B (a collection of sets forming a partition of V ) ◮ Assumption: f e depends on x v , v ∈ e , only Example (convex f : R 5 → R ): f ( x ) = 7( x 1 + x 3 ) 2 + 5( x 2 − x 3 + x 4 ) 2 + ( x 4 − x 5 ) 2 � �� � � �� � � �� � f e 1 ( x ) f e 2 ( x ) f e 3 ( x ) V = { 1 , 2 , 3 , 4 , 5 } , | V | = 5 , e 1 = { 1 , 3 } , e 2 = { 2 , 3 , 4 } , e 3 = { 4 , 5 } 4 / 30
Applications ◮ structured stochastic optimization (via Sample Average Approximation) ◮ learning ◮ sparse least-squares ◮ sparse SVMs, matrix completion, graph cuts (see Niu-Recht-R´ e-Wright (2011)) ◮ truss topology design ◮ optimal statistical designs 5 / 30
PART 1: LOCK-FREE HYBRID SGD/RCD METHODS based on: P. R. and M. Tak´ aˇ c, Lock-free randomized first order methods, manuscript, 2013. 6 / 30
Problem-Specific Constants function definition average maximum Edge-Vertex Degree ω ′ ω e = | e | = |{ v ∈ V : v ∈ e }| ω ¯ (# vertices incident with an edge) (relevant if | B | = | V | ) Edge-Block Degree ¯ σ σ ′ σ e = |{ b ∈ B : b ∩ e � = ∅}| (# blocks incident with an edge) (relevant if | B | > 1) Vertex-Edge Degree ¯ δ ′ δ v = |{ e ∈ E : v ∈ e }| δ (# edges incident with a vertex) (not needed!) Edge-Edge Degree ¯ ρ e = |{ e ′ ∈ E : e ′ ∩ e � = ∅} ρ ′ ρ (# edges incident with an edge) (relevant if | E | > 1) Remarks: ◮ Our results depend on: ¯ σ (avg Edge-Block degree) and ¯ ρ (avg Edge-Edge degree) ◮ First and second row are identical if | B | = | V | (blocks correspond to vertices/coordinates) 7 / 30
Example A T 5 0 − 3 1 A T 1 . 5 2 . 1 0 ∈ R 4 × 3 2 A = = A T 0 0 6 3 A T . 4 0 0 4 4 � f ( x ) = 1 2 � Ax � 2 2 = 1 i x ) 2 , ( A T | E | = 4 , | V | = 3 2 i =1 8 / 30
Example A T 5 0 − 3 1 A T 1 . 5 2 . 1 0 ∈ R 4 × 3 2 A = = A T 0 0 6 3 A T . 4 0 0 4 4 � f ( x ) = 1 2 � Ax � 2 2 = 1 i x ) 2 , ( A T | E | = 4 , | V | = 3 2 i =1 Computation of ¯ ω and ¯ ρ : v 1 v 2 v 3 ω e i ρ e i e 1 × × 2 4 e 2 × × 2 3 e 3 × 1 2 e 4 × 1 3 ω = 2+2+1+1 ρ = 4+3+2+3 δ v j 3 1 2 ¯ = 1 . 5 , ¯ = 3 4 4 ρ e = |{ e ′ ∈ E : e ′ ∩ e � = ∅} , ω e = | e | , δ v = |{ e ∈ E : v ∈ e }| 8 / 30
Algorithm Iteration j + 1 looks as follows: x j +1 = x j − γ | E | σ e ∇ b f e ( x r ( j ) ) Viewpoint of the processor performing this iteration: ◮ Pick edge e ∈ E , uniformly at random ◮ Pick block b intersecting edge e , uniformly at random ◮ Read current x (enough to read x v for v ∈ e ) ◮ Compute ∇ b f e ( x ) ◮ Apply update: x ← x − α ∇ b f e ( x ) with α = γ | E | σ e and γ > 0 ◮ Do not wait (no synchronization!) and start again! Easy to show that E [ | E | σ e ∇ b f e ( x )] = ∇ f ( x ) 9 / 30
Main Result Setup: ◮ c = strong convexity parameter of f ◮ L = Lipschitz constant of ∇ f ◮ �∇ f ( x ) � 2 ≤ M for x visited by the method ◮ Starting point: x 0 ∈ R | V | ◮ 0 < ǫ < L 2 � x 0 − x ∗ � 2 2 c ǫ ◮ constant stepsize: γ := ρ/ | E | ) L 2 M 2 (¯ σ +2 τ ¯ 10 / 30
Main Result Setup: ◮ c = strong convexity parameter of f ◮ L = Lipschitz constant of ∇ f ◮ �∇ f ( x ) � 2 ≤ M for x visited by the method ◮ Starting point: x 0 ∈ R | V | ◮ 0 < ǫ < L 2 � x 0 − x ∗ � 2 2 c ǫ ◮ constant stepsize: γ := ρ/ | E | ) L 2 M 2 (¯ σ +2 τ ¯ Result: Under the above assumptions, for � � LM 2 � L � x 0 − x ∗ � 2 � σ + 2 τ ¯ ρ 2 k ≥ ¯ c 2 ǫ log − 1 , | E | ǫ we have 0 ≤ j ≤ k E { f ( x j ) − f ∗ } ≤ ǫ. min 10 / 30
Special Cases � � LM 2 2 L � x 0 − x ∗ � 2 σ + 2 τ ¯ ρ General result: (¯ | E | ) c 2 ǫ log − 1 ǫ � �� � � �� � Λ common to all special cases special case lock-free parallel version of . . . Λ | B | + 2 τ | E | = 1 Randomized Block Coordinate Descent | E | Incremental Gradient Descent 1 + 2 τ ¯ ρ | B | = 1 | E | (Hogwild! as implemented) RAINCODE: RAndomized INcremental ω + 2 τ ¯ ρ | B | = | V | COordinate DEscent ¯ | E | (Hogwild! as analyzed) | E | = | B | = 1 Gradient Descent 1 + 2 τ 11 / 30
Analysis via a New Recurrence Let a j = 1 2 E [ � x j − x ∗ � 2 ] Nemirovski-Juditsky-Lan-Shapiro: a j +1 ≤ (1 − 2 c γ j ) a j + 1 2 γ 2 j M 2 Niu-Recht-R´ e-Wright (Hogwild!): √ 2 c ω ′ M τ ( δ ′ ) 1 / 2 ) a 1 / 2 a j +1 ≤ (1 − c γ ) a j + γ 2 ( 2 γ 2 M 2 Q , + 1 j Q = ω ′ + 2 τ ρ ′ | E | + 4 ω ′ ρ ′ | E | τ + 2 τ 2 ( ω ′ ) 2 ( δ ′ ) 1 / 2 where R.-Tak´ aˇ c: a j +1 ≤ (1 − 2 c γ ) a j + 1 2 γ 2 (¯ | E | ) M 2 ρ ¯ σ + 2 τ 12 / 30
Parallelization Speedup Factor Λ of serial version ¯ σ 1 PSF = (Λ of parallel version) /τ = | E | ) /τ = ρ ¯ 1 2¯ ρ (¯ σ + 2 τ τ + ¯ σ | E | 13 / 30
Parallelization Speedup Factor Λ of serial version ¯ σ 1 PSF = (Λ of parallel version) /τ = | E | ) /τ = ρ ¯ 1 2¯ ρ (¯ σ + 2 τ τ + ¯ σ | E | Three modes: ◮ Brute force (many processors; τ very large): PSF ≈ ¯ σ | E | 2¯ ρ ρ ¯ ◮ Favorable structure ( σ | E | ≪ 1 τ ; fixed τ ): ¯ PSF ≈ τ ◮ Special τ ( τ = | E | ρ ): ¯ PSF = | E | σ ¯ σ + 2 ≈ τ ρ ¯ ¯ 13 / 30
Improvements vs Hogwild! If | B | = | V | (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j +1 = x j − γ | E | ω e ∇ v f e ( x r ( j ) ) 14 / 30
Improvements vs Hogwild! If | B | = | V | (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j +1 = x j − γ | E | ω e ∇ v f e ( x r ( j ) ) Niu-Recht-R´ e-Wright (Hogwild!, 2011): Λ = 4 ω ′ + 24 τ ρ ′ | E | + 24 τ 2 ω ′ ( δ ′ ) 1 / 2 R.-Tak´ aˇ c: ω + 2 τ ¯ ρ Λ = ¯ | E | 14 / 30
Improvements vs Hogwild! If | B | = | V | (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j +1 = x j − γ | E | ω e ∇ v f e ( x r ( j ) ) Niu-Recht-R´ e-Wright (Hogwild!, 2011): Λ = 4 ω ′ + 24 τ ρ ′ | E | + 24 τ 2 ω ′ ( δ ′ ) 1 / 2 R.-Tak´ aˇ c: ω + 2 τ ¯ ρ Λ = ¯ | E | Advantages of our approach: ◮ Dependence on averages and not maxima! ( ω ′ → ¯ ω , ρ ′ → ¯ ρ ) ◮ Better constants (4 → 1, 24 → 2) ◮ The third large term is not present (no dependence on τ 2 and δ ′ ) ◮ Introduction of blocks ( ⇒ cover also block coordinate descent, gradient descent, SGD) ◮ Simpler analysis 14 / 30
Modified Algorithm: Global Reads and Local Writes ∗ Partition vertices (coordinates) into τ + 1 blocks V = b 1 ∪ b 2 ∪ · · · ∪ b τ +1 and assign block b i to processor i , i = 1 , 2 , . . . , τ + 1. Processor i will (asynchronously) do: ◮ Pick edge e ∈ { e ′ ∈ E : e ′ ∩ b i � = ∅} , uniformly at random (edge intersecting with block owned by processor i ) ◮ Update: x j +1 = x j − α ∇ b i f e ( x r ( j ) ) Pros and cons: ◮ + good if global reads and local writes are cheap, but global writes are expensive (NUMA = Non Uniform Memory Access) ◮ - do not have an analysis ∗ Idea proposed by Ben Recht. 15 / 30
Experiment 1: rcv size = 1.2 GB, features = | V | = 47,236, training: | E | = 677,399, testing: 20,242 0.12 1 CPU, Asyn. 1 CPU, Syn. 0.11 4 CPU, Asyn. 4 CPU, Syn. 0.1 16 CPU, Asyn. 16 CPU, Syn. 0.09 Train Error 0.08 0.07 0.06 0.05 0.04 0.03 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Epoch 16 / 30
Experiment 2 Artificial problem instance: m � 2 � Ax � 2 = f ( x ) = 1 1 2 ( A T i x ) 2 . minimize i =1 A ∈ R m × n ; m = | E | = 500 , 000; n = | V | = 50 , 000 Three methods: ◮ Synchronous, all = parallel synchronous method with | B | = 1 ◮ Asynchronous, all = parallel asynchronous method with | B | = 1 ◮ Asynchronous, block = parallel asynchronous method with | B | = τ (no need for atomic operations ⇒ additional speedup) We measure elapsed time needed to perform 20 m iterations (20 epochs) 17 / 30
Uniform instance: | e | = 10 for all edges 18 / 30
Recommend
More recommend