Coordinate Update for Large Scale Optimization (via Asynchronous Parallel Computing) Wotao Yin (UCLA) Joint with: Y.T.Chow, B.Edmunds, R.Hannah, Z.Peng, T.Wu (UCLA), Y.Xu (Alabama), M.Yan (Michigan State) VALSE – September 2016 1 / 58
How much do we need parallel computing?
Back in 1993 2 / 58
2006 3 / 58
2014–2016 4 / 58
35 Years of CPU Trend Number of CPUs Performance per core Cores per CPU 1995 2000 2005 2010 2015 D. Henty. Emerging Architectures and Programming Models for Parallel Computing, 2012. 5 / 58
Today: 4x ADM 16-core 3.5GHz CPUs (64 cores total) 6 / 58
Today: Tesla K80 GPU (2496 cores) 7 / 58
Today: Octa-Core Headsets 8 / 58
Free lunch is over! 9 / 58
How to use all the cores available?
Parallel computing Problem Agent Agent Agent t N t 2 t 1 · · · 10 / 58
Parallel speedup • definition: serial time speedup = parallel time • Amdahl’s Law: N agents, ρ percent of computing is parallel 1 ideal speedup = ( ρ/N ) + (1 − ρ ) 20 25% 50% 90% 15 95% Speedup 10 5 0 0 2 4 6 8 10 10 10 10 10 Number of processors 11 / 58
Parallel speedup • ε := parallel overhead (e.g., startup, synchronization, collection) • real world: 1 speedup = ( ρ/N ) + (1 − ρ ) + ε 10 20 25% 25% 50% 50% 8 90% 90% 15 95% 95% 6 Speedup Speedup 10 4 5 2 0 0 0 2 4 6 8 0 2 4 6 8 10 10 10 10 10 10 10 10 10 10 Number of processors Number of processors when ε = N when ε = log( N ) 12 / 58
Sync versus Async Agent 1 Agent 1 idle idle Agent 2 Agent 2 idle Agent 3 Agent 3 idle Synchronous Asynchronous (wait for the slowest) (non-stop, no wait) 13 / 58
Sync versus Async Sync Async Sync Wait � Latency � Bus Contention � Memory Contention � Theory � Scalability Good Better 14 / 58
Compute more, communicate less CPU speed ≫ streaming speed = O ( bandwidth ) 1 ≫ response speed = latency 15 / 58
Decompose-to-parallelize optimization models
Large-sum decomposition N r ( x ) + 1 � minimize f i ( x ) N x ∈ R m i =1 • interested in large N • nice structures: f i ’s are smooth and r is proximable • Stochastic approximation methods: SG, SAG, SAGA, SVRG, Finito pro : faster than batch methods con : update entire x ∈ R m ; model is restricted 16 / 58
Coordinate descent (CD) decomposition m f ( x 1 , . . . , x m ) + 1 � minimize r i ( x i ) m x ∈ R m i =1 • f is smooth, r i can be nonsmooth • update variables in a (shuffled) cyclic, random, greedy, or parallel fashion pro: faster than the full-update method con: nonsmooth functions need to separable 17 / 58
Solution overview 1. Re-formulate a problem into x = T ( x ) (use dual variables, operator splitting) 2. Apply coordinate update (CU) : at iteration k , select i k , do �� T ( x k ) � if i = i k x k +1 i = i x k otherwise . i 3. Parallelize CU without sync or locking 18 / 58
Async-parallel coordinate update
Brief history of async-parallel fixed-point algorithms • 1969 – a linear equation solver by Chazan and Miranker; • 1978 – extended to the fixed-point problem by Baudet under the absolute-contraction 1 type of assumption. • For 20–30 years, mainly solve linear, nonlinear and differential equations by many people • 1989 – Parallel and Distributed Computation: Numerical Methods by Bertsekas and Tsitsiklis. 2000 – Review by Frommer and Szyld. • 1991 – gradient-projection itr assuming a local linear-error bound by Tseng • 2001 – domain decomposition assuming strong convexity by Tai & Tseng 1 An operator T : R n → R n is absolute-contractive if | T ( x ) − T ( y ) | ≤ P | x − y | , component-wise, where | x | denotes the vector with components | x i | , i = 1 , ..., n , and P ∈ R n × n and ρ ( P ) < 1 . + 19 / 58
Simple demo x 2 update is delayed; distance to solution increases! 20 / 58
Simple demo x 2 update is delayed; distance to solution increases! 21 / 58
Simple demo x 2 update is delayed; distance to solution increases! 22 / 58
Simple demo If x 1 is updated much more frequently than x 2 then divergence is likely. 23 / 58
Previous theory: absolute-contractive operator • Absolute-contractive operator T : R n → R n : if | T ( x ) − T ( y ) | ≤ P | x − y | , component-wise, where | x | denotes the vector with components | x i | , i = 1 , ..., n , and P ∈ R n × n and ρ ( P ) < 1 . + • Interpretation : the full update x k +1 = Tx k must produce x 1 , x 2 , . . . that lie in nested shrinking boxes • Result: stable to async-parallelize x k +1 = Tx k ; some but few applications 24 / 58
Randomized coordinate selection • select x i to update with probability p i , where min i p i > 0 • benefits : • much more applications • automatic load balance • drawback : • require either global memory or communication • pseudo-random number generation takes time • practice : • despite theory, full randomization is unnecessary • enough to shuffle coordinates once 25 / 58
Proposed method and theory
ARock 2 : Async-parallel coordinate update • problem: x = T ( x ) • x = ( x 1 , . . . , x m ) ∈ H 1 × · · · × H m • sub-operator S i ( x ) := x i − ( T ( x )) i • algorithm: each agent randomly picks i k ∈ { 1 , . . . , m } : � x k i − η k S i ( x k − d k ) , if i = i k x k +1 ← i x k i , otherwise . • assumptions: nonexpansive T , no locking ( d k is a vector), atomic update • guarantee: almost sure weak convergence under proper η k 2 Peng-Xu-Yan-Y. SISC’16 26 / 58
Nonexpansive operator T Problem : find x such that x = T ( x ) or 0 = S ( x ) where T = I − S . Assumption (nonexpansiveness) The operator T : H → H is nonexpansive, i.e., � T ( x ) − T ( y ) � ≤ � x − y � ∀ x, y ∈ H . Assumption (existence of solution) Fix T := { x ∈ H : x = T ( x ) } is nonempty. 27 / 58
Krasnosel’ski˘ i–Mann (KM) iteration • fixed-point problem: find x such that x = T ( x ) • KM iteration: T is nonexpansive, pick η ∈ [ ǫ, 1 − ǫ ] , iterate: x k +1 = x k − η ( I − T ) ( x k ) � �� � S • why important : generalizes gradient descent, proximal-point algorithm, prox-gradient, operator-splitting algorithms such as alternating projection, Douglas-Rachford and ADMM, parallel coordinate descent, . . . 28 / 58
• weak case: if T has a fixed point and is nonexpansive, then weak converge to a fixed point, � Sx k � 2 = o (1 /k ) • strong case: if T is contractive, then has a unique fixed point and linear convergence 29 / 58
ARock convergence notation: • m = # coordinates • τ = max async delay 1 • for simplicity, uniform selection p i ≡ m Theorem (Known max delay) Assume that T is nonexpansive and has a fixed point. Use step sizes 2 m − 1 / 2 τ +1 ) , ∀ k . Then, with probability one, x k ⇀ x ∗ ∈ Fix T . 1 η k ∈ [ ǫ, Consequence: • O (1) step size if τ ∼ √ m • no need to sync until at least p > O ( √ m ) 30 / 58
Stochastic (unbounded) delays • j k,i : delay of x i at iteration k • P ℓ := Pr[max i { j k,i } ≥ ℓ ] : iteration-independent distribution of max delay • ∃ B ∋ ∀ k, | j k,i − j k,i ′ | < B : x i ’s delays are even old at each iteration Theorem Assume that T is nonexpansive and has a fixed point. Fix c ∈ (0 , 1) . Use fixed step size η = cH for either of the following cases: ℓ ( ℓP ℓ ) 1 / 2 < ∞ , set H = � ( ℓ 1 / 2 + ℓ − 1 / 2 ) � − 1 1. if � � ℓ P 1 / 2 1 1 + √ m ℓ < ∞ , set H = � � − 1 2. if � � ℓ ℓP 1 / 2 ℓ P 1 / 2 2 1 + √ m ℓ ℓ Then, with probability one, x k ⇀ x ∗ ∈ Fix T . 31 / 58
Arbitrary unbounded delays • j k,i : async delay of x i at iteration k • j k = max i { j k,i } : max delay at iteration k • lim inf j k < ∞ : all but finitely many iterations have a bounded delay Theorem Assume that T is nonexpansive and has a fixed point. Fix c ∈ (0 , 1) and R > 1 . Use step sizes � R j k − 1 / 2 � − 1 η k = c 1 + √ m ( R − 1) . bnd ⇀ x ∗ ∈ Fix T . Then, with probability one, x k • Optionally optimize R based on { j k } . 32 / 58
Numerical results
TMAC: A Toolbox of Async-Parallel, Coordinate, Splitting, and Stochastic Methods • C++11 multi-threading (no shared-memory parallelism in Matlab) • Plug in your operators, get free coordinate-update and async-parallelism • github.com/uclaopt/tmac • committers: Brent Edmunds, Zhimin Peng • contributors: Yerong Li, Yezheng Li, Tianyu Wu • supports: Windows, Mac, Linux 33 / 58
ℓ 1 logistic regression • model : N λ � x � 1 + 1 � log � i x ) � 1 + exp( − b i · a T minimize , (1) N x ∈ R n i =1 • sparse numerical linear algebra are used for datasets: news20, url 20 20 sync sync async async ideal ideal 15 15 Speedup Speedup 10 10 5 5 0 0 0 5 10 15 20 0 5 10 15 20 Threads Threads dataset “news20” dataset “url” 34 / 58
Recommend
More recommend