resource oblivious parallel computing
play

Resource Oblivious Parallel Computing Vijaya Ramachandran - PowerPoint PPT Presentation

Resource Oblivious Parallel Computing Vijaya Ramachandran Department of Computer Science University of Texas at Austin Joint work with Richard Cole Reference. R. Cole, V. Ramachandran, Efficient Resource Oblivious Algorithms for


  1. Resource Oblivious Parallel Computing Vijaya Ramachandran Department of Computer Science University of Texas at Austin Joint work with Richard Cole Reference. R. Cole, V. Ramachandran, “Efficient Resource Oblivious Algorithms for Multicores”. http://arxiv.org/abs/1103.4071. 0-0

  2. T HE M ULTICORE E RA • Chip Multiprocessors (CMP) or Multicores : Due to power consumption and other reasons, microprocessors are being built with multiple cores on a chip. Dual-cores are already on most desktops, and number of cores is expected to increase (dramatically) for the foreseeable future • The multicore era represents a paradigm shift in general-purpose computing. • Computer science research needs to address the multitude of challenges that come with this shift to the multicore era. 1

  3. A LGORITHMS : VON N EUMANN E RA VS M ULTICORE In order to successfully move from the von Neumann era to the emerging multicore era, we need to develop methods that: • Exploit both parallelism and cache-efficiency. • Further, these algorithms need to be portable (i.e., independent of machine parameters). Even better would be a resource oblivious computation where both the algorithm and the run-time system are independent of machine parameters. 2

  4. M ULTICORE C OMPUTATION M ODEL • We model the multicore computation with: – A multithreaded algorithm that generates parallel tasks (‘threads’). – A run-time scheduler that schedules parallel tasks across cores. (Our scheduler has a distributed implementation.) – A shared memory with caches. – Data organized in blocks with cache coherence to enforce data consistency across cores. – Communication cost in terms of cache miss costs, including costs incurred through false sharing. Our main results are for multicores with private caches. 3

  5. O UR R ESULTS • The class of Hierarchical Balanced Parallel (HBP) algorithms. • HBP algorithms for scans, matrix computations, FFT, etc., building on known algorithms. • A new HBP sorting algorithm: SPMS: Sample, Partition, and Merge Sort . • Techniques to reduce the adverse effects of false sharing: limited access writes, O (1) block sharing, and gapping . 4

  6. O UR R ESULTS ( CONTINUED ) • The Priority Work Stealing Scheduler (PWS) . • Cache miss overhead of the HBP algorithms, when scheduled by PWS, is bounded by the sequential cache complexity, even when the cost of false sharing is included , given a suitable ‘tall cache’. (for large inputs that do not fit in the caches). At the end of the talk, we address multi-level cache hierarchy [Chowdhury-Silvestri-B-R’10], and other parallel models. 5

  7. 6

  8. R OAD M AP • Background on multithreaded computations and work stealing. • Cache and block misses. • Hierarchical Balanced Parallel (HBP) computations. • Priority Work Stealing (PWS) Scheduler. • An example with Strassen’s matrix multiplication algorithm. • Discussion. 7

  9. M ULTITHREADED C OMPUTATIONS % Returns s = P n M-Sum ( A [1 ..n ] , s ) i =1 A [ i ] if n = 1 then return s := A [1] end if fork (M-Sum ( A [1 ..n/ 2] , s 1 ) ; M-Sum ( A [ n 2 + 1 ..n ] , s 2 )) join: return s = s 1 + s 2 • Sequential execution computes recursively in a dfs traversal of this computation tree. 8

  10. M ULTITHREADED C OMPUTATIONS % Returns s = P n M-Sum ( A [1 ..n ] , s ) i =1 A [ i ] if n = 1 then return s := A [1] end if fork (M-Sum ( A [1 ..n/ 2] , s 1 ) ; M-Sum ( A [ n 2 + 1 ..n ] , s 2 )) join: return s = s 1 + s 2 • Sequential execution computes recursively in a dfs traversal of this computation tree. • Forked tasks can run in parallel. • Runs on p cores in O ( n/p + log p ) parallel steps by forking log p times to generate p parallel tasks. 8-a

  11. W ORK -S TEALING P ARALLEL E XECUTION % Returns s = P n M-Sum ( A [1 ..n ] , s ) i =1 A [ i ] if n = 1 then return s := A [1] end if fork (M-Sum ( A [1 ..n/ 2] , s 1 ) ; M-Sum ( A [ n 2 + 1 ..n ] , s 2 )) join: return s = s 1 + s 2 9

  12. W ORK -S TEALING P ARALLEL E XECUTION % Returns s = P n M-Sum ( A [1 ..n ] , s ) i =1 A [ i ] if n = 1 then return s := A [1] end if fork (M-Sum ( A [1 ..n/ 2] , s 1 ) ; M-Sum ( A [ n 2 + 1 ..n ] , s 2 )) join: return s = s 1 + s 2 • Computation starts in first core C • At each fork, second forked task is placed on C ’s task queue T . • Computation continues at C (in sequential order), with tasks popped from tail of T as needed. • Task at head of T is available to be stolen by other cores that are idle. 9-a

  13. 10

  14. W ORK - STEALING • Work-stealing is a well-known method in scheduling with various heuristics used for stealing protocol. • Randomized work-stealing (RWS) has provably good parallel speed-up on fairly general computation dags. [Blumofe-Leiserson 1999]. • Caching bounds for RWS are derived in [ABB02, Frigo-Strumpen10, BGN10]; more recently in [Cole-R11]. None of these cache miss bounds are optimal. 11

  15. R OAD M AP • Background on multithreaded computations and work stealing. • Cache and block misses. • Hierarchical Balanced Parallel (HBP) computations. • Priority Work Stealing (PWS) Scheduler. • An example with Strassen’s matrix multiplication algorithm. • Discussion. 12

  16. C ACHE M ISSES Definition. Let τ be a task that accesses r data items (i.e., words) during its execution. We say that r = | τ | is the size of τ . τ is f -cache friendly if these data items are contained in O ( r/B + f ( r )) blocks. A multithreaded computation C is f -cache friendly if every task in C is f -cache friendly 13

  17. C ACHE M ISSES Definition. Let τ be a task that accesses r data items (i.e., words) during its execution. We say that r = | τ | is the size of τ . τ is f -cache friendly if these data items are contained in O ( r/B + f ( r )) blocks. A multithreaded computation C is f -cache friendly if every task in C is f -cache friendly Lemma. A stolen task τ incurs an additional O (min { M, | τ |} /B + f ( τ )) cache misses compared to the steal-free sequential execution. If f ( | τ | ) = O ( | τ | /B ) and | τ | ≥ 2 M , this is a 0 asymptotic excess, i.e., the excess is bounded by the sequential cache miss cost. 13-a

  18. F ALSE S HARING • False sharing , and more generally block misses , occur when there is at least one write to a shared block. • In such shared block accesses, delay is incurred by participating cores when control of the block is given to a writing core, and the other cores wait for the block to be updated with the value of the write. • A typical cache coherence protocol invalidates the copy of the block at the remaining cores when it transfers control of the block to the writing core. The delay at the remaining cores is at least that of one cache miss, and could be more. 14

  19. 15

  20. 16

  21. B LOCK M ISS C OST M EASURE Definition. Suppose that block β is moved m times from one cache to another ( due to cache or block misses ) during a time interval T = [ t 1 , t 2 ] . Then m is defined to be the block delay incurred by β during T . The block wait cost incurred by a task τ on a block β is the delay incurred during the execution of τ due to block misses when accessing β , measured in units of cache misses. 17

  22. B LOCK M ISS C OST M EASURE Definition. Suppose that block β is moved m times from one cache to another ( due to cache or block misses ) during a time interval T = [ t 1 , t 2 ] . Then m is defined to be the block delay incurred by β during T . The block wait cost incurred by a task τ on a block β is the delay incurred during the execution of τ due to block misses when accessing β , measured in units of cache misses. The block wait cost could be much larger than B if multiple writes to the same location are allowed. In most of our analysis, we will use the block delay of β within a time interval T as the block wait cost of every task that accesses β during T . 17-a

  23. B LOCK M ISS C OST M EASURE Definition. Suppose that block β is moved m times from one cache to another ( due to cache or block misses ) during a time interval T = [ t 1 , t 2 ] . Then m is defined to be the block delay incurred by β during T . The block wait cost incurred by a task τ on a block β is the delay incurred during the execution of τ due to block misses when accessing β , measured in units of cache misses. The block wait cost could be much larger than B if multiple writes to the same location are allowed. In most of our analysis, we will use the block delay of β within a time interval T as the block wait cost of every task that accesses β during T . This cost measure is highly pessimistic, hence upper bounds obtained using it are likely to hold for other cost measures for block misses. 17-b

  24. R EDUCING B LOCK M ISS C OSTS : A LGORITHMIC T ECHNIQUES 1. We enforce limited access writes : An algorithm is limited access if each of its writable variables is accessed O (1) times. 18

  25. R EDUCING B LOCK M ISS C OSTS : A LGORITHMIC T ECHNIQUES 1. We enforce limited access writes : An algorithm is limited access if each of its writable variables is accessed O (1) times. 2. We attempt to obtain O (1) -block sharing in our algorithms. Definition. A task τ of size r is L -block sharing , if there are O ( L ( r )) blocks which τ can share with all other tasks that could be scheduled in parallel with τ and could access a location in the block. A computation is L -block sharing if every task in it is L -block sharing. 18-a

Recommend


More recommend