Multithreaded Algorithms Architecture Evolution We’ve come a long way since we blamed Von Neumann for putting that bottleneck in our computers Memory contains data and programs Computer fetches the instructions sequentially from memory and executes it Multicore Challenges - 2
Architecture Evolution On one hand processor powers have improved On the other hand, how it interacts with memory has changed Multicore Challenges - 3 Data & Instruction Streamed Single Instruction Multiple Single Data Instruction Single SISD Data MISD Multiple Single Instruction Instruction Multiple Data Multiple Data SIMD MIMD Multicore Challenges - 4
Processor Evolution Quite a ride of improved processor speeds, speed doubling every 18 months or so Memory speed has been doubling every six years However that has reached limits due to various engineering constraints Mostly hard to contain heat due to packing density Right about 2003 when we started seeing multiple cores This is the direction we’re headed now Multicore Challenges - 5 Moore’ s Law Predicted that the capacity of chips would double every 2 years The trend of making a single processor faster has flattened Now we’re on a trend to add more cores This tread will accelerate in the next several years Multicore Challenges - 6
Not Your Father’ s Environment The multicore processors pose some real challenges for programmers On a single processor multithreading is really multitasking On a multicore processor multithreads are on steroids Each core pipelines instructions through multiple threads They get to memory more rapidly Increased possibility of contention Multicore Challenges - 7 You’ve Been Drawn In You’ve been drawn in to this war You may argue your application does not need threading But, you have no choice when you are on multicore Remember how a sequential program can break when run with multiple threads Programs will likely break when run on multiple cores Memory is much slower than CPU, cores tend to rely more on cache to improve performance Multicore Challenges - 8
Cache Reliance Cache makes sense generally due to data locality of programs However, on multicore you have multiple layers of cache Not all cache visible to all cores! However, when multiple threads access the data and these data may be in different caches, what’ s going to happen to data correctness? Multicore Challenges - 9 A Change in Paradigm As memory to CPU gap widens this problem gets acute Compiler will take care of quite a few things The JVM will take care of quite a few things But, you have to pull a bigger load of your share for correctness Multicore Challenges - 10
Programming Got Complex You have to be more vigilant You need to synchronize more often You have to know about what’ s visible and what’ s not If you don’ t, you’ll see odd unpredictable results Not much fun when that happens Multicore Challenges - 11 Rethink your Programming We are used to mutable shared state Mutable is not too bad Sharing data (reads) is not bad also But if you try to share mutable data, we have real trouble waiting ahead Multicore Challenges - 12
Challenges Breaking a problem into threads is hard What is an optimal partitioning? How do you schedule these threads? How do you communicate between these threads? Multicore Challenges - 13 Dynamic Multithreaded Prog. You rely upon a platform that takes care of the details such as load-balancing, scheduling, etc. You expect two features to be available: Nested Parallelism You can spawn subroutines The caller and the spawned subroutines can proceed independently Parallel loop Iterations of the loop can execute concurrently Multicore Challenges - 14
Benefits Simply extension to serial model with parallel, spawn, & sync Easy to convert parallel to sequential algorithm by removing these keywords Easy to quantify parallelism based on work and span Divide and conquer lends itself well to this model Multicore Challenges - 15 Dynamic MT Basics Serialization of a MT algorithm is achieved by deleting keywords spawn, sync, & parallel Spawn indicates only that a scheduler may (not must) schedule for the subroutine to run concurrently A sync indicates that the algorithm can’ t proceed until the spawned subroutine has completed and its result has been received Every procedure executes an implicit sync before it returns Ensures all spawned subroutines terminate before the procedure does Multicore Challenges - 16
Example Fibonacci Number FIB(n) FIB(n) if n <= 1 if n <= 1 return n return n else else Serialize x = spawn FIB(n-1) x = spawn FIB(n-1) y = FIB(n-2) y = FIB(n-2) sync sync return x + y return x + y Multicore Challenges - 17 Example Count the number of primes Multicore Challenges - 18
A Model for MT Execution Strand: A chain of instructions with no parallel controls MT computations can be represented as a computation DAG Vertices represent instructions or strands Edges represent dependencies between instructions If the DAG has a directed path from one strand to another, the two are logically in series, otherwise they are locally in parallel Multicore Challenges - 19 Performance Measures T p is runtime of an algorithm on p processors Work and span are useful to calculate theoretical efficiencies Work the total time to execute the entire computation on one processor Sum of time taken by each strand Span: longest time to execute the strands along any path in the DAG # of processors comes into factor as well Multicore Challenges - 20
Performance Measures Work and span provide the lower bounds on the runtime In one step an ideal parallel computer with p processors can do at most p units of work In T p time, it can perform pT p units of work Total work to do is T 1 (that is work done on one processor) So, we have pT p >= T 1 Work law is T p >= T 1 /p Multicore Challenges - 21 Performance Measures A machine with unlimited number of processors can emulate a p processor machine by using its p processors Span law T p >= T ! Adding more processors than needed does not help Multicore Challenges - 22
Speedup T 1 / T p From work law we have T p >= T 1 /p So, T 1 /T p <= p The speed up on p processors can be at most p When T 1 /T p = Θ (p) you have linear speedup When T 1 /T p = p, you have perfect linear speedup Multicore Challenges - 23 Limits on Speedup You simply can’ t throw processors at a problem and expect speedup Amdahl’ s law: Speedup that can be realized is limited by the sequential fractions of the computation 1 The overall speedup will be (1-P) + P S P is the part of computation that can enjoy S speed up Multicore Challenges - 24
Parallelism T 1 / T ! Represents average amount of work that can be done in parallel for each step along the critical path (span) Represents the upper bound, max possible speedup that can be achieved Limit on possibility of attaining linear speedup You can’ t throw more processors at the problem to improve speedup Multicore Challenges - 25 Analyzing MT Algorithms Compute T 1 (n) To compute T ! (n), analyze the span If two subcomputations are done in sequence, their spans add to form the span of their composition If they are joined in parallel, take the maximum of the two spans Multicore Challenges - 26
Parallel Loops Algorithms may benefit from executing loop iterations in parallel Parallel keyword before the for loop conveys this Multicore Challenges - 27 Matrix Vector Multiplication Multicore Challenges - 28
Matrix Vector Mult: Div/Conq Multicore Challenges - 29 Computational Efficiency T 1 (n) = Θ (n^2) from serialization of MAT-VEC T ! (n) = Θ (lg n) + max 1<=i<=n iter(i) Overall domination of span is Θ (n) Parallelism is Θ (n^2)/ Θ (n) = Θ (n) Multicore Challenges - 30
Race Conditions Deterministic behavior is critical Non-deterministic means unpredictable and unreliable Shared state is OK Mutable state is OK Shared-mutable state is not OK Multicore Challenges - 31 What leads to Race Condition? Partly it is the timing It is also about visibility Multicore Challenges - 32
Avoid Race Conditions Ensure your algorithm does not have race conditions It is better to avoid it at the root by using immutability Multicore Challenges - 33
Recommend
More recommend