b3cc concurrency 08 parallelism from concurrency
play

B3CC: Concurrency 08: Parallelism from Concurrency Trevor L. - PowerPoint PPT Presentation

B3CC: Concurrency 08: Parallelism from Concurrency Trevor L. McDonell Utrecht University, B2 2020-2021 Recap Concurrency: dealing with lots of things at once - Collection of independently executing processes - Two or more threads are


  1. B3CC: Concurrency 08: Parallelism from Concurrency Trevor L. McDonell Utrecht University, B2 2020-2021

  2. Recap • Concurrency: dealing with lots of things at once - Collection of independently executing processes - Two or more threads are making progress • Parallelism: doing lots of things at once - Simultaneous execution of (possibly related) computations - At least two threads are executing simultaneously 2

  3. Recap • So far we have discussed concurrency as a means to write modular code with multiple interactions - Example: network server that interacts with multiple clients simultaneously • Sometimes this can speed up the program by overlapping the I/O or time spent waiting for clients to respond, but this speedup doesn’t require multiple processors to achieve • In many cases we can use the same method to achieve real parallelism - From now, we will talk about some of the considerations for doing this well 3

  4. Motivations 4

  5. The free lunch is over • “The free lunch is over” (2005) - Today virtually all processors include multiple cores/processing elements - This has become the primary method for increasing performance - This has consequences for the programmer http://www.gotw.ca/publications/concurrency-ddj.htm 5

  6. Why? 48 Years of Microprocessor Trend Data epyc 10 7 Transistors itanium 2 (thousands) 10 6 pentium 4 10 5 pentium 10 4 10 3 386 10 2 10 1 10 0 1970 1980 1990 2000 2010 2020 Year Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp https://github.com/karlrupp/microprocessor-trend-data 6

  7. Why? • Moore's curve (1965) - Observation that the number of transistors in an integrated circuit doubles roughly every two years • In particular, to minimise the cost per transistor - Not a law in any sense of the word (don't call it that) 7

  8. Why? • Dennard scaling - As transistors get smaller, power density remains constant - Combined with shrinking transistors, this implies performance per watt grows at roughly the same rate as transistor density • signal delay decreases (clock frequency increases) • voltage and current decrease (power density remains constant) 8

  9. Why? 48 Years of Microprocessor Trend Data epyc 10 7 Transistors itanium 2 (thousands) 10 6 pentium 4 10 5 pentium 10 4 Frequency (MHz) 10 3 386 Typical Power 10 2 (Watts) 10 1 10 0 1970 1980 1990 2000 2010 2020 Year Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp https://github.com/karlrupp/microprocessor-trend-data 9

  10. Why? • Since ~2005 Dennard scaling breaks down - Static power losses increased faster than the overall power supply dropped (due to decreasing voltage & current) - Consequence: can no longer improve performance through frequency scaling alone 10

  11. Why? 48 Years of Microprocessor Trend Data epyc 10 7 Transistors itanium 2 (thousands) 10 6 pentium 4 Single-Thread 10 5 Performance pentium (SpecINT x 10 3 ) 10 4 Frequency (MHz) 10 3 386 Typical Power 10 2 (Watts) 10 1 10 0 1970 1980 1990 2000 2010 2020 Year Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp https://github.com/karlrupp/microprocessor-trend-data 11

  12. Why? • Traditional approaches to increasing CPU performance: - Frequency scaling - Caches - Micro-architectural improvements • Out of order execution (increase utilisation of execution hardware) • Branch prediction (guess the outcome of control flow) • Speculative execution (do work before knowing if it will be needed) 12

  13. Why? Transistor • Frequency scaling: The Power Wall density - Power consumption of transistors does not decrease as fast as density increases Transistor power - Performance limited by power Total power consumption (& dissipation) Time 13

  14. Why? Compute • Caches: The Memory Wall Performance - Memory speed does not increase Gap as fast as computing speed Memory - Increasingly difficult to hide Time memory latency 14

  15. Why? Cost • Microarchitecture improvements: Instruction Level Parallelism Wall - Law of diminishing returns - Pollack rule: performance ∝ Serial performance complexity2 15

  16. Why? 48 Years of Microprocessor Trend Data epyc 10 7 Transistors itanium 2 (thousands) 10 6 pentium 4 Single-Thread 10 5 Performance pentium (SpecINT x 10 3 ) 10 4 Frequency (MHz) 10 3 386 Typical Power 10 2 (Watts) Number of 10 1 Logical Cores 10 0 1970 1980 1990 2000 2010 2020 Year Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp https://github.com/karlrupp/microprocessor-trend-data 16

  17. Why? 48 Years of Microprocessor Trend Data epyc 10 7 Transistors itanium 2 (thousands) 10 6 pentium 4 Single-Thread 10 5 Performance pentium (SpecINT x 10 3 ) 10 4 Frequency (MHz) 10 3 386 Typical Power 10 2 (Watts) Number of 10 1 Logical Cores 10 0 1970 1980 1990 2000 2010 2020 Year Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp https://github.com/karlrupp/microprocessor-trend-data 17

  18. Aside: more cores ≠ more performance https://arstechnica.com/gadgets/2020/11/a-history-of-intel-vs-amd-desktop-performance-with-cpu-charts-galore/ 18

  19. Considerations 19

  20. Parallelism • Improving application performance through parallelisation means: - Reducing the total time to compute a single result (latency) - Increasing the rate at which a series of results are computed (throughput) - Reducing the power consumption of a computation 20

  21. Problem • To make the program run faster, we need to gain more from parallelisation than we lose due to the overhead of adding it - Granularity: If the tasks are too small, the overhead of managing the tasks outweighs any benefit you might get from running them in parallel - Data dependencies: When one task depends on another, they must be performed sequentially 21

Recommend


More recommend