tddd56 multicore and gpu programming lesson 1
play

TDDD56: Multicore and GPU programming Lesson 1 Introduction to - PowerPoint PPT Presentation

TDDD56: Multicore and GPU programming Lesson 1 Introduction to laboratory work Nicolas Melot nicolas.melot (at) liu.se Linkping university (Sweden) November 4, 2015 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4,


  1. The ABA problem Thread 0 The scenario starts ◮ old → A Thread 0 pops A, preempted before ◮ new → B CAS(head, old=A, new=B) ◮ pool → A → null Thread 1 pops A, succeeds Thread 1 Thread 2 pops B, succeeds ◮ old → C ◮ new → A Thread 1 pushes A, succeeds ◮ pool → null Thread 0 performs Thread 2 CAS(head, old=A, new=B) ◮ old → B The shared stack should be empty, but it ◮ new → C points to B in Thread 2’s recycling bin ◮ pool → B → null shared → B → null Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 17 / 40

  2. The ABA problem shared stack head A null head null head null head B null thread 0 pool thread 1 pool thread 2 pool Figure: The shared stack should be empty, but points to B in Thread 2’s recycling bin Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 18 / 40

  3. Lab 2: Directions Implement a stack and protect it using locks Implement a CAS-based stack ◮ A CAS assembly implementation is provided in the lab skeleton Use pthread synchronization to make several threads to preempt each other in order to play one ABA scenario Use a ABA-free performance test to compare performance of a lock-based and CAS-based concurrent stack Get more details and hints in the lab compendium Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 19 / 40

  4. Lab 3: Parallel sorting Implement or optimize an existing sequential sort implementation Parallelize with shared memory approach (pthread or openMP) Paralleize with Dataflow (Drake) Test your sorting implementation with various situations ◮ Random, ascending, descending or constant input ◮ Small and big input sizes ◮ Other tricky situations you may imagine Built-in sorting functions (qsort(), std::sort()) are forbidden ◮ May rewrite it for better performance Lab demo: describe the important techniques that accelerate your implementation. Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 20 / 40

  5. Base sequential sort pivot = array[size / 2]; for (i = 0; i < size; i++) { if (array[i] < pivot) { left[left size] = array[i]; left size++; } else if (array[i] > pivot) { right[right size] = array[i]; right size++; } else pivot count++; } simple quicksort(left, left size); simple quicksort(right, right size); memcpy(array, left, left size * sizeof ( int )); for (i = left size; i < left size + pivot count; i++) array[i] = pivot; memcpy(array + left size + pivot count, right, right size * sizeof ( int )); Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 21 / 40

  6. Base sequential sort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 22 / 40

  7. Base sequential sort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 22 / 40

  8. Base sequential sort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 22 / 40

  9. Base sequential sort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 22 / 40

  10. Base sequential sort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 22 / 40

  11. Base sequential sort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 22 / 40

  12. Base sequential sort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 22 / 40

  13. Base sequential sort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 22 / 40

  14. Parallelization opportunities Parallelization opportunities ◮ Recursive calls ◮ Computing pivots ◮ Merging, if necessary Smart solutions challenging to implement ◮ In-place quicksort: false sharing ◮ Parallel sampling/merging: synchronization ◮ Follow the KISS rule Avoid spawning more threads than the computer has cores Use data locality with caches and cache lines Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 23 / 40

  15. Simple parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 24 / 40

  16. Simple parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 24 / 40

  17. Simple parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 24 / 40

  18. Simple parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 24 / 40

  19. Simple parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 24 / 40

  20. Simple parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 24 / 40

  21. Simple parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 24 / 40

  22. Simple parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 24 / 40

  23. Simple parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 24 / 40

  24. Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

  25. Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Choose pivot to divide buffer into unequal parts Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

  26. Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Choose pivot to divide buffer into unequal parts Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

  27. Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Choose pivot to divide buffer into unequal parts Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

  28. Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Choose pivot to divide buffer into unequal parts Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

  29. Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Choose pivot to divide buffer into unequal parts Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

  30. Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Choose pivot to divide buffer into unequal parts Partition and recurse into 3 parts (sample sort) Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

  31. Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Choose pivot to divide buffer into unequal parts Partition and recurse into 3 parts (sample sort) Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

  32. Parallel Quicksort sort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Choose pivot to divide buffer into unequal parts Partition and recurse into 3 parts (sample sort) Makes implementation harder Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 25 / 40

  33. Mergesort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 26 / 40

  34. Mergesort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 26 / 40

  35. Mergesort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 26 / 40

  36. Mergesort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 26 / 40

  37. Mergesort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 26 / 40

  38. Mergesort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 26 / 40

  39. Mergesort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 26 / 40

  40. Mergesort Data index Processing time sequential task Local sort Partition Merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 26 / 40

  41. Simple Mergesort Parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 27 / 40

  42. Simple Mergesort Parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 27 / 40

  43. Simple Mergesort Parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 27 / 40

  44. Simple Mergesort Parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 27 / 40

  45. Simple Mergesort Parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 27 / 40

  46. Simple Mergesort Parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 27 / 40

  47. Simple Mergesort Parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 27 / 40

  48. Simple Mergesort Parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 27 / 40

  49. Simple Mergesort Parallelization Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 27 / 40

  50. Parallel Mergesort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 28 / 40

  51. Parallel Mergesort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Divide buffer into 2 unequal parts Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 28 / 40

  52. Parallel Mergesort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Divide buffer into 2 unequal parts Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 28 / 40

  53. Parallel Mergesort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Divide buffer into 2 unequal parts Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 28 / 40

  54. Parallel Mergesort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Divide buffer into 2 unequal parts Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 28 / 40

  55. Parallel Mergesort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Divide buffer into 2 unequal parts Partition and recurse into 3 parts and 3-way merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 28 / 40

  56. Parallel Mergesort with 3 cores Can only efficiency use power of two number of cores. How to use three cores efficiently? Divide buffer into 2 unequal parts Partition and recurse into 3 parts and 3-way merge Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 28 / 40

  57. Pipelined parallel mergesort Classic parallelism: start a task when the previous one is done Core 1 Core 2 Core 3 Core 4 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 29 / 40

  58. Pipelined parallel mergesort Classic parallelism: start a task when the previous one is done Core 1 Core 2 Core 3 Core 4 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 29 / 40

  59. Pipelined parallel mergesort Classic parallelism: start a task when the previous one is done Core 1 Core 2 Core 3 Core 4 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 29 / 40

  60. Pipelined parallel mergesort Classic parallelism: start a task when the previous one is done Core 1 Core 2 Core 3 Core 4 Pipeline parallelism: Run next merging task as soon as possible Core 1 Core 2 Core 3 Core 4 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 29 / 40

  61. Pipelined parallel mergesort Classic parallelism: start a task when the previous one is done Core 1 Core 2 Core 3 Core 4 Pipeline parallelism: Run next merging task as soon as possible Core 1 Core 2 Core 3 Core 4 Even more speedup Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 29 / 40

  62. Pipelined parallel mergesort Classic parallelism: start a task when the previous one is done Core 1 Core 2 Core 3 Core 4 Pipeline parallelism: Run next merging task as soon as possible Core 1 Core 2 Core 3 Core 4 Even more speedup Difficult to implement manually Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 29 / 40

  63. Pipeline parallelism Related research since the 60’ Program verifiability Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 30 / 40

  64. Pipeline parallelism Related research since the 60’ Program verifiability Parallelism is a mere “consequence” Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 30 / 40

  65. Pipeline parallelism Related research since the 60’ Program verifiability Parallelism is a mere “consequence” Sequential tasks communicating through channels Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 30 / 40

  66. Pipeline parallelism Related research since the 60’ Program verifiability Parallelism is a mere “consequence” Sequential tasks communicating through channels Theories: Kahn Networks, (Synchronous) Data Flow, Communicating Sequential Processes 1 1 1 1 + + 1 1 1 1 D 1 1 1 1 *k *k D *k *k 1 1 1 1 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 30 / 40

  67. Pipeline parallelism Related research since the 60’ Program verifiability Parallelism is a mere “consequence” Sequential tasks communicating through channels Theories: Kahn Networks, (Synchronous) Data Flow, Communicating Sequential Processes Languages: Streamit, CAL, Esterel 1 1 1 1 + + 1 1 1 1 D 1 1 1 1 *k *k D *k *k 1 1 1 1 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 30 / 40

  68. Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

  69. Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

  70. Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

  71. Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Stream programming (Mostly) sequential tasks Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

  72. Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Stream programming (Mostly) sequential tasks Actual parallelism: scheduling Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

  73. Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Stream programming (Mostly) sequential tasks Actual parallelism: scheduling No universal shared memory Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

  74. Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Stream programming (Mostly) sequential tasks Actual parallelism: scheduling No universal shared memory Natural to pipeline parallelism Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

  75. Classic versus stream Most programming languages unsuitable to parallelism Abstract a single, universal instruction pointer Abstract a single, universal address space Difficult to read with several threads in mind Annotations (OpenMP) not helping with high number of cores Stream programming (Mostly) sequential tasks Actual parallelism: scheduling No universal shared memory Natural to pipeline parallelism Communications with on-chip memories: on-chip pipelining Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 31 / 40

  76. Back to parallel merge Classic parallelism Core 1 Core 2 Core 3 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 32 / 40

  77. Back to parallel merge Classic parallelism Core 1 Core 2 Core 3 Pipelining (4 initial sorting tasks) Core 1 Core 2 Core 3 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 32 / 40

  78. Back to parallel merge Classic parallelism Core 1 Core 2 Core 3 Pipelining (4 initial sorting tasks) Core 1 Core 2 Core 3 Pipelining (8 initial sorting tasks ) Core 1 Core 2 Core 3 Nicolas Melot nicolas.melot (at) liu.se (LIU) TDDD56 lesson 1 November 4, 2015 32 / 40

Recommend


More recommend