parallel programming and heterogeneous computing
play

Parallel Programming and Heterogeneous Computing Shared-Memory - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing Shared-Memory Hardware Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group Recap: Types of Parallelism Data Level Parallelism


  1. Parallel Programming and Heterogeneous Computing Shared-Memory Hardware Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group

  2. Recap: Types of Parallelism Data Level Parallelism ■ The same operation is applied in parallel to multiple D D D D D D D D units of data. D D D D I Task Level Parallelism ■ Multiple operations are executed in parallel. Instruction Level Parallelism (ILP) D □ D D D D D D D ... between operations in a task D Thread Level Parallelism (TLP) □ ParProg 2020 B3 ... between multiple tasks within a workload Shared-Memory Hardware Request Level Parallelism □ Lukas Wenzel ... between multiple workloads Chart 2

  3. Shared-Memory Hardware Exploiting Instruction Level Parallelism ILP arises naturally within a workload ■ Programmers think in terms of a single instruction sequence □ TLP is explicitly encoded within a workload ■ Programmers designate parallel operations using multiple tasks □ ParProg 2020 B3 Shared-Memory ILP TLP Hardware Lukas Wenzel Why consider ILP in a parallel programming lecture? Knowledge of common ILP mechanisms and assumptions enables Chart 3 performance optimization on single-thread granularity!

  4. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Instruction execution phases (e.g. Instruction Fetch, Decode, Execute, ■ Memory Access, Writeback) employ distinct hardware units Without pipelining only one unit would operate each clock cycle □ Pipelining increases throughput by utilizing all units in every cycle ■ Latency per instruction remains the same ■ F D E M W F D E M W ParProg 2020 B3 F D E M W F D E M W Shared-Memory Hardware F D E M W F D E M W Lukas Wenzel 15 Cycles 7 Cycles 20% Utilization Approaching 100% Utilization Chart 4

  5. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Data Hazards) MOV R0,#1 R0 ← 0x01 R0: 0x00 ADD R1,R0,#3 ADD R1,R0,#3 LD R2,[R1] LD R2,[R1] R1: 0x00 LD R3,[R0] LD R3,[R0] R2: 0x00 ADD R0,R0,R3 ADD R0,R0,R3 R3: 0x00 LD R3,[R1] LD R3,[R1] Fetch Decode Memory Execute Writeback ParProg 2019 Shared-Memory Hardware Cycle 1 Lukas Wenzel Chart 5.1

  6. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Data Hazards) MOV R0,#1 R0 ← 0x01 R0 ← 0x01 R0: 0x00 ADD R1,R0,#3 R1 ← R0 + 0x03 LD R2,[R1] LD R2,[R1] R1: 0x00 LD R3,[R0] LD R3,[R0] R2: 0x00 ADD R0,R0,R3 ADD R0,R0,R3 R3: 0x00 LD R3,[R1] LD R3,[R1] Fetch Decode Memory Execute Writeback ParProg 2019 Shared-Memory Hardware Cycle 2 Lukas Wenzel Chart 5.2

  7. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Data Hazards) Forward MOV R0,#1 R0 ← 0x01 R0 ← 0x01 R0: 0x00 ADD R1,R0,#3 R1 ← R0 + 0x03 R1 ← 0x04 LD R2,[R1] R2 ← [R1] R1: 0x00 LD R3,[R0] LD R3,[R0] R2: 0x00 ADD R0,R0,R3 ADD R0,R0,R3 R3: 0x00 LD R3,[R1] LD R3,[R1] Fetch Decode Memory Execute Writeback ParProg 2019 Shared-Memory Hardware Cycle 3 Lukas Wenzel Chart 5.3

  8. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Data Hazards) Forward MOV R0,#1 R0 ← 0x01 R0: 0x01 ADD R1,R0,#3 R1 ← 0x04 R1 ← 0x04 LD R2,[R1] R2 ← [R1] R2 ← [0x04] R1: 0x00 LD R3,[R0] R3 ← [R0] R2: 0x00 ADD R0,R0,R3 ADD R0,R0,R3 R3: 0x00 LD R3,[R1] LD R3,[R1] Fetch Decode Memory Execute Writeback ParProg 2019 Shared-Memory Hardware Cycle 4 Lukas Wenzel Chart 5.4

  9. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Data Hazards) Operand Fetch MOV R0,#1 R0: 0x01 ADD R1,R0,#3 R1 ← 0x04 LD R2,[R1] R2 ← [0x04] R2 ← 0xd4 R1: 0x04 LD R3,[R0] R3 ← [R0] R3 ← [0x01] R2: 0x00 ADD R0,R0,R3 R0 ← R0 + R3 Dependency R3: 0x00 LD R3,[R1] LD R3,[R1] Fetch Decode Memory Execute Writeback ParProg 2019 Shared-Memory Hardware Cycle 5 Lukas Wenzel Chart 5.5

  10. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Data Hazards) MOV R0,#1 R0: 0x01 ADD R1,R0,#3 LD R2,[R1] R2 ← 0xd4 R1: 0x04 LD R3,[R0] R3 ← [0x01] R3 ← 0xd1 R2: 0xd4 ADD R0,R0,R3 R0 ← R0 + R3 Bubble R3: 0x00 LD R3,[R1] LD R3,[R1] Fetch Decode Memory Execute Writeback ParProg 2019 Shared-Memory Hardware Cycle 6 Lukas Wenzel Chart 5.6

  11. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Data Hazards) Operand Fetch MOV R0,#1 R0: 0x01 ADD R1,R0,#3 LD R2,[R1] R1: 0x04 LD R3,[R0] R3 ← 0xd1 R2: 0xd4 ADD R0,R0,R3 R0 ← R0 + R3 R0 ← 0xd2 Bubble R3: 0xd1 LD R3,[R1] R3 ← [R1] Fetch Decode Memory Execute Writeback ParProg 2019 Shared-Memory Hardware Cycle 7 Lukas Wenzel Chart 5.7

  12. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Data Hazards) Operand Fetch MOV R0,#1 R0: 0x01 ADD R1,R0,#3 LD R2,[R1] R1: 0x04 LD R3,[R0] R2: 0xd4 ADD R0,R0,R3 R0 ← 0xd2 R0 ← 0xd2 R3: 0xd1 LD R3,[R1] R3 ← [R1] R3 ← [0x04] Fetch Decode Memory Execute Writeback ParProg 2019 Shared-Memory Hardware Cycle 8 Lukas Wenzel Chart 5.8

  13. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Data Hazards) MOV R0,#1 R0: 0xd2 ADD R1,R0,#3 LD R2,[R1] R1: 0x04 LD R3,[R0] R2: 0xd4 ADD R0,R0,R3 R0 ← 0xd2 R3: 0xd1 LD R3,[R1] R3 ← [0x04] R3 ← 0xd4 Fetch Decode Memory Execute Writeback ParProg 2019 Shared-Memory Hardware Cycle 9 Lukas Wenzel Chart 5.9

  14. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Data Hazards) MOV R0,#1 R0: 0xd2 ADD R1,R0,#3 LD R2,[R1] R1: 0x04 LD R3,[R0] R2: 0xd4 ADD R0,R0,R3 R3: 0xd4 LD R3,[R1] R3 ← 0xd4 Fetch Decode Memory Execute Writeback ParProg 2019 Shared-Memory Hardware Cycle 10 Lukas Wenzel Chart 5.10

  15. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Control Hazard) LD R0,[#1] R0 ← [0x01] R0: 0x00 MOV R1,#108 MOV R1,#5 R1: 0x00 BEQ R0,R1,L1 BEQ R0,R1,L1 LD R1,[#2] LD R1,[#2] ADD R0,R0,R1 ADD R0,R0,R1 L1:ST R0,[#4] L1:ST R0,[#4] L1:ST R0,[#4] Fetch Decode Memory Execute Writeback Branch ParProg 2019 Shared-Memory Hardware Cycle 1 Lukas Wenzel Chart 6.1

  16. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Control Hazard) LD R0,[#1] R0 ← [0x01] R0 ← [0x01] R0: 0x00 MOV R1,#5 R1 ← 0x6c R1: 0x00 BEQ R0,R1,L1 BEQ R0,R1,L1 LD R1,[#2] LD R1,[#2] ADD R0,R0,R1 ADD R0,R0,R1 L1:ST R0,[#4] L1:ST R0,[#4] L1:ST R0,[#4] Fetch Decode Memory Execute Writeback Branch ParProg 2019 Shared-Memory Hardware Cycle 2 Lukas Wenzel Chart 6.2

  17. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Control Hazard) LD R0,[#1] R0 ← [0x01] R0 ← 0x6c R0: 0x00 MOV R1,#5 R1 ← 0x6c R1 ← 0x6c R1: 0x00 BEQ R0,R1,L1 R1 – R0 = 0: L1 LD R1,[#2] LD R1,[#2] ADD R0,R0,R1 ADD R0,R0,R1 L1:ST R0,[#4] L1:ST R0,[#4] L1:ST R0,[#4] Fetch Decode Memory Execute Writeback Branch ParProg 2019 Shared-Memory Hardware Cycle 3 Lukas Wenzel Chart 6.3

  18. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Control Hazard) LD R0,[#1] R0 ← 0x6c R0: 0x6c MOV R1,#5 R1 ← 0x6c R1 ← 0x6c R1: 0x00 BEQ R0,R1,L1 R1 – R0 = 0: L1 0x6c-0x6c=0: L1 LD R1,[#2] R1 ← [0x02] ADD R0,R0,R1 ADD R0,R0,R1 L1:ST R0,[#4] L1:ST R0,[#4] L1:ST R0,[#4] Fetch Decode Memory Execute Writeback Branch ParProg 2019 Shared-Memory Hardware Cycle 4 Lukas Wenzel Chart 6.4

  19. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Control Hazard) LD R0,[#1] R0: 0x6c MOV R1,#5 R1 ← 0x6c R1: 0x6c BEQ R0,R1,L1 0x6c-0x6c=0: L1 TRUE: L1 LD R1,[#2] R1 ← [0x02] R1 ← [0x02] ADD R0,R0,R1 R0 ← R0 + R1 L1:ST R0,[#4] L1:ST R0,[#4] L1:ST R0,[#4] Fetch Decode Memory Execute Writeback Branch ParProg 2019 Shared-Memory Hardware Cycle 5 Lukas Wenzel Chart 6.5

  20. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Control Hazard) LD R0,[#1] R0: 0x6c MOV R1,#5 R1: 0x6c BEQ R0,R1,L1 TRUE: L1 LD R1,[#2] R1 ← [0x02] R1 ← 0x12 ADD R0,R0,R1 R0 ← R0 + R1 R0 ← 0x6c+0x12 L1:ST R0,[#4] L1:ST R0,[#4] [0x04] ← R0 [0x04] ← R0 Fetch Decode Memory Execute Writeback Branch ParProg 2019 Shared-Memory FETCH L1 | FLUSH Hardware Cycle 6 Lukas Wenzel Chart 6.6

  21. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Control Hazard) LD R0,[#1] R0: 0x6c MOV R1,#5 R1: 0x6c BEQ R0,R1,L1 LD R1,[#2] ADD R0,R0,R1 L1:ST R0,[#4] [0x04] ← R0 [0x04] ← R0 Fetch Decode Memory Execute Writeback Branch ParProg 2019 Shared-Memory Hardware Cycle 7 Lukas Wenzel Chart 6.7

  22. Shared-Memory Hardware Exploiting Instruction Level Parallelism Pipelining Example (Control Hazard) LD R0,[#1] R0: 0x6c MOV R1,#5 R1: 0x6c BEQ R0,R1,L1 LD R1,[#2] ADD R0,R0,R1 L1:ST R0,[#4] [0x04] ← R0 [0x04] ← R0 [0x04] ← 0x6c Fetch Decode Memory Execute Writeback Branch ParProg 2019 Shared-Memory Hardware Cycle 8 Lukas Wenzel Chart 6.8

Recommend


More recommend