A Scalable Architecture for Ordered Parallelism Mark Jeffrey , - - PowerPoint PPT Presentation

a scalable architecture for ordered parallelism
SMART_READER_LITE
LIVE PREVIEW

A Scalable Architecture for Ordered Parallelism Mark Jeffrey , - - PowerPoint PPT Presentation

A Scalable Architecture for Ordered Parallelism Mark Jeffrey , Suvinay Subramanian, Cong Yan, Joel Emer, Daniel Sanchez MICRO 2015 Multicores Target Easy Parallelism 2 Multicores Target Easy Parallelism 2 Regular : known tasks and data


slide-1
SLIDE 1

A Scalable Architecture for Ordered Parallelism

Mark Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, Daniel Sanchez

MICRO 2015

slide-2
SLIDE 2

Multicores Target Easy Parallelism

2

slide-3
SLIDE 3

Regular: known tasks and data

Multicores Target Easy Parallelism

2

slide-4
SLIDE 4

Regular: known tasks and data

Multicores Target Easy Parallelism

2

ü

slide-5
SLIDE 5

Irregular: unknown tasks and data Regular: known tasks and data

Multicores Target Easy Parallelism

2

ü

slide-6
SLIDE 6

Irregular: unknown tasks and data Regular: known tasks and data Unordered tasks

Multicores Target Easy Parallelism

2

ü

slide-7
SLIDE 7

Irregular: unknown tasks and data Regular: known tasks and data Unordered tasks

Multicores Target Easy Parallelism

2

ü

Load-balancing Synchronization

slide-8
SLIDE 8

Irregular: unknown tasks and data Regular: known tasks and data Unordered tasks Ordered tasks

Multicores Target Easy Parallelism

2

ü

Load-balancing Synchronization

≈ û

slide-9
SLIDE 9

Irregular: unknown tasks and data Regular: known tasks and data Unordered tasks Ordered tasks

Multicores Target Easy Parallelism

2

ü

Load-balancing Synchronization

≈ û

slide-10
SLIDE 10

Ordering is a simple and general form of synchronization Irregular: unknown tasks and data Regular: known tasks and data Unordered tasks Ordered tasks

Multicores Target Easy Parallelism

2

slide-11
SLIDE 11

Ordering is a simple and general form of synchronization Irregular: unknown tasks and data Regular: known tasks and data Unordered tasks Ordered tasks

Support for order enables widespread parallelism

Multicores Target Easy Parallelism

2

slide-12
SLIDE 12

Outline

3 ¨ Understanding Ordered Parallelism ¨ Swarm ¨ Evaluation

slide-13
SLIDE 13

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source

slide-14
SLIDE 14

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A

Order = Distance from source node

1 2 3 4 5 6 7 8

Tasks

slide-15
SLIDE 15

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A

Order = Distance from source node

1 2 3 4 5 6 7 8

Tasks

slide-16
SLIDE 16

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A A C B

Order = Distance from source node

1 2 3 4 5 6 7 8

Tasks

slide-17
SLIDE 17

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A A C B

Order = Distance from source node

1 2 3 4 5 6 7 8

Tasks

slide-18
SLIDE 18

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A C 2 A C B B D

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

slide-19
SLIDE 19

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A C 2 A C B B D

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

slide-20
SLIDE 20

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A B C 3 2 A C B B D D

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

slide-21
SLIDE 21

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A B C 3 2 A C B B D D

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

slide-22
SLIDE 22

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A B C 3 2 A C B B D D

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

slide-23
SLIDE 23

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A B C D 3 1 2 A C B B D D E

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

slide-24
SLIDE 24

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A B C D 3 1 2 A C B B D D E

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

slide-25
SLIDE 25

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A B C D E 3 1 3 2 A C B B D D E

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

slide-26
SLIDE 26

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A B C D E 3 1 3 2 A C B B D D E

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

slide-27
SLIDE 27

Example: Parallelism in Dijkstra’s Algorithm

4

Finds shortest-path tree on a graph with weighted edges

A B C D E 3 2 2 4 1 3 3 source A B C D E 3 1 3 2 A C B B D D E

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

slide-28
SLIDE 28

Parallelism in Dijkstra’s Algorithm

5

Can execute independent tasks out of order

A C B B D D E

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

slide-29
SLIDE 29

Parallelism in Dijkstra’s Algorithm

5

Can execute independent tasks out of order

A C B B D D E

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks Data dependences

slide-30
SLIDE 30

Parallelism in Dijkstra’s Algorithm

5

Can execute independent tasks out of order

A C B B D D E

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

A C B B D D E E

Valid schedule Data dependences

slide-31
SLIDE 31

Parallelism in Dijkstra’s Algorithm

5

Can execute independent tasks out of order

A C B B D D E

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

A C B B D D E E

Valid schedule Data dependences

2x parallelism (more in larger graphs) Tasks and dependences unknown in advance

slide-32
SLIDE 32

Parallelism in Dijkstra’s Algorithm

5

Can execute independent tasks out of order

A C B B D D E

Order = Distance from source node

1 2 3 4 5 6 7 8 E

Tasks

A C B B D D E E

Valid schedule Data dependences

2x parallelism (more in larger graphs) Tasks and dependences unknown in advance

Need speculative execution to elide order constraints

slide-33
SLIDE 33

Insights about Ordered Parallelism

6

slide-34
SLIDE 34
  • 1. With perfect speculation, parallelism is plentiful

Insights about Ordered Parallelism

6

slide-35
SLIDE 35
  • 1. With perfect speculation, parallelism is plentiful

Insights about Ordered Parallelism

6

A C B B D D E E

Ideal schedule

slide-36
SLIDE 36

Parallelism max 800x window=64 26x window=1k 180x

  • 1. With perfect speculation, parallelism is plentiful

Insights about Ordered Parallelism

6

A C B B D D E E

Ideal schedule

slide-37
SLIDE 37

Parallelism max 800x window=64 26x window=1k 180x

  • 1. With perfect speculation, parallelism is plentiful
  • 2. Tasks are tiny: 32 instructions on average

Insights about Ordered Parallelism

6

A C B B D D E E

Ideal schedule

slide-38
SLIDE 38

Parallelism max 800x window=64 26x window=1k 180x

  • 1. With perfect speculation, parallelism is plentiful
  • 2. Tasks are tiny: 32 instructions on average
  • 3. Independent tasks are far away in program order

Insights about Ordered Parallelism

6

A C B B D D E E

Ideal schedule

slide-39
SLIDE 39

Parallelism max 800x window=64 26x window=1k 180x

  • 1. With perfect speculation, parallelism is plentiful
  • 2. Tasks are tiny: 32 instructions on average
  • 3. Independent tasks are far away in program order

Insights about Ordered Parallelism

6

A C B B D D E E

Ideal schedule

A C B D E

N-task window Can execute N tasks ahead

  • f the earliest active task
slide-40
SLIDE 40

Parallelism max 800x window=64 26x window=1k 180x

  • 1. With perfect speculation, parallelism is plentiful
  • 2. Tasks are tiny: 32 instructions on average
  • 3. Independent tasks are far away in program order

Insights about Ordered Parallelism

6

A C B B D D E E

Ideal schedule

A C B D E

N-task window Can execute N tasks ahead

  • f the earliest active task
slide-41
SLIDE 41

Parallelism max 800x window=64 26x window=1k 180x

  • 1. With perfect speculation, parallelism is plentiful
  • 2. Tasks are tiny: 32 instructions on average
  • 3. Independent tasks are far away in program order

Insights about Ordered Parallelism

6

A C B B D D E E

Ideal schedule

A C B D E

N-task window Can execute N tasks ahead

  • f the earliest active task
slide-42
SLIDE 42

Parallelism max 800x window=64 26x window=1k 180x

  • 1. With perfect speculation, parallelism is plentiful
  • 2. Tasks are tiny: 32 instructions on average
  • 3. Independent tasks are far away in program order

Insights about Ordered Parallelism

6

A C B B D D E E

Ideal schedule

A C B D E

N-task window Can execute N tasks ahead

  • f the earliest active task
slide-43
SLIDE 43

Parallelism max 800x window=64 26x window=1k 180x

  • 1. With perfect speculation, parallelism is plentiful
  • 2. Tasks are tiny: 32 instructions on average
  • 3. Independent tasks are far away in program order

Insights about Ordered Parallelism

6

A C B B D D E E

Ideal schedule

Need a large window of speculation

A C B D E

N-task window Can execute N tasks ahead

  • f the earliest active task
slide-44
SLIDE 44

Prior Work Can’t Mine Ordered Parallelism

7

slide-45
SLIDE 45

Prior Work Can’t Mine Ordered Parallelism

¨ Thread-Level Speculation (TLS) parallelizes loops and

function calls in sequential programs

7

slide-46
SLIDE 46

Prior Work Can’t Mine Ordered Parallelism

¨ Thread-Level Speculation (TLS) parallelizes loops and

function calls in sequential programs

7 Max parallelism TLS parallelism 800x 1.1x

slide-47
SLIDE 47

Prior Work Can’t Mine Ordered Parallelism

¨ Thread-Level Speculation (TLS) parallelizes loops and

function calls in sequential programs

7 Max parallelism TLS parallelism 800x 1.1x

Execution order ≠ creation order

slide-48
SLIDE 48

Prior Work Can’t Mine Ordered Parallelism

¨ Thread-Level Speculation (TLS) parallelizes loops and

function calls in sequential programs

7 Max parallelism TLS parallelism 800x 1.1x

Task-scheduling priority queues introduce false data dependences Execution order ≠ creation order

slide-49
SLIDE 49

Prior Work Can’t Mine Ordered Parallelism

¨ Thread-Level Speculation (TLS) parallelizes loops and

function calls in sequential programs

¨ Sophisticated parallel algorithms yield limited speedup

7 Max parallelism TLS parallelism 800x 1.1x

Task-scheduling priority queues introduce false data dependences Execution order ≠ creation order

slide-50
SLIDE 50

Prior Work Can’t Mine Ordered Parallelism

¨ Thread-Level Speculation (TLS) parallelizes loops and

function calls in sequential programs

¨ Sophisticated parallel algorithms yield limited speedup

7 Max parallelism TLS parallelism 800x 1.1x

Task-scheduling priority queues introduce false data dependences

1 32 64 Speedup 1c 32c 64c

bfs

1c 32c 64c

sssp

1c 32c 64c

astar

1c 32c 64c

msf

1c 32c 64c

des

1c 32c 64c

silo

Execution order ≠ creation order

slide-51
SLIDE 51

Swarm Mines Ordered Parallelism

8

1 32 64 Speedup 1c 32c 64c

bfs

117x 1c 32c 64c

sssp

1c 32c 64c

astar

1c 32c 64c

msf

1c 32c 64c

des

1c 32c 64c

silo

slide-52
SLIDE 52

Swarm Mines Ordered Parallelism

8

1 32 64 Speedup 1c 32c 64c

bfs

117x 1c 32c 64c

sssp

1c 32c 64c

astar

1c 32c 64c

msf

1c 32c 64c

des

1c 32c 64c

silo

slide-53
SLIDE 53

Swarm Mines Ordered Parallelism

8

¨ Execution model based on timestamped tasks

1 32 64 Speedup 1c 32c 64c

bfs

117x 1c 32c 64c

sssp

1c 32c 64c

astar

1c 32c 64c

msf

1c 32c 64c

des

1c 32c 64c

silo

slide-54
SLIDE 54

Swarm Mines Ordered Parallelism

8

¨ Execution model based on timestamped tasks ¨ Architecture executes tasks speculatively out of order ¤ Leverages execution model to scale

1 32 64 Speedup 1c 32c 64c

bfs

117x 1c 32c 64c

sssp

1c 32c 64c

astar

1c 32c 64c

msf

1c 32c 64c

des

1c 32c 64c

silo

slide-55
SLIDE 55

Outline

9 ¨ Understanding Ordered Parallelism ¨ Swarm ¨ Evaluation

slide-56
SLIDE 56

Swarm Execution Model

10

Programs consist of timestamped tasks

slide-57
SLIDE 57

Swarm Execution Model

10

Programs consist of timestamped tasks

¤ Tasks can create children tasks with >= timestamp ¤ Tasks appear to execute in timestamp order

slide-58
SLIDE 58

Swarm Execution Model

10

Programs consist of timestamped tasks

¤ Tasks can create children tasks with >= timestamp ¤ Tasks appear to execute in timestamp order ¤ Programmed with implicitly-parallel task API 2 3 4 4 6 7 5

swarm::enqueue(fptr, ¡ts, ¡args...); ¡

slide-59
SLIDE 59

Swarm Execution Model

10

Programs consist of timestamped tasks

¤ Tasks can create children tasks with >= timestamp ¤ Tasks appear to execute in timestamp order ¤ Programmed with implicitly-parallel task API 2 3 4 4 6 7 5

Conveys new work to hardware as soon as possible

swarm::enqueue(fptr, ¡ts, ¡args...); ¡

slide-60
SLIDE 60

Swarm Execution Model

10

Programs consist of timestamped tasks

¤ Tasks can create children tasks with >= timestamp ¤ Tasks appear to execute in timestamp order ¤ Programmed with implicitly-parallel task API 2 3 4 4 6 7 5

Conveys new work to hardware as soon as possible

swarm::enqueue(fptr, ¡ts, ¡args...); ¡

slide-61
SLIDE 61

Swarm Task Example: Dijkstra

11 void ¡ssspTask(Timestamp ¡dist, ¡Vertex& ¡v) ¡{ ¡ ¡ ¡if ¡(!v.isVisited()) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡v.distance ¡= ¡dist; ¡ ¡ ¡ ¡ ¡for ¡(Vertex& ¡u ¡: ¡v.neighbors) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Timestamp ¡uDist ¡= ¡dist ¡+ ¡edgeWeight(v, ¡u); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡swarm::enqueue(&ssspTask, ¡uDist, ¡u); ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡} ¡ } ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

slide-62
SLIDE 62

Swarm Task Example: Dijkstra

11 void ¡ssspTask(Timestamp ¡dist, ¡Vertex& ¡v) ¡{ ¡ ¡ ¡if ¡(!v.isVisited()) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡v.distance ¡= ¡dist; ¡ ¡ ¡ ¡ ¡for ¡(Vertex& ¡u ¡: ¡v.neighbors) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Timestamp ¡uDist ¡= ¡dist ¡+ ¡edgeWeight(v, ¡u); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡swarm::enqueue(&ssspTask, ¡uDist, ¡u); ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡} ¡ } ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

slide-63
SLIDE 63

Swarm Task Example: Dijkstra

11 void ¡ssspTask(Timestamp ¡dist, ¡Vertex& ¡v) ¡{ ¡ ¡ ¡if ¡(!v.isVisited()) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡v.distance ¡= ¡dist; ¡ ¡ ¡ ¡ ¡for ¡(Vertex& ¡u ¡: ¡v.neighbors) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Timestamp ¡uDist ¡= ¡dist ¡+ ¡edgeWeight(v, ¡u); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡swarm::enqueue(&ssspTask, ¡uDist, ¡u); ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡} ¡ } ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

Timestamp

slide-64
SLIDE 64

Swarm Task Example: Dijkstra

11 void ¡ssspTask(Timestamp ¡dist, ¡Vertex& ¡v) ¡{ ¡ ¡ ¡if ¡(!v.isVisited()) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡v.distance ¡= ¡dist; ¡ ¡ ¡ ¡ ¡for ¡(Vertex& ¡u ¡: ¡v.neighbors) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Timestamp ¡uDist ¡= ¡dist ¡+ ¡edgeWeight(v, ¡u); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡swarm::enqueue(&ssspTask, ¡uDist, ¡u); ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡} ¡ } ¡ ¡ swarm::enqueue(ssspTask, ¡0, ¡sourceVertex); ¡ swarm::run(); ¡

Timestamp

slide-65
SLIDE 65

Swarm Architecture Overview

12

Tiled Multicore

Memory controller Memory controller Memory controller Memory controller

Tile

Core Core Core Core

L1I/D L1I/D L1I/D L1I/D

L2 Cache L3 Cache Bank

Router

Task Unit Tile Organization

slide-66
SLIDE 66

Swarm Architecture Overview

12

Per-tile task units:

¨ Task Queue: holds task descriptors ¨ Commit Queue: holds speculative state of finished tasks Tiled Multicore

Memory controller Memory controller Memory controller Memory controller

Tile

Core Core Core Core

L1I/D L1I/D L1I/D L1I/D

L2 Cache L3 Cache Bank

Router

Task Unit Tile Organization TQ Task Unit CQ

slide-67
SLIDE 67

Swarm Architecture Overview

12

Per-tile task units:

¨ Task Queue: holds task descriptors ¨ Commit Queue: holds speculative state of finished tasks Tiled Multicore

Memory controller Memory controller Memory controller Memory controller

Tile

Core Core Core Core

L1I/D L1I/D L1I/D L1I/D

L2 Cache L3 Cache Bank

Router

Task Unit Tile Organization TQ Task Unit CQ

Commit queues provide the window of speculation

slide-68
SLIDE 68

Task Unit Queues

13

¨ Task queue: holds task descriptors ¨ Commit Queue: holds speculative state of finished tasks

Task Queue 9, I 10, I 2, R 8, R 3, F Cores Commit Queue 8 2 3 68 Task States: IDLE (I) RUNNING (R) FINISHED (F)

slide-69
SLIDE 69

Task Unit Queues

13

¨ Task queue: holds task descriptors ¨ Commit Queue: holds speculative state of finished tasks

Task Queue 9, I 10, I 2, R 8, R 3, F Cores Commit Queue 8 2 3 7, I (timestamp=7,

taskFn, args) New Task

69 Task States: IDLE (I) RUNNING (R) FINISHED (F)

slide-70
SLIDE 70

7

Task Unit Queues

14

¨ Task queue: holds task descriptors ¨ Commit Queue: holds speculative state of finished tasks

Task Queue 7, R 9, I 10, I 2, F 8, R 3, F Cores Commit Queue 8 2 3 70 Task States: IDLE (I) RUNNING (R) FINISHED (F)

slide-71
SLIDE 71

8

Task Unit Queues

15

¨ Task queue: holds task descriptors ¨ Commit Queue: holds speculative state of finished tasks

Task Queue 7, F 9, I 10, I 2, F 8, R 3, F Cores Commit Queue 8 7 2 3 71 Task States: IDLE (I) RUNNING (R) FINISHED (F)

slide-72
SLIDE 72

9 9

Task Unit Queues

16

¨ Task queue: holds task descriptors ¨ Commit Queue: holds speculative state of finished tasks

Task Queue 7, F 9, R 10, I 2, F 8, R 3, F Cores Commit Queue 8 7 2 3 72 Task States: IDLE (I) RUNNING (R) FINISHED (F)

slide-73
SLIDE 73

9 9

Task Unit Queues

16

¨ Task queue: holds task descriptors ¨ Commit Queue: holds speculative state of finished tasks

Task Queue 7, F 9, R 10, I 2, F 8, R 3, F Cores Commit Queue 8 7 2 3 73 Task States: IDLE (I) RUNNING (R) FINISHED (F)

Similar to a reorder buffer, but at the task level

slide-74
SLIDE 74

High-Throughput Ordered Commits

17

¨ Suppose 64-cycle tasks execute on 64 cores ¤ 1 task commit/cycle to scale ¤ TLS commit schemes (successor lists, commit token) too slow

slide-75
SLIDE 75

High-Throughput Ordered Commits

17

¨ Suppose 64-cycle tasks execute on 64 cores ¤ 1 task commit/cycle to scale ¤ TLS commit schemes (successor lists, commit token) too slow ¨ We adapt “Virtual Time” [Jefferson, TOPLAS 1985]

GVT Arbiter Tile 1 Tile N Tile 2 …

slide-76
SLIDE 76

High-Throughput Ordered Commits

17

¨ Suppose 64-cycle tasks execute on 64 cores ¤ 1 task commit/cycle to scale ¤ TLS commit schemes (successor lists, commit token) too slow ¨ We adapt “Virtual Time” [Jefferson, TOPLAS 1985]

GVT Arbiter Tile 1 Tile N Tile 2 …

¨ Tiles periodically communicate to

find the earliest unfinished task

slide-77
SLIDE 77

High-Throughput Ordered Commits

17

¨ Suppose 64-cycle tasks execute on 64 cores ¤ 1 task commit/cycle to scale ¤ TLS commit schemes (successor lists, commit token) too slow ¨ We adapt “Virtual Time” [Jefferson, TOPLAS 1985]

GVT Arbiter Tile 1 Tile N Tile 2 …

¨ Tiles periodically communicate to

find the earliest unfinished task

slide-78
SLIDE 78

High-Throughput Ordered Commits

17

¨ Suppose 64-cycle tasks execute on 64 cores ¤ 1 task commit/cycle to scale ¤ TLS commit schemes (successor lists, commit token) too slow ¨ We adapt “Virtual Time” [Jefferson, TOPLAS 1985]

GVT Arbiter Tile 1 Tile N Tile 2 …

¨ Tiles periodically communicate to

find the earliest unfinished task

¨ Tiles commit all tasks that

precede it

slide-79
SLIDE 79

High-Throughput Ordered Commits

17

¨ Suppose 64-cycle tasks execute on 64 cores ¤ 1 task commit/cycle to scale ¤ TLS commit schemes (successor lists, commit token) too slow ¨ We adapt “Virtual Time” [Jefferson, TOPLAS 1985]

GVT Arbiter Tile 1 Tile N Tile 2 …

¨ Tiles periodically communicate to

find the earliest unfinished task

¨ Tiles commit all tasks that

precede it

With large commit queues, many tasks commit at once

slide-80
SLIDE 80

Amortizes commit costs among many tasks

High-Throughput Ordered Commits

17

¨ Suppose 64-cycle tasks execute on 64 cores ¤ 1 task commit/cycle to scale ¤ TLS commit schemes (successor lists, commit token) too slow ¨ We adapt “Virtual Time” [Jefferson, TOPLAS 1985]

GVT Arbiter Tile 1 Tile N Tile 2 …

¨ Tiles periodically communicate to

find the earliest unfinished task

¨ Tiles commit all tasks that

precede it

With large commit queues, many tasks commit at once

slide-81
SLIDE 81

Speculative Execution Example

18

Time Core 0 Core 1 Core 2 Timestamp order

slide-82
SLIDE 82

Speculative Execution Example

18

1 3 Time Core 0 Core 1 Core 2 1 3 Timestamp order

slide-83
SLIDE 83

Speculative Execution Example

18

¨ Tasks can execute even if parent is still speculative ¤ Uncovers more parallelism 1 3 Time Core 0 Core 1 Core 2 1 3 Timestamp order

slide-84
SLIDE 84

Speculative Execution Example

18

¨ Tasks can execute even if parent is still speculative ¤ Uncovers more parallelism 1 3 5 4 Time Core 0 Core 1 Core 2 1 3 4 5 Timestamp order

slide-85
SLIDE 85

Speculative Execution Example

18

¨ Tasks can execute even if parent is still speculative ¤ Uncovers more parallelism 1 3 2 5 4 Time Core 0 Core 1 Core 2 1 3 4 5 2 Timestamp order

slide-86
SLIDE 86

Speculative Execution Example

18

¨ Tasks can execute even if parent is still speculative ¤ Uncovers more parallelism 1 3 2 5 4 Time Core 0 Core 1 Core 2 1 3 4 5 2 Timestamp order Data dependence

slide-87
SLIDE 87

Speculative Execution Example

18

¨ Tasks can execute even if parent is still speculative ¤ Uncovers more parallelism ¤ May trigger cascading (but selective) aborts 1 3 2 5 4 Time Core 0 Core 1 Core 2 1 3 4 5 2 Timestamp order Data dependence

slide-88
SLIDE 88

Speculative Execution Example

18

¨ Tasks can execute even if parent is still speculative ¤ Uncovers more parallelism ¤ May trigger cascading (but selective) aborts 1 3 2 5 4 Time Core 0 Core 1 Core 2 1 3 4 5 2 Timestamp order Data dependence

slide-89
SLIDE 89

Swarm Speculation Mechanisms

19

¨ Key requirements for speculative execution: ¤ Fast commits ¤ Large speculative window à Small per-task speculative state

slide-90
SLIDE 90

Swarm Speculation Mechanisms

19

¨ Key requirements for speculative execution: ¤ Fast commits ¤ Large speculative window à Small per-task speculative state ¨ Eager versioning + timestamp-based conflict detection ¤ Bloom filters for cheap read/write sets [Yen, HPCA 2007]

slide-91
SLIDE 91

Swarm Speculation Mechanisms

19

¨ Key requirements for speculative execution: ¤ Fast commits ¤ Large speculative window à Small per-task speculative state ¨ Eager versioning + timestamp-based conflict detection ¤ Bloom filters for cheap read/write sets [Yen, HPCA 2007] ¤ Uses hierarchical memory system to filter conflict checks

slide-92
SLIDE 92

Swarm Speculation Mechanisms

19

¨ Key requirements for speculative execution: ¤ Fast commits ¤ Large speculative window à Small per-task speculative state ¨ Eager versioning + timestamp-based conflict detection ¤ Bloom filters for cheap read/write sets [Yen, HPCA 2007] ¤ Uses hierarchical memory system to filter conflict checks ¨ Enables two helpful properties 1.

Forwarding of still-speculative data

2.

On rollback, corrective writes abort dependent tasks only

slide-93
SLIDE 93

Outline

20 ¨ Understanding Ordered Parallelism ¨ Swarm ¨ Evaluation

slide-94
SLIDE 94

Evaluation Methodology

21

¨ Event-driven, sequential, Pin-based simulator ¨ Target system: 64-core, 16-tile chip

Memory controller Memory controller Memory controller Memory controller

Tile

Core Core Core Core

L1I/D L1I/D L1I/D L1I/D

L2 Cache L3 Cache Bank

Router

Task Unit 16 MB shared L3 (1MB/tile) 256 KB per-tile L2s 32 KB per-core L1s 4096 task queue entries (64/core) 1024 commit queue entries (16/core) 256-byte, 8-way Bloom filters

slide-95
SLIDE 95

Evaluation Methodology

21

¨ Event-driven, sequential, Pin-based simulator ¨ Target system: 64-core, 16-tile chip

Memory controller Memory controller Memory controller Memory controller

Tile

Core Core Core Core

L1I/D L1I/D L1I/D L1I/D

L2 Cache L3 Cache Bank

Router

Task Unit 16 MB shared L3 (1MB/tile) 256 KB per-tile L2s 32 KB per-core L1s 4096 task queue entries (64/core) 1024 commit queue entries (16/core) 256-byte, 8-way Bloom filters

slide-96
SLIDE 96

Evaluation Methodology

21

¨ Event-driven, sequential, Pin-based simulator ¨ Target system: 64-core, 16-tile chip ¨ Scalability experiments from 1-64 cores ¤ Scaled-down systems have fewer tiles

Memory controller Memory controller Memory controller Memory controller

Tile

Core Core Core Core

L1I/D L1I/D L1I/D L1I/D

L2 Cache L3 Cache Bank

Router

Task Unit 16 MB shared L3 (1MB/tile) 256 KB per-tile L2s 32 KB per-core L1s 4096 task queue entries (64/core) 1024 commit queue entries (16/core) 256-byte, 8-way Bloom filters

slide-97
SLIDE 97

1 32 64 Speedup 1c 32c 64c

bfs

117x 1c 32c 64c

sssp

1c 32c 64c

astar

1c 32c 64c

msf

1c 32c 64c

des

1c 32c 64c

silo

Swarm vs. Software Versions

22

slide-98
SLIDE 98

1 32 64 Speedup 1c 32c 64c

bfs

117x 1c 32c 64c

sssp

1c 32c 64c

astar

1c 32c 64c

msf

1c 32c 64c

des

1c 32c 64c

silo

Swarm vs. Software Versions

22

43x – 117x faster than serial versions

slide-99
SLIDE 99

1 32 64 Speedup 1c 32c 64c

bfs

117x 1c 32c 64c

sssp

1c 32c 64c

astar

1c 32c 64c

msf

1c 32c 64c

des

1c 32c 64c

silo

Swarm vs. Software Versions

22

43x – 117x faster than serial versions 3x – 18x faster than parallel versions

slide-100
SLIDE 100

1 32 64 Speedup 1c 32c 64c

bfs

117x 1c 32c 64c

sssp

1c 32c 64c

astar

1c 32c 64c

msf

1c 32c 64c

des

1c 32c 64c

silo

Swarm vs. Software Versions

22

43x – 117x faster than serial versions 3x – 18x faster than parallel versions Simple implicitly-parallel code

slide-101
SLIDE 101

Swarm Uses Resources Efficiently

23

20 40 60 80 100

Core cycles (%)

bfs sssp astar msf des silo

Commit Abort Queue Stall

slide-102
SLIDE 102

Swarm Uses Resources Efficiently

23

20 40 60 80 100

Core cycles (%)

bfs sssp astar msf des silo

Commit Abort Queue Stall

Most time spent executing tasks that commit

slide-103
SLIDE 103

Swarm Uses Resources Efficiently

23

200 400 600 800 1000 1200 1400

Avg entries used

bfs sssp astar msf des silo 2.6K 2.6K 2.3K 2.7K Task queue Commit queue

20 40 60 80 100

Core cycles (%)

bfs sssp astar msf des silo

Commit Abort Queue Stall

Most time spent executing tasks that commit Swarm speculates 200-800

tasks ahead on average

slide-104
SLIDE 104

Swarm Uses Resources Efficiently

23

¨ Speculation adds moderate energy overheads: ¤ 15% extra network traffic ¤ Conflict check logic triggered in 9-16% of cycles

200 400 600 800 1000 1200 1400

Avg entries used

bfs sssp astar msf des silo 2.6K 2.6K 2.3K 2.7K Task queue Commit queue

20 40 60 80 100

Core cycles (%)

bfs sssp astar msf des silo

Commit Abort Queue Stall

Most time spent executing tasks that commit Swarm speculates 200-800

tasks ahead on average

slide-105
SLIDE 105

Conclusions

24

¨ Swarm exploits ordered parallelism efficiently ¤ Necessary to parallelize many key algorithms ¤ Simplifies parallel programming in general

Irregular Regular Unordered Ordered

slide-106
SLIDE 106

Conclusions

24

¨ Swarm exploits ordered parallelism efficiently ¤ Necessary to parallelize many key algorithms ¤ Simplifies parallel programming in general ¨ Conventional wisdom: Ordering limits parallelism

Irregular Regular Unordered Ordered

slide-107
SLIDE 107

Conclusions

24

¨ Swarm exploits ordered parallelism efficiently ¤ Necessary to parallelize many key algorithms ¤ Simplifies parallel programming in general ¨ Conventional wisdom: Ordering limits parallelism

Expressive execution model + large window = Only true data dependences limit parallelism

Irregular Regular Unordered Ordered

slide-108
SLIDE 108

Conclusions

24

¨ Swarm exploits ordered parallelism efficiently ¤ Necessary to parallelize many key algorithms ¤ Simplifies parallel programming in general ¨ Conventional wisdom: Ordering limits parallelism ¨ Conventional wisdom: Speculation is wasteful

Expressive execution model + large window = Only true data dependences limit parallelism

Irregular Regular Unordered Ordered

slide-109
SLIDE 109

Conclusions

24

¨ Swarm exploits ordered parallelism efficiently ¤ Necessary to parallelize many key algorithms ¤ Simplifies parallel programming in general ¨ Conventional wisdom: Ordering limits parallelism ¨ Conventional wisdom: Speculation is wasteful

Expressive execution model + large window = Only true data dependences limit parallelism Speculation unlocks plentiful ordered parallelism Can trade parallelism for efficiency (e.g., simpler cores)

Irregular Regular Unordered Ordered

slide-110
SLIDE 110

Thanks for your attention! Questions?

A Scalable Architecture for Ordered Parallelism Mark Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, Daniel Sanchez