[PPT] - Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies PowerPoint Presentation

SLIDE 1

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies

Po-An Tsai, Changping Chen, and Daniel Sanchez

SLIDE 2

Die-stacking has enabled near-data processing

SLIDE 3

Die-stacking has enabled near-data processing

Conventional multicore processors use a multi-level deep cache hierarchy to reduce data movement Shared LLC Cores Private Caches

SLIDE 4

Die-stacking has enabled near-data processing

Conventional multicore processors use a multi-level deep cache hierarchy to reduce data movement Shared LLC Cores Private Caches DRAM Dies Logic Layer Near-data processors place cores close to main memory to reduce data movement NDP Core Vault Controller Private cache only (shallow hierarchy)

SLIDE 5

Die-stacking has enabled near-data processing

Conventional multicore processors use a multi-level deep cache hierarchy to reduce data movement Shared LLC Cores Private Caches DRAM Dies Logic Layer Near-data processors place cores close to main memory to reduce data movement NDP Core Vault Controller Private cache only (shallow hierarchy) Neither shallow nor deep hierarchies work well for all applications…

SLIDE 6

Asymmetric hierarchies get the best of both worlds

SLIDE 7

Asymmetric hierarchies get the best of both worlds

SLIDE 8

Prior work proposes hybrid system with asymmetric memory hierarchies to get the best of both

Asymmetric hierarchies get the best of both worlds

[Ahn et al., ISCA’15][Gao et al., PACT’15] [Hsieh et al., ISCA’16][Boroumand et al., ASPLOS’18]

SLIDE 9

Applications have strong hierarchy preferences

4

SLIDE 10

Applications have strong hierarchy preferences

4

10 20 30 40 50 60 70 80 Deep hier. LLC hit Shallow hierarchy Deep hier. LLC miss Access latency (ns)

SLIDE 11

Applications have strong hierarchy preferences

4 Performance/J of milc

n different hierarchies

10 20 30 40 50 60 70 80 Deep hier. LLC hit Shallow hierarchy Deep hier. LLC miss Access latency (ns) 0.5 1 1.5 2 2.5 3 Deep hierarchy Shallow hierarchy Normalized Perf/J

SLIDE 12

Applications have strong hierarchy preferences

4 Performance/J of milc

n different hierarchies

Performance/J of xalanc

n different hierarchies

10 20 30 40 50 60 70 80 Deep hier. LLC hit Shallow hierarchy Deep hier. LLC miss Access latency (ns) 0.5 1 1.5 2 2.5 3 Deep hierarchy Shallow hierarchy Normalized Perf/J 0.2 0.4 0.6 0.8 1 1.2 Deep hierarchy Shallow hierarchy Normalized Perf/J

SLIDE 13

Applications have strong hierarchy preferences

4 Performance/J of milc

n different hierarchies

How well each application can use the shared LLC is critical to its preference

Performance/J of xalanc

n different hierarchies

10 20 30 40 50 60 70 80 Deep hier. LLC hit Shallow hierarchy Deep hier. LLC miss Access latency (ns) 0.5 1 1.5 2 2.5 3 Deep hierarchy Shallow hierarchy Normalized Perf/J 0.2 0.4 0.6 0.8 1 1.2 Deep hierarchy Shallow hierarchy Normalized Perf/J

SLIDE 14

Scheduling programs to the right hierarchy is hard

5

SLIDE 15

Scheduling programs to the right hierarchy is hard

5

Many applications prefer different hierarchies over time because they have different phases

Performance/J of gems

SLIDE 16

Scheduling programs to the right hierarchy is hard

5

Many applications prefer different hierarchies over time because they have different phases Applications may prefer different hierarchies due to resource contention with other applications

0.5 1 1.5 2 2.5 Shallow hierarchy Deep hierarchy 2MB LLC Deep hierarchy 4MB LLC Deep hierarchy 8MB LLC Deep hierarchy 16MB LLC Normalized Perf/J

Performance/J of gems Performance/J of xalanc

SLIDE 17

Prior schedulers focus on different systems and constraints

6

SLIDE 18

Prior schedulers focus on different systems and constraints

6

 Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])

 Focuses on symmetric memory systems (multi-socket LLCs/NUMA) LLC 1 8MB LLC 2 8MB

SLIDE 19

Prior schedulers focus on different systems and constraints

6

 Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])

 Focuses on symmetric memory systems (multi-socket LLCs/NUMA)

 Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])

 Focuses on asymmetric core microarchitectures (big.LITTLE systems) In-order cores OoO cores LLC 1 8MB LLC 2 8MB

SLIDE 20

Prior schedulers focus on different systems and constraints

6

 Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])

 Focuses on symmetric memory systems (multi-socket LLCs/NUMA)

 Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])

 Focuses on asymmetric core microarchitectures (big.LITTLE systems)

 NDP-aware workload partitioning (PIM-enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])

 Focuses on single workloads and requires software modifications or compiler support In-order cores OoO cores LLC 1 8MB LLC 2 8MB

SLIDE 21

Prior schedulers focus on different systems and constraints

6

 Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])

 Focuses on symmetric memory systems (multi-socket LLCs/NUMA)

 Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])

 Focuses on asymmetric core microarchitectures (big.LITTLE systems)

 NDP-aware workload partitioning (PIM-enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])

 Focuses on single workloads and requires software modifications or compiler support

By contrast, our goal is to schedule threads considering both memory and core asymmetries, with no program modifications and transparently to users

In-order cores OoO cores LLC 1 8MB LLC 2 8MB

SLIDE 22

7 Hardware utility monitors Hardware Software Sample accesses Misses Cache size Miss curves Produce

AMS: An asymmetry-aware scheduler

Analytical model that estimates performance under different hierarchies First contribution

Schedule threads

Second contribution Two thread placement algorithms (AMS-Greedy/AMS-DP) that extend techniques originally designed for cache partitioning

SLIDE 23

AMS analytical model

8

SLIDE 24

 AMS estimates application preferences using total memory access latency

AMS analytical model

8

SLIDE 25

 AMS estimates application preferences using total memory access latency

AMS analytical model

8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8

SLIDE 26

 AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC

 Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)

AMS analytical model

8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8

SLIDE 27

 AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC

 Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)

AMS analytical model

8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8 A function of LLC capacity

SLIDE 28

 AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC

 Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)

AMS analytical model

8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8 Latency Latency curve model Processor-die core LLC Capacity (MB) 2 4 6 8 A function of LLC capacity

SLIDE 29

 AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC

 Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)

 Shallow hierarchy has no shared LLC  Lat = # accesses x Latency of shallow mem

AMS analytical model

8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8 NDP core Latency Latency curve model Processor-die core LLC Capacity (MB) 2 4 6 8 A function of LLC capacity

SLIDE 30

 AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC

 Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)

 Shallow hierarchy has no shared LLC  Lat = # accesses x Latency of shallow mem

AMS analytical model

8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8 NDP core Latency Latency curve model Processor-die core LLC Capacity (MB) 2 4 6 8 A function of LLC capacity Use processor-die core Use NDP core

SLIDE 31

Handling heterogeneous cores

9

 Combine model from prior work (PIE) with our memory latency model

SLIDE 32

Handling heterogeneous cores

9

 Combine model from prior work (PIE) with our memory latency model

NDP core Memory latency Latency curves Processor-die core LLC Capacity (MB) 2 4 6 8

SLIDE 33

Handling heterogeneous cores

9

 Combine model from prior work (PIE) with our memory latency model

NDP core Memory latency Latency curves Processor-die core LLC Capacity (MB) 2 4 6 8

Weigh by MLP

Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Memory stalls Memory stall curves

SLIDE 34

Handling heterogeneous cores

9

 Combine model from prior work (PIE) with our memory latency model

NDP core Memory latency Latency curves Processor-die core LLC Capacity (MB) 2 4 6 8

Weigh by MLP Add non-memory component weighed by ILP

Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Memory stalls Memory stall curves

SLIDE 35

Handling heterogeneous cores

9

 Combine model from prior work (PIE) with our memory latency model

NDP core Memory latency Latency curves Processor-die core LLC Capacity (MB) 2 4 6 8

Weigh by MLP Add non-memory component weighed by ILP

Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Memory stalls Memory stall curves Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Core cycles Core cycle curves

Non-mem cycles

SLIDE 36

Handling heterogeneous cores

9

 Combine model from prior work (PIE) with our memory latency model

NDP core Memory latency Latency curves Processor-die core LLC Capacity (MB) 2 4 6 8

Weigh by MLP Add non-memory component weighed by ILP

Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Memory stalls Memory stall curves Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Core cycles Core cycle curves

Non-mem cycles

Can be extended to other asymmetries, like frequencies (see paper)

SLIDE 37

AMS-Greedy overview

10

SLIDE 38

 Solve an optimization problem that seeks to minimize total cost

AMS-Greedy overview

10

SLIDE 39

 Solve an optimization problem that seeks to minimize total cost  Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10

SLIDE 40

 Solve an optimization problem that seeks to minimize total cost  Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10 Input: Cost curves of all threads for deep hierarchy

SLIDE 41

 Solve an optimization problem that seeks to minimize total cost  Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10 Input: Cost curves of all threads for deep hierarchy Cache partitioning

algo. from

prior work Partition plan T1: 3MB T2: 1MB T3: 4MB …

SLIDE 42

 Solve an optimization problem that seeks to minimize total cost  Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10 Input: Cost curves of all threads for deep hierarchy Cache partitioning

algo. from

prior work Partition plan T1: 3MB T2: 1MB T3: 4MB … Compare cost of deep/shallow

hier. according

to the plan Map some threads to shallow hierarchy

SLIDE 43

 Solve an optimization problem that seeks to minimize total cost  Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10 Input: Cost curves of all threads for deep hierarchy Cache partitioning

algo. from

prior work Partition plan T1: 3MB T2: 1MB T3: 4MB … Compare cost of deep/shallow

hier. according

to the plan Map some threads to shallow hierarchy

Do remaining threads fit in deep hier.?

SLIDE 44

 Solve an optimization problem that seeks to minimize total cost  Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10 Input: Cost curves of all threads for deep hierarchy Cache partitioning

algo. from

prior work Partition plan T1: 3MB T2: 1MB T3: 4MB … Compare cost of deep/shallow

hier. according

to the plan Map some threads to shallow hierarchy

Do remaining threads fit in deep hier.?

Yes Done

SLIDE 45

 Solve an optimization problem that seeks to minimize total cost  Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10 Input: Cost curves of all threads for deep hierarchy Cache partitioning

algo. from

prior work Partition plan T1: 3MB T2: 1MB T3: 4MB … Compare cost of deep/shallow

hier. according

to the plan Map some threads to shallow hierarchy

Do remaining threads fit in deep hier.?

Yes Done No Cost curves for threads still mapped the deep hierarchy

SLIDE 46

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

SLIDE 47

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8

Thread 1 Thread 2 Thread 3

SLIDE 48

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8

Partition the LLC among threads 1-3

Thread 1 Thread 2 Thread 3

8MB

SLIDE 49

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8

3MB Partition the LLC among threads 1-3

Thread 1 Thread 2 Thread 3

4MB 8MB 1MB

SLIDE 50

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8

3MB Partition the LLC among threads 1-3

Thread 1 Thread 2 Thread 3

4MB 8MB 1MB

: Opportunity cost

SLIDE 51

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

 Uses opportunity cost to decide which thread should give up processor-die

Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8

3MB Partition the LLC among threads 1-3

Thread 1 Thread 2 Thread 3

4MB 8MB 1MB

: Opportunity cost Opportunity cost <0 move to NDP

SLIDE 52

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

 Uses opportunity cost to decide which thread should give up processor-die

Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8

3MB Partition the LLC among threads 1-3

Thread 1 Thread 2 Thread 3

4MB 8MB 1MB

: Opportunity cost Opportunity cost <0 move to NDP

Perform multiple rounds

f partitioning until the

processor die is not

versubscribed

SLIDE 53

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

 Uses opportunity cost to decide which thread should give up processor-die

Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8

3MB Partition the LLC among threads 1-3

Thread 1 Thread 2 Thread 3

4MB 8MB 1MB

: Opportunity cost Opportunity cost <0 move to NDP

Perform multiple rounds

f partitioning until the

processor die is not

versubscribed

Overhead: 0.1% of system cycles when scheduling every 50ms

SLIDE 54

AMS-DP: Scheduling threads with dynamic programming

12

SLIDE 55

AMS-DP: Scheduling threads with dynamic programming

12

 Prior work has shown that dynamic programming (DP) solve cache partitioning

ptimally in polynomial time

 We propose an algorithm using DP to solve our optimization problem optimally

SLIDE 56

AMS-DP: Scheduling threads with dynamic programming

12

 Prior work has shown that dynamic programming (DP) solve cache partitioning

ptimally in polynomial time

 We propose an algorithm using DP to solve our optimization problem optimally

SLIDE 57

AMS-DP: Scheduling threads with dynamic programming

12

 Prior work has shown that dynamic programming (DP) solve cache partitioning

ptimally in polynomial time

 We propose an algorithm using DP to solve our optimization problem optimally

SLIDE 58

AMS-DP: Scheduling threads with dynamic programming

12

 Prior work has shown that dynamic programming (DP) solve cache partitioning

ptimally in polynomial time

 We propose an algorithm using DP to solve our optimization problem optimally

 AMS-DP serves as the upper bound of AMS-Greedy

 But it is more expensive

SLIDE 59

Data placement for asymmetric hierarchies

13

SLIDE 60

Data placement for asymmetric hierarchies

13

SLIDE 61

Data placement for asymmetric hierarchies

13

SLIDE 62

Data placement for asymmetric hierarchies

13

 NDP systems have different constraints from NUMA systems

 NDP cores have plentiful intra-stack bandwidth but limited inter-stack bandwidth

SLIDE 63

Data placement for asymmetric hierarchies

13

 NDP systems have different constraints from NUMA systems

 NDP cores have plentiful intra-stack bandwidth but limited inter-stack bandwidth

 We use simple heuristics to keep data from a thread in a single stack

 Threads try to allocate to the same stack so long as the stack has enough capacity

SLIDE 64

See paper for more details

14

 Handling multithreaded workloads  AMS-DP formulation  Different system scenarios

 Oversubscribed systems  Short-lived workloads or latency critical workloads

SLIDE 65

Evaluation

15

SLIDE 66

Evaluation

15

 Modeled system:

SLIDE 67

Evaluation

15

 Modeled system:

SLIDE 68

Evaluation

15

 Modeled system:

Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC

SLIDE 69

Evaluation

15

 Modeled system:

Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2

SLIDE 70

Evaluation

15

 Modeled system:  Workloads

 Multi-programmed SPECCPU  Multithreaded SPECOMP/PARSEC

(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2

SLIDE 71

Evaluation

15

 Modeled system:  Workloads

 Multi-programmed SPECCPU  Multithreaded SPECOMP/PARSEC

(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2

 Compared schedulers

SLIDE 72

Evaluation

15

 Modeled system:  Workloads

 Multi-programmed SPECCPU  Multithreaded SPECOMP/PARSEC

(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2

 Compared schedulers

 Random (baseline that we normalize to)

SLIDE 73

Evaluation

15

 Modeled system:  Workloads

 Multi-programmed SPECCPU  Multithreaded SPECOMP/PARSEC

(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2

 Compared schedulers

 Random (baseline that we normalize to)  Always NDP/Always processor-die

SLIDE 74

Evaluation

15

 Modeled system:  Workloads

 Multi-programmed SPECCPU  Multithreaded SPECOMP/PARSEC

(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2

 Compared schedulers

 Random (baseline that we normalize to)  Always NDP/Always processor-die  Extended CRUISE [ASPLOS’12]/PIE [ISCA’11]

SLIDE 75

Evaluation

15

 Modeled system:  Workloads

 Multi-programmed SPECCPU  Multithreaded SPECOMP/PARSEC

(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2

 Compared schedulers

 Random (baseline that we normalize to)  Always NDP/Always processor-die  Extended CRUISE [ASPLOS’12]/PIE [ISCA’11]  AMS-Greedy/AMS-DP

SLIDE 76

AMS finds the right hierarchy for each application

16

SLIDE 77

AMS finds the right hierarchy for each application

16

SLIDE 78

AMS finds the right hierarchy for each application

16 Always processor never leverages the NDP capability of the asymmetric system and is 8% worse than Random

SLIDE 79

AMS finds the right hierarchy for each application

16 Always processor never leverages the NDP capability of the asymmetric system and is 8% worse than Random Always NDP sometimes hurts applications that prefer deep hierarchies because it never leverages the LLC. Only 9% better

SLIDE 80

AMS finds the right hierarchy for each application

16 Always processor never leverages the NDP capability of the asymmetric system and is 8% worse than Random Always NDP sometimes hurts applications that prefer deep hierarchies because it never leverages the LLC. Only 9% better AMS-Greedy never hurts performance and improves weighted speedup by up to 37% and by 18% on average

SLIDE 81

AMS handles resource contention better than prior work

17

SLIDE 82

 Run workloads with 100% utilization to stress contention

AMS handles resource contention better than prior work

17

SLIDE 83

 Run workloads with 100% utilization to stress contention

AMS handles resource contention better than prior work

17

SLIDE 84

 Run workloads with 100% utilization to stress contention

AMS handles resource contention better than prior work

17

AMS-Greedy performs very close to AMS-DP , only 1% worse

SLIDE 85

 Run workloads with 100% utilization to stress contention

AMS handles resource contention better than prior work

17

AMS-Greedy performs very close to AMS-DP , only 1% worse Both AMS-Greedy and AMS- DP outperform CRUISE

SLIDE 86

AMS handles asymmetric core + memory well

18

SLIDE 87

AMS handles asymmetric core + memory well

18

 Deep hierarchy uses Haswell-like cores  Shallow hierarchy uses Silvermont-like cores

SLIDE 88

AMS handles asymmetric core + memory well

18

 Deep hierarchy uses Haswell-like cores  Shallow hierarchy uses Silvermont-like cores

SLIDE 89

AMS handles asymmetric core + memory well

18

 Deep hierarchy uses Haswell-like cores  Shallow hierarchy uses Silvermont-like cores

AMS-Greedy with the PIE model improves performance more than handling core/memory asymmetries separately

SLIDE 90

See paper for more evaluation results

19

 A case study to show AMS adapts to application phases  Multithreaded workloads  Detailed runtime overheads  Sensitivity study for system parameters

 Number of cores, LLC capacity, main memory capacity  Performance without and with hardware support for cache partitioning

SLIDE 91

Conclusion

20

SLIDE 92

Conclusion

20

 Scheduling computation in asymmetric systems is very challenging

SLIDE 93

Conclusion

20

 Scheduling computation in asymmetric systems is very challenging  We present AMS, an adaptive scheduler for asymmetric systems  AMS uses analytical models to adapt quickly and thread mapping algorithms

inspired by cache partitioning algorithms to find high-quality mappings

Hardware utility monitors Hardware Software Sample accesses Misses Cache size Miss curves Produce Analytical model that estimates performance under different hierarchies First contribution

Schedule threads

Second contribution Two thread placement algorithms that extends techniques originally designed for cache partitioning

SLIDE 94

Thanks! Any questions?

21

 Scheduling computation in asymmetric systems is very challenging  We present AMS, an adaptive scheduler for asymmetric systems  AMS uses analytical models to adapt quickly and thread mapping algorithms

inspired by cache partitioning algorithms to find high-quality mappings

Hardware utility monitors Hardware Software Sample accesses Misses Cache size Miss curves Produce Analytical model that estimates performance under different hierarchies First contribution

Schedule threads

Second contribution Two thread placement algorithms that extends techniques originally designed for cache partitioning