Online Auto-Tuning Ray S. Chen Jeffrey K. Hollingsworth 1 - - PowerPoint PPT Presentation

online auto tuning
SMART_READER_LITE
LIVE PREVIEW

Online Auto-Tuning Ray S. Chen Jeffrey K. Hollingsworth 1 - - PowerPoint PPT Presentation

ANGEL: A Hierarchical Approach to Online Auto-Tuning Ray S. Chen Jeffrey K. Hollingsworth 1 Motivation HPC systems will require online auto-tuning Managing billion-way parallelism is non-trivial Cannot myopically focus on wall-time


slide-1
SLIDE 1

ANGEL: A Hierarchical Approach to Online Auto-Tuning

Ray S. Chen Jeffrey K. Hollingsworth

1

slide-2
SLIDE 2

Motivation

  • HPC systems will require online auto-tuning

– Managing billion-way parallelism is non-trivial

  • Cannot myopically focus on wall-time

– 20MW power goal represents additional hurdle

  • Need an auto-tuner that is:

– Coordinated (Managed by the runtime OS) – Online (Optimization occurs without training runs) – Multi-objective (Handle power as well as wall-time)

2

slide-3
SLIDE 3

Dealing with Multiple Objectives

  • Multi-objective problems have a set of solutions

– Each solution in set is equivalent

  • Optimal solution is subjective

– Tuner cannot choose for the user

  • Online tuning even harder

– Cannot pause for user input – Must limit overhead of testing – Use as few evaluations as possible

3

slide-4
SLIDE 4

ANGEL Inputs

  • Two values per objective collected from user apriori

– Priority Rank

  • Orders each objective from highest to lowest
  • Each rank must be unique

– Leeway Percentage

  • Amount ANGEL may stray from this objective’s best
  • Used to find improvements in other objectives

4

slide-5
SLIDE 5

ANGEL Algorithm

  • Begin with highest priority objective

– Use single-objective algorithm for this objective alone – Record all value ranges (min, max) during sub-search – Repeat with next highest objective until all are searched

  • Penalize sub-searches to maintain leeway preference

– Applied when higher priority objective exceeds leeway – Allows upper level sub-searches to guide lower levels

  • Result of final sub-search is the overall solution

5

slide-6
SLIDE 6

ANGEL Penalty Function

  • One-dimensional example with two objectives

6

slide-7
SLIDE 7

Numerical Testsuite Experiments

  • Tests from multi-objective optimization literature

– Designed to be difficult, but not pathological

  • Compared against ParEGO

– Represents best evolutionary algorithm for our case – Strives to use very few function evaluations – Geared towards (relatively) low-dimensional objectives

  • Compared against random

– Must ensure our algorithm does something intelligent

7

slide-8
SLIDE 8

Testsuite Results – Quality

  • Quality is a measure of the converged solution.

– Distance from the best solution discovered by hand.

  • ANGEL wins on two-thirds of testsuite.

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 KNO1 OKA1 OKA2 VLMOP2 VLMOP3 DTLZ1a DTLZ2a DTLZ4a DTLZ7a

Converged Distance from Optimal (Normalized)

Random ParEGO ANGEL

43.39

8

slide-9
SLIDE 9

Testsuite Results – Efficiency

  • Efficiency is a measure of search overhead.

– Critically important to keep low for online auto-tuning.

  • ANGEL wins on all but one test.

0.5 1 1.5 2 2.5 3 3.5 4 KNO1 OKA1 OKA2 VLMOP2 VLMOP3 DTLZ1a DTLZ2a DTLZ4a DTLZ7a

Distance from Optimal per Evaluation (Normalized)

Random ParEGO ANGEL

7.04

9

slide-10
SLIDE 10

LULESH Experiments

  • Lawrence Livermore’s LULESH proxy application

– Unstructured hex mesh problem

  • Tuning two input variables:

– OpenACC loop vector length – GPU clock frequency

  • Two objectives:

– Minimize running time – Minimize energy consumption

10

slide-11
SLIDE 11

LULESH Objective Landscapes

11

slide-12
SLIDE 12

Changing the Threshold

  • ANGEL behaves properly for changing leeways

– Energy usage declines along with leeway – Shows proper behavior for real HPC data

12

slide-13
SLIDE 13

Conclusion and Future Work

  • ANGEL is a step towards runtime system auto-tuning

– Uses an iterative and hierarchical approach – Controlled by simple user inputs provided aprioi – Performs well on numerical testsuite – Shown to work correctly on real HPC data

  • Future work

– Power (rather than energy) studies – Alternate underlying single-objective algorithms – Explore avenues for parallelism

13