a mixture of experts approach for runtime mapping in
play

A Mixture of Experts Approach for Runtime Mapping in Dynamic - PowerPoint PPT Presentation

A Mixture of Experts Approach for Runtime Mapping in Dynamic Environments Murali Emani School of Informatics University of Edinburgh Modern computing hardware Diverse Stochastjc Evolving 2 Parallelism Mapping Program Computatjon Steps


  1. A Mixture of Experts Approach for Runtime Mapping in Dynamic Environments Murali Emani School of Informatics University of Edinburgh

  2. Modern computing hardware Diverse Stochastjc Evolving 2

  3. Parallelism Mapping Program Computatjon Steps Hardware 3

  4. Parallelism Mapping Program Workloads Hardware Sofuware Data Hardware 4

  5. Parallelism Mapping Program Workloads Hardware Sofuware Data Hardware Program performance is sensitjve to the environment 5

  6. What exactly is the problem ? Optjmal partjtjoning of the parallel work is not static and non-trivial 6

  7. What exactly is the problem ? Existjng approaches are based on one-size-fjts-all policy 7

  8. What exactly is the problem ? Existjng approaches are based on one-size-fjts-all policy ➔ Not suitable for dynamic environments ➔ Hard to extend and update 8

  9. Goals ➔ Determine optjmal resources for a parallel program Avoid under-subscription / over-subscription ➔ Enable program auto-tuning Adapt smartly to varying resources ➔ Program and Platgorm aware Generic and portable 9

  10. Where does it fit in the stack Applicatjon Runtjme Operatjng System Hardware 10

  11. State Space 11

  12. Idea ➔ Identjfy best mapping policy in each set 12

  13. Idea ➔ Identjfy best mapping policy in each set E k E 1 E k-1 E 2 13

  14. Idea ➔ Collect these policies E k E 1 E 1 E 2 …. E k-1 E k-1 E 2 E k 14

  15. Idea ➔ Choose the best policy based on current state E k E 1 E 1 E 2 …. E k-1 E k-1 E 2 E k 15

  16. Idea ➔ Choose the best policy based on current state E k E 1 E 1 E 2 …. E k-1 E k-1 E 2 E k 16

  17. Mixture of Experts based Mapping ➔ Ensemble of experts (mapping policies) ➔ Smart way to select the best expert at runtjme ➔ Combine offmine prior models with online learning 17

  18. Mixture of Experts based Mapping # threads Expert 1 # threads Expert 2 . . . . # threads Expert k 18

  19. Mixture of Experts based Mapping How to select the best # threads Expert 1 expert ? # threads Expert 2 Expensive to evaluate with . . # threads of all experts . . # threads Expert k 19

  20. Mixture of Experts based Mapping How to select the best # threads Expert 1 expert ? # threads Expert 2 Expensive to evaluate with . . # threads of all experts . . # threads Expert k Environment predictor 20

  21. Mixture of Experts based Mapping How to select the best # threads Expert 1 expert ? environment # threads Expert 2 Expensive to evaluate with environment . . # threads of all experts . . # threads Expert k environment Environment predictor 21

  22. Predictive Modelling Environment predictor Thread predictor What is the best What should the environment # threads should look like 22

  23. Predictive Modelling Environment predictor Thread predictor What is the best What should the environment # threads should look like Input-feature-vector = < code, environment > f = (c,e) 23

  24. Approach – Machine Learning 24

  25. Approach – Machine Learning ➔ Hand crafued solutjons infeasible Learning prediction Data Training Model algorithm Pre-processing data New input 25

  26. Approach – Machine Learning ➔ Hand crafued solutjons infeasible Learning prediction Data Training Model algorithm Pre-processing data New input ➔ Train offmine, deploy online ➔ Supervised learning, Cross-validated ➔ Trained on NAS, evaluated on additjonal benchmarks 26 * Training overhead: one-off cost of 9216 experiments

  27. Training phase ➔ Various confjguratjons of program pairs and # threads 9216 experiments ; 3 weeks for runs; 1.1 GB log ➔ Feature space dimensionality reductjon : Informatjon gain 10 / 154 rich subset of features ➔ Linear Regression Models 27

  28. Features STATIC DYNAMIC (code) (environment) # instructjons # workload threads # branches # processors # load/store run queue size CPU load page free list rate cached memory 28

  29. How to select the best expert Online Expert Selector Select expert ‘k’ 29

  30. How to select the best expert Online Expert Selector Select expert ‘k’ Use 'Environment predictor' as a proxy to select the best mapping policy 30

  31. All put together... 31

  32. How many experts ? 32

  33. How many experts ? open questjon 33

  34. Started with 4 experts 34

  35. Evaluation Platgorm : 32-core Intel Xeon Benchmarks : NAS, SpecOMP, Parsec ( OpenMP ) Comparison : OpenMP default, Online, Offmine, Analytjc Workloads : Small ( light ), large ( heavy ) Hardware : Low, high frequent Online : “Parcae: a system for flexible parallel execution”, A. Raman, A. Zaks, J. W. Lee, and D. I. August. PLDI'12 Offline : “Smart, Adaptive Mapping of Parallelism in the Presence of External Workload ” , Murali Krishna Emani, Zheng Wang and Michael O'Boyle, CGO'13 35 Analytic : “Adaptive, Efficient, Parallel Execution of Parallel Programs”, S. Sridharan, G. Gupta, and G. S. Sohi. PLDI ’14.

  36. Results 1.17x over analytjc 1.26x over offmine 1.38x over online 36

  37. Why multjple experts ? Why not a single model ? E k E 1 M E k-1 E 2 37

  38. Why multjple experts ? Why not a single model ? Multjple experts outperforms single model 38

  39. Can this approach be used with other optjmizatjon techniques ? 39

  40. Can this approach be used with other optjmizatjon techniques ? Affjnity-based scheduling 40

  41. To sum up... Developed an approach for smart parallelism mapping ➔ Adaptjve to dynamic environment ➔ Predictjve modelling at its heart ➔ Environment predictor as a proxy to select the best mapping policy 41

  42. What next ? ➔ Integratjng this concept in CnC ➔ Focus on tuning component ➔ Runtjme and Applicatjon tuning ➔ Dynamic partjtjoning of resources to steps 42

  43. Idea Instances of computations (steps) Step 1 Step 2 Step 3 Step 4 ➔ Varying resource requirements for steps ➔ Mapping depends on when data is ready 43

  44. Take away ➔ One-size-fjts- none ➔ A bag of multjple policies is more practjcal than one ➔ Machine learning can be of help !! Thank you Murali Emani University of Edinburgh m.k.emani@sms.ed.ac.uk 44

  45. Backup 45

  46. Adaptive Parallelism Mapping ➔ Program performance is sensitjve to the environment ● Various characteristjcs ● Large number of components ● Compute/memory/disk ● Increased chances of failure bound Inherent Hardware behavior Target Sofuware Data ● Recurring upgrades ● Varying amount of I/O ● Versions compatjbility ● Scalability issues 46

  47. 47

  48. All experts use the same features, they vary in importance across each expert. 48

  49. 49

  50. Evaluation Platgorm : 32-core Intel Xeon 4 one-socket nodes, 8 cores/socket, 3.7.10 kernel Compiler : gcc 4.6 “-O3 -fopenmp” Benchmarks : NAS, SpecOMP, Parsec ( OpenMP ) Comparison : OpenMP default, Online, Offmine, Analytjc Workloads : Small ( light ), large ( heavy ) Hardware : Low, high frequent Online : “Parcae: a system for flexible parallel execution”, A. Raman, A. Zaks, J. W. Lee, and D. I. August. PLDI'12 Offline : “Smart, Adaptive Mapping of Parallelism in the Presence of External Workload ” , Murali Krishna Emani, Zheng Wang and Michael O'Boyle, CGO'13 50 Analytic : “Adaptive, Efficient, Parallel Execution of Parallel Programs”, S. Sridharan, G. Gupta, and G. S. Sohi. PLDI ’14.

  51. What is the efgect of increasing # experts ? Graceful additjon of experts What about # experts > 4 ? Needs more analysis 51

Recommend


More recommend