benchmark design for robust profile directed optimization
play

Benchmark Design for Robust Profile-Directed Optimization SPEC - PowerPoint PPT Presentation

Benchmark Design for Robust Profile-Directed Optimization SPEC Workshop 2007 Paul Berube and Jos Nelson Amaral University of Alberta NSERC Alberta Ingenuity iCore January 21, 2007 Paul Berube 1 In this talk SPEC: SPEC


  1. Benchmark Design for Robust Profile-Directed Optimization SPEC Workshop 2007 Paul Berube and José Nelson Amaral University of Alberta NSERC Alberta Ingenuity iCore January 21, 2007 Paul Berube 1

  2. In this talk • SPEC: SPEC CPU • PDF: Offline, profile-guided optimization • Test: Evaluate • Data/Inputs: Program input data January 21, 2007 Paul Berube 2

  3. PDF in Research • SPEC benchmarks and inputs used, but rules seldom followed exactly – PDF will continue regardless of admissibility in reported results • Some degree of profiling is taken as a given in many recent compiler and architecture works January 21, 2007 Paul Berube 3

  4. An Opportunity to Improve • No PDF for base in CPU2006 – An opportunity to step back and consider • Current evaluation methodology for PDF is not rigorous – Dictated by inputs/rules provided in SPEC CPU – Usually followed when reporting PDF research January 21, 2007 Paul Berube 4

  5. Current Methodology Static optimization input.ref Flag Tuning optimizing Test compiler peak_static January 21, 2007 Paul Berube 5

  6. Current Methodology PDF optimization input.train input.ref Flag Tuning PDF Train optimizing Test Profile compiler peak_pdf Instrumenting compiler January 21, 2007 Paul Berube 6

  7. Current Methodology PDF optimization input.train input.ref  Flag Tuning PDF Train optimizing Test Profile compiler if (peak_pdf > peak_static) Instrumenting peak := peak_pdf; compiler January 21, 2007 Paul Berube 7

  8. Current Methodology PDF optimization input.train input.ref Flag Tuning PDF Train optimizing Test Profile compiler if (peak_pdf > peak_static) Instrumenting peak := peak_pdf; compiler else peak := peak_static; January 21, 2007 Paul Berube 8

  9. Current Methodology PDF optimization input.train input.ref Flag Tuning Is this PDF Train optimizing Test comparison Profile compiler sound? Does 1 training and 1 test input predict PDF performance? (peak_pdf > peak_static) if (peak_pdf > peak_static) Instrumenting (peak_pdf > other_pdf) peak := peak_pdf; compiler else peak := peak_static; January 21, 2007 Paul Berube 9

  10. Current Methodology PDF optimization input.train input.ref Flag Tuning Variance Is this between inputs PDF Train optimizing Test comparison can be larger than Profile compiler sound? reported Does 1 training improvements! and 1 test input predict PDF performance? (peak_pdf > peak_static) if (peak_pdf > peak_static) Instrumenting (peak_pdf > other_pdf) peak := peak_pdf; compiler else peak := peak_static; January 21, 2007 Paul Berube 10

  11. January 21, 2007 vs. Static combined bzip2 – Train on xml -6 10 12 -4 -2 compressed 0 2 4 6 8 docs gap graphic jpeg Paul Berube log mp3 mpeg pdf program random reuters source > 14% xml 11

  12. PDF is like Machine Learning • Complex parameter space • Limited observed data (training) • Adjust parameters to match observed data – maximize expected performance January 21, 2007 Paul Berube 12

  13. Evaluation of Learning Systems • Must take sensitivity to training and evaluation inputs into account – PDF specializes code according to training data – Changing inputs can greatly alter performance • Performance results must have statistical significance measures – Differentiate between gains/losses and noise January 21, 2007 Paul Berube 13

  14. Overfitting • Specializing for the training data too closely • Exploiting particular properties of the training data that do not generalize • Causes: – insufficient quantity of training data – insufficient variation among training data – deficient learning system January 21, 2007 Paul Berube 14

  15. Overfitting • Currently: ✗ Engineer the compiler to not overfit the single training data (underfitting) ✗ No clear rules for input selection ✗ Some benchmark authors replicate data between train and ref • Overfitting can be rewarded! January 21, 2007 Paul Berube 15

  16. Criteria for Evaluation • Predict expected future performance • Measure performance variance • Do not reward overfitting • Same evaluation criteria as ML – Cross-validation addresses these criteria January 21, 2007 Paul Berube 16

  17. Cross-Validation • Split a collection of inputs into two or more non-overlapping sets • Train on one set, test on the other set(s) • Repeat, using a different set for training Test Train January 21, 2007 Paul Berube 17

  18. Leave-one-out Cross-Validation • If little data, reduce test set to 1 input – Leave N out: only N inputs in test Test Train January 21, 2007 Paul Berube 18

  19. Cross-Validation • The same data is NEVER in both the training and the testing set – Overfitting will not enhance performance • Multiple evaluations allows statistical measure to be calculated on the results – Standard deviation, confidence intervals... • Set of training inputs allows system to exploit commonalities between inputs January 21, 2007 Paul Berube 19

  20. Proposed Methodology • PDFPeak score, distinct from peak – Report with standard deviation • Provide a PDF workload – Inputs used for both training and evaluation, so “medium” sized (~2 min running time) – 9 inputs needed for meaningful statistical measures January 21, 2007 Paul Berube 20

  21. Proposed Methodology • Split inputs into 3 sets (at design time) • For each input in each evaluation, calculate speedup compared to (non-PDF) peak • Calculate (over all evaluations) – mean speedup – standard deviation of speedups January 21, 2007 Paul Berube 21

  22. Example PDF Workload (9 inputs): jpeg mpeg xml html text doc pdf source program January 21, 2007 Paul Berube 22

  23. Example – Split workload PDF Workload A jpeg (9 inputs): xml pdf jpeg mpeg xml B mpeg html html text source doc pdf source C text program doc program January 21, 2007 Paul Berube 23

  24. Example – Train and Run A Train Instrumenting compiler January 21, 2007 Paul Berube 24

  25. Example – Train and Run A PDF Train optimizing Profile(A) compiler Instrumenting compiler January 21, 2007 Paul Berube 25

  26. Example – Train and Run A B+C PDF Train optimizing Test Profile(A) compiler mpeg 1% html 5% text 4% Instrumenting compiler doc -3% source 4% program 2% January 21, 2007 Paul Berube 26

  27. Example – Train and Run B A+C PDF Train optimizing Test Profile(B) compiler jpeg 4% Mpeg 2% xml -1% html 5% text 5% text 3% Instrumenting doc 1% compiler doc -7% pdf 4% source 1% program 1% program 1% January 21, 2007 Paul Berube 27

  28. Example – Train and Run C A+B PDF Train optimizing Test Profile(C) compiler jpeg 5% jpeg 2% Mpeg 2% xml 2% xml -3% html 5% mpeg -1% text 2% text 3% Instrumenting html 3% doc 2% compiler doc -7% pdf 3% pdf 3% source 1% source 3% program-1% program 1% January 21, 2007 Paul Berube 28

  29. Example – Evaluate doc 1% doc -3% html 3% html 5% Average: 2.33 jpeg 5% jpeg 4% mpeg -1% mpeg 1% pdf 3% pdf 4% program 1% program 2% source 3% source 4% text 5% text 4% xml -1% xml 2% January 21, 2007 Paul Berube 29

  30. Example – Evaluate doc 1% doc -3% html 3% html 5% Average: 2.33 jpeg 5% jpeg 4% mpeg -1% mpeg 1% pdf 3% pdf 4% program 1% program 2% Std. Dev: 2.30 source 3% source 4% text 5% text 4% xml -1% xml 2% January 21, 2007 Paul Berube 30

  31. Example – Evaluate doc 1% doc -3% html 3% html 5% Average: 2.33 jpeg 5% jpeg 4% mpeg -1% PDF improves performance: mpeg 1% • 2.33±2.30%, 17 times out of 25 pdf 3% • 2.33±4.60%, 19 times out of 20 pdf 4% program 1% program 2% Std. Dev: 2.30 source 3% source 4% text 5% text 4% xml -1% xml 2% January 21, 2007 Paul Berube 31

  32. Example – Evaluate PDF improves performance: • 2.33±2.30%, 17 times out of 25 • 2.33±4.60%, 19 times out of 20 (peak_pdf > peak_static)? (new_pdf > other_pdf)? Depends on mean and variance of both! January 21, 2007 Paul Berube 32

  33. Pieces of Effective Evaluation • Workload of inputs • Education about input selection – Rules and guidelines for authors • Adoption of a new methodology for PDF evaluation January 21, 2007 Paul Berube 33

  34. Practical Concerns • Benchmark user – Many additional runs, but on smaller inputs – Two additional program compilation • Benchmark author – Most INT benchmarks use multiple data, and/or additional data is easily available – PDF input set could be used for REF January 21, 2007 Paul Berube 34

  35. Conclusion • PDF is here: important for compilers and architecture, in research and in practice • The current methodology for PDF evaluation is not reliable • Proposed a methodology for meaningful evaluation January 21, 2007 Paul Berube 35

  36. Thanks Questions? January 21, 2007 Paul Berube 36

Recommend


More recommend