performance assessment in optimization anne auger cmap
play

Performance Assessment in Optimization Anne Auger, CMAP & Inria - PowerPoint PPT Presentation

Performance Assessment in Optimization Anne Auger, CMAP & Inria Visualization and presentation of single runs Displaying 3 runs (three trials) Displaying 3 runs (three trials) Displaying 3 runs (three trials) Displaying 51 runs Which


  1. Performance Assessment in Optimization Anne Auger, CMAP & Inria

  2. Visualization and presentation of single runs

  3. Displaying 3 runs (three trials)

  4. Displaying 3 runs (three trials)

  5. Displaying 3 runs (three trials)

  6. Displaying 51 runs

  7. Which Statistics?

  8. More problems with average / expectations from Hansen GECCO 2019 Experimentation tutorial

  9. Which Statistics?

  10. Implications from Hansen GECCO 2019 Experimentation tutorial

  11. Benchmarking Black-Box Optimizers Benchmarking: running an algorithm on several test functions in order to evaluate the performance of the algorithm

  12. Why Numerical Benchmarking? Evaluate the performance of optimization algorithms Compare the performance of di ff erent algorithms understand strength and weaknesses of algorithms help in design of new algorithms

  13. On performance measures …

  14. Performance measure - What to measure? CPU time (to reach a given target) drawbacks: depend on the implementation, on the language, on the machine time is spent on code optimization instead of science Testing heuristics, we have it all wrong, J.N. Hooker, 1995 Journal of Heuristics Prefer “absolute” value: # of function evaluations to reach a given target assumptions: internal cost of the algorithm negligible or measured independently

  15. On performance measures - Requirements “Algorithm A is 10/100 times faster than Algorithm B to solve this type of problems”

  16. On performance measures - Requirements “Algorithm A is 10/100 times faster than Algorithm B to solve this type of problems” quantitative measures As opposed to displayed: mean f-value after 3.10^5 f-evals (51 runs) bold: statistically signi fi cant concluded: “EFWA signi fi cantly better than EFWA-NG” Source: Dynamic search in fireworks algorithm, Shaoqiu Zheng, Andreas Janecek, Junzhi Li and Ying Tan CEC 2014

  17. On performance measures - Requirements a performance measure should be quantitative, with a ratio scale well-interpretable with a meaning relevant in the “real world” simple

  18. Fixed Cost versus Fixed Budget - Collecting Data

  19. Fixed Cost versus Fixed Budget - Collecting Data Collect for a given target (several target), the number of function evaluations needed to reach a target Repeat several times: if algorithms are stochastic, never draw a conclusion from a single run if deterministic algorithm, repeat by changing (randomly) the initial conditions

  20. ECDF: Empirical Cumulative Distribution Function of the Runtime

  21. ̂ De fi nition of an ECDF Let be real random variables. Then the ( X 1 , …, X n ) empirical cumulative distribution function (ECDF) is defined as n F ( t ) = 1 ∑ 1 X i ≤ t n i =1

  22. We display the ECDF of the runtime to reach target function values (see next slides for illustrations)

  23. A Convergence Graph A Convergence Graph

  24. First Hitting Time is Monotonous

  25. 15 Runs

  26. 15 Runs ≤ 15 Runtime Data Points target

  27. Empirical Cumulative Distribution Empirical CDF 1 the ECDF of run lengths to reach the target 0.8 ● has for each data point a 0.6 vertical step of constant size 0.4 ● displays for each x-value (budget) 0.2 the count of observations to the left (first 0 hitting times)

  28. Empirical Cumulative Distribution Empirical CDF 1 interpretations possible: 0.8 ● 80% of the runs reached the target 0.6 ● e.g. 60% of the 0.4 runs need between 2000 and 4000 0.2 evaluations 0

  29. Aggregation 15 runs

  30. Aggregation 15 runs 50 targets

  31. Aggregation 15 runs 50 targets

  32. Aggregation 15 runs 50 targets ECDF with 750 steps

  33. We can aggregate over: • different targets • different functions and targets We should not aggregate over dimension as functions of different dimensions have typically very different runtimes

  34. ECDF aggregated over targets - single functions ECDF for 3 different algorithms

  35. ECDF aggregated over targets - single function ECDF for a single algorithm different dimensions

  36. ERT/ART: Average Runtime

  37. Which performance measure ?

  38. Which performance measure ?

  39. Expected Running Time (restart algo) ERT = E [ RT r ] = 1 − p s p s E [ RT unsuccessful + E [ RT successful ] Estimator for ERT #succ b p s = # Runs \ RT unsucc = Average Evals of unsuccessful runs \ RT succ = Average Evals of successful runs #Evals ART = #success

  40. Example: scaling behavior A R T A

  41. On Test functions

  42. Test functions function testbed (set of test functions) should “re fl ect reality”: should model typical di ff iculties one is willing to solve Example: BBOB testbed (implemented in the COCO platform) the test functions are mainly non-convex and non-separable scalable with the search space dimension not too easy to solve, but yet comprehensible

  43. Test functions (cont.) If aggregation of results over all functions from a testbed (through ECDF): one needs to be careful that some di ff iculties are not over- represented or that not too many easy functions are present

  44. The bbob Testbed • 24 functions in 5 groups: • 6 dimensions: 2, 3, 5, 10, 20, (40 optional)

  45. BBOB testbed Black-Box Optimization Benchmarking test suite http://coco.gforge.inria.fr/doku.php?id=start noiseless / noisy testbed noiseless testbed noisy testbed

  46. COCO platform: automatizing the benchmarking process

  47. https://github.com/numbbo/coco Step 1: download COCO

  48. https://github.com/numbbo/coco Step 2: installation of post-processing

  49. http://coco.gforge.inria.fr/doku.php?id=algorithms Step 3: downloading data for the moment: IPOP-CMA-ES

  50. https://github.com/numbbo/coco postprocess python –m bbob_pproc IPOP-CMA-ES

Recommend


More recommend