Performance Assessment in Optimization Anne Auger, CMAP & Inria
Visualization and presentation of single runs
Displaying 3 runs (three trials)
Displaying 3 runs (three trials)
Displaying 3 runs (three trials)
Displaying 51 runs
Which Statistics?
More problems with average / expectations from Hansen GECCO 2019 Experimentation tutorial
Which Statistics?
Implications from Hansen GECCO 2019 Experimentation tutorial
Benchmarking Black-Box Optimizers Benchmarking: running an algorithm on several test functions in order to evaluate the performance of the algorithm
Why Numerical Benchmarking? Evaluate the performance of optimization algorithms Compare the performance of di ff erent algorithms understand strength and weaknesses of algorithms help in design of new algorithms
On performance measures …
Performance measure - What to measure? CPU time (to reach a given target) drawbacks: depend on the implementation, on the language, on the machine time is spent on code optimization instead of science Testing heuristics, we have it all wrong, J.N. Hooker, 1995 Journal of Heuristics Prefer “absolute” value: # of function evaluations to reach a given target assumptions: internal cost of the algorithm negligible or measured independently
On performance measures - Requirements “Algorithm A is 10/100 times faster than Algorithm B to solve this type of problems”
On performance measures - Requirements “Algorithm A is 10/100 times faster than Algorithm B to solve this type of problems” quantitative measures As opposed to displayed: mean f-value after 3.10^5 f-evals (51 runs) bold: statistically signi fi cant concluded: “EFWA signi fi cantly better than EFWA-NG” Source: Dynamic search in fireworks algorithm, Shaoqiu Zheng, Andreas Janecek, Junzhi Li and Ying Tan CEC 2014
On performance measures - Requirements a performance measure should be quantitative, with a ratio scale well-interpretable with a meaning relevant in the “real world” simple
Fixed Cost versus Fixed Budget - Collecting Data
Fixed Cost versus Fixed Budget - Collecting Data Collect for a given target (several target), the number of function evaluations needed to reach a target Repeat several times: if algorithms are stochastic, never draw a conclusion from a single run if deterministic algorithm, repeat by changing (randomly) the initial conditions
ECDF: Empirical Cumulative Distribution Function of the Runtime
̂ De fi nition of an ECDF Let be real random variables. Then the ( X 1 , …, X n ) empirical cumulative distribution function (ECDF) is defined as n F ( t ) = 1 ∑ 1 X i ≤ t n i =1
We display the ECDF of the runtime to reach target function values (see next slides for illustrations)
A Convergence Graph A Convergence Graph
First Hitting Time is Monotonous
15 Runs
15 Runs ≤ 15 Runtime Data Points target
Empirical Cumulative Distribution Empirical CDF 1 the ECDF of run lengths to reach the target 0.8 ● has for each data point a 0.6 vertical step of constant size 0.4 ● displays for each x-value (budget) 0.2 the count of observations to the left (first 0 hitting times)
Empirical Cumulative Distribution Empirical CDF 1 interpretations possible: 0.8 ● 80% of the runs reached the target 0.6 ● e.g. 60% of the 0.4 runs need between 2000 and 4000 0.2 evaluations 0
Aggregation 15 runs
Aggregation 15 runs 50 targets
Aggregation 15 runs 50 targets
Aggregation 15 runs 50 targets ECDF with 750 steps
We can aggregate over: • different targets • different functions and targets We should not aggregate over dimension as functions of different dimensions have typically very different runtimes
ECDF aggregated over targets - single functions ECDF for 3 different algorithms
ECDF aggregated over targets - single function ECDF for a single algorithm different dimensions
ERT/ART: Average Runtime
Which performance measure ?
Which performance measure ?
Expected Running Time (restart algo) ERT = E [ RT r ] = 1 − p s p s E [ RT unsuccessful + E [ RT successful ] Estimator for ERT #succ b p s = # Runs \ RT unsucc = Average Evals of unsuccessful runs \ RT succ = Average Evals of successful runs #Evals ART = #success
Example: scaling behavior A R T A
On Test functions
Test functions function testbed (set of test functions) should “re fl ect reality”: should model typical di ff iculties one is willing to solve Example: BBOB testbed (implemented in the COCO platform) the test functions are mainly non-convex and non-separable scalable with the search space dimension not too easy to solve, but yet comprehensible
Test functions (cont.) If aggregation of results over all functions from a testbed (through ECDF): one needs to be careful that some di ff iculties are not over- represented or that not too many easy functions are present
The bbob Testbed • 24 functions in 5 groups: • 6 dimensions: 2, 3, 5, 10, 20, (40 optional)
BBOB testbed Black-Box Optimization Benchmarking test suite http://coco.gforge.inria.fr/doku.php?id=start noiseless / noisy testbed noiseless testbed noisy testbed
COCO platform: automatizing the benchmarking process
https://github.com/numbbo/coco Step 1: download COCO
https://github.com/numbbo/coco Step 2: installation of post-processing
http://coco.gforge.inria.fr/doku.php?id=algorithms Step 3: downloading data for the moment: IPOP-CMA-ES
https://github.com/numbbo/coco postprocess python –m bbob_pproc IPOP-CMA-ES
Recommend
More recommend