Bayesian Analysis for Algorithm Performance Comparison Is it possible to compare optimization algorithms without hypothesis testing? Josu Ceberio
Is there a reproducibility crisis? Fuente: Monya Baker (2016) Is there a reproducibility crisis? Nature, 533, 452-454
Is there a reproducibility crisis? Hypothesis Questions Experimentation Conclusions Idea for solving a set Is my algorithm Compare the performance What conclusions do we of problems more better than the state- of my algorithm with the- draw from the efficiently. of-the-art? state-of-the-art on some experimentation? benchmark of problems. On which problems is How do we answer to the my algorithm better? The analysis of the results formulated questions? should take into account Why is my algorithm the associated better (or worse)? uncertainty .
The Questions How l likely i is m my p proposal t to be be the b best a algorithm t to s solve a p problem? How l likely i is m my p proposal t to be be the b best a algorithm f from the c compared o ones?
<latexit sha1_base64="QScPf75YqpsLM08xO+kyaRgOrOs=">AB+XicbVBNS8NAEN3Ur1q/oh69LBahvZREBT1JwUuPFWwrtCFstpt26WYTdifFEvtPvHhQxKv/xJv/xm2bg7Y+GHi8N8PMvCARXIPjfFuFtfWNza3idmlnd2/wD48aus4VZS1aCxi9RAQzQSXrAUcBHtIFCNRIFgnGN3O/M6YKc1jeQ+ThHkRGUgeckrASL5tJxWoPFZvekDSp4bvVH27NScOfAqcXNSRjmav3V68c0jZgEKojWXdJwMuIAk4Fm5Z6qWYJoSMyYF1DJYmY9rL5VN8ZpQ+DmNlSgKeq78nMhJpPYkC0xkRGOplbyb+53VTCK+9jMskBSbpYlGYCgwxnsWA+1wxCmJiCKGKm1sxHRJFKJiwSiYEd/nlVdI+r7kXNefuslx38ziK6ASdogpy0RWqowZqohaiaIye0St6szLrxXq3PhatBSufOUZ/YH3+ANqXknE=</latexit> The Point Unknown Behaviour STATISTICAL A ANALYSIS O OF EXPERIMENTAL R RESULTS Observed Sample WHAT N NHST C COMPUTES NULL HYPOTHESIS STATISTICAL TESTING p ( t ( x ) > τ | H 0 )
The controversy with NHST
<latexit sha1_base64="/MpXzWcP8EqakOTUlXIz1ULR90=">AB73icbVDLSgNBEOz1GeMr6tHLYBDiJeyqoMeAlxwjmAckS5idzCZDZmfXmV4xPyEFw+KePV3vPk3TpI9aGJBQ1HVTXdXkEh0HW/nZXVtfWNzdxWfntnd2+/cHDYMHGqGa+zWMa6FVDpVC8jgIlbyWa0yiQvBkMb6Z+84FrI2J1h6OE+xHtKxEKRtFKraRU7bpPj2fdQtEtuzOQZeJlpAgZat3CV6cXszTiCpmkxrQ9N0F/TDUKJvk30kNTygb0j5vW6poxI0/nt07IadW6ZEw1rYUkpn6e2JMI2NGUWA7I4oDs+hNxf+8dorhtT8WKkmRKzZfFKaSYEymz5Oe0JyhHFlCmRb2VsIGVFOGNqK8DcFbfHmZNM7L3kXZvb0sVrwsjhwcwmUwIMrqEAValAHBhKe4RXenHvnxXl3PuatK042cwR/4Hz+ABOrj0c=</latexit> <latexit sha1_base64="QScPf75YqpsLM08xO+kyaRgOrOs=">AB+XicbVBNS8NAEN3Ur1q/oh69LBahvZREBT1JwUuPFWwrtCFstpt26WYTdifFEvtPvHhQxKv/xJv/xm2bg7Y+GHi8N8PMvCARXIPjfFuFtfWNza3idmlnd2/wD48aus4VZS1aCxi9RAQzQSXrAUcBHtIFCNRIFgnGN3O/M6YKc1jeQ+ThHkRGUgeckrASL5tJxWoPFZvekDSp4bvVH27NScOfAqcXNSRjmav3V68c0jZgEKojWXdJwMuIAk4Fm5Z6qWYJoSMyYF1DJYmY9rL5VN8ZpQ+DmNlSgKeq78nMhJpPYkC0xkRGOplbyb+53VTCK+9jMskBSbpYlGYCgwxnsWA+1wxCmJiCKGKm1sxHRJFKJiwSiYEd/nlVdI+r7kXNefuslx38ziK6ASdogpy0RWqowZqohaiaIye0St6szLrxXq3PhatBSufOUZ/YH3+ANqXknE=</latexit> <latexit sha1_base64="1JetnS1nfDHVeV06DeUX+AEQ8Y=">AB/HicbZDLSgMxFIYz9VbrbRLN8Ei1IVloJuhIKbLivYC7TDkEnTNjSTGZKMOIz1Vdy4UMStD+LOtzHTzkJbfwh8/OczsnvR5wp7TjfVmFldW19o7hZ2tre2d2z9w/aKowloS0S8lB2fawoZ4K2NOcdiNJceBz2vEnN1m9c0+lYqG40lE3QCPBsygrWxPLuMTqNqw3MeH06uM0AGPLvi1JyZ4DKgHCogV9Ozv/qDkMQBFZpwrFQPOZF2Uyw1I5xOS/1Y0QiTCR7RnkGBA6rcdHb8FB4bZwCHoTRPaDhzf0+kOFAqCXzTGWA9Vou1zPyv1ov18MpNmYhiTQWZLxrGHOoQZknAZOUaJ4YwEQycyskYywx0SavkgkBLX5GdpnNXRec24vKnWUx1Eh+AIVAECl6AOGqAJWoCABDyDV/BmPVkv1rv1MW8tWPlMGfyR9fkDE+OTDg=</latexit> <latexit sha1_base64="ixOtl42DABu1QXwNHfHlqHtk6E=">ACDXicbZC7SgNBFIZnvcZ4W7W0GYxCUh2VdBCJWCTMoK5QLIs5PZMjshZmzYoh5ARtfxcZCEVt7O9/GSbKIJv4w8POdczhzfi8WXIFlfRlz8wuLS8uZlezq2vrGprm1XVNRIimr0khEsuERxQPWRU4CNaIJSOBJ1jd612N6vVbJhWPwhvox8wJSCfkPqcENHLNfswzkP+rnDZApLcl12rcIEn5PyHuGbOKlpj4VljpyaHUlVc87PVjmgSsBCoIEo1bSsGZ0AkcCrYMNtKFIsJ7ZEOa2obkoApZzC+ZogPNGljP5L6hYDH9PfEgARK9QNPdwYEumq6NoL/1ZoJ+GfOgIdxAiyk0V+IjBEeBQNbnPJKIi+NoRKrv+KaZdIQkEHmNUh2NMnz5raUdE+LlrXJ7mSncaRQbtoD+WRjU5RCZVRBVURQ/oCb2gV+PReDbejPdJ65yRzuygPzI+vgFYSZkn</latexit> The controversy with NHST We assume the null hypothesis , the average The p-value refers to the probability of erroneously performance of the compared methods is the same. assuming that there are differences when actually Then, the observed difference is computed from data there are not . It is used to measure the magnitude of and the probability of observing such a difference (or difference, as it decreases when the difference bigger) is estimated: the p-value . increases. WHAT N NHST C COMPUTES WHAT W WE W WOULD L LIKE T TO K KNOW p ( t ( x ) > τ | H 0 ) p ( H 0 | x ) 1 − p ( t ( x ) > τ | H 0 ) = p ( t ( x ) < τ | H 0 ) 1 − p ( H 0 | x ) = p ( H 1 | x )
The Point Unknown Behaviour Many alternatives to handle uncertainty associated with empirical results: 6WDWLVWLFDO��QDO\VLV +DQGERRN $�&RPSUHKHQVL�H�+DQGERRN�RI�6�D�LV�LFDO &RQFHS�V��7HFKQLT�HV�DQG�6RI��DUH�7RROV Observed Sample �����(GL�LRQ 'U�0LFKDHO�-�GH�6PL�K
<latexit sha1_base64="QScPf75YqpsLM08xO+kyaRgOrOs=">AB+XicbVBNS8NAEN3Ur1q/oh69LBahvZREBT1JwUuPFWwrtCFstpt26WYTdifFEvtPvHhQxKv/xJv/xm2bg7Y+GHi8N8PMvCARXIPjfFuFtfWNza3idmlnd2/wD48aus4VZS1aCxi9RAQzQSXrAUcBHtIFCNRIFgnGN3O/M6YKc1jeQ+ThHkRGUgeckrASL5tJxWoPFZvekDSp4bvVH27NScOfAqcXNSRjmav3V68c0jZgEKojWXdJwMuIAk4Fm5Z6qWYJoSMyYF1DJYmY9rL5VN8ZpQ+DmNlSgKeq78nMhJpPYkC0xkRGOplbyb+53VTCK+9jMskBSbpYlGYCgwxnsWA+1wxCmJiCKGKm1sxHRJFKJiwSiYEd/nlVdI+r7kXNefuslx38ziK6ASdogpy0RWqowZqohaiaIye0St6szLrxXq3PhatBSufOUZ/YH3+ANqXknE=</latexit> The Point Unknown Behaviour STATISTICAL A ANALYSIS O OF EXPERIMENTAL R RESULTS Observed Sample WHAT N NHST C COMPUTES NULL HYPOTHESIS BAYESIAN STATISTICAL STATISTICAL TESTING p ( t ( x ) > τ | H 0 ) ANALYSIS
<latexit sha1_base64="1oaUrufzQhQHrQgFYQ+vqg7duQg=">ACEXicbVDLSsNAFJ34rPUVdelmsAjpiRV0GXRjcsI9gFtKJPpB06eTBzI5S0v+DGX3HjQhG37tz5N07bCNp6YOBwzr3cOcdPBFdg21/Gyura+sZmYau4vbO7t28eHDZUnErK6jQWsWz5RDHBI1YHDoK1EslI6AvW9IfXU795z6TicXQHo4R5IelHPOCUgJa6puVaHRgwIGNV7iQyTiDGrqXGc7GMf+xy1yzZFXsGvEycnJRQDrdrfnZ6MU1DFgEVRKm2YyfgZUQCp4JNip1UsYTQIemztqYRCZnyslmiCT7VSg8HsdQvAjxTf29kJFRqFPp6MiQwUIveVPzPa6cQXHoZj5IUWETnh4JUYB17Wg/uckoiJEmhEqu/4rpgEhCQZdY1CU4i5GXSaNac4q1dvzUu0qr6OAjtEJspCDLlAN3SAX1RFD+gJvaBX49F4Nt6M9/noipHvHKE/MD6+ASzGnJY=</latexit> The Bayesian Approach The method focuses on estimating relevant information about the underlying performance parametric distribution represented by a set of Likelihood parameters θ . function This method asses the distribution of θ P ( θ | s ) ∝ P ( s | θ ) P ( θ ) conditioned on a sample s drawn from the performance distribution. Posterior distribution Prior distribution of the parameters of the parameters Instead of having a single probability distribution to model the underlying performance, Bayesian HOW D DO W WE C COMPARE M MULTIPLE statistics considers all possible distributions AL ALGORITHMS? and assigns a probability to each.
From Results to Rankings Observed Sample Minimizing some instances of a problem Minimizing a given instance of a problem σ 1 σ 2 σ 3 σ 4 σ 5 Algorithm f 1 Algorithm f 5 Algorithm f 2 Algorithm f 4 Algorithm f 3 GA 100 3 3 3 4 4 GA 256 GA 130 GA 566 GA 37 1 2 5 5 3 PSO 90 PSO 125 PSO 80 PSO 756 PSO 352 5 4 2 3 2 ILP 135 ILP 89 ILP 135 ILP 101 ILP 19 SA 105 4 1 4 1 5 SA 369 SA 30 SA 56 SA 100 GP 95 2 5 1 2 1 GP 36 GP 300 GP 57 GP 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . rankings, permutations
<latexit sha1_base64="l2ncjWDTg/lJpaxSOQNZ0W4MK+s=">ACQXicbVBLSwMxGMzWd31VPXoJFqFeyq4KeikIXjxWsFro1iWbZtvYJLsk3ypl2b/mxX/gzbsXD4p49WL6OGjrQGAyMx9fMmEiuAHXfXEKc/MLi0vLK8XVtfWNzdLW9rWJU01Zg8Yi1s2QGCa4Yg3gIFgz0YzIULCbsH8+9G/umTY8VlcwSFhbkq7iEacErBSUmvWKb3hXkoOan+i4E2S85uW3CvuCRVDxI01o9hBk45B18zy3l1QG2V2ND4O/zDtrYl/zbg8OglLZrboj4FniTUgZTVAPSs9+J6apZAqoIMa0PDeBdkY0cCpYXvRTwxJC+6TLWpYqIplpZ6MGcrxvlQ6OYm2PAjxSf09kRBozkKFNSgI9M+0Nxf+8VgrRaTvjKkmBKTpeFKUCQ4yHdeIO14yCGFhCqOb2rZj2iC0NbOlFW4I3/eVZcn1Y9Y6q7uVx+ex4Uscy2kV7qI8dILO0AWqowai6BG9onf04Tw5b86n8zWOFpzJzA76A+f7BwPtslo=</latexit> Plackett-luce Model ! n w σ i Y P ( σ ) = P n j = i w σ j i =1 ● Each algorithm in the comparison has a weight associated. ● The weights sum up 1. ● The weight associated to an algorithm represents its probability to appear at first rank.
Recommend
More recommend