Power-Law Distributions in Empirical Data Article for Advanced Methods in Applied Statistics Christian Anker Rosiek 8th March 2018 Christian Anker Rosiek Power-Law Distributions in Empirical Data 1 / 14
SIAM REVIEW ? 2009 Society for Industrial and Applied Mathematics Vol. 51, No. 4, pp. 661-703 Power-Law Distributions in Empirical Data* Aaron Claused Cosma Rohilla Shalizi* M. E. J. Newman^ Abstract. Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the detection and characterization of power laws is complicated by the large fluctuations that occur in the tail of the distribution?the part of the distribution representing large but rare events?and by the difficulty of identifying the range over which power-law behav ior holds. Commonly used methods for analyzing power-law data, such as least-squares fitting, can produce substantially inaccurate estimates of parameters for power-law dis tributions, and even in cases where such methods return accurate answers they are still unsatisfactory because they give no indication of whether the data obey a power law at all. Here we present a principled statistical framework for discerning and quantifying power-law behavior in empirical data. Our approach combines maximum-likelihood fitting methods with goodness-of-fit tests based on the Kolmogorov-Smirnov (KS) statistic and likelihood ratios. We evaluate the effectiveness of the approach with tests on synthetic data and give critical comparisons to previous approaches. We also apply the proposed methods to twenty-four real-world data sets from a range of different disciplines, each of which has been conjectured to follow a power-law distribution. In some cases we find these conjectures to be consistent with the data, while in others the power law is ruled out. Key words, power-law distributions, Pareto, Zipf, maximum likelihood, heavy-tailed distributions, doi:10.1137/070710111 http://tuvalu.santafe.edu/~aaronc/powerlaws/ likelihood ratio test, model selection AMS subject classifications. 62-07, 62P99, 65C05, 62F99 Christian Anker Rosiek Power-Law Distributions in Empirical Data 2 / 14 DOI. 10.1137/070710111 I. Introduction. Many empirical quantities cluster around a typical value. The speeds of cars on a highway, the weights of apples in a store, air pressure, sea level, the temperature in New York at noon on a midsummer's day: all of these things vary somewhat, but their distributions place a negligible amount of probability far from the typical value, making the typical value representative of most observations. For instance, it is a useful statement to say that an adult male American is about 180cm tall because no one deviates very far from this height. Even the largest deviations, which are exceptionally rare, are still only about a factor of two from the mean in * Received by the editors December 2, 2007; accepted for publication (in revised form) February 2, 2009; published electronically November 6, 2009. This work was supported in part by the Santa Fe Institute (AC) and by grants from the James S. McDonnell Foundation (CRS and MEJN) and the National Science Foundation (MEJN). htt p: / / www. siam. org / j our nals / sirev /51-4/71011.html f Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, and Department of Computer Science, University of New Mexico, Albuquerque, NM 87131. * Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213. ? Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI 48109. 661
Power law distributions Continuous distribution � x � − α p ( x ) = α − 1 (1) x min x min Discrete distribution x − α p ( x ) = (2) ζ ( α, x min ) Christian Anker Rosiek Power-Law Distributions in Empirical Data 3 / 14
Power-law histogram (continuous distribution) 1 . 0 0 . 8 0 . 6 p ( x ) 0 . 4 0 . 2 0 . 0 0 50 100 150 x n = 10 000 , α = 3 . 5 , x min = 1 . Christian Anker Rosiek Power-Law Distributions in Empirical Data 4 / 14
Power-law histogram (continuous distribution) 10 − 1 10 − 3 p ( x ) 10 − 5 10 − 7 10 0 10 1 10 2 x n = 10 000 , α = 3 . 5 , x min = 1 . Christian Anker Rosiek Power-Law Distributions in Empirical Data 4 / 14
Linear least squares fit 10 0 LS + PDF True α 10 − 2 p ( x ) 10 − 4 10 − 6 10 0 10 1 10 2 x n = 10 000 , α = 3 . 5 , x min = 1 → α LS = 3 . 34(10) . ˆ Christian Anker Rosiek Power-Law Distributions in Empirical Data 5 / 14
Maximum likelihood parameter estimation Continuous distribution � n � − 1 ln x i � α MLE = 1 + n ˆ . (3) x min i =1 Discrete distribution α MLE = argmax ˆ L (4a) α with n � L = − n ln ζ ( α, x min ) − α ln x i . (4b) i =1 Christian Anker Rosiek Power-Law Distributions in Empirical Data 6 / 14
Maximum likelihood parameter estimation 10 0 LS + PDF True α Cont. MLE 10 − 2 p ( x ) 10 − 4 10 − 6 10 0 10 1 10 2 x n = 10 000 , α = 3 . 5 , x min = 1 → α MLE = 3 . 51(2) . ˆ Christian Anker Rosiek Power-Law Distributions in Empirical Data 7 / 14
Parameter estimation comparison (a) 3.5 3 est. α 2.5 2 1.5 1.5 2 2.5 3 3.5 (b) 3.5 Disc. MLE 3 Cont. MLE LS + PDF est. α LS + CDF 2.5 2 1.5 1.5 2 2.5 3 3.5 true α Article [1] Figure 3.2. Different α -estimators used with (a) discrete and (b) continuous power-laws. Christian Anker Rosiek Power-Law Distributions in Empirical Data 8 / 14
US city population 10 0 10 − 1 P ( x ) 10 − 2 10 − 3 � ∞ x p ( x ′ ) dx ′ P ( x ) = 10 − 4 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 City population x Christian Anker Rosiek Power-Law Distributions in Empirical Data 9 / 14
Estimating cut-off x min 4.5 4 3.5 estimated α 3 2.5 2 1.5 1 0 1 2 3 4 10 10 10 10 10 estimated x min Article [1] Figure 3.3. 5000 samples with α = 2 . 5 , x min = 100 averaged over 2500 trials. Christian Anker Rosiek Power-Law Distributions in Empirical Data 10 / 14
Estimating cut-off x min 5 4 Estimated α 3 2 1 10 0 10 1 10 2 10 3 10 4 Estimated x min 5000 samples with α = 2 . 5 , x min = 100 . 10 individual trials. Christian Anker Rosiek Power-Law Distributions in Empirical Data 11 / 14
Estimating cut-off x min One method: Maximize similarity between measured data distribution and best-fit distribution. Similarity is here measured with Kolmogorov-Smirnov test statistic: D = max x ≥ x min | S ( x ) − P ( x ) | (5) where P ( x ) is measured data CDF and S ( x ) is best-fit CDF. Additionally, proposed Monte Carlo GOF: Sample a large number of artificial observations from distributions with the best-fit parameters. p -value is now the ratio of simulated samples that have worse D . (Note: Greater p -value is better.) Christian Anker Rosiek Power-Law Distributions in Empirical Data 12 / 14
US city population 10 0 10 − 1 10 − 2 P ( x ) 10 − 3 � ∞ P ( x ) = x p ( x ′ ) dx ′ 10 − 4 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 City population x Christian Anker Rosiek Power-Law Distributions in Empirical Data 13 / 14
Rounding off Not covered here: Model comparison using likelihood ratios. Application to real-world datasets. Appendices: Mathematical and computational details, e.g. MLE convergence, sampling from power-law distributions, etc. Follow-up article [2]: Power-law distributions in binned empirical data. References: [1] Aaron Clauset, Cosma Rohilla Shalizi, and M.E.J. Newman. Power-law distributions in empirical data. SIAM Review 51 , 661–703 (2009). [2] Y. Virkar, and A. Clauset. Power-law distributions in binned empirical data. The Annals of Applied Statistics 8 , 89–119 (2014). Christian Anker Rosiek Power-Law Distributions in Empirical Data 14 / 14
Recommend
More recommend