A Brief History of Lognormal and Power Law Distributions and an - PowerPoint PPT Presentation

A Brief History of Lognormal and Power Law Distributions and an Application to File Size Distributions Michael Mitzenmacher Harvard University

Motivation: General • Power laws now everywhere in computer science. – See the popular texts Linked by Barabasi or Six Degrees by Watts. – File sizes, download times, Internet topology, Web graph, etc. • Other sciences have known about power laws for a long time. – Economics, physics, ecology, linguistics, etc. • We should know history before diving in.

Motivation: Specific • Recent work on file size distributions – Downey (2001): file sizes have lognormal distribution (model and empirical results). – Barford et al. (1999): file sizes have lognormal body and Pareto (power law) tail. (empirical) • Understanding file sizes important for – Simulation tools: SURGE – Explaining network phenomena: power law for file sizes may explain self-similarity of network traffic. • Wanted to settle discrepancy. • Found rich (and insufficiently cited) history. • Helped lead to new file size model.

Power Law Distribution • A power law distribution satisfies − α ≥ Pr[ ] ~ X x cx • Pareto distribution ( ) − α x ≥ = Pr[ X x ] k – Log-complementary cumulative distribution function (ccdf) is exactly linear. ≥ = − α + α ln Pr[ ] ln ln X x x k • Properties – Infinite mean/variance possible

Lognormal Distribution • X is lognormally distributed if Y = ln X is normally distributed. 1 2 2 − − µ σ = 2 (ln ) / x • Density function: f ( x ) e π σ 2 x • Properties: – Finite mean/variance. – Skewed: mean > median > mode – Multiplicative: X 1 lognormal, X 2 lognormal implies X 1 X 2 lognormal.

Similarity • Easily seen by looking at log-densities. • Pareto has linear log-density. = − α − + α + α ln f ( x ) ( 1 ) ln x ln k ln • For large σ , lognormal has nearly linear log-density. ( ) − µ 2 ln x = − − π σ − ln f ( x ) ln x ln 2 σ 2 2 • Similarly, both have near linear log-ccdfs. – Log-ccdfs usually used for empirical, visual tests of power law behavior. • Question: how to differentiate them empirically?

Lognormal vs. Power Law • Question: Is this distribution lognormal or a power law? – Reasonable follow-up: Does it matter? • Primarily in economics – Income distribution. – Stock prices. (Black-Scholes model.) • But also papers in ecology, biology, astronomy, etc.

History • Power laws – Pareto : income distribution, 1897 – Zipf-Auerbach: city sizes, 1913/1940’s – Zipf-Estouf: word frequency, 1916/1940’s – Lotka: bibliometrics, 1926 – Mandelbrot: economics/information theory, 1950’s+ • Lognormal – McAlister, Kapetyn: 1879, 1903. – Gibrat: multiplicative processes, 1930’s.

Generative Models: Power Law • Preferential attachment – Dates back to Yule (1924), Simon (1955). • Yule: species and genera. • Simon: income distribution, city population distributions, word frequency distributions. – Web page degrees: more likely to link to page with many links. • Optimization based – Mandelbrot (1953): optimize information per character. – HOT model for file sizes. Zhu et al. (2001)

Preferential Attachment • Consider dynamic Web graph. – Pages join one at a time. – Each page has one outlink. • Let X j ( t ) be the number of pages of degree j at time t . • New page links: – With probability α , link to a random page. – With probability (1- α ), a link to a page chosen proportionally to indegree. (Copy a link.)

Simple Analysis dX X = 1 α − 0 0 dt t dX X X X X − − j j 1 j j 1 j = α − α + − α − − − α ( 1 )( 1 ) ( 1 ) j j dt t t t t j = • Assume limiting distribution where X c t j − α c 2 1 j − ~ 1 − α c 1 j − j 1 − − α − α ( 2 ) /( 1 ) c j ~ j

Optimization Model: Power Law • Mandelbrot experiment: design a language over a d -ary alphabet to optimize information per character. – Probability of j th most frequently used word is p j . – Length of j th most frequently used word is c j . • Average information per word: ∑ = − H p log p j 2 j j • Average characters per word: ∑ = C p j c j j

Optimization Model: Power Law • Optimize ratio A = C / H . ∑ ∑ = = − C p j c H p log p j j 2 j j j ( ( ) ) + c H C log ep = dA j 2 j 2 dp H j − Hc / C = = dA 0 when p 2 / e j j dp j j ≈ If log , power law results. c j d

Monkeys Typing Randomly • Miller (psychologist, 1957) suggests following: monkeys type randomly at a keyboard. – Hit each of n characters with probability p . – Hit space bar with probability 1 - np > 0. – A word is sequence of characters separated by a space. • Resulting distribution of word frequencies follows a power law. • Conclusion: Mandelbrot’s “optimization” not required for languages to have power law

Miller’s Argument • All words with k letters appear with prob. − p k ( 1 ) pn • There are n k words of length k . – Words of length k have frequency ranks [ ] ( ) ( ) ( ) ( ) + + − − − − k k 1 1 1 / 1 , 1 / 1 n n n n • Manipulation yields power law behavior + − ≤ ≤ − log j 1 log j ( 1 ) ( 1 ) p np p p np N N j • Recently extended by Conrad, Mitzenmacher to case of unequal letter probabilities. – Non-trivial: requires complex analysis.

Generative Models: Lognormal • Start with an organism of size X 0 . • At each time step, size changes by a random multiplicative factor. = X F X − − t t 1 t 1 • If F t is taken from a lognormal distribution, each X t is lognormal. • If F t are independent, identically distributed then (by CLT) X t converges to lognormal distribution.

BUT! • If there exists a lower bound: = ε X max( , F X ) − − t t 1 t 1 then X t converges to a power law distribution. (Champernowne, 1953) • Lognormal model easily pushed to a power law model.

Example • At each time interval, suppose size either increases by a factor of 2 with probability 1/3, or decreases by a factor of 1/2 with probability 2/3. – Limiting distribution is lognormal. – But if size has a lower bound, power law. -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 -4 -3 -2 -1 0 1 2 3 4 5 6

Example continued -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 • After n steps distribution increases - decreases becomes normal (CLT). -4 -3 -2 -1 0 1 2 3 4 5 6 • Limiting distribution: − ≥ ⇒ ≥ x Pr[ X x ] ~ 2 Pr[ size x ] ~ 1 / x

Double Pareto Distributions • Consider continuous version of lognormal generative model. – At time t , log X t is normal with mean µ t and variance σ 2 t • Suppose observation time is randomly distributed. – Income model: observation time depends on age, generations in the country, etc.

Double Pareto Distributions • Reed (2000,2001) analyzes case where time distributed exponentially. ∞ 1 − λ ∫ − − µ σ = λ 2 2 t (ln x t ) / 2 t f ( x ) e dt e π σ 2 xt = t 0 – Also Adamic, Huberman (1999). • Simplest case: µ = 0, σ = 1 ⎧ λ − − λ ≥ 1 2 x for x 1 ⎪ 2 = ⎨ f ( x ) λ − + λ ≤ 1 2 ⎪ x for x 1 ⎩ 2

Double Pareto Behavior • Double Pareto behavior, density – On log-log plot, density is two straight lines – Between lognormal (curved) and power law (one line) • Can have lognormal shaped body, Pareto tail. – The ccdf has Pareto tail; linear on log-log plots. – But cdf is also linear on log-log plots.

Lognormal vs. Double Pareto

Double Pareto File Sizes • Reed used Double Pareto to explain income distribution – Appears to have lognormal body, Pareto tail. • Double Pareto shape closely matches empirical file size distribution. – Appears to have lognormal body, Pareto tail. • Is there a reasonable model for file sizes that yields a Double Pareto Distribution?

Downey’s Ideas • Most files derived from others by copying, editing, or filtering. • Start with a single file. • Each new file derived from old file. = F × New file size Old file size • Like lognormal generative process. – Individual file sizes converge to lognormal.

Problems • “Global” distribution not lognormal. – Mixture of lognormal distributions. • Everything derived from single file. – Not realistic. – Large correlation: one big file near root affects everybody. • Deletions not handled.

Recursive Forest File Size Model • Keep Downey’s basic process. • At each time step, either – Completely new file generated (prob. p ), with distribution F 1 or – New file is derived from old file (prob. 1 - p ): = F 2 × New file size Old file size • Simplifying assumptions. – Distribution F 1 = F 2 = F is lognormal. – Old file chosen uniformly at random.

Depth 2 Depth 1 Recursive Forest Depth 0 = new files

A Brief History of Lognormal and Power Law Distributions and an - PowerPoint PPT Presentation

A Brief History of Lognormal and Power Law Distributions and an Application to File Size Distributions Michael Mitzenmacher Harvard University Motivation: General Power laws now everywhere in computer science. See the popular texts

Exploring Lognormal Income Distributions 11 Oct, 2014 1C 1C 2014 NNN2 1 2014 NNN2 2

Exploring Lognormal Incomes V1D 20 Nov, 2014 1D 1D 2014 NNN+ 1 2014 NNN+ 2 Exploring

Outline Power Law Size Distributions Distributions Power Law Size Distributions Overview

The Lognormal Distribution V1A July 20, 2018 1A 1A 2018 ASA 1 2018 ASA 2 Statistical

Power Law Size Distributions Overview Introduction Principles of Complex Systems Examples

Lognormal distribution of subjects by income XL5A-V0G 26 Mar 2017 XL5A: 0G XL5A: 0G 2014

THE POWER OF ONE J. CORPENI NG POWER OF ONE THE POWER OF ONE HISTORY PROVES ALL THINGS OF

Lognormal distribution of subjects by income XL5A-V0H XL5A: 0H XL5A: 0H 2014 Schield

More Mechanisms for Generating Power-Law Distributions Optimization Minimal Cost Mandelbrot vs.

Mechanisms for Generating Power-Law Distributions Random Walks The First Return Problem

Mechanisms for Generating Random Walks Power-Law Distributions The First Return Problem

More Mechanisms for Generating Optimization Power-Law Distributions Minimal Cost Mandelbrot vs.

Power-Law Distributions in Empirical Data Article for Advanced Methods in Applied Statistics

Determination of Inventories and Power Distributions for the NBSR A.L. Hanson and D.J. Diamond

ES ESS IN NYC BATTERIES NY NYC E ESS SS ESS History 2013 - Present Power

Genome 559, Winter 2012 Review Comparing networks Node degree distributions Power law

Approximations of the Laplace Transform of a Lognormal Random Variable Leonardo Rojas Nandayapa

Ordering comparisons Comparing distributions: Part 4 R.W. Oldford More than two distributions

Input Distributions Reading: Chapter 6 in Law Input Distributions Overview Probability Theory

1 2 History of community emergencies. With the exception of the power outage in History of

Triangular Distributions and Correlations The simple math behind triangular distributions and

Stat 5102 Lecture Slides: Deck 1 Empirical Distributions, Exact Sampling Distributions,

Unit 2: Probability and distributions 3. Normal and binomial distributions PS: Explain your

Unit 2: Probability and distributions 3. Normal and binomial distributions GOVT 3990 - Spring