analys ing data with gnu r
play

Analys ing data with GNU R Nils Kammenhuber Technis che Univers - PowerPoint PPT Presentation

Analys ing data with GNU R Nils Kammenhuber Technis che Univers itt Mnchen Sources for s ome of thes e s lides Gilberto Cmara Instituto Nacional de Pesquisas Espaciais Manfred Jobmann T echnische Universitt Mnchen


  1. Ins talling additional R packages from CRAN  T o install additional packages:  Ubuntu, Debian, etc:  First, try the shell command apt-cache search pkgname  Others / if there’s no Ubuntu/Debian package:  You might need to have C, C++, Fortran (!) compilers etc. installed on your machine  Start R on your computer.  At the R prompt, type: > choose CRANmirror() > install.packages(c(“pkg1”, “pkg2”), dependencies=TRUE)  After the package has been installed, make R use it: > library(pkg1)  Now you can use the new functions etc.  If it didn’t work, try re-starting R

  2. Remainder of this talk: Typical workflow • Read in the file • Explorative analysis • Some basic statistics • Some basic plot types • Save the plots • Advanced topics • More complex analyses and plots • More on R syntax

  3. Very s imple input  T ask: Read a file into a vector  Input file looks like this: 1 2 17.5 99  Read this into vector x: x <- scan(“inputfile.txt”)  There are more options  help(scan)

  4. Some examples > length(x) 100 > min(x) [1] -2.4923 > max(x) [1] 2.969

  5. “What values does x typically have?” (1)

  6. If there’s heavy variation in the data… • What’s the typical income for persons in this room? • Now Bill Gates walks in. What’s the typical income now? • Mode (“Modalwert”): The value that occurs most often Of course, usually only defined for categorical or discrete values • And what about local maxima? Or multiple maxima? •  Use with care Unfortunately, not a built-in function in R. Instead use: • > tab.x <- table(x); names(t.x[which.max(t.x)]) > median(x)  the 50% quantile 50% of values are greater than the median, 50% are lesser • Way less sensitive to outliers than mean (cf. Bill Gates example) • Estimator for the mean if the x are s ymmetrically dis tributed • > mean(x, trim=0.05)  trimmed mean Disregard 5% smallest and 5% largest values for x • Idea: Outliers are not representative / not important for the system • • Dangerous ! Very often, that is a wrong as s umption!

  7. “Unfortunately, this is not a built-in function”  Writing your own functions in R > harmean <- function(x) { + return(length(x) / sum(x)) + } Fix an error using external text editor (N.B. should be 1/x, not x): > fix(harmean) Syntax error […blah…] in line 2. Use x <- edit() > fix(harmean) Baaaad! Now all of your changes are lost! Instead you should have done what R told you: > harmean <- edit() Now that we’re at it, let’s also fix our geometric mean: > geomean <- edit() Baaaad again! Now geomean becomes a copy of whatever you last edit() ed.  should have used fix(geomean) here

  8. Saving your own functions in R • Method 1: Automatic saving • > quit() Save workspace image? [y/n/c]: • If you answer “ y ” now, your own functions will re-appear the next time that you start R • …but only if you s tart it from the s ame working directory! • And where is that? > getwd() [1] “/home/blabla/research/mydata” • Method 2: • Edit external text file with, e.g., Emacs • > source(“my-functions.R”) • What about the other way round? (i.e., R  text file) > dump(list=c(“fnc1”, “fnc2”), file=“my-fun.R”)

  9. Back to the data. What about other s tatis tics ? • quantile(x, probs=c(0.05, 0.1, 0.9, 0.95)) • Values that are 5% or 10% smaller or greater than all others • > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -9.38 -3.51 -0.03 0.54 3.96 11.67 25% and 75% quantile are called the quartiles • Help! summary() behaves strangely and doesn’t calculate quantiles and mean! • Probably some non-numerical values slipped into your data •  R will not view the data as numbers • T ypical culprits: “-”, “N/A”, “*”, “unknown” etc. in your file

  10. Meas uring dis pers ion

  11. Why is there als o the s tandard deviation if we already have the variance?

  12. Standard deviation, s tandard error

  13. Comparing dis pers ions

  14. And if I want to compare the s tandard errors , not the s tandard deviations ?

  15. What about Bill Gates and dis pers ion…? Remove the “constant=1” to get an estimate for the std.dev. (then you don’t have to multiply with 0.6745)

  16. Robus t s tatis tics • Classical statistics: mean, variance, … Very sensitive to outliers • • More exact if there are no outliers Readers (and paper reviewers!) are accustomed to it • • Robust statistics: median, MAD, … Not sensitive to outliers • • Less exact: One single datapoint represents the entire data set! Readers and reviewers may not know them • (“The median, ah, yes, what’s that again… and WTF is MAD?”) • When in doubt, use robust statistics • But that’s just my personal opinion!

  17. Enough. Let’s plot! • Obvious approach: Scatter plot > plot(x) > plot(x, type=“l”)

  18. Plotting (2) Combination of type=“p” and type=“l” : > plot(x, type=“b”)

  19. But does this plot type actually reveal what we want to s ee? • Always as k this ques tion! • So, are you really interested in the temporal development? Apparently, the values fluctuate randomly and are not really correlated… • Better alternatives: • Box–whisker plot • Histogram plot • Density plot • CDF plot • CCDF plot • Scatterplot against other data • QQ plot against postulated theoretical distribution • QQ plot against other data

  20. What you s hould not plot • A real res earcher never, ever us es pie charts ! • There’s a good reason why R has no method for drawing them • More on this in the slideset “How to lie with statistics” • Also, you should try avoiding 3D charts at all costs • 3D plot is plotted on 2D paper or 2D screen  Often very difficult to make out the 3D structure • Much more difficult to understand (“so what does it actually mean if a data point is in the upper left rear corner?”) • Only use 3D plots as a very last resort • Always remember: • You don’t make the plot just for you • You don’t make the plot to impress the reader • You make the plot to help the reader (and the reviewer!) understand your point

  21. Box–whis ker plot T ersely shows the distribution of your data: outlier points > boxplot(x) 75% quantile Boundaries calculated from 50% quantile quantile ranges (i.e., median) ( Last data points that are still within median±1.5 ∙ IQR) 25% quantile

  22. His togram plot: This is what you us ually want > hist(x, breaks=20) > hist(x) T extmasterformat bearbeiten T extmasterformat bearbeiten Zweite Ebene Zweite Ebene Dritte Ebene Dritte Ebene Vierte Ebene Vierte Ebene Fünfte Ebene Fünfte Ebene

  23. Dens ity plots (PDF) • “A histogram with an infinite number of (interpolated) bars” • Interpolation using a density kernel function > plot(density(x))

  24. Cumulative dens ity function plots (CDF) • Basically, it’s just the integral of the density function • But without the interpolation part • So it’s actually a sum, not an integral > plot(ecdf(x)) • Why CDF instead of density or histogram? • Can immediately see quantiles • More expressive if points spread out very wide e.g., 20% quantile

  25. Logarithmic axes • Use logarithmic scaling if the data is very asymmetric • Always a good idea to play around with axis s caling! > plot(ecdf(x), log=“x”) Error: logarithmic axis must have positive boundaries > plot(ecdf(x), log=“x”, xlim=range(x))

  26. atanh, as inh axes (Ralph: “The Kammenhuber s cale” – but I was not the firs t to dis cover it;-) • Properties of logarithmic scaling • Visual magnification towards 0 • Visual compression towards +∞ • Cannot dispay value 0.0 or negative values • But what if we wanted logarithm-like scaling for negative and for positive values in one plot? • T wo types of hyperbolic scaling • asinh axis scaling: • Visual compression towards −∞ • Visual magnification towards 0 • Visual compression towards +∞ • [shifted and scaled] atanh axis scaling: • Visual magnification towards 0.0 • Visual compression around 0.5 • Visual magnification towards 1.0

  27. Linear vs . as inh X axis • Linear scale hides the fact that we have two peaks! • N.B. Interval boundaries are different

  28. Linear vs . atanh Y axis • Linear scale hides the fact that the temporal development towards larger key lengths also took place at very short and very large key lengths, not only for popular key lengths

  29. Why does it work? atanh s cale (s hifted) as inh s cale

  30. How to do it in R • Rather tricky! • R has no builtin support for this • Need to manually fiddle around with it • Basic idea for asinh-scaled X axis: > plot(x=asinh(data), y=…, xaxt=“n”, …) > axis(side=1, at=asinh(c(-100, -10, -1, 0, 1, 10, 100)), labels=c(-100, -10, -1, 0, 1, 10, 100)) • Basic idea for atanh-scaled Y axis: > atanhS <- function(x) atanh(2*x-1) > plot(x=…, y=atanh(data), yaxt=“n”, …) > axis(side=2, at=atanhS(c(0.01, 0.1, 0.5, 0.9, 0.99)), labels=c(0.01, 0.1, 0.5, 0.9, 0.99)) • Cumbersome  I wrote some R scripts that are about to be released

  31. Nomenclature • Apparently, these axis scales have been known for decades or centuries • Apparently, no consistent nomenclature for them • I call them “two-ended pseudo-logarithmic scale”, since that describes what they look like • Hard to understand if you explain it purely orally: • ‘But hey, a double-logarithmic plot is easy to do in R, just say plot(…, log=“xy”) ’ • That’s true, but a double-logarithmic plot is a plot with logarithmic X and logarithmic Y axis – that’s an entirely different thing!

  32. Further information • Nils Kammenhuber: Two-ended pseudo-logarithmic scaling. T echnical Report, NET 2012, to appear

  33. Combining plots • Now I have multiple measurements that I want to compare in one single plot > plot(first.data, type=“l”, col=“black”, lty=1) > lines(second.data, col=“blue”, lty=2) • Or with points instead of lines: > plot(first.data, type=“p”, col=“black”, pch=1) > points(second.data, col=“blue”, pch=2)

  34. Adjus ting axis limits • Problem: The axis limits are not readjusted after the first call to plot() > plot(first.data, type=“l”, col=“black”, lty=1, xlim=range(c(first.data, second.data)), ylim=range(c(first.data, second.data))) > lines(second.data, col=“blue”, lty=2) • Alternatively, can manually specify axis limits, e.g., for probabilities: > plot(first.data, …, xlim=range(c(first.data, second.data)), ylim=c(0.0, 1.0))

  35. “I jus t ran the experiment for half an hour and I feel that I have enough data now.” • Maybe… maybe not. • The readers expect error bars or something like that • How do I get error bars? • Some researchers just use arith. mean ± 2 standard errors (or ±3) • About 95.5% of all data falls within ±2 σ of „real“ mean • About 99.7% of all data falls within ±3 σ of „real“ mean • … but only if the data is actually normally dis tributed! • Better: calculate a confidence interval on the mean of your data • Semantics: “ With a probability of 99% (  confidence level; adjustable), the output of the process that generated this data has an arithmetic mean that lies within this interval” • Higher confidence  interval gets larger • Higher variability  interval gets larger • More data points  interval gets smaller

  36. How to calculate confidence intervals

  37. Confidence intervals : Be careful! 1. Your data s hould be (roughly) normally dis tributed • …at least, not very asymmetric • Check it  QQ plots, and maybe statistical tests • If not the case  group the data and apply Central Limit Theorem 2. Your data should be independently distributed • Often not the case! • Check it  (1) Plot the data and look for a trend, (2) ACF plots • If not the case  group the data into larger groups and check ACF again 3. The Central Limit Theorem only works if the system that creates your samples has finite variance • More on this later

  38. “Here, the error roughly looks like a Gauß curve, s o it mus t be normally dis tributed” • How do I prove that my data is normally distributed? • Na ïve approach: • Plot histogram and density of your data • Plot density function of normal distribution into same diagram • “Here, they look the same” > hist(mydata, freq=FALSE, xlim=c(0.0, 0.3)) > curve(dnorm(x, mean=mean(mydata), sd=sd(mydata)) , col=“blue”, lty=2, add=TRUE) > legend( … more about this later… ) It looks normally distributed, but this data actually comes from the sum of a Student-T and an exponential distribution…

  39. What’s wrong with that approach 1. You can never “prove” that data is normally (or exponentially, t, Pareto, …)-distributed • You only can present graphs, calculations, statistical tests etc. that do not contradict that assumption 2. The eye is easily fooled – many symmetrical distributions look like Gauß’ bell curve 3. There are way better ways to analyse this: • QQ plots • PP plots (not shown) • Statistical tests: χ ² test, Kolmogorov–Smyrnov test, … (not shown – they usually are „too picky“;-)

  40. Digres s ion #1: Theoretical probability dis tributions in R • A lot of distributions are built into R • Naming convention: For every probability distribution XY , there are the following four functions • d xy () – the density function (I used that in the histogram) • p xy () – the probability function (≈cumulative density) (you could use that for comparison in a CDF plot of real data) • q xy () – the quantile function (useful for QQ plots and some statistical tests) • r xy () – draw random numbers from that distribution • Example: • rt(50, df=3) + rexp(50) – rnorm(50, sd=3) : Create 50 random variables from a combination of T, exponential and Normal distribution

  41. Digres s ion #2: Random number generators

  42. Random number generators : Literature • I–8 Discrete Event Simulation slideset on random number generation • L’Ecuyer, Simard: TestU01: a C library for empirical testing of random number generators ACM Transactions on Mathematical Software, Volume 33, No. 4, 2007

  43. Better alternative ins tead to comparing his togram vs . dens ity curve: QQ-Plots • Short for quantile–quantile plot • Compare the quantiles of two distributions • Imagine you have two distributions (e.g., #1 – empirical observations; #2 – postulated theoretical distribution, e.g., normal distribution) • Calculate 1%, 2%, … 100% quantile for each distribution • T ake the quantiles of 1st distribution as Y coordinates and quantiles of 2nd distribution as X coordinates • How to read them: • A straight (i.e., unbent) line means that the two distributions are similar “ In a sense, the only function that the human mind really understands well is a straight line. One good strategy for making a graphy easy to understand is to – John P . Boyd make it as linear as possible.” • Be careful about the ends • Also note the axis labels – they might be shifted

  44. QQ plot example Histogram QQ plot: comparison: looks visible differences – ~normal not a straight line

  45. How to do a QQ plot in R • QQ plot against normal distribution: > qqnorm(mydata) > qqline(mydata) #auxiliary straight line • QQ plot against some other distribution xy: > qqplot(qxy(ppoints(data), …), data) • Nicer: > qqplot(qxy(ppoints(data), …), data, xlab=“Theoretical quantiles”, ylab=“Sample quantiles”, main=“Comparison with xy distribution”) • QQ plots can be us ed to compare different data s ets : > qqplot(data.internet, data.mysimulation)

  46. QQ plots with type=“l” or type=“b” > qqnorm(data, type=“b”) > qqnorm( sort (data), type=“b”) • Huh!??

  47. Be careful when reading QQ plots ! Don’t let yourselves be fooled! DAX revenues vs. normal distribution: Good model for normal trading days Bad model for exceptional trading days

  48. Confidence intervals : Be careful! 1. Your data should be (roughly) normally distributed • …at least, not very asymmetric • Check it  QQ plots, and maybe statistical tests • If not the case  group the data and apply Central Limit Theorem 2. Your data should be independently distributed Often not the case! • Check it  (1) Plot the data and look for a trend, (2) ACF plots • If not the case  group the data into larger groups and check ACF • again 3. The Central Limit Theorem only works if the system that creates your samples has finite variance More on this later •

  49. Statis tical tes ts • If you really want to be correct, then use a statistical test once you’re satisfied with the QQ plot • Millions of statistical tests that can be used to check “if the data follows some given theoretical distribution” (e.g., Gauß’ normal distribution) • χ ² test: Very universal (i.e., can be used with just about any distribution), but less powerful (i.e., needs more data) • Kolmogorov–Smyrnov test (only for a number of popular distributions), more powerful (also works with less data) • Anderson–Darling test, … • Most of them are built into R • Check literature to see how to use them

  50. Statis tical tes ts : General obs ervations • A s tatis tical tes t never can prove that your data actually follows s ome dis tribution XY! Rather, it checks if it finds evidence agains t (!) this assumption. • • If no such evidence found: “acquitted due to lack of evidence”, i.e., “ could be XY distributed” (but not: “acquitted due to obvious innocence”!) • If such evidence found: “seems guilty from my point of view”, i.e., “ to me, it doesn’t look XY distributed” – but in fact you might simply not have collected enough data yet, or you should have used a more powerful test: • The more restricted your test (the less distributions you can use it for etc.), the better its power • As with confidence intervals, you select a confidence level • Semantics: “With = α 5% error rate, no evidence could be found that contradicts the assumption that the data is XY-distributed” In fact, confidence intervals are a special way to formulate a • Student-T-test

  51. Confidence intervals : Be careful! 1. Your data should be (roughly) normally distributed • …at least, not very asymmetric • Check it  QQ plots, and maybe statistical tests • If not the case  group the data and apply Central Limit Theorem 2. Your data should be independently distributed Often not the case! • Check it  (1) Plot the data and look for a trend, (2) ACF plots • If not the case  group the data into larger groups and check ACF • again 3. The Central Limit Theorem only works if the system that creates your samples has finite variance More on this later •

  52. Forcing our data to be normally dis tributed: Making us e of the Central Limit Theorem • Often times, your data is not normally distributed • Central Limit Theorem, pragmatic version “for dummies”: You have a large number of samples that do not at all look normally • distributed • Group them together into chunks of about 30–100 samples • Use groups of 30 if the data looks symmetric and roughly similar to a Gauß bell curve • Use groups of 100 if the data looks very asymmetric • For each group, calculate the arithmetic mean of that group Now a miracle happens: • The group means [roughly] follow a normal distribution! • Try it out in R yourselves! • Now you can use the group means as input for your statistical tool that expects normally distributed data (e.g., confidence intervals)

  53. Confidence intervals : Be careful! 1. Your data should be (roughly) normally distributed • …at least, not very asymmetric • Check it  QQ plots, and maybe statistical tests • If not the case  group the data and apply Central Limit Theorem 2. Your data should be independently distributed • Often not the case! • Check it  (1) Plot the data and look for a trend, (2) ACF plots • If not the case  group the data into larger groups and check ACF again 3. The Central Limit Theorem only works if the s ys tem that creates your s amples has finite variance

  54. Central limit theorem: limits • The central limit theorem only holds if the process that creates your data has finite variance! • Of course, the sample variance is always finite: sd(mydata) or var(mydata) always will give a finite value (…unless mydata contains an +Inf value) • But the process that created your samples can have an infinite variance • “Infinite variance? Come on, that’s an esoteric special case!” • No, it’s not – it can easily happen with power law distributions… • …and power law distributions are ubiquitous! • More on this on later slides

  55. Confidence intervals : Be careful! 1. Your data should be (roughly) normally distributed • …at least, not very asymmetric • Check it  QQ plots, and maybe statistical tests • If not the case  group the data and apply Central Limit Theorem 2. Your data s hould be independently dis tributed Often not the case! • Check it  (1) Plot the data and look for a trend, (2) ACF plots • If not the case  group the data into larger groups and check ACF • again 3. The Central Limit Theorem only works if the system that creates your samples has finite variance More on this later •

  56. Data mus t not contain s ignificant trends • Simply do plot(data) and visually check if there’s a trend • e.g., values seem to grow over time… • …or seem to decay… • …or you see a curved shape (e.g., first upwards, then downwards) • …etc. • AFAIK no really good way to check all these with a statistical test • Even if you can’t make out a trend, you’ll have to check independence using an ACF plot

  57. ACF plots : Are your s amples really independent? • Consider temporal development of queue length at a router: • If large now  cannot change much in the next milliseconds • If small now  unlikely that it changes dramatically • If your data shows such autocorrelation  Confidence intervals will be too small  If you don’t check this, you’re a cheater! • Check autocorrelation using ACF plots (cf. next slides) • If the plots suggest unacceptable autocorrelation, then … • Group the samples (as with central limit theorem) • If they are grouped already, make the groups larger • “Should I include these ACF plots into my paper?” • No – just say that “using ACF plots, we ensured that our samples (or sample means) were sufficiently uncorrelated”

  58. Autocorrelation

  59. How to do and how to read ACF (autocorrelation function) plots • How to do them in R: IMO rather silly for lag =0, but R does it… > acf(mydata) • What does it show: • Calculate autocorrelation for lag = 0, 1, 2, 3, … • Show boundaries for “random” autocorrelation • “random autocorrelation”: an autocorrelation within these boundaries is very likely to be caused by random fluctuations, not by an actual systemic autocorrelation • How to read it: • GOOD: If all bars (except the silly first one) stay within these boundaries, then we can be happy and assume the sample does not contain unacceptably large autocorrelation • BAD (i.e., we have indication for an unacceptable level of autocorrelation): If further bars at the beginning cross the boundaries (erratic single bars further towards the right might be okay)

  60. The data T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene

  61. Good ACF plot

  62. The data T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene

  63. Bad ACF plot T extmasterformat bearbeiten Zweite Ebene Dritte Ebene With a grain of salt, we might tolerate this… Vierte Ebene Fünfte Ebene … but not this

  64. The data T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene

  65. Terrible ACF plot T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene

  66. This number s huffling is not only good for confidence intervals … • Many statistical tools require that your data be • uncorrelated (“iid”) • and some require it to be normally distributed • You can apply the same tools in these cases (ACF plots, grouping, QQ plots, central limit theorem, etc.)

  67. Digres s ion: How to pres ent a plot in a pres entation I am a bad example, since I don’t adhere to these steps in this talk, but…: 1. What figures does this plot talk about? Why do you s how it? “This plot shows compares the development of the gnaricrosticity in relation to the bullawomptleness and will show this paper is great” Explain the axes . (People tend to forget this, but it’s important!) 2. “The X axis displays the bullawomptleness in square Hertz per packet” • “The Y axis displays the gnaricrosticity in cubic degrees per nanobyte” • How do you read your plot? (Don’t interpret the data points yet!) 3. “Data points in the upper right corner are good, because blah.” • “Data points in the lower right corner are unfavourable, ….” • 4. Why are there multiple lines /colours /… in your plot? “The blue dots represent measurements in Munich, the red dots represent those in Shanghai.” 5. Only now you can s tart interpreting what the plot s hows : “As you can see, we only have data points in the upper left corner, and only one in the lower right. Therefore I deserve the Fields medal.”

  68. CCDF plots • CDF plot: shows cumulative probability on Y axis • CCDF plot: shows 1 − cumulative probability on Y axis • What’s the point? If we use a logarithmic Y axis, it makes a huge difference whether we • do a CDF plot or a CCDF plot Almost always are plotted with logscale X and logscale Y axis • Can be used to spot Power Laws (cf. later slides) • • Alternative names: Survival plot, survival function plot, … • How to do them in R? Unfortunately, not a builtin function • (maybe I shouldn’t have mentioned this fact about this pie charts…) Use my crude plot.ccdf.R • or download Fabian Schneider’s plot.ccdf.R • or try to find a package at CRAN •

  69. A CDF plot with linear s cale T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene

  70. CDF plot, logarithmic X axis (s ame data) T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene

  71. CDF plot, log X and Y axes (s ame data) T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene

  72. CCDF plot, log X and Y axes (s ame data!!) T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene Fünfte Ebene

  73. Why s hould we care about Power Laws ? • Not everything out there is normally or exponentially distributed • Actually, Power Laws and similar distributions are rather widespread (more on this later) • Surprising and nasty properties! (see next slides) • En vogue in the scientific community since about the 1980s, 1990s (depending on subject) Every time you see something resembling a straight line on a plot with doubly logarithmic axes, you are expected to instantly yell “A power law! A power law!”, regardless of whether that actually makes sense. The more log-log-straight lines in your paper, the merrier.

  74. Power Laws : Surpris ing properties • Depending on the exponent parameter, a power-law distribution may have… • … infinite variance  very high variability • Cannot just measure, e.g., 100 values, take their mean, and assume that the sample mean is a good estimator for the actual mean of the generating process – and the median is only a good estimator for the mean of a symmetric distribution! • Central Limit Theorem will not work  Can not simply group a couple of values and trust they will be normally distributed around their mean • … or even an infinite mean! • Warning: Of course, the sample variance and sample mean are always defined! (unless you have Inf or NaN values in your data) • Other surprising properties (more on this later) • Self-similarity, fractal character • Long-range dependence

  75. Nomenclature confus ion

  76. Some notions in the power law context • Used almost interchangeably: Power Law, Power T ail, Long T ail, Fat T ail, Heavy T ail • Pareto distribution • Zipf’s Law • 80 : 20 rule, Pareto principle • (Pseudo|Asymptotic|Stochastic) Self-similarity (of 1st, 2nd degree) • Fractal • Long-Range Dependency (LRD) • Slowly decaying variance • 1/f noise • Hurst parameter > 0.5 • • …but of course most of them do not exactly describe the same systems – there sometimes are subtle differences! N.B. We won’t describe them with much mathematical rigour …

  77. Self-s imilarity, fractals Scale-free distribution: Not just a curiosity, but generates samples that are self-similar or fractal , i.e., more or less look the same on all kinds of scales: Real Unrealistic / world / Non-Power- Power law Law traffic: traffic: Variability is Variability quickly remains averaged the same out on across larger time many time scales scales

  78. 80 : 20 rule (a.k.a. „Pareto principle“)  The fraction W of the wealth in the hands of the richest P of the the population is given by W = P (α−2)/(α−1)  Example: US wealth: α = 2.1  Richest 20% of the population holds 86% of the wealth  Of course, not only restricted to wealth, but a fundamental property of Power-Law-distributed variables

  79. Power laws are s eemingly everywhere (1) scientific papers 1981-1997 AOL users visiting sites ‘97 in Moby Dick bestsellers 1895-1965 AT&T customers on 1 day California 1910-1992

  80. Power Laws are s eemingly everywhere (2) wars (1816–1980) Solar flares Moon richest individuals 2003 US family names 1990 US cities 2003

Recommend


More recommend