Analys ing data with GNU R Nils Kammenhuber Technis che Univers - PowerPoint PPT Presentation
Analys ing data with GNU R Nils Kammenhuber Technis che Univers itt Mnchen Sources for s ome of thes e s lides Gilberto Cmara Instituto Nacional de Pesquisas Espaciais Manfred Jobmann T echnische Universitt Mnchen
Ins talling additional R packages from CRAN ďŽ T o install additional packages: ď¨ Ubuntu, Debian, etc: ď¨ First, try the shell command apt-cache search pkgname ď¨ Others / if thereâs no Ubuntu/Debian package: ď¨ You might need to have C, C++, Fortran (!) compilers etc. installed on your machine ď¨ Start R on your computer. ď¨ At the R prompt, type: > choose CRANmirror() > install.packages(c(âpkg1â, âpkg2â), dependencies=TRUE) ď¨ After the package has been installed, make R use it: > library(pkg1) ď¨ Now you can use the new functions etc. ď¨ If it didnât work, try re-starting R
Remainder of this talk: Typical workflow ⢠Read in the file ⢠Explorative analysis ⢠Some basic statistics ⢠Some basic plot types ⢠Save the plots ⢠Advanced topics ⢠More complex analyses and plots ⢠More on R syntax
Very s imple input ďŽ T ask: Read a file into a vector ďŽ Input file looks like this: 1 2 17.5 99 ďŽ Read this into vector x: x <- scan(âinputfile.txtâ) ďŽ There are more options ď¨ help(scan)
Some examples > length(x) 100 > min(x) [1] -2.4923 > max(x) [1] 2.969
âWhat values does x typically have?â (1)
If thereâs heavy variation in the data⌠⢠Whatâs the typical income for persons in this room? ⢠Now Bill Gates walks in. Whatâs the typical income now? ⢠Mode (âModalwertâ): The value that occurs most often Of course, usually only defined for categorical or discrete values ⢠And what about local maxima? Or multiple maxima? â˘ ď¨ Use with care Unfortunately, not a built-in function in R. Instead use: ⢠> tab.x <- table(x); names(t.x[which.max(t.x)]) > median(x) ď§ the 50% quantile 50% of values are greater than the median, 50% are lesser ⢠Way less sensitive to outliers than mean (cf. Bill Gates example) ⢠Estimator for the mean if the x are s ymmetrically dis tributed ⢠> mean(x, trim=0.05) ď§ trimmed mean Disregard 5% smallest and 5% largest values for x ⢠Idea: Outliers are not representative / not important for the system ⢠⢠Dangerous ! Very often, that is a wrong as s umption!
âUnfortunately, this is not a built-in functionâ ď¨ Writing your own functions in R > harmean <- function(x) { + return(length(x) / sum(x)) + } Fix an error using external text editor (N.B. should be 1/x, not x): > fix(harmean) Syntax error [âŚblahâŚ] in line 2. Use x <- edit() > fix(harmean) Baaaad! Now all of your changes are lost! Instead you should have done what R told you: > harmean <- edit() Now that weâre at it, letâs also fix our geometric mean: > geomean <- edit() Baaaad again! Now geomean becomes a copy of whatever you last edit() ed. ď¨ should have used fix(geomean) here
Saving your own functions in R ⢠Method 1: Automatic saving ⢠> quit() Save workspace image? [y/n/c]: ⢠If you answer â y â now, your own functions will re-appear the next time that you start R ⢠âŚbut only if you s tart it from the s ame working directory! ⢠And where is that? > getwd() [1] â/home/blabla/research/mydataâ ⢠Method 2: ⢠Edit external text file with, e.g., Emacs ⢠> source(âmy-functions.Râ) ⢠What about the other way round? (i.e., R ď¨ text file) > dump(list=c(âfnc1â, âfnc2â), file=âmy-fun.Râ)
Back to the data. What about other s tatis tics ? ⢠quantile(x, probs=c(0.05, 0.1, 0.9, 0.95)) ⢠Values that are 5% or 10% smaller or greater than all others ⢠> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -9.38 -3.51 -0.03 0.54 3.96 11.67 25% and 75% quantile are called the quartiles ⢠Help! summary() behaves strangely and doesnât calculate quantiles and mean! ⢠Probably some non-numerical values slipped into your data â˘ ď¨ R will not view the data as numbers ⢠T ypical culprits: â-â, âN/Aâ, â*â, âunknownâ etc. in your file
Meas uring dis pers ion
Why is there als o the s tandard deviation if we already have the variance?
Standard deviation, s tandard error
Comparing dis pers ions
And if I want to compare the s tandard errors , not the s tandard deviations ?
What about Bill Gates and dis pers ionâŚ? Remove the âconstant=1â to get an estimate for the std.dev. (then you donât have to multiply with 0.6745)
Robus t s tatis tics ⢠Classical statistics: mean, variance, ⌠Very sensitive to outliers ⢠⢠More exact if there are no outliers Readers (and paper reviewers!) are accustomed to it ⢠⢠Robust statistics: median, MAD, ⌠Not sensitive to outliers ⢠⢠Less exact: One single datapoint represents the entire data set! Readers and reviewers may not know them ⢠(âThe median, ah, yes, whatâs that again⌠and WTF is MAD?â) ⢠When in doubt, use robust statistics ⢠But thatâs just my personal opinion!
Enough. Letâs plot! ⢠Obvious approach: Scatter plot > plot(x) > plot(x, type=âlâ)
Plotting (2) Combination of type=âpâ and type=âlâ : > plot(x, type=âbâ)
But does this plot type actually reveal what we want to s ee? ⢠Always as k this ques tion! ⢠So, are you really interested in the temporal development? Apparently, the values fluctuate randomly and are not really correlated⌠⢠Better alternatives: ⢠Boxâwhisker plot ⢠Histogram plot ⢠Density plot ⢠CDF plot ⢠CCDF plot ⢠Scatterplot against other data ⢠QQ plot against postulated theoretical distribution ⢠QQ plot against other data
What you s hould not plot ⢠A real res earcher never, ever us es pie charts ! ⢠Thereâs a good reason why R has no method for drawing them ⢠More on this in the slideset âHow to lie with statisticsâ ⢠Also, you should try avoiding 3D charts at all costs ⢠3D plot is plotted on 2D paper or 2D screen ď¨ Often very difficult to make out the 3D structure ⢠Much more difficult to understand (âso what does it actually mean if a data point is in the upper left rear corner?â) ⢠Only use 3D plots as a very last resort ⢠Always remember: ⢠You donât make the plot just for you ⢠You donât make the plot to impress the reader ⢠You make the plot to help the reader (and the reviewer!) understand your point
Boxâwhis ker plot T ersely shows the distribution of your data: outlier points > boxplot(x) 75% quantile Boundaries calculated from 50% quantile quantile ranges (i.e., median) ( Last data points that are still within medianÂą1.5 â IQR) 25% quantile
His togram plot: This is what you us ually want > hist(x, breaks=20) > hist(x) T extmasterformat bearbeiten T extmasterformat bearbeiten Zweite Ebene Zweite Ebene Dritte Ebene Dritte Ebene Vierte Ebene Vierte Ebene FĂźnfte Ebene FĂźnfte Ebene
Dens ity plots (PDF) ⢠âA histogram with an infinite number of (interpolated) barsâ ⢠Interpolation using a density kernel function > plot(density(x))
Cumulative dens ity function plots (CDF) ⢠Basically, itâs just the integral of the density function ⢠But without the interpolation part ⢠So itâs actually a sum, not an integral > plot(ecdf(x)) ⢠Why CDF instead of density or histogram? ⢠Can immediately see quantiles ⢠More expressive if points spread out very wide e.g., 20% quantile
Logarithmic axes ⢠Use logarithmic scaling if the data is very asymmetric ⢠Always a good idea to play around with axis s caling! > plot(ecdf(x), log=âxâ) Error: logarithmic axis must have positive boundaries > plot(ecdf(x), log=âxâ, xlim=range(x))
atanh, as inh axes (Ralph: âThe Kammenhuber s caleâ â but I was not the firs t to dis cover it;-) ⢠Properties of logarithmic scaling ⢠Visual magnification towards 0 ⢠Visual compression towards +â ⢠Cannot dispay value 0.0 or negative values ⢠But what if we wanted logarithm-like scaling for negative and for positive values in one plot? ⢠T wo types of hyperbolic scaling ⢠asinh axis scaling: ⢠Visual compression towards ââ ⢠Visual magnification towards 0 ⢠Visual compression towards +â ⢠[shifted and scaled] atanh axis scaling: ⢠Visual magnification towards 0.0 ⢠Visual compression around 0.5 ⢠Visual magnification towards 1.0
Linear vs . as inh X axis ⢠Linear scale hides the fact that we have two peaks! ⢠N.B. Interval boundaries are different
Linear vs . atanh Y axis ⢠Linear scale hides the fact that the temporal development towards larger key lengths also took place at very short and very large key lengths, not only for popular key lengths
Why does it work? atanh s cale (s hifted) as inh s cale
How to do it in R ⢠Rather tricky! ⢠R has no builtin support for this ⢠Need to manually fiddle around with it ⢠Basic idea for asinh-scaled X axis: > plot(x=asinh(data), y=âŚ, xaxt=ânâ, âŚ) > axis(side=1, at=asinh(c(-100, -10, -1, 0, 1, 10, 100)), labels=c(-100, -10, -1, 0, 1, 10, 100)) ⢠Basic idea for atanh-scaled Y axis: > atanhS <- function(x) atanh(2*x-1) > plot(x=âŚ, y=atanh(data), yaxt=ânâ, âŚ) > axis(side=2, at=atanhS(c(0.01, 0.1, 0.5, 0.9, 0.99)), labels=c(0.01, 0.1, 0.5, 0.9, 0.99)) ⢠Cumbersome ď¨ I wrote some R scripts that are about to be released
Nomenclature ⢠Apparently, these axis scales have been known for decades or centuries ⢠Apparently, no consistent nomenclature for them ⢠I call them âtwo-ended pseudo-logarithmic scaleâ, since that describes what they look like ⢠Hard to understand if you explain it purely orally: ⢠âBut hey, a double-logarithmic plot is easy to do in R, just say plot(âŚ, log=âxyâ) â ⢠Thatâs true, but a double-logarithmic plot is a plot with logarithmic X and logarithmic Y axis â thatâs an entirely different thing!
Further information ⢠Nils Kammenhuber: Two-ended pseudo-logarithmic scaling. T echnical Report, NET 2012, to appear
Combining plots ⢠Now I have multiple measurements that I want to compare in one single plot > plot(first.data, type=âlâ, col=âblackâ, lty=1) > lines(second.data, col=âblueâ, lty=2) ⢠Or with points instead of lines: > plot(first.data, type=âpâ, col=âblackâ, pch=1) > points(second.data, col=âblueâ, pch=2)
Adjus ting axis limits ⢠Problem: The axis limits are not readjusted after the first call to plot() > plot(first.data, type=âlâ, col=âblackâ, lty=1, xlim=range(c(first.data, second.data)), ylim=range(c(first.data, second.data))) > lines(second.data, col=âblueâ, lty=2) ⢠Alternatively, can manually specify axis limits, e.g., for probabilities: > plot(first.data, âŚ, xlim=range(c(first.data, second.data)), ylim=c(0.0, 1.0))
âI jus t ran the experiment for half an hour and I feel that I have enough data now.â ⢠Maybe⌠maybe not. ⢠The readers expect error bars or something like that ⢠How do I get error bars? ⢠Some researchers just use arith. mean Âą 2 standard errors (or Âą3) ⢠About 95.5% of all data falls within Âą2 Ď of ârealâ mean ⢠About 99.7% of all data falls within Âą3 Ď of ârealâ mean ⢠⌠but only if the data is actually normally dis tributed! ⢠Better: calculate a confidence interval on the mean of your data ⢠Semantics: â With a probability of 99% ( ď§ confidence level; adjustable), the output of the process that generated this data has an arithmetic mean that lies within this intervalâ ⢠Higher confidence ď¨ interval gets larger ⢠Higher variability ď¨ interval gets larger ⢠More data points ď¨ interval gets smaller
How to calculate confidence intervals
Confidence intervals : Be careful! 1. Your data s hould be (roughly) normally dis tributed ⢠âŚat least, not very asymmetric ⢠Check it ď¨ QQ plots, and maybe statistical tests ⢠If not the case ď¨ group the data and apply Central Limit Theorem 2. Your data should be independently distributed ⢠Often not the case! ⢠Check it ď¨ (1) Plot the data and look for a trend, (2) ACF plots ⢠If not the case ď¨ group the data into larger groups and check ACF again 3. The Central Limit Theorem only works if the system that creates your samples has finite variance ⢠More on this later
âHere, the error roughly looks like a GauĂ curve, s o it mus t be normally dis tributedâ ⢠How do I prove that my data is normally distributed? ⢠Na ĂŻve approach: ⢠Plot histogram and density of your data ⢠Plot density function of normal distribution into same diagram ⢠âHere, they look the sameâ > hist(mydata, freq=FALSE, xlim=c(0.0, 0.3)) > curve(dnorm(x, mean=mean(mydata), sd=sd(mydata)) , col=âblueâ, lty=2, add=TRUE) > legend( ⌠more about this later⌠) It looks normally distributed, but this data actually comes from the sum of a Student-T and an exponential distributionâŚ
Whatâs wrong with that approach 1. You can never âproveâ that data is normally (or exponentially, t, Pareto, âŚ)-distributed ⢠You only can present graphs, calculations, statistical tests etc. that do not contradict that assumption 2. The eye is easily fooled â many symmetrical distributions look like GauĂâ bell curve 3. There are way better ways to analyse this: ⢠QQ plots ⢠PP plots (not shown) ⢠Statistical tests: Ď Â˛ test, KolmogorovâSmyrnov test, ⌠(not shown â they usually are âtoo pickyâ;-)
Digres s ion #1: Theoretical probability dis tributions in R ⢠A lot of distributions are built into R ⢠Naming convention: For every probability distribution XY , there are the following four functions ⢠d xy () â the density function (I used that in the histogram) ⢠p xy () â the probability function (âcumulative density) (you could use that for comparison in a CDF plot of real data) ⢠q xy () â the quantile function (useful for QQ plots and some statistical tests) ⢠r xy () â draw random numbers from that distribution ⢠Example: ⢠rt(50, df=3) + rexp(50) â rnorm(50, sd=3) : Create 50 random variables from a combination of T, exponential and Normal distribution
Digres s ion #2: Random number generators
Random number generators : Literature ⢠Iâ8 Discrete Event Simulation slideset on random number generation ⢠LâEcuyer, Simard: TestU01: a C library for empirical testing of random number generators ACM Transactions on Mathematical Software, Volume 33, No. 4, 2007
Better alternative ins tead to comparing his togram vs . dens ity curve: QQ-Plots ⢠Short for quantileâquantile plot ⢠Compare the quantiles of two distributions ⢠Imagine you have two distributions (e.g., #1 â empirical observations; #2 â postulated theoretical distribution, e.g., normal distribution) ⢠Calculate 1%, 2%, ⌠100% quantile for each distribution ⢠T ake the quantiles of 1st distribution as Y coordinates and quantiles of 2nd distribution as X coordinates ⢠How to read them: ⢠A straight (i.e., unbent) line means that the two distributions are similar â In a sense, the only function that the human mind really understands well is a straight line. One good strategy for making a graphy easy to understand is to â John P . Boyd make it as linear as possible.â ⢠Be careful about the ends ⢠Also note the axis labels â they might be shifted
QQ plot example Histogram QQ plot: comparison: looks visible differences â ~normal not a straight line
How to do a QQ plot in R ⢠QQ plot against normal distribution: > qqnorm(mydata) > qqline(mydata) #auxiliary straight line ⢠QQ plot against some other distribution xy: > qqplot(qxy(ppoints(data), âŚ), data) ⢠Nicer: > qqplot(qxy(ppoints(data), âŚ), data, xlab=âTheoretical quantilesâ, ylab=âSample quantilesâ, main=âComparison with xy distributionâ) ⢠QQ plots can be us ed to compare different data s ets : > qqplot(data.internet, data.mysimulation)
QQ plots with type=âlâ or type=âbâ > qqnorm(data, type=âbâ) > qqnorm( sort (data), type=âbâ) ⢠Huh!??
Be careful when reading QQ plots ! Donât let yourselves be fooled! DAX revenues vs. normal distribution: Good model for normal trading days Bad model for exceptional trading days
Confidence intervals : Be careful! 1. Your data should be (roughly) normally distributed ⢠âŚat least, not very asymmetric ⢠Check it ď¨ QQ plots, and maybe statistical tests ⢠If not the case ď¨ group the data and apply Central Limit Theorem 2. Your data should be independently distributed Often not the case! ⢠Check it ď¨ (1) Plot the data and look for a trend, (2) ACF plots ⢠If not the case ď¨ group the data into larger groups and check ACF ⢠again 3. The Central Limit Theorem only works if the system that creates your samples has finite variance More on this later â˘
Statis tical tes ts ⢠If you really want to be correct, then use a statistical test once youâre satisfied with the QQ plot ⢠Millions of statistical tests that can be used to check âif the data follows some given theoretical distributionâ (e.g., GauĂâ normal distribution) â˘ Ď Â˛ test: Very universal (i.e., can be used with just about any distribution), but less powerful (i.e., needs more data) ⢠KolmogorovâSmyrnov test (only for a number of popular distributions), more powerful (also works with less data) ⢠AndersonâDarling test, ⌠⢠Most of them are built into R ⢠Check literature to see how to use them
Statis tical tes ts : General obs ervations ⢠A s tatis tical tes t never can prove that your data actually follows s ome dis tribution XY! Rather, it checks if it finds evidence agains t (!) this assumption. ⢠⢠If no such evidence found: âacquitted due to lack of evidenceâ, i.e., â could be XY distributedâ (but not: âacquitted due to obvious innocenceâ!) ⢠If such evidence found: âseems guilty from my point of viewâ, i.e., â to me, it doesnât look XY distributedâ â but in fact you might simply not have collected enough data yet, or you should have used a more powerful test: ⢠The more restricted your test (the less distributions you can use it for etc.), the better its power ⢠As with confidence intervals, you select a confidence level ⢠Semantics: âWith = Îą 5% error rate, no evidence could be found that contradicts the assumption that the data is XY-distributedâ In fact, confidence intervals are a special way to formulate a ⢠Student-T-test
Confidence intervals : Be careful! 1. Your data should be (roughly) normally distributed ⢠âŚat least, not very asymmetric ⢠Check it ď¨ QQ plots, and maybe statistical tests ⢠If not the case ď¨ group the data and apply Central Limit Theorem 2. Your data should be independently distributed Often not the case! ⢠Check it ď¨ (1) Plot the data and look for a trend, (2) ACF plots ⢠If not the case ď¨ group the data into larger groups and check ACF ⢠again 3. The Central Limit Theorem only works if the system that creates your samples has finite variance More on this later â˘
Forcing our data to be normally dis tributed: Making us e of the Central Limit Theorem ⢠Often times, your data is not normally distributed ⢠Central Limit Theorem, pragmatic version âfor dummiesâ: You have a large number of samples that do not at all look normally ⢠distributed ⢠Group them together into chunks of about 30â100 samples ⢠Use groups of 30 if the data looks symmetric and roughly similar to a GauĂ bell curve ⢠Use groups of 100 if the data looks very asymmetric ⢠For each group, calculate the arithmetic mean of that group Now a miracle happens: ⢠The group means [roughly] follow a normal distribution! ⢠Try it out in R yourselves! ⢠Now you can use the group means as input for your statistical tool that expects normally distributed data (e.g., confidence intervals)
Confidence intervals : Be careful! 1. Your data should be (roughly) normally distributed ⢠âŚat least, not very asymmetric ⢠Check it ď¨ QQ plots, and maybe statistical tests ⢠If not the case ď¨ group the data and apply Central Limit Theorem 2. Your data should be independently distributed ⢠Often not the case! ⢠Check it ď¨ (1) Plot the data and look for a trend, (2) ACF plots ⢠If not the case ď¨ group the data into larger groups and check ACF again 3. The Central Limit Theorem only works if the s ys tem that creates your s amples has finite variance
Central limit theorem: limits ⢠The central limit theorem only holds if the process that creates your data has finite variance! ⢠Of course, the sample variance is always finite: sd(mydata) or var(mydata) always will give a finite value (âŚunless mydata contains an +Inf value) ⢠But the process that created your samples can have an infinite variance ⢠âInfinite variance? Come on, thatâs an esoteric special case!â ⢠No, itâs not â it can easily happen with power law distributions⌠⢠âŚand power law distributions are ubiquitous! ⢠More on this on later slides
Confidence intervals : Be careful! 1. Your data should be (roughly) normally distributed ⢠âŚat least, not very asymmetric ⢠Check it ď¨ QQ plots, and maybe statistical tests ⢠If not the case ď¨ group the data and apply Central Limit Theorem 2. Your data s hould be independently dis tributed Often not the case! ⢠Check it ď¨ (1) Plot the data and look for a trend, (2) ACF plots ⢠If not the case ď¨ group the data into larger groups and check ACF ⢠again 3. The Central Limit Theorem only works if the system that creates your samples has finite variance More on this later â˘
Data mus t not contain s ignificant trends ⢠Simply do plot(data) and visually check if thereâs a trend ⢠e.g., values seem to grow over time⌠⢠âŚor seem to decay⌠⢠âŚor you see a curved shape (e.g., first upwards, then downwards) ⢠âŚetc. ⢠AFAIK no really good way to check all these with a statistical test ⢠Even if you canât make out a trend, youâll have to check independence using an ACF plot
ACF plots : Are your s amples really independent? ⢠Consider temporal development of queue length at a router: ⢠If large now ď¨ cannot change much in the next milliseconds ⢠If small now ď¨ unlikely that it changes dramatically ⢠If your data shows such autocorrelation ď¨ Confidence intervals will be too small ď¨ If you donât check this, youâre a cheater! ⢠Check autocorrelation using ACF plots (cf. next slides) ⢠If the plots suggest unacceptable autocorrelation, then ⌠⢠Group the samples (as with central limit theorem) ⢠If they are grouped already, make the groups larger ⢠âShould I include these ACF plots into my paper?â ⢠No â just say that âusing ACF plots, we ensured that our samples (or sample means) were sufficiently uncorrelatedâ
Autocorrelation
How to do and how to read ACF (autocorrelation function) plots ⢠How to do them in R: IMO rather silly for lag =0, but R does it⌠> acf(mydata) ⢠What does it show: ⢠Calculate autocorrelation for lag = 0, 1, 2, 3, ⌠⢠Show boundaries for ârandomâ autocorrelation ⢠ârandom autocorrelationâ: an autocorrelation within these boundaries is very likely to be caused by random fluctuations, not by an actual systemic autocorrelation ⢠How to read it: ⢠GOOD: If all bars (except the silly first one) stay within these boundaries, then we can be happy and assume the sample does not contain unacceptably large autocorrelation ⢠BAD (i.e., we have indication for an unacceptable level of autocorrelation): If further bars at the beginning cross the boundaries (erratic single bars further towards the right might be okay)
The data T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene FĂźnfte Ebene
Good ACF plot
The data T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene FĂźnfte Ebene
Bad ACF plot T extmasterformat bearbeiten Zweite Ebene Dritte Ebene With a grain of salt, we might tolerate this⌠Vierte Ebene Fßnfte Ebene ⌠but not this
The data T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene FĂźnfte Ebene
Terrible ACF plot T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene FĂźnfte Ebene
This number s huffling is not only good for confidence intervals ⌠⢠Many statistical tools require that your data be ⢠uncorrelated (âiidâ) ⢠and some require it to be normally distributed ⢠You can apply the same tools in these cases (ACF plots, grouping, QQ plots, central limit theorem, etc.)
Digres s ion: How to pres ent a plot in a pres entation I am a bad example, since I donât adhere to these steps in this talk, butâŚ: 1. What figures does this plot talk about? Why do you s how it? âThis plot shows compares the development of the gnaricrosticity in relation to the bullawomptleness and will show this paper is greatâ Explain the axes . (People tend to forget this, but itâs important!) 2. âThe X axis displays the bullawomptleness in square Hertz per packetâ ⢠âThe Y axis displays the gnaricrosticity in cubic degrees per nanobyteâ ⢠How do you read your plot? (Donât interpret the data points yet!) 3. âData points in the upper right corner are good, because blah.â ⢠âData points in the lower right corner are unfavourable, âŚ.â ⢠4. Why are there multiple lines /colours /⌠in your plot? âThe blue dots represent measurements in Munich, the red dots represent those in Shanghai.â 5. Only now you can s tart interpreting what the plot s hows : âAs you can see, we only have data points in the upper left corner, and only one in the lower right. Therefore I deserve the Fields medal.â
CCDF plots ⢠CDF plot: shows cumulative probability on Y axis ⢠CCDF plot: shows 1 â cumulative probability on Y axis ⢠Whatâs the point? If we use a logarithmic Y axis, it makes a huge difference whether we ⢠do a CDF plot or a CCDF plot Almost always are plotted with logscale X and logscale Y axis ⢠Can be used to spot Power Laws (cf. later slides) ⢠⢠Alternative names: Survival plot, survival function plot, ⌠⢠How to do them in R? Unfortunately, not a builtin function ⢠(maybe I shouldnât have mentioned this fact about this pie chartsâŚ) Use my crude plot.ccdf.R ⢠or download Fabian Schneiderâs plot.ccdf.R ⢠or try to find a package at CRAN â˘
A CDF plot with linear s cale T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene FĂźnfte Ebene
CDF plot, logarithmic X axis (s ame data) T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene FĂźnfte Ebene
CDF plot, log X and Y axes (s ame data) T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene FĂźnfte Ebene
CCDF plot, log X and Y axes (s ame data!!) T extmasterformat bearbeiten Zweite Ebene Dritte Ebene Vierte Ebene FĂźnfte Ebene
Why s hould we care about Power Laws ? ⢠Not everything out there is normally or exponentially distributed ⢠Actually, Power Laws and similar distributions are rather widespread (more on this later) ⢠Surprising and nasty properties! (see next slides) ⢠En vogue in the scientific community since about the 1980s, 1990s (depending on subject) Every time you see something resembling a straight line on a plot with doubly logarithmic axes, you are expected to instantly yell âA power law! A power law!â, regardless of whether that actually makes sense. The more log-log-straight lines in your paper, the merrier.
Power Laws : Surpris ing properties ⢠Depending on the exponent parameter, a power-law distribution may have⌠⢠⌠infinite variance ď very high variability ⢠Cannot just measure, e.g., 100 values, take their mean, and assume that the sample mean is a good estimator for the actual mean of the generating process â and the median is only a good estimator for the mean of a symmetric distribution! ⢠Central Limit Theorem will not work ď¨ Can not simply group a couple of values and trust they will be normally distributed around their mean ⢠⌠or even an infinite mean! ⢠Warning: Of course, the sample variance and sample mean are always defined! (unless you have Inf or NaN values in your data) ⢠Other surprising properties (more on this later) ⢠Self-similarity, fractal character ⢠Long-range dependence
Nomenclature confus ion
Some notions in the power law context ⢠Used almost interchangeably: Power Law, Power T ail, Long T ail, Fat T ail, Heavy T ail ⢠Pareto distribution ⢠Zipfâs Law ⢠80 : 20 rule, Pareto principle ⢠(Pseudo|Asymptotic|Stochastic) Self-similarity (of 1st, 2nd degree) ⢠Fractal ⢠Long-Range Dependency (LRD) ⢠Slowly decaying variance ⢠1/f noise ⢠Hurst parameter > 0.5 ⢠⢠âŚbut of course most of them do not exactly describe the same systems â there sometimes are subtle differences! N.B. We wonât describe them with much mathematical rigour âŚ
Self-s imilarity, fractals Scale-free distribution: Not just a curiosity, but generates samples that are self-similar or fractal , i.e., more or less look the same on all kinds of scales: Real Unrealistic / world / Non-Power- Power law Law traffic: traffic: Variability is Variability quickly remains averaged the same out on across larger time many time scales scales
80 : 20 rule (a.k.a. âPareto principleâ) ďŽ The fraction W of the wealth in the hands of the richest P of the the population is given by W = P (Îąâ2)/(Îąâ1) ďŽ Example: US wealth: Îą = 2.1 ďŽ Richest 20% of the population holds 86% of the wealth ďŽ Of course, not only restricted to wealth, but a fundamental property of Power-Law-distributed variables
Power laws are s eemingly everywhere (1) scientific papers 1981-1997 AOL users visiting sites â97 in Moby Dick bestsellers 1895-1965 AT&T customers on 1 day California 1910-1992
Power Laws are s eemingly everywhere (2) wars (1816â1980) Solar flares Moon richest individuals 2003 US family names 1990 US cities 2003
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.