Analyzing data using Python Eric Marsden <eric.marsden@risk-engineering.org> The purpose of computing is insight, not numbers. – Richard Hamming
data probabilistic model event probabilities consequence model event consequences risks curve fjtting costs decision-making criteria Tiese slides 2 / 47 Where does this fjt into risk engineering?
data probabilistic model event probabilities consequence model event consequences risks curve fjtting costs decision-making criteria Tiese slides 2 / 47 Where does this fjt into risk engineering?
data probabilistic model event probabilities consequence model event consequences risks curve fjtting costs decision-making criteria Tiese slides 2 / 47 Where does this fjt into risk engineering?
3 / 47 Descriptive statistics
observations • organize and simplify data to help understand it ▷ Inferential statistics use observations (data from a sample) to make inferences about the total population • generalize from a sample to a population 4 / 47 Descriptive statistics ▷ Descriptive statistics allow you to summarize information about
5 / 47 Source: dilbert.com Descriptive statistics
median or the mean ▷ Tie median is the point in the distribution where half the values are lower, and half higher • it’s the 0.5 quantile ▷ Tie (arithmetic) mean (also called the average or the mathematical expectation) is the “center of mass” of the distribution • continuous case: 𝔽(𝑌) = ∫ 𝑐 ▷ Tie mode is the element that occurs most frequently in your data 6 / 47 Measures of central tendency ▷ Central tendency (the “middle” of your data) is measured either by the 𝑏 𝑦𝑔 (𝑦)𝑒𝑦 • discrete case: 𝔽(𝑌) = ∑ 𝑗 𝑦 𝑗 𝑄(𝑦 𝑗 )
7 / 47 Measurements of fatigue life Association, 53(281) Source: Birnbaum, Z. W. and Saunders, S. C. (1958), A statistical model for life-length of materials , Journal of the American Statistical 1416.0 1400.9108910891089 1400.9108910891089 > cycles.mean() > cycles = numpy.array([370, 1016, 1235, [...] 1560, 1792]) > import numpy (1958). Data from Birnbaum and Saunders subjected to loads of 21 000 PSI. strips of 6061-T6 aluminium sheeting, (thousands of cycles until rupture) of Illustration: fatigue life of aluminium sheeting > numpy.mean(cycles) > numpy.median(cycles)
8 / 47 # 50th percentile = 0.5 quantile = median # <-- almost unchanged 79.70768232050916 > numpy.median(weights) # <-- big change 190.4630171138418 > numpy.mean(weights) # outliers > weights = numpy.append(weights, [10001, 101010]) 79.69717178759265 Note: the mean is quite sensitive to outliers, the median much less. 79.69717178759265 79.83294314806949 > numpy.mean(weights) > weights = numpy.random.normal(80, 10, 1000) > import numpy ▷ the median is what’s called a robust measure of central tendency Aside: sensitivity to outliers > numpy.median(weights) > numpy.percentile(weights, 50)
If the distribution of data is symmetrical, then the mean is equal to the median. If the distribution is asymmetric (skewed), the mean is generally closer to the skew than the median. Degree of asymmetry is measured by skewness (Python: scipy.stats.skew() ) 9 / 47 Measures of central tendency Negative skew Positive skew
▷ Variance measures the dispersion (spread) of observations around the mean • 𝑊𝑏𝑠(𝑌) = 𝔽 [(𝑌 − 𝔽[𝑌]) 2 ] function of 𝑌 1 • note: if observations are in metres, variance is measured in 𝑛 2 • Python: array.var() or numpy.var(array) ▷ Standard deviation is the square root of the variance • it has the same units as the mean • Python: array.std() or numpy.std(array) 10 / 47 Measures of variability • continuous case: 𝜏 2 = ∫(𝑦 − 𝜈) 2 𝑔 (𝑦)𝑒𝑦 where 𝑔 (𝑦) is the probability density • discrete case: 𝜏 2 = 𝑜−1 ∑ 𝑜 𝑗=1 (𝑦 𝑗 − 𝜈) 2
11 / 47 823.99793599999998 between 100 and 200. Calculate the mean, min, max, variance and standard deviation of this sample. > import numpy > obs = numpy.random.randint(100, 201, 1000) > obs.mean() 149.49199999999999 Task : Choose randomly 1000 integers from a uniform distribution 100 28.705364237368595 200 > obs.std() Exercise: Simple descriptive statistics > obs.min() > obs.max() > obs.var()
12 / 47 interval plt.xlabel("Cycles until failure") Histograms are a sort of bar graph that shows # our Birnbaum and Sanders failure data import matplotlib.pyplot as plt “reasonable” histogram, but is subjective. Note: the width of the bins is important to obtain a plt.hist(cycles) 3 Plot the number of observations in each interval 2 Count the number of observations in each classes or intervals (called “bins”) 1 Subdivide the observations into several equal To build a histogram: displays raw counts or proportions. the distribution of data values. Tie vertical axis Histograms: plots of variability 25 20 15 10 5 0 500 1000 1500 2000 2500 Cycles until failure
13 / 47 half the fjrst and third quartiles ¾ of cases from the latter ¼ ▷ Tie interquartile range (IQR) is the distance between ▷ Tie second quartile, the median, divides the dataset in fjrst ¼ of cases from the latter ¾ that breaks a dataset into four equal parts Quartiles ▷ A quartile is the value that marks one of the divisions ▷ Tie fjrst quartile, at the 25 th percentile, divides the 25% of observations 25% of observations 25% of observations 25% of observations ▷ Tie third quartile, the 75 th percentile, divides the fjrst interquartile range • 25 th percentile and the 75 th percentile
14 / 47 A “box and whisker” plot or boxplot shows the spread of the data ▷ the median (horizontal line) ▷ lower and upper quartiles Q1 and Q3 (the box) ▷ upper whisker: last datum < Q3 + 1.5×IQR ▷ the lower whisker: fjrst datum > Q1 - 1.5×IQR ▷ any data beyond the whiskers are typically called outliers import matplotlib.pyplot as plt plt.boxplot(cycles) plt.xlabel("Cycles until failure") Box and whisker plot 2500 2000 Cycles until failure 1500 1000 500 1 differently, to represent the 5 th and 95 th Note that some people plot whiskers percentiles for example, or even the min and max values…
Adds a kernel density estimation to a boxplot import seaborn as sns sns.violinplot(cycles, orient="v") plt.xlabel("Cycles until failure") 15 / 47 Violin plot 2500 2000 1500 1000 500 0 Cycles until failure
increases). A good estimator should be unbiased, precise and consistent (converge as sample size 16 / 47 Bias and precision Precise Imprecise Biased Unbiased
know the associated uncertainty • especially for risk engineering! ▷ One option is to report the standard error • ̂ 𝜏 √𝑜 , where ̂ 𝜏 is the sample standard deviation (an estimator for the population standard deviation) and 𝑜 is the size of the sample • diffjcult to interpret without making assumptions about the distribution of the error (ofuen assumed to be normal) ▷ Alternatively, we might report a confjdence interval 17 / 47 Estimating values ▷ In engineering, providing a point estimate is not enough: we also need to
time, the parameter of interest will be included in that interval • most commonly, 95% confjdence intervals are used ▷ Confjdence intervals are used to describe the uncertainty in a point estimate • a wider confjdence interval means greater uncertainty 18 / 47 Confjdence intervals ▷ A two-sided confjdence interval is an interval [𝑀, 𝑉] such that C% of the
19 / 47 A 90% confjdence interval means that 10% of Here, for a two-sided confjdence interval. included in that interval. the time, the parameter of interest will not be Interpreting confjdence intervals population mean m m m m m m m m m m
20 / 47 A 90% confjdence interval means that 10% of Here, for a one-sided confjdence interval. included in that interval. the time, the parameter of interest will not be Interpreting confjdence intervals population mean m m m m m m m m m m
21 / 47 Data from Birnbaum and Saunders (1958) graphically on a barplot, as “error lines”. Note however that this graphical presentation is ambiguous, because some authors represent the standard deviation on error bars. Tie caption should always state what the error bars represent. import seaborn as sns sns.barplot(cycles, ci=95, capsize=0.1) plt.xlabel("Cycles until failure (95% CI)") Confjdence intervals can be displayed Illustration: fatigue life of aluminium sheeting 0 200 400 600 800 1000 1200 1400 Cycles until failure, with 95% confidence interval
sample population Statistical inference means deducing information about a population by examining only a subset of the population (the sample). We use a sample statistic to estimate a population parameter . 22 / 47 Statistical inference
Recommend
More recommend