SUMMARY STATISTICS INTRODUCTION TO DATA ANALYSIS FINAL EXAM Friday - PowerPoint PPT Presentation

INTRODUCTION TO DATA ANALYSIS SUMMARY STATISTICS

INTRODUCTION TO DATA ANALYSIS FINAL EXAM ▸ Friday February 7 2020 ::: 4-8pm ▸ 66/E33 & 66/E34 ▸ no class at noon on that day

INTRODUCTION TO DATA ANALYSIS HOW (NOT) TO PERFORM OPTIMALLY IN THIS COURSE ▸ use the script, not the slides ▸ individual practice at home essential

INTRODUCTION TO DATA ANALYSIS LEARNING GOALS ▸ understand what a “summary statistic” is ▸ understand and be able to compute the following: ▸ counts and frequencies for categorical data ▸ measures of central tendency: mean, mode & median ▸ measures of dispersion: variance, standard deviation & quantiles ▸ bootstrapped confidence intervals for an estimate ▸ co-variance & correlation

INTRODUCTION TO DATA ANALYSIS SUMMARY STATISTICS ▸ usually: what we analyze ≠ what we actually measured ▸ data observations are always already interpreted abstractions over a much richer reality ▸ e.g., we record whether a coin landed heads or tails, not where it landed ▸ summary statistic: a single number that represent one aspect of the data ▸ useful for communication about / understanding of the data at hand ▸ e.g., counting observations of a particular type / calculating the mean of some numeric observations

INTRODUCTION TO DATA ANALYSIS BIO-LOGIC JAZZ-METAL ▸ 102 participants from this course [THANKS FOR DOING THIS!] ▸ everybody got three 2-alternative forced-choice questions (in random order): “If you have to choose between the following two options, which one do you prefer?” 1. Biology vs Logic 2. Jazz vs Metal 3. Mountains vs Beach ▸ no sane person would defend serious scientific hypotheses about this study, but the lecturer conjectures irresponsibly that a certain musical taste may be correlated with a particular preference for academic subjects

INTRODUCTION TO DATA ANALYSIS INSPECTING THE DATA participant with ID 379 prefers: ‣ beaches over mountains ‣ logic over biology ‣ metal over jazz

INTRODUCTION TO DATA ANALYSIS COUNTING OBSERVATIONS ▸ functions `n`, `count`, and `tally` from `dplyr` package ▸ caveats: ▸ different versions of `dplyr` package implement `count` differently ▸ several packages define a `count` function; use `dplyr::count` explicitly to be sure ▸ functions `table` and `prop.table` from base R

INTRODUCTION TO DATA ANALYSIS COUNTING OBSERVATIONS ▸ `n` works only in `mutate` and `summarize` ▸ `n` essentially counts rows (useful after grouping!)

INTRODUCTION TO DATA ANALYSIS COUNTING OBSERVATIONS ▸ `count` and `tally` are wrappers around `n` ▸ `count` implicitly groups/ungroups ▸ `tally` does not tinker with existing grouping

INTRODUCTION TO DATA ANALYSIS COUNTS OF CHOICE PAIRS

INTRODUCTION TO DATA ANALYSIS PROPORTIONS OF CHOICE PAIRS

INTRODUCTION TO DATA ANALYSIS

INTRODUCTION TO DATA ANALYSIS MEASURES OF CENTRAL TENDENCY & DISPERSION ▸ central tendency: where is “the center” of the data observations ▸ dispersion: how far are values distributed around “the center”

INTRODUCTION TO DATA ANALYSIS AVOCADO DATA ▸ data released by Hass Avocado Board (plucked from kaggle)

INTRODUCTION TO DATA ANALYSIS MEAN

INTRODUCTION TO DATA ANALYSIS MEAN :: EXAMPLE

INTRODUCTION TO DATA ANALYSIS CALCULATING THE MEAN IN R

INTRODUCTION TO DATA ANALYSIS EXCURSION :: MEAN AS EXPECTED VALUE ▸ the mean can be conceptualized also as the value you would expect to gain when you sample once from the observed data ▸ useful later to link this to the expected value of a random variable (but not important right now)

INTRODUCTION TO DATA ANALYSIS MEDIAN

INTRODUCTION TO DATA ANALYSIS MEDIAN :: EXAMPLE

INTRODUCTION TO DATA ANALYSIS CALCULATING THE MEDIAN IN R

INTRODUCTION TO DATA ANALYSIS MEAN VS MEDIAN ▸ mean is more susceptible to outliers ▸ choice of mean vs. median is great for manipulation: ▸ “How to mislead with statistics”

INTRODUCTION TO DATA ANALYSIS MODE ▸ the mode is the value that occurred most frequently in the data ▸ often not applicable to metric data (where each measurement, if fine-grained enough occurs only once) ▸ good for nominal and ordinal measures ▸ there is no built-in function in R to calculate the mode ▸ caveat: function `mode` exists but is unrelated

INTRODUCTION TO DATA ANALYSIS VARIANCE

INTRODUCTION TO DATA ANALYSIS VARIANCE :: EXAMPLE

INTRODUCTION TO DATA ANALYSIS VARIANCE :: BIASED AND UNBIASED ESTIMATORS ▸ biased estimator (unless mean is known) ▸ unbiased estimator (if mean is estimated from data as well) ▸ R’s built-in function `var` calculates the unbiased estimator!

INTRODUCTION TO DATA ANALYSIS STANDARD DEVIATION

INTRODUCTION TO DATA ANALYSIS VARIANCE & STANDARD DEVIATION :: EXAMPLE

INTRODUCTION TO DATA ANALYSIS QUANTILE ▸ the k % quantile is a value so that k% of the data are smaller

INTRODUCTION TO DATA ANALYSIS CONFIDENCE ESTIMATES VIA BOOTSTRAPPING ▸ variance & standard deviation tell us how far around the mean the data dwells ▸ they do not tell us how good our estimate of the mean is ▸ we can use bootstrapping, a special instance of resampling methods for this purpose

INTRODUCTION TO DATA ANALYSIS BOOTSTRAPPING 95 % CONFIDENCE INTERVALS FOR THE MEAN

INTRODUCTION TO DATA ANALYSIS BOOTSTRAPPING 95 % CONFIDENCE INTERVALS collected measures of interest for each resample original data resample 1 resample 2 - Fish: Water vector created by brgfx - www.freepik.com

INTRODUCTION TO DATA ANALYSIS BOOTSTRAPPING 95 % CONFIDENCE INTERVALS 95% bootstrapped CI original data

INTRODUCTION TO DATA ANALYSIS BOOTSTRAPPING IN R full data example partial data example

INTRODUCTION TO DATA ANALYSIS NESTED TIBBLES FOR GROUP SUMMARIES

INTRODUCTION TO DATA ANALYSIS NESTING TABLES

INTRODUCTION TO DATA ANALYSIS UNNESTING NESTED TABLES

⃗ ⃗ ⃗ ⃗ INTRODUCTION TO DATA ANALYSIS COVARIANCE ▸ covariance measures the degree to which two associated measurements show similar deviation from their respective means n 1 Cov ( ∑ x , y ) = ( x i − μ x ) ( y i − μ y ) n − 1 i =1

⃗ ⃗ ⃗ ⃗ INTRODUCTION TO DATA ANALYSIS n 1 COVARIANCE :: EXAMPLE Cov ( ∑ x , y ) = ( x i − μ x ) ( y i − μ y ) n − 1 i =1

size ? weight Maria Pershina • Jona Carmon

weight size 24.11.2019 Maria Pershina • Jona Carmon 44

¯ 𝑦 weight 𝑂 variance = ∑ ² 𝑦 ) ² 𝑗 =1 ( 𝑦 𝑗 − ¯ 𝑂 ² ² ² ² ² 𝑂 coariance = ∑ 𝑗 =1 ( 𝑦 𝑗 − ¯ 𝑦 )( 𝑧 𝑗 − ¯ 𝑧 ) ² ² 𝑂 size 24.11.2019 Maria Pershina • Jona Carmon 45

¯ 𝑦 weight ¯ 𝑧 𝑂 coariance = ∑ 𝑗 =1 ( 𝑦 𝑗 − ¯ 𝑦 )( 𝑧 𝑗 − ¯ 𝑧 ) 𝑂 size 24.11.2019 Maria Pershina • Jona Carmon 46

¯ 𝑦 weight + − ¯ 𝑧 + − 𝑂 coariance = ∑ 𝑗 =1 ( 𝑦 𝑗 − ¯ 𝑦 )( 𝑧 𝑗 − ¯ 𝑧 ) 𝑂 size 24.11.2019 Maria Pershina • Jona Carmon 47

⃗ ⃗ ⃗ ⃗ INTRODUCTION TO DATA ANALYSIS n 1 Cov ( ∑ COVARIANCE :: INTERPRETATION x , y ) = ( x i − μ x ) ( y i − μ y ) n − 1 i =1 ▸ summands are positive when x i and y i deviate “in the same direction” from their respective means ▸ positive (negative) covariance therefore reflects an overall tendency that that higher x i , the higher (lower) y i ▸ this is a descriptive property of the data, not an evidential indicator of a causal relation

INTRODUCTION TO DATA ANALYSIS COVARIANCE :: SCALE VARIANCE ▸ covariance is not invariant under positive linear transformation

⃗ ⃗ ⃗ ⃗ ⃗ ⃗ INTRODUCTION TO DATA ANALYSIS PRODUCT-MOMENT CORRELATION ▸ Bravais-Pearson product-moment correlation coefficient is defined as covariance standardized by std. deviations Cov ( x , y ) y = r SD ( x ) SD ( x y )

INTRODUCTION TO DATA ANALYSIS CORRELATION :: EXAMPLE ▸ correlation is invariant under positive linear transformation

INTRODUCTION TO DATA ANALYSIS CORRELATION :: EXAMPLE ▸ negative correlation indicates an overall negative association: the higher total-volume-sold, the lower the average price

INTRODUCTION TO DATA ANALYSIS CORRELATION :: PROPERTIES & INTERPRETATION ▸ r lies in [-1;1] ▸ r = 0 indicates no correlation at all ▸ r =1 indicates perfect positive correlation ▸ r = -1 indicates perfect negative correlation ▸ r >= 0.5 suggests noteworthy (pos.) correlation ▸ r <= -0.5 suggests noteworthy (neg.) correlation ▸ r 2 also interpretable as “variance explained” in a regression model (later)

SUMMARY STATISTICS INTRODUCTION TO DATA ANALYSIS FINAL EXAM Friday - PowerPoint PPT Presentation

INTRODUCTION TO DATA ANALYSIS SUMMARY STATISTICS INTRODUCTION TO DATA ANALYSIS FINAL EXAM Friday February 7 2020 ::: 4-8pm 66/E33 & 66/E34 no class at noon on that day INTRODUCTION TO DATA ANALYSIS HOW (NOT) TO PERFORM OPTIMALLY

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

The Pulse monitors: Statistics Smartpods PULSE 1 - Improve Facility Efficiencies 2 - Increase

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning

UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics

The Statistics Network The Statistics Network Statistics network Compute servers Desktop PCs

Statistics for Social Sciences I: Introduction to Statistics Introduction to Statistics

Introduction to Data Science: x (1) x 1 x 2 x ( n ) x i n 1 1 Size: size

REPUBLIC OF NAMIBIA WHAT IS FOREIGN TRADE STATISTICS WHAT IS FOREIGN TRADE STATISTICS Records

AP Biology and Statistics Statistics Statistics help to better understand the meaning of a

Who we are? OECD STATISTICS ESTONIA AUSTRALIAN BUREAU OF STATISTICS STATISTICS NEW ZEALAND

Statistics in Schools Classrooms Powered by Census Data CENSUS.GOV/SCHOOLS Statistics in

Order Statistics and Pitman Closeness Katherine F. Davies Department of Statistics University of

Education Statistics of Korea Sung Ho Park Director of Center for Educational Statistics

Connection patterns N E TW OR K AN ALYSIS IN TH E TIDYVE R SE Massimo Franceschet Prof . of

EECS 70: Lecture 27. Joint and Conditional Distributions. EECS 70: Lecture 27. Joint and

QGP-like effects in Small Systems with LHC Run3+ Naghmeh Mohammadi arxiv:1812.06772 (HL-LHC WG5

Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June

Understand your Design Typical Questions PRACE Autumn School 2013 - Industry Oriented HPC

Wireless Location Privacy: Radiometric Breaches and Defenses Marco Gruteser WINLAB Trends

Wu Yuefang Liu xunchuan et al. Reporter: Liu Xunchuan Preface: brief introduction to our work

The politjcal representatjon of women & ethnic groups in legislatures around the world Didier

SUMMARY STATISTICS INTRODUCTION TO DATA ANALYSIS FINAL EXAM Friday - PowerPoint PPT Presentation

INTRODUCTION TO DATA ANALYSIS SUMMARY STATISTICS INTRODUCTION TO DATA ANALYSIS FINAL EXAM Friday February 7 2020 ::: 4-8pm 66/E33 & 66/E34 no class at noon on that day INTRODUCTION TO DATA ANALYSIS HOW (NOT) TO PERFORM OPTIMALLY

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Areal statistics Barry Rowlingson Research Fellow DataCamp Spatial Statistics in R Borders

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

The Pulse monitors: Statistics Smartpods PULSE 1 - Improve Facility Efficiencies 2 - Increase

Quality Assurance in Official Statistics Directorate of Economics &amp; Statistics, Planning

UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics UK Bleeding Disorder Statistics

The Statistics Network The Statistics Network Statistics network Compute servers Desktop PCs

Statistics for Social Sciences I: Introduction to Statistics Introduction to Statistics

Introduction to Data Science: x (1) x 1 x 2 x ( n ) x i n 1 1 Size: size

REPUBLIC OF NAMIBIA WHAT IS FOREIGN TRADE STATISTICS WHAT IS FOREIGN TRADE STATISTICS Records

AP Biology and Statistics Statistics Statistics help to better understand the meaning of a

Who we are? OECD STATISTICS ESTONIA AUSTRALIAN BUREAU OF STATISTICS STATISTICS NEW ZEALAND

Statistics in Schools Classrooms Powered by Census Data CENSUS.GOV/SCHOOLS Statistics in

Order Statistics and Pitman Closeness Katherine F. Davies Department of Statistics University of

Education Statistics of Korea Sung Ho Park Director of Center for Educational Statistics

Connection patterns N E TW OR K AN ALYSIS IN TH E TIDYVE R SE Massimo Franceschet Prof . of

EECS 70: Lecture 27. Joint and Conditional Distributions. EECS 70: Lecture 27. Joint and

QGP-like effects in Small Systems with LHC Run3+ Naghmeh Mohammadi arxiv:1812.06772 (HL-LHC WG5

Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June

Understand your Design Typical Questions PRACE Autumn School 2013 - Industry Oriented HPC

Wireless Location Privacy: Radiometric Breaches and Defenses Marco Gruteser WINLAB Trends

Wu Yuefang Liu xunchuan et al. Reporter: Liu Xunchuan Preface: brief introduction to our work

The politjcal representatjon of women &amp; ethnic groups in legislatures around the world Didier

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning

The politjcal representatjon of women & ethnic groups in legislatures around the world Didier