Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. - PowerPoint PPT Presentation

Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD

Logistics All course materials will be hosted here: http://sjspielman.org/bio5312_fall2017 Submit assignments via Canvas: https://templeu.instructure.com Please bring your laptop to class!!! Office SERC 643 ◦ Weekly office hours Friday 1-3 ground floor of SERC ß vote?

Course goals The primary goal is to analyze, interpret, and visualize data in the biological sciences Achieved via statistical analysis and data science techniques in R This is not a course in statistical theory.

Course topics Descriptive and Summary Statistics Data visualization Fundamentals in probability, distributions Statistical inference: hypothesis testing and confidence intervals Linear modeling Multiple testing Binary classification Clustering methods Special topics in current biological data analysis

But first, what are we doing here? Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. We use statistics to make inferences about phenomena using samples and quantify uncertainty of data Biostatistics is (surprisingly!) a branch of applied statistics geared towards to medical and biological problems

Populations and samples Populations are the entire collection of individuals/units/etc. a researcher is interested in ◦ Generally we can never know the true composition of a population ◦ Populations are described with parameters Samples are subsets of individuals/units from populations ◦ We use hypothesis testing to (try to) draw population-level conclusions from samples ◦ Samples are described with estimates Parameters and estimates use different notations, as we will see

What makes a good sample? In an ideal world, a sample is unbiased and features low sampling error Sampling error ◦ Bias is a systematic discrepancy between estimate and parameter Precise Imprecise Low bias and low sampling error Samples should be randomly chosen Accurate ◦ Each population unit should have an equal and independent chance of being chosen for a given sample Inaccurate Bias

Pop quiz: Is it random? A researcher selects the first 58 student volunteers that sign up for a study A computer program numbers all residents in a community, and then uses a random-number generator to select 26 residents A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that fall out of the box. A researcher selects all study participants whose first name starts with an A, B, K, M, or O.

Descriptive and Summary Statistics Tools to concisely describe data, numerically and visually Generally the first step in data exploration and statistical analysis o Identify missing values, outliers, etc. o Check assumptions required to fit models or perform statistical tests o Identify trends that merit further study

Types of data How you analyze and visualize data depends on the type of data you have Quantitative data Categorical data ◦ Continuous ◦ Nominal ◦ Discrete (includes count data) ◦ Ordinal ◦ Binary*

Quantitative data Continuous ◦ Any real-number value within some range Discrete ◦ Values are in indivisible units, i.e. whole or counting numbers ◦ Includes count data (number of cups of coffee per day, number of amino acids in a protein…)

Categorical data Nominal ◦ Hair color, eye color, sex genotypes (XX, XY, XXY, XYY, XO). Ordinal – categories with a natural ordering ◦ Bad, fair, good, excellent ◦ A, B, C, D Binary ◦ Yes/No ◦ True/False Bonus: names of sex genotypes?

Measures of Location Continuous Discrete Mode Mean ◦ The most frequent appearing observation in " = $ % the distribution (commonly used for discrete % ∑ 𝑍 𝑍 ( ()$ data) ◦ 1, 2, 2, 2, 3, 4, 4, 5, 6 à 2 Median %*$ ◦ For odd n, the th observation + % ◦ For even n, the average of the + th and % + + 1 th observation

Measures of location in distributions http://i.imgur.com/YSEYhha.jpg

Measures of spread Range Standard deviation and variance Interquartile range

Range Difference between largest and smallest value in a distribution ◦ 1, 2, 3, 7, 9 à 8 ◦ 1, 2, 3, 7, 9, 500 à 499 Range is very sensitive to extreme observations and becomes very unwieldy very quickly.

� Standard deviation and variance Generally discussed in the context of mean " : Deviance describes how each n th data point deviates from mean 𝑍 ", 𝑍 " , 𝑍 " , …, 𝑍 " ◦ 𝑍 $ − 𝑍 + − 𝑍 0 − 𝑍 % − 𝑍 Standard deviation of a sample $ ") + % ∑ ◦ 𝑡 = %2$ (𝑍 ( −𝑍 ()$ Variance ◦ 𝑡 +

Interquartile range Generally discussed in the context of median Quartiles divide the data into four equal parts (“quar”!) Interquartile range (IQR) is the difference between the third and first quartile ◦ How much of the data does the IQR encompass? Interquartile range First quartile Median Third quartile 1.25 1.64 1.91 2.31 2.37 2.38 2.84 2.87 2.93 2.94 2.98 3.00 3.09 3.22 3.41 3.55 Five number summary: min, Q1, median, Q3, max

Mean or median? The median is much more robust to outliers compared to the mean. mean mean Which would you choose for a symmetric distribution and why?

Measures of variability Coefficient of variation is the standard deviation of a sample expressed as a percentage of the sample mean (aka normalized) 𝒕 ◦ 𝑫𝑷𝑾 = ; ×𝟐𝟏𝟏% 𝒁 ◦ Useful measure for comparing variability between two differently-scaled datasets

� � Sample vs population notation Measurement Sample estimate Population parameter $ $ Mean " = % % % ∑ % ∑ 𝑍 𝑍 𝜈 = 𝑦 ( ( ()$ ()$ $ Standard % ∑ (𝜈 ( −𝜈̅) + σ = % $ ") + % ∑ 𝑡 = %2$ (𝑍 ( −𝑍 ()$ deviation ()$ 𝑡 + σ + Variance

Visualizing data Different types of plots are used to represent different types of data Continuous data Histogram Density plot Boxplot Violin plot Discrete data Bar plot Comparing two continuous variables Scatterplot Trend over time Line plot

Histogram 40 30 Count 20 10 0 12 14 16 18 Value

Using histograms to describe distributions Uniform Bell–shaped Asymmetric (skewed) Bimodal

Density plots smoothen histograms 50 40 0.3 0.3 30 Density density count 0.2 0.2 20 0.1 0.1 10 0.0 0.0 0 12 14 16 18 12 12 14 14 16 16 18 18 Value x x

Boxplot Graphical representation of a five- “whiskers” number summary 2 Q3 “Whiskers” calculated as data within +/- 1.5 IQR Median IQR Value 0 Q1 − 2 outliers − 4

Boxplots: The plot thickens* Bimodal Unimodal 600 10 400 Value Count 200 0 0 0 10 0 10 Value Distributions *Pun intended.

What can we say about this distribution based on its boxplot? 0.6 Symmetry? Asymmetric Skewness? Right-skewed Modality? Unclear 0.4 Value 0.2 0.0

Violin plot: Density meets boxplot N(5, 4) N(2, 1) N(4, 0.09) 12 Violin plot 8 value 4 0 x 0.20 Density plot 0.3 0.15 1.0 density 0.2 0.10 0.5 0.1 0.05 0.00 0.0 0.0 0 3 6 9 12 0 2 4 3.0 3.5 4.0 4.5 5.0 value 12 Boxplot 8 value 4 0 x

Barplot 60 Flower color 40 orange Count pink red white 20 0 orange pink red white Flowers in garden

Cautionary tale in barplots http://journals.plos.org/plosbiology/article?id =10.1371/journal.pbio.1002128

Scatterplot 4 response/dependent variable 10 3 Variable 2 Variable 2 2 0 1 − 10 0 − 2 − 1 0 1 2 3 − 2 − 1 0 1 2 Variable 1 Variable 1 explanatory/independent variable

Time series data Year 2003 2002 2001 2000 1999 150 1998 140 1997 1996 130 Value 1995 120 1994 110 1993 100 1992 1991 1992 1996 2000 Year 1990 75 100 125 150 175 Value

BRE BREAK

Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. - PowerPoint PPT Presentation

Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD Logistics All course materials will be hosted here: http://sjspielman.org/bio5312_fall2017 Submit assignments via Canvas: https://templeu.instructure.com Please

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

Descriptive Statistics Descriptive and Inferential Statistics Recall that statistical methods are

Descriptive statistics P RACTICIN G S TATIS TICS IN TERVIEW QUES TION S IN R Zuzanna

I t Introduction to d t i t Descriptive Descriptive Statistics Statistics 17.871 Spring

Descriptive Epidem iology & Descriptive Epidem iology & Study design Study design

Descriptive Complexity of Jonni Virtema Deterministic Polylogarithmic Time Descriptive

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science

Statistics and Data Analysis Descriptive Statistics (2): Summarization Ling-Chieh Kung

Introduction to Data Science CS 5963 / Math 3900 Lecture 2: Introduction to Descriptive

Descriptive Statistics and Probability: A Look at Real- World

Descriptive Statistics Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1

Descriptive Statistics Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc

Trademark and Unfair Competition Law Slides 22: Descriptive and Nominative Fair Use LAWS 7341-001

Descriptive combinatorics and ergodic theorems Anush Tserunyan University of Illinois at

Games in Descriptive Set Theory, or: its all fun and games until someone loses the axiom of

Drawing a line in 2D Consider a 7x7 pixel screen Draw an oblique line across it

Schema Refinement and Normal Forms Chapter 19 Instructor: Vladimir Zadorozhny

Estimates of Standard Model Backgrounds in Searches for New Physics the Case of Isolated

Pushdown Automata 7-0 Pushdown Automata The automata we saw so far were

kernel CCA, kernel Kmeans Spectral Clustering 1 MACHINE LEARNING 2012 Change in timetable:

Neural network for supervised learning Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels

Solving sparse polynomial systems using Gr obner basis Mat as R. Bender Sorbonne

Quarter BPS classified by Brauer Brauer algebra algebra Quarter BPS classified by Yusuke Kimura

Sambuz

Useful Links

Newsletter

Mail Us

Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. - PowerPoint PPT Presentation

Descriptive and Summary Statistics BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD Logistics All course materials will be hosted here: http://sjspielman.org/bio5312_fall2017 Submit assignments via Canvas: https://templeu.instructure.com Please

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

Descriptive Statistics Descriptive and Inferential Statistics Recall that statistical methods are

Descriptive statistics P RACTICIN G S TATIS TICS IN TERVIEW QUES TION S IN R Zuzanna

I t Introduction to d t i t Descriptive Descriptive Statistics Statistics 17.871 Spring

Descriptive Epidem iology &amp; Descriptive Epidem iology &amp; Study design Study design

Descriptive Complexity of Jonni Virtema Deterministic Polylogarithmic Time Descriptive

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science

Statistics and Data Analysis Descriptive Statistics (2): Summarization Ling-Chieh Kung

Introduction to Data Science CS 5963 / Math 3900 Lecture 2: Introduction to Descriptive

Descriptive Statistics and Probability: A Look at Real- World

Descriptive Statistics Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1

Descriptive Statistics Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc

Trademark and Unfair Competition Law Slides 22: Descriptive and Nominative Fair Use LAWS 7341-001

Descriptive combinatorics and ergodic theorems Anush Tserunyan University of Illinois at

Games in Descriptive Set Theory, or: its all fun and games until someone loses the axiom of

Drawing a line in 2D Consider a 7x7 pixel screen Draw an oblique line across it

Schema Refinement and Normal Forms Chapter 19 Instructor: Vladimir Zadorozhny

Estimates of Standard Model Backgrounds in Searches for New Physics the Case of Isolated

Pushdown Automata 7-0 Pushdown Automata The automata we saw so far were

kernel CCA, kernel Kmeans Spectral Clustering 1 MACHINE LEARNING 2012 Change in timetable:

Neural network for supervised learning Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels

Solving sparse polynomial systems using Gr obner basis Mat as R. Bender Sorbonne

Quarter BPS classified by Brauer Brauer algebra algebra Quarter BPS classified by Yusuke Kimura

Sambuz

Useful Links

Newsletter

Mail Us

Descriptive Epidem iology & Descriptive Epidem iology & Study design Study design