Essential Data Preparation, Descriptive Statistics and - PowerPoint PPT Presentation

Essential Data Preparation, Descriptive Statistics and Visualizations with Examples for Slides by Michael Hahsler

Purpose 1) "Get used to the data." 2) Clean the data (e.g., find missing values, outliers, mistakes) 3) "Make sure the data makes sense." 4) Find simple relationships between variables. 5) Prepare data for predictive/prescriptive modeling.

RapidMiner ● Install RapidMiner Studio and obtain an educational license (see course website). ● The dataset for the examples can be obtained from http://michael.hahsler.net/SMU/EMIS3309/data/census.csv ● Rapidminer processes for this slide set are available here (save and import process in Rapidminer): http://michael.hahsler.net/SMU/EMIS3309/data/rapidminer/Basic_Stat – istics_and_Visualizations.rmp http://michael.hahsler.net/SMU/EMIS3309/data/rapidminer/Cleaning – _and_preprocessing.rmp

Importing Data For the examples we use a dataset with census data at the ZIP- code level (Data and processes can be found on the class website). Features/Attributes should not be integer! Observations Quantitative – ratio Categorical (RM: integer or real) (RM: polynomial)

Single Variable - Quantitative 5-Number Summary: Descriptive Stats min – Rapidminer gives you this for Population per Zipcode: 1 st quartile – Median – Mean – 3 rd quartile – max – Histogram Visualization to show the distribution

Single Variable - Categorical Count table Descriptive Stats ● Bar chart for counts Visualization ● Pie chart (not ideal for more than a few groups)

Data Cleaning RapidMiner Operators in Cleansing – Missing values? Is this the result of reading the data? Are missing values correctly read in (or are there values like 99, 'N/A' or '.' as text)? Do we have to impute the missing values? – Outliers and strange values Identify in histograms and scatter plots. Examples: many zeros, weird visual pattern visible. Might be the result of data collection. Needs investigation! – Duplicates: Are these a data problem? – Dates : Make sure that these are read in correctly!

Data Cleaning Set a higher number of bins. What do the spikes at 0 and 200,000 for median family income mean? What should we do?

RapidMiner Data Transformation Operators in Blending and Cleansing Data needs to be often transformed to be useful in a model. ● Features – Feature Selection (select attributes) – Feature Generation (generate aggregation, etc.) – Nomalization (e.g, z-score) – Discretization (binning) – Manipulate values (map, replace) ● Observations – Sampling/Filtering examples – Grouping and aggregation

Two Variables - Quantitative RapidMiner : Use Correlation Correlation Matrix node Descriptive Stats Example: population and # of housing units per zipcode have a (Pearson) correlation coefficient of: 0.975 Scatterplot Visualization

Two Variables - Categorical Cross-tabulation Descriptive Stats (i.e., contingency table) RapidMiner : Use Aggregate and Pivot nodes Grouped bar Visualization charts Mosaic plot (not very popular)

Two Variables - Mixed RapidMiner : Use Compare 5-number statistic Aggregate node Descriptive Stats grouped by categorical variable. Bar chart for Visualization individual statistic to compare groups Box plot

Multiple Variables Are usually broken down into pairwise comparisons. Correlation matrix Scatterplot matrix

Multiple Variables (cont.) Comparing multiple quantitative variables (or comparing a single quantitative variable between groups defined by another categorical variable). Tables with group-wise statistics or Boxplot (called Quartile in RapidMiner)

Basic Descriptive Statistics and Data Visualization Cheat Sheet Single Variable - Explore the distribution Statistics Visualization Categorical Variable Counts Bar chart Quantitative Variable 5-number summary Histogram Two Variables - Explore the relationship Statistics Visualization Categorical Variables Contingency table Grouped bar chart (Cross tabulation) Quantitative Variables Correlation Scatter plot Mixed Variables Group-wise Box plot statistics (e.g., average) Bar chart of statistics 3+ Variables Break it down into pairwise statistics or plots. E.g., Correlation matrix, scatter plot matrix, box plot.

Essential Data Preparation, Descriptive Statistics and - PowerPoint PPT Presentation

Essential Data Preparation, Descriptive Statistics and Visualizations with Examples for Slides by Michael Hahsler Purpose 1) "Get used to the data." 2) Clean the data (e.g., find missing values, outliers, mistakes) 3) "Make

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

Descriptive Statistics Descriptive and Inferential Statistics Recall that statistical methods are

Data Preparation Data Preparation Types of Data and Basic statistics Discretization of

Descriptive statistics P RACTICIN G S TATIS TICS IN TERVIEW QUES TION S IN R Zuzanna

I t Introduction to d t i t Descriptive Descriptive Statistics Statistics 17.871 Spring

Essential Oils Class with Jami Borlik 1 Essential Oils Class with Jami Borlik 2 Essential Oils

Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science

Statistics and Data Analysis Descriptive Statistics (2): Summarization Ling-Chieh Kung

Descriptive Epidem iology & Descriptive Epidem iology & Study design Study design

Descriptive Complexity of Jonni Virtema Deterministic Polylogarithmic Time Descriptive

Introduction to Data Science CS 5963 / Math 3900 Lecture 2: Introduction to Descriptive

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Agenda for today 1. Descriptive Data Analysis 2. Graphics XploRe Descriptive Data Analysis 1-2

Descriptive Statistics Observed data are at the heart of every application of statistics. We need

Descriptive Statistics Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1

Descriptive Statistics Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc

About Inequality: How do we get to reducing inequality? Webinar, Part I June 22, 2015 1 Focus

COMP 516 COMP 516 Research Methods in Computer Science Research Methods in Computer Science

vs. Prescriptive The instructor might ask students to discuss as a group which choice they should

Styles of Empirical Research James J. Heckman Econ 312, Spring 2019 Descriptive Studies

Descriptive Statistics Central Tendency Variation Mean and Standard Deviation of Grouped Data

Descripti v e statistics P R AC TIC IN G STATISTIC S IN TE R VIE W QU E STION S IN P YTH ON

Announcements U nit 1: I ntroduction to data L ecture 2: E xploratory data analysis S tatistics 101

Experimental Analysis Marco Chiarandini Department of Mathematics & Computer Science

Sambuz

Useful Links

Newsletter

Mail Us

Essential Data Preparation, Descriptive Statistics and - PowerPoint PPT Presentation

Essential Data Preparation, Descriptive Statistics and Visualizations with Examples for Slides by Michael Hahsler Purpose 1) "Get used to the data." 2) Clean the data (e.g., find missing values, outliers, mistakes) 3) "Make

48-175 Descriptive Geometry Basic Concepts of Descriptive Geometry Descriptive geometry is

Descriptive Statistics Descriptive and Inferential Statistics Recall that statistical methods are

Data Preparation Data Preparation Types of Data and Basic statistics Discretization of

Descriptive statistics P RACTICIN G S TATIS TICS IN TERVIEW QUES TION S IN R Zuzanna

I t Introduction to d t i t Descriptive Descriptive Statistics Statistics 17.871 Spring

Essential Oils Class with Jami Borlik 1 Essential Oils Class with Jami Borlik 2 Essential Oils

Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science

Statistics and Data Analysis Descriptive Statistics (2): Summarization Ling-Chieh Kung

Descriptive Epidem iology &amp; Descriptive Epidem iology &amp; Study design Study design

Descriptive Complexity of Jonni Virtema Deterministic Polylogarithmic Time Descriptive

Introduction to Data Science CS 5963 / Math 3900 Lecture 2: Introduction to Descriptive

1 Practical Information 2 Introduction to Statistics Per Bruun Brockhoff 3 Descriptive Statistics:

Agenda for today 1. Descriptive Data Analysis 2. Graphics XploRe Descriptive Data Analysis 1-2

Descriptive Statistics Observed data are at the heart of every application of statistics. We need

Descriptive Statistics Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1

Descriptive Statistics Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc

About Inequality: How do we get to reducing inequality? Webinar, Part I June 22, 2015 1 Focus

COMP 516 COMP 516 Research Methods in Computer Science Research Methods in Computer Science

vs. Prescriptive The instructor might ask students to discuss as a group which choice they should

Styles of Empirical Research James J. Heckman Econ 312, Spring 2019 Descriptive Studies

Descriptive Statistics Central Tendency Variation Mean and Standard Deviation of Grouped Data

Descripti v e statistics P R AC TIC IN G STATISTIC S IN TE R VIE W QU E STION S IN P YTH ON

Announcements U nit 1: I ntroduction to data L ecture 2: E xploratory data analysis S tatistics 101

Experimental Analysis Marco Chiarandini Department of Mathematics &amp; Computer Science

Sambuz

Useful Links

Newsletter

Mail Us

Descriptive Epidem iology & Descriptive Epidem iology & Study design Study design

Experimental Analysis Marco Chiarandini Department of Mathematics & Computer Science