CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 2: EXPLORATORY DATA - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 2: EXPLORATORY DATA ANALYSIS Spring 2019 Marion Neumann

RECAP: WHAT IS DATA SCIENCE? …solving problems with data… scientific, collect & clean & use data social, or data understand format to create business problem data solution data problem f data analysis and/or machine learning 2

WHERE DOES DATA COME FROM? Internal Sources • business-centric data in organizational data bases recording day to day operations • scientific or experimental data • • Existing External Sources à data is available for free or a fee public government databases, stock market data, Yelp reviews • • usually (somewhat) pre-processed Collect your own data à beyond the scope of this course • Online Data à typically raw data • from APIs (e.g. Google Map API, Facebook API, Twitter API) • web scraping: using software, scripts or by-hand extracting data from what is • displayed on a page or what is contained in the HTML file Caution: not all data that is accessible is good to be used ! • Are you violating their terms of service? Privacy concerns for website and their clients? • Do they have an API or fee that you are bypassing? • • Are they willing to share this data? 3

VARIABLES AND DATA TYPES • Types of Variables Order continuous or numeric 2 discrete order categorical no categorical w 2 categories binary t.ES No • Data Types discrete categorical binary integer Boolean binary we prefer numeric 1 data types continuous floating point arrays formatted text string categorical free form text lists dictionaries arrays data types compound Example: 4 https://www.zillow.com

DATA(SET) REPRESENTATION • Tables (csv, xlsx etc.) • two-dimensional representation • rows represent data records • columns represents one type of measurement • Structured Data (json, xml etc.) • complex and multi-tiered dictionary • Semi-structured Data (.txt) • flat text representation with known structure • data can be easily parsed • Unstructured Data (.txt) • prose text 5

DATA IS (ALWAYS) MESSY • Common issues with data: • missing values: how do we fill in? • wrong values: how can we detect and correct? • messy format/representation • Example: number of produce deliveries over a weekend Common causes of messiness: variables/features are stored in both rows and columns • • multiple features are stored in one column multiple types of experimental units stored in same table • 6

DATA (PRE-)PROCESSING Goal: bring data in a format we can use for analysis (and/or machine learning) à use a format that is good for Python J (e.g. 2d arrays) à recall from last lecture: data points vs features/variables • Data Parsing and Formatting data wrangling • Data Profiling à asses data amount and quality • Data Cleaning • Data Engineering (more later in this course…) • detect outliers • feature engineering • data augmentation 7

DATA ≠ DATA • Two kinds of data: population vs. sample A sample is a ( representative ) A population is the entire set subset of the objects or events of objects or events under under study. study. Population can be à needed because it’s hypothetical “all students” or impossible or intractable to all students in this class. obtain or use population data. • What are problems with sample data? 8

EXPLORATORY DATA ANALYSIS (EDA) Different ways of exploring data: • explore each individual variable in the dataset • summary statistics • spread • distribution • assess interactions between variables (or between individual variables and the target) • correlation, analysis of variance (ANOVA) • explore data across many dimensions (more later in this course…) • clustering • dimensionality reduction (e.g. principal component analysis (PCA), etc.) 9

SUMMARY STATISTICS • (sample) mean • (sample) median • Example: Ages: 17, 19, 21, 22, 23, 23, 23, 38 What is the median age? What is the mean/average age? • mean vs median • which one is easier /more efficient to compute? Caution : the mean is sensitive to outliers! Caution : consider practicality (efficiency) of implementation! 10

SUMMARY STATISTICS • mode = variable that occurs most often • useful for categorical variables à visualize with a bar plot DSFS Ch3 11

MEASURES OF SPREAD • range = max value – min value • variance • Caution: does not have the same unit as x i • standard deviation Why is measuring the spread important? 12

DATA VISUALIZATION • Can summary statistics and measures of spread tell us everything? 13

DATA VISUALIZATION • Can summary statistics and measures of spread tell us everything? 14

TYPES OF VISUALIZATION • distribution à how does a variable distribute over a range of possible values • relationship à how do the values of multiple variables in the dataset relate • comparison à how do trends in multiple variable or datasets compare • composition à how does the dataset break down into subgroups 15

VISUALIZE DISTRIBUTION • histogram Caution : Trends in histograms are sensitive to the number of bins. PDSH p245 16

VISUALIZE RELATIONSHIP • scatter plot • distribution of two variables • relationship between two variables PDSH p233 DSFS Ch3 17

VISUALIZE COMPARISONS • multiple histograms • visualize how different variables compare (or how a variable differs over specific groups) à we can also use box plots to compare different variables 18

VISUALIZE COMPOSITION/COMPARISON • box plots • compare different variables à cf. Lab1 • compare a quantitative variable across groups à highlights the range , quartiles , median and outliers This plot illustrates composition , since it looks at classes/categories of one variable. Lab1 19

VISUALIZE COMPOSITION • pie chart • stacked area graph Visualize trend over time! 20

ACTIVITY 2 • TASK 1 : What do the following plots produced in Lab1 visualize? Caution : Not all visualizations are good visualizations. • TASK 2 : Which of the following visualizations are good/proper visualization and which do you think are problematic (and why )? 21

MORE DIMENSIONS • How about relationship between 3 variables? à 3D is not always better 22

CATEGORICAL VARIABLES • use color coding for categorical variables sepal_length Data visualization can help figure out what we need to predict class labels! pedal_length 23

SUMMARY & READING • EDA process • (pre-)process data • summarize data • present/visualize distribution and relationships • EDA goals • develop/find hypothesis/question(s) to be investigated • use data to answer the question(s) DSFS • • Ch3: Visualizing Data (matplotlib, bar/line charts, scatter plots) • PDSH • Ch4: Visualization with Matplotlib • plotting with matplotlib (p217-221) • scatter plots (p233-237) • histograms (p245-247) 24

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 2: EXPLORATORY DATA - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 2: EXPLORATORY DATA ANALYSIS Spring 2019 Marion Neumann RECAP: WHAT IS DATA SCIENCE? solving problems with data scientific, collect & clean & use data social, or data understand

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 4: REGRESSION Spring 2019 Marion Neumann RECAP:

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING Spring 2019 Marion Neumann

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 3: SENTIMENT ANALYSIS Spring 2019 Marion Neumann

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 6: LEARNING PRINCIPLES Spring 2019 Marion Neumann

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED PREDICTION Spring 2019 Marion

CSE217 INTRODUCTION TO DATA SCIENCE COURSE WEBSITE, SYLLABUS, ACADEMIC INTEGRITY Spring 2019

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

ASSIGNMENT AND LOOPS CSSE 120 Rose-Hulman Institute of Technology Outline (some of Chapters 2

Ch 7/10: Tables, Color paper: ArteryViz (carryforward from last time) to read

Deep Learning Research for NLP Graham Neubig Language Processing Mary prevents Peter from

Compositional Transfinite Semantics of While Hrmel Nestra Institute of Computer Science

Review Relational, equality, and logical expressions evaluate to int values 1 (true) or 0

CSE 142 Chapter 4 Programming I l Read Sections 4.14.5, 4.74.9 l The book assumes that

1 Annoucnements Homework 1 due Wednesday 2 Survey of Computer Science Decision Trees,

Technical Aspects of the Paper: Improving Code Readability Models with Textual Features