data exploration visualization
play

Data Exploration & Visualization MAT 6480W / STT 6705V Guy Wolf - PowerPoint PPT Presentation

Geometric Data Analysis Data Exploration & Visualization MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de Montr eal Fall 2019 MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 1 / 46 Outline Tabular data


  1. Geometric Data Analysis Data Exploration & Visualization MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit´ e de Montr´ eal Fall 2019 MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 1 / 46

  2. Outline Tabular data 1 Observations/Data-Points vs. Features/Attributes Qualitative vs. Quantitative attributes Qualitative: Nominal vs. Ordinal Quantitative: Interval vs. Ratio Summary statistics 2 Frequency, mode, & percentiles Mean & median Range & variance Covariance & correlation Data quality Visualizations 3 Box plots Histograms Star plots MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 2 / 46

  3. Outline (cont.) Parallel coordinate plots Scatter plots Quiver plots Transactional data 4 Term matrix Text documents Structured signals (e.g., audio and EEG) 5 Fourier & wavelets Spectrogram & scalogram Multidimensional signals (e.g., images and videos) 6 Visualization with contour plots Nonparametric (affinity-/distance-based) representations 7 Graph data Visualization with matrix plots MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 3 / 46

  4. What is data? ❅ � ❅ � ❅ � ❅ ❘ � ✠ ✒ � � ■ ❅ � ❅ � ❅ � ❅ MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 4 / 46

  5. What is data? Experimental vs. observational data Experimental data Data collected from strictly controlled/designed experiments with efforts made to ensure statistical validity. Examples Medical clinical trials Election polls Observational data Data collected from “real-world” settings without control over the captured underlying phenomena. It is easier to collect and obtain, but results and conclusions from such data may be biased or inconclusive. Most data in “data science” is observational data. MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 5 / 46

  6. Tabular Data MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 6 / 46

  7. Tabular data Organizing data in a table of observations-by-features is considered the most convenient and standard format for data analysis. Example Consider the following procedure: From each machine, collect 3 temperature measurements 1 (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes), and 2 power consumption values (MOBO, GPU) Attach unique identifiers of the machine, OS, and hardware 2 manufacturer Every second, store a record with these values from every 3 machine in the system. We end up with hundreds of thousands of records, each containing 12 fields. MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 7 / 46

  8. Tabular data Observations/Data-Points vs. Features/Attributes Features/attributes/properties/fields � �� � Timestamp OS Temp · · · CPU # proc  Observations/objects/data-     points/samples/records    . . . . . .   . . . . . .  . . . . . .              45 ◦ C · · · 9/1/161:00AM LNX 65% 23           . . . . . .   . . . . . .  . . . . . .           MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 8 / 46

  9. Tabular data Types of features/attributes It is important to recognize the types of values each feature/attribute takes in order to understand which operations make sense for it. Examples Can we compute an average eye color? How do we compute the difference between phone numbers? Can we say today is “twice as hot/cold” as yesterday? This is similar to problems like 6 apples / 4 people = 1.5 apples per person, but 10 people / 4 car seats = 3 cars . MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 9 / 46

  10. Tabular data Qualitative vs. Quantitative attributes Attribute values can be split into two types: Qualitative attributes Attributes that take values from a (finite) set of categories are called categorical or qualitative attributes. In some sense, they describe an object/observation, rather than measure its properties. Quantitative attributes Attributes that represent quantities are called numerical or quantitative attributes. They provide concrete quantifiable measurements of an object/observation. MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 10 / 46

  11. Tabular data Qualitative: Nominal vs. Ordinal Qualitative attributes can be split further into two types: Nominal attributes Examples: zip codes, eye color, operating system, gender Values of such attributes just specify names without any particular order or relation between them (except for = and � =). Binary attributes are nominal attributes with only two values (Yes/No or 0/1). They can be symmetric or asymmetric based in whether or not their values are equally informative. Ordinal attributes Examples: ratings, grades, street/avenue numbers Values of such attributes have some order, even though they don’t specify an exact quantity MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 11 / 46

  12. Tabular data Quantitative: Interval vs. Ratio Quantitative attributes can also be split into two types: Interval attributes Examples: calendar dates, azimuth direction, Fahrenheit temperatures Such attributes represent quantities with meaningful difference (or fixed intervals) between their values (but no multiplicative relations). Ratio attributes Examples: mass, length, distance, currency, age, electrical current Such attributes represent quantities that have meaningful ratios between their values. Unlike interval attributes, ratio ones usually have an “absolute zero”. We can also split quantities into discrete and continuous ones. All qualitative attributes are considered discrete. MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 12 / 46

  13. Tabular data Summary of attribute types The types of attributes can be regarded via the operations that can be applied to them: Comparison (= and � =) - every type Ordering ( > and < ) - every type except nominal Differences ( − ) and addition (+) - only quantitative Division ( / ) and multiplication ( × , · ) - only ratio Other operations (e.g., mean, median, correlation) may also be inapplicable for some types while applicable to others. MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 13 / 46

  14. Tabular data Technical formats Tabular data can be stored, collected, or given in several standard formats, such as: Comma separated file (CSV) Flat file or delimited text file (e.g., space or tab delimited) XML or other log files Proprietary formats (e.g., FCS for biological data or MAT files for Matlab data) Database tables There are several techniques and standard designs to collect and store big data in databases. Data warehouse, ETL (extract-transform-load), and OLAP (Online Analytical Processing) are some related terms en- countered frequently in the IT industry. MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 14 / 46

  15. Tabular data Data warehouse: star and snowflake schemas Star schema MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 15 / 46

  16. Tabular data Data warehouse: star and snowflake schemas Snowflake schema MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 15 / 46

  17. Summary statistics The raw representation of the data is often not convenient for initial exploration and understanding of the data. How do we get general insights into the data and its attributes as a whole? Summary statistics Properties that summarize global information, such as central tendency, spread, and variations of observations and features. These statistics provide an important first step in data analysis and most of them are not difficult to compute in linear time w.r.t the size of the data. MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 16 / 46

  18. Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

  19. Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. Mode The most frequent value of an attribute in the data. MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

  20. Summary statistics Frequency, mode, & percentiles Frequency The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. Mode The most frequent value of an attribute in the data. Percentiles The p -th percentile (with 0 ≤ p ≤ 100) of an attribute is a value P p such that p % of the observed values of this attributes are less than P p . We typically take P p as one of the observed values of the attributes. Alternatives: quartile Q i ( i = 1 , 2 , 3), quantile, etc. MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

Recommend


More recommend