Geometric Data Analysis Introduction to Data Science MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit´ e de Montr´ eal Fall 2019 MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 1 / 19
Outline What is Data Science? 1 From data to information Predictive vs. descriptive information Supervised vs. unsupervised learning Data Analysis Tasks 2 Classification & regression Clustering & anomaly detection Association rules & sequential patterns Visualization & dimensionality reduction Data Analysis Process 3 Software for Data Analysis 4 MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 2 / 19
Optional textbooks on data mining MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 3 / 19
What is data science? Data Mining Machine Learning Non-trivial extraction of useful, Field of study that gives computers new, hidden, and/or implicit infor- the ability to learn without being mation from data. explicitly programmed. Deep Learning Big Data A set of algorithms that attempt Extremely large data sets that may to model high-level data abstrac- be analyzed computationally to re- tions in data by using multiple pro- veal patterns, trends, and associa- cessing layers, composed of multi- tions, especially relating to human ple linear and non-linear transfor- behavior and interactions. mations. Related terms: knowledge discovery in databases (KDD), pattern recognition, data warehousing, OLAP, ETL, IT, etc. MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 4 / 19
What is data science? MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 4 / 19
What is data science? MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 4 / 19
What is data science? From data to information MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19
What is data science? From data to information collected data ✲ MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19
What is data science? From data to information collected data ✲ MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19
What is data science? From data to information collected data ✲ ✲ MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19
What is data science? From data to information collected data ✲ ✲ ✲ MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19
What is data science? From data to information collected data ✲ ⊆ R O (100 + ) ✲ ✲ ✲ MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19
What is data science? From data to information collected data ✲ ⊆ R O (100 + ) ✲ ✲ ✲ � � � � ✠ MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19
What is data science? From data to information MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19
What is data science? From data to information MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19
What is data science? From data to information Examples of data mining / analysis tasks: Recommend movies on Netflix or books on Amazon. Object recognition in images and automatic image tagging Community detection in social networks (e.g., Facebook) Automatic medical diagnosis and treatment recommendation Examples of data processing tasks that do not require data mining: Signature-based anti-virus Retrieving details from a contact list Text-based search in a document or on the web Quicksort, balanced trees, heaps, etc. MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 6 / 19
What is data science? Predictive vs. descriptive methods Predictive methods Predict unknown information from known data. How much would my house sell for, based on sales stats? Will Bob like Ghostbusters, based on his Netflix history? Descriptive methods Infer or extract interpretable patterns to describe data. What consumer profiles should my ads target? If Jim’s card is trying to charge $300 in a Disney store today, is it reasonable or a fraud? MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 7 / 19
What is data science? Supervised vs. unsupervised learning Machine learning data analysis tasks are roughly divided into: Supervised learning Inferring information from labeled training data. Unsupervised learning Finding hidden patterns in unlabeled data. Semi-supervised learning Combine information from labeled and unlabeled data to model and deduce information. MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 8 / 19
Data analysis tasks Classification Classification Classify “items” into a finite set of classes, or “categories”. Training phase Classification model: Labeled data: � �� � � �� � { ( x 1 , ℓ 1 ) , . . . , ( x n , ℓ n ) } ⊂ X × L � ⇒ F : X → L , F ( x i ) = ℓ i | L | < ∞ Testing phase Classification result: New data: � �� � � �� � y 1 , y 2 , . . . ∈ X �→ classification model � ⇒ F ( y 1 ) , . . . , F ( y n ) ∈ L MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 9 / 19
Data analysis tasks Classification - examples Example (MNIST digit classification) MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 10 / 19
Data analysis tasks Classification - examples Example (CalTech 101 image classification) Anchor Joshua-Tree Beaver Lotus Water-Lily MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 10 / 19
Data analysis tasks Regression Regression Compute (or infer) the value of a (piecewise) continuous function from a finite number of sampled “items” & values. This task is similar to classification, but here the model F can have an infinite range (e.g., ❘ or [0 , 1]). Examples Market pricing of a house/apartment/car based on its features. Trend line & model fitting from collected experimental data. Weather predictions, such as temperature and probability of rain/snow. Confidence rating in diagnostics (or binary classifier). MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 11 / 19
Data analysis tasks Clustering Clustering Group together similar “items” while separating ones that are different from each other. MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 12 / 19
Data analysis tasks Clustering Clustering Group together similar “items” while separating ones that are different from each other. MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 12 / 19
Data analysis tasks Clustering Clustering Group together similar “items” while separating ones that are different from each other. The quality of obtained clusters stems from their interpretability. Variations include known or unknown number of cluster number, as well as multiscale hierarchical clustering structures. Examples Clustering stocks to diversify stock market investment Community detection in social networks by clustering profiles Clustering genes and cells to uncover activities, reactions, and interactions. Network activity profiling by clustering packets/sessions. MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 12 / 19
Data analysis tasks Anomaly detection Anomaly/outlier detection Detect significant deviations from normal behavior expressed by inferred data patterns. The notion of “normal behavior” can be defined in several ways, such as clustering or model fitting. Examples Fraud detection in credit cards Intrusion detection in cybersecurity Detecting bot traffic in online advertising Malfunction detection in process monitoring MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 13 / 19
Data analysis tasks Association rules Association rule discovery Produce dependency rules that model input coocurrences of “items” to predict, given a partial “transaction”, the remaining “items” in it. Training phase Association rules: Observed transactions: � �� � � �� � F : 2 X → 2 X , T ⊆ T i �→ F ( T ) ≈ T i \ T T 1 , . . . , T n ⊆ X � ⇒ Testing phase Predicted information: Partial transactions: � �� � � �� � S 1 , S 2 , . . . ⊆ X �→ association rules � ⇒ ∀ i , S i �→ F ( S i ) ⊆ X \ S i MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 14 / 19
Data analysis tasks Association rules Association rule discovery Produce dependency rules that model input coocurrences of “items” to predict, given a partial “transaction”, the remaining “items” in it. Examples Active advertisements & recommendations (e.g., “Users who liked/bought this product also liked/bought that product”) Support decision making on shelve organization stores & supermarkets Name completions in emails, social networks, etc. Unlike classification, the actual testing phase is often less important than the discovered rules in this case. MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 14 / 19
Data analysis tasks Sequential patterns Sequential pattern discovery Given a set of ordered event sequences, produce rules to predict unknown/missing/future events from prior and/or subsequent events. Similar in some sense to association rule discovery, but with an order or timeline aspect to each transaction. Examples String mining: Natural language processing Gene sequencing in DNA and RNA Frequent item purchase sequences Predicting outcomes of medical treatment MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 15 / 19
Recommend
More recommend