Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 1 Jan-Willem van de Meent
What is Data Mining?
Intersection of Disciplines Machine Statistics Optimization Learning Applications Visualization Data Mining Distributed Algorithms Databases Computing ( slide adapted from Han et al. Data Mining Concepts and Techniques )
Databases Perspective Knowledge Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Data Cleaning Data / Feature Selection Data Integration Databases ( slide adapted from Han et al. Data Mining Concepts and Techniques )
Machine Learning Perspective Data Science Data Cleaning / Machine Posthoc Deployment Collection Preprocessing Learning Analysis API Integration Association Rules Evaluation Pipelining Sensors Normalization Classification Visualization Scaling Scraper Feature selection Clustering Interpretation Dimension reduction Outlier detection ( slide adapted from Nate Derbinsky)
3 Aspects of Data Mining Data Types Tasks Methods Exploratory • Sets • Association Rules • Analysis Matrices / Tables • Dimensionality • Market Basket • Reduction Graphs • Analysis Regression • Time series • Recommender • Classification • Sequences Systems • Clustering • Text Community • • Detection Topic Models • Images • Link • Bandits • Analysis
Machine Learning Methods Supervised Learning Given labeled examples, learn to make predictions for unlabeled examples. Example: Image classification. Unsupervised Learning (This Course) Given unlabeled examples learn to identify structure. Example: Community detection in social networks. Reinforcement Learning Learn to take actions that maximize future reward. Example: Targeting advertisements.
Regression Goal: Predict a Continuous Label Boston Housing Data (source: UCI ML datasets) https://archive.ics.uci.edu/ml/datasets/Housing
Regression Target Variable MEDV : Median value of owner-occupied homes in $1000's
Regression Features Real-valued CRIM : per capita crime rate by town
Regression Features Discrete / Categorical CHAS : Charles River variable (= 1 if tract bounds river; 0 otherwise)
Regression Features Hand-Engineered DIS : weighted distances to five Boston employment centers
Regression Goal: Use past labels ( red ) to learn trends that generalize to future data points ( green ) Time-series Data source: https://am241.wordpress.com/tag/time-series/
Classification Goal: Predict a discrete label. Input Hidden Label Images Units (one-hot) [0 0 0 0 0 0 0 0 1 0]: 9 [0 0 0 0 0 0 0 1 0 0]: 8 [0 0 0 0 1 0 0 0 0 0]: 5 [0 0 0 0 0 0 1 0 0 0]: 7 [0 0 0 0 0 0 0 1 0 0]: 8 256 10 28 x 28
Classification Example: Iris Data Iris Setosa Iris versicolor Iris virginica Sepal Petal https://en.wikipedia.org/wiki/Iris_flower_data_set
Unsupervised Learning Goal: Can we make predictions in absence of labels? Methods in this Course: • Frequent Itemsets and Association rule mining • Dimensionality Reduction • Clustering • Topic Modeling • Community Detection • Link Analysis • Recommender Systems
Association Rule Mining Baskets of items TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Association Rules {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Dimensionality Reduction Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities Original Data (4 dims) Projection with PCA (2 dims)
Dimensionality Reduction Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities Input PCA (Linear) TSNE (Non-linear) Images MNIST
Clustering Goal: Learn categories of examples (i.e. classification without labels) Iris Data (after PCA) Inferred Clusters
Hidden Markov Models Goal: Learn categories of time points (i.e. clustering of points within time series) Sequence of States Time Series
Topic Models Goal: Learn topics (categories of words) and quantify topic frequency for each document Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,
Community Detection Goal: Identify groups of connected nodes (i.e. clustering on graphs)
Community Detection Goal: Identify groups of connected nodes (i.e. clustering on graphs) Nodes: Football Teams, Edges: Matches, Communities: Conferences
Link Analysis Goal: Predict which website is the most authoritative. Few/no Many inbound inbound links links Links from Links from unimportant important pages pages • Pages with more inbound links are more important • Inbound links from important pages carry more weight (adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Reinforcement Learning Goal: Take action that maximizes future reward . Example: Google Plays Atari Action: Joystick direction / Buttons. Reward: Score.
Reinforcement Learning Goal: Take action that maximizes future reward . Example: Netflix Website Design Action: Which movies to show. Reward: User Retention.
Recommender Systems Goal: Predict user preferences for unseen items. Methods: Supervised learning (predict ratings), Reinforcement learning (rating is reward), Unsupervised learning (e.g. community detection on users / items)
Theme: Optimization of Objectives Common theme in Machine Learning : Using data-driven algorithms to make predictions that are optimal according to some objective . Supervised Learning: Minimize regression or classification loss Unsupervised Learning: Maximize expected probability of data Reinforcement Learning: Maximize expected reward
Syllabus https://course.ccs.neu.edu/ds5230f18/
Homework Problems • 4 problem sets, 40% of grade • Python encouraged, but can use any language (within reason – TA must be able to run your code) • Discussion is encouraged, but submissions must be completed individually (absolutely no sharing of code ) • Submission via zip file on blackboard by 11.59pm on day of deadline (no late submissions) • Please follow submission guidelines.
Project Goals • Select a dataset / prediction problem • Perform exploratory analysis and pre-processsing • Apply one or more algorithms • Critically evaluate results • Submit a report and present project
Project Deadlines • 21 Sep : Form teams of 2-4 people • 24 Oct : Submit abstract (1 paragraph) • 5 Nov : Milestone 1 (exploratory analysis) • 19 Nov : Milestone 2 (statistical analysis) • 28 Nov : Project Presentations • 7 Dec : Project Reports Due
Grading Homework: 40% • Midterm: 15% • Final: 15% • Project: 30% •
Participation 1. Read the materials 2. Attend the Lectures 3. Ask questions 4. Come to office hours 5. Help Others Class participation is used to adjust grade upwards (at the discretion of the instructor)
Textbooks Leskovec, Aggarwal Bishop Murphy Hastie, Rajaraman, Tribshirani, Ullman Friedman Machine Learning Statistics Data Mining Recommend you buy one PDF freely PDF freely Available available available on campus network
Recommend
More recommend