lecture 1
play

Lecture 1 Jan-Willem van de Meent What is Data Mining? - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 1 Jan-Willem van de Meent What is Data Mining? Intersection of Disciplines Machine Statistics Optimization Learning Applications Visualization Data


  1. Unsupervised Machine Learning 
 and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 1 Jan-Willem van de Meent

  2. What is Data Mining?

  3. Intersection of Disciplines Machine Statistics Optimization Learning Applications Visualization Data Mining Distributed 
 Algorithms Databases Computing ( slide adapted from Han et al. Data Mining Concepts and Techniques )

  4. Databases Perspective Knowledge Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Data Cleaning Data / Feature Selection Data Integration Databases ( slide adapted from Han et al. Data Mining Concepts and Techniques )

  5. Machine Learning Perspective Data Science Data Cleaning / Machine Posthoc Deployment Collection Preprocessing Learning Analysis API Integration Association Rules Evaluation Pipelining Sensors Normalization Classification Visualization Scaling Scraper Feature selection Clustering Interpretation Dimension reduction Outlier detection ( slide adapted from Nate Derbinsky)

  6. 3 Aspects of Data Mining Data Types Tasks Methods Exploratory 
 • Sets • Association Rules • Analysis Matrices / Tables • Dimensionality 
 • Market Basket 
 • Reduction Graphs • Analysis Regression • Time series • Recommender 
 • Classification • Sequences Systems • Clustering • Text Community 
 • • Detection Topic Models • Images • Link 
 • Bandits • Analysis

  7. Machine Learning Methods Supervised Learning Given labeled examples, learn to make predictions for 
 unlabeled examples. Example: Image classification. Unsupervised Learning (This Course) Given unlabeled examples learn to identify structure. Example: Community detection in social networks. Reinforcement Learning Learn to take actions that maximize future reward. Example: Targeting advertisements.

  8. Regression Goal: Predict a Continuous Label Boston Housing Data (source: UCI ML datasets) https://archive.ics.uci.edu/ml/datasets/Housing

  9. Regression Target Variable MEDV : Median value of owner-occupied homes in $1000's

  10. Regression Features Real-valued CRIM : per capita crime rate by town

  11. Regression Features Discrete / Categorical CHAS : Charles River variable 
 (= 1 if tract bounds river; 0 otherwise)

  12. Regression Features Hand-Engineered DIS : weighted distances to five 
 Boston employment centers

  13. Regression Goal: Use past labels ( red ) to learn trends that 
 generalize to future data points ( green ) Time-series Data source: https://am241.wordpress.com/tag/time-series/

  14. Classification Goal: Predict a discrete label. Input Hidden Label Images Units (one-hot) [0 0 0 0 0 0 0 0 1 0]: 9 [0 0 0 0 0 0 0 1 0 0]: 8 [0 0 0 0 1 0 0 0 0 0]: 5 [0 0 0 0 0 0 1 0 0 0]: 7 [0 0 0 0 0 0 0 1 0 0]: 8 256 10 28 x 28

  15. Classification Example: Iris Data Iris Setosa Iris versicolor Iris virginica Sepal Petal https://en.wikipedia.org/wiki/Iris_flower_data_set

  16. Unsupervised Learning Goal: Can we make predictions in absence of labels? Methods in this Course: • Frequent Itemsets and Association rule mining • Dimensionality Reduction • Clustering • Topic Modeling • Community Detection • Link Analysis • Recommender Systems

  17. Association Rule Mining Baskets of items TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Association Rules {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

  18. Dimensionality Reduction Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities Original Data (4 dims) Projection with PCA (2 dims)

  19. Dimensionality Reduction Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities Input PCA (Linear) TSNE (Non-linear) Images MNIST

  20. Clustering Goal: Learn categories of examples (i.e. classification without labels) Iris Data (after PCA) Inferred Clusters

  21. Hidden Markov Models Goal: Learn categories of time points (i.e. clustering of points within time series) Sequence of States Time Series

  22. Topic Models Goal: Learn topics (categories of words) and quantify topic frequency for each document Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,

  23. Community Detection Goal: Identify groups of connected nodes (i.e. clustering on graphs)

  24. Community Detection Goal: Identify groups of connected nodes (i.e. clustering on graphs) Nodes: Football Teams, Edges: Matches, Communities: Conferences

  25. Link Analysis Goal: Predict which website is the most authoritative. Few/no Many inbound 
 inbound 
 links links Links from Links from unimportant important pages pages • Pages with more inbound links are more important • Inbound links from important pages carry more weight (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  26. Reinforcement Learning Goal: Take action that maximizes future reward . Example: Google Plays Atari Action: Joystick direction / Buttons. Reward: Score.

  27. Reinforcement Learning Goal: Take action that maximizes future reward . Example: Netflix Website Design Action: Which movies to show. Reward: User Retention.

  28. Recommender Systems Goal: Predict user preferences for unseen items. Methods: Supervised learning (predict ratings), Reinforcement learning (rating is reward), Unsupervised learning (e.g. community detection on users / items)

  29. Theme: Optimization of Objectives Common theme in Machine Learning : 
 Using data-driven algorithms to make predictions 
 that are optimal according to some objective . Supervised Learning: Minimize regression or classification loss Unsupervised Learning: Maximize expected probability of data Reinforcement Learning: Maximize expected reward

  30. Syllabus https://course.ccs.neu.edu/ds5230f18/

  31. Homework Problems • 4 problem sets, 40% of grade • Python encouraged, but can use any language 
 (within reason – TA must be able to run your code) • Discussion is encouraged, but submissions 
 must be completed individually 
 (absolutely no sharing of code ) • Submission via zip file on blackboard 
 by 11.59pm on day of deadline 
 (no late submissions) • Please follow submission guidelines.

  32. Project Goals • Select a dataset / prediction problem • Perform exploratory analysis 
 and pre-processsing • Apply one or more algorithms • Critically evaluate results • Submit a report and present project

  33. Project Deadlines • 21 Sep : Form teams of 2-4 people • 24 Oct : Submit abstract (1 paragraph) • 5 Nov : Milestone 1 (exploratory analysis) • 19 Nov : Milestone 2 (statistical analysis) • 28 Nov : Project Presentations • 7 Dec : Project Reports Due

  34. Grading Homework: 40% • Midterm: 15% • Final: 15% • Project: 30% •

  35. Participation 1. Read the materials 2. Attend the Lectures 3. Ask questions 4. Come to office hours 5. Help Others Class participation is used to adjust grade upwards (at the discretion of the instructor)

  36. Textbooks Leskovec, 
 Aggarwal Bishop Murphy Hastie, Rajaraman, Tribshirani, Ullman Friedman Machine Learning Statistics Data Mining Recommend you buy one PDF freely 
 PDF freely 
 Available 
 available available on campus 
 network

Recommend


More recommend