Lecture 1 Jan-Willem van de Meent What is Data Mining? - PowerPoint PPT Presentation

Unsupervised Machine Learning   and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 1 Jan-Willem van de Meent

What is Data Mining?

Intersection of Disciplines Machine Statistics Optimization Learning Applications Visualization Data Mining Distributed   Algorithms Databases Computing ( slide adapted from Han et al. Data Mining Concepts and Techniques )

Databases Perspective Knowledge Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Data Cleaning Data / Feature Selection Data Integration Databases ( slide adapted from Han et al. Data Mining Concepts and Techniques )

Machine Learning Perspective Data Science Data Cleaning / Machine Posthoc Deployment Collection Preprocessing Learning Analysis API Integration Association Rules Evaluation Pipelining Sensors Normalization Classification Visualization Scaling Scraper Feature selection Clustering Interpretation Dimension reduction Outlier detection ( slide adapted from Nate Derbinsky)

3 Aspects of Data Mining Data Types Tasks Methods Exploratory   • Sets • Association Rules • Analysis Matrices / Tables • Dimensionality   • Market Basket   • Reduction Graphs • Analysis Regression • Time series • Recommender   • Classification • Sequences Systems • Clustering • Text Community   • • Detection Topic Models • Images • Link   • Bandits • Analysis

Machine Learning Methods Supervised Learning Given labeled examples, learn to make predictions for   unlabeled examples. Example: Image classification. Unsupervised Learning (This Course) Given unlabeled examples learn to identify structure. Example: Community detection in social networks. Reinforcement Learning Learn to take actions that maximize future reward. Example: Targeting advertisements.

Regression Goal: Predict a Continuous Label Boston Housing Data (source: UCI ML datasets) https://archive.ics.uci.edu/ml/datasets/Housing

Regression Target Variable MEDV : Median value of owner-occupied homes in $1000's

Regression Features Real-valued CRIM : per capita crime rate by town

Regression Features Discrete / Categorical CHAS : Charles River variable   (= 1 if tract bounds river; 0 otherwise)

Regression Features Hand-Engineered DIS : weighted distances to five   Boston employment centers

Regression Goal: Use past labels ( red ) to learn trends that   generalize to future data points ( green ) Time-series Data source: https://am241.wordpress.com/tag/time-series/

Classification Goal: Predict a discrete label. Input Hidden Label Images Units (one-hot) [0 0 0 0 0 0 0 0 1 0]: 9 [0 0 0 0 0 0 0 1 0 0]: 8 [0 0 0 0 1 0 0 0 0 0]: 5 [0 0 0 0 0 0 1 0 0 0]: 7 [0 0 0 0 0 0 0 1 0 0]: 8 256 10 28 x 28

Classification Example: Iris Data Iris Setosa Iris versicolor Iris virginica Sepal Petal https://en.wikipedia.org/wiki/Iris_flower_data_set

Unsupervised Learning Goal: Can we make predictions in absence of labels? Methods in this Course: • Frequent Itemsets and Association rule mining • Dimensionality Reduction • Clustering • Topic Modeling • Community Detection • Link Analysis • Recommender Systems

Association Rule Mining Baskets of items TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Association Rules {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Dimensionality Reduction Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities Original Data (4 dims) Projection with PCA (2 dims)

Dimensionality Reduction Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities Input PCA (Linear) TSNE (Non-linear) Images MNIST

Clustering Goal: Learn categories of examples (i.e. classification without labels) Iris Data (after PCA) Inferred Clusters

Hidden Markov Models Goal: Learn categories of time points (i.e. clustering of points within time series) Sequence of States Time Series

Topic Models Goal: Learn topics (categories of words) and quantify topic frequency for each document Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,

Community Detection Goal: Identify groups of connected nodes (i.e. clustering on graphs)

Community Detection Goal: Identify groups of connected nodes (i.e. clustering on graphs) Nodes: Football Teams, Edges: Matches, Communities: Conferences

Link Analysis Goal: Predict which website is the most authoritative. Few/no Many inbound   inbound   links links Links from Links from unimportant important pages pages • Pages with more inbound links are more important • Inbound links from important pages carry more weight (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Reinforcement Learning Goal: Take action that maximizes future reward . Example: Google Plays Atari Action: Joystick direction / Buttons. Reward: Score.

Reinforcement Learning Goal: Take action that maximizes future reward . Example: Netflix Website Design Action: Which movies to show. Reward: User Retention.

Recommender Systems Goal: Predict user preferences for unseen items. Methods: Supervised learning (predict ratings), Reinforcement learning (rating is reward), Unsupervised learning (e.g. community detection on users / items)

Theme: Optimization of Objectives Common theme in Machine Learning :   Using data-driven algorithms to make predictions   that are optimal according to some objective . Supervised Learning: Minimize regression or classification loss Unsupervised Learning: Maximize expected probability of data Reinforcement Learning: Maximize expected reward

Syllabus https://course.ccs.neu.edu/ds5230f18/

Homework Problems • 4 problem sets, 40% of grade • Python encouraged, but can use any language   (within reason – TA must be able to run your code) • Discussion is encouraged, but submissions   must be completed individually   (absolutely no sharing of code ) • Submission via zip file on blackboard   by 11.59pm on day of deadline   (no late submissions) • Please follow submission guidelines.

Project Goals • Select a dataset / prediction problem • Perform exploratory analysis   and pre-processsing • Apply one or more algorithms • Critically evaluate results • Submit a report and present project

Project Deadlines • 21 Sep : Form teams of 2-4 people • 24 Oct : Submit abstract (1 paragraph) • 5 Nov : Milestone 1 (exploratory analysis) • 19 Nov : Milestone 2 (statistical analysis) • 28 Nov : Project Presentations • 7 Dec : Project Reports Due

Grading Homework: 40% • Midterm: 15% • Final: 15% • Project: 30% •

Participation 1. Read the materials 2. Attend the Lectures 3. Ask questions 4. Come to office hours 5. Help Others Class participation is used to adjust grade upwards (at the discretion of the instructor)

Textbooks Leskovec,   Aggarwal Bishop Murphy Hastie, Rajaraman, Tribshirani, Ullman Friedman Machine Learning Statistics Data Mining Recommend you buy one PDF freely   PDF freely   Available   available available on campus   network

Lecture 1 Jan-Willem van de Meent What is Data Mining? - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 1 Jan-Willem van de Meent What is Data Mining? Intersection of Disciplines Machine Statistics Optimization Learning Applications Visualization Data

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Syllabus CSCI376: Human-Computer Interaction Mondays & Thursdays; 2:35-3:50pm; TCL206

Compression and remote inspection of 4D data sets Jarek Rossignac, Jack Snoeyink, and Peter

More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of

Data Visualization with Vega-Lite and Altair With many collaborators: Dominik Moritz

Game Theory for Homeland Security: Lessons Learned from Deployed Applications Chr hris is

Todays Agenda Opening Remarks Nathan Schumacher, Public Participation Specialist The

Welcome to the IRIS workshop The consortium partners KFV (Austrian Road Safety Board), Austria

More ggplot Steve Bagley somgen223.stanford.edu 1 More ggplot somgen223.stanford.edu 2 iris

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 1 Jan-Willem van de Meent What is Data Mining? - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 1 Jan-Willem van de Meent What is Data Mining? Intersection of Disciplines Machine Statistics Optimization Learning Applications Visualization Data

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Syllabus CSCI376: Human-Computer Interaction Mondays &amp; Thursdays; 2:35-3:50pm; TCL206

Compression and remote inspection of 4D data sets Jarek Rossignac, Jack Snoeyink, and Peter

More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of

Data Visualization with Vega-Lite and Altair With many collaborators: Dominik Moritz

Game Theory for Homeland Security: Lessons Learned from Deployed Applications Chr hris is

Todays Agenda Opening Remarks Nathan Schumacher, Public Participation Specialist The

Welcome to the IRIS workshop The consortium partners KFV (Austrian Road Safety Board), Austria

More ggplot Steve Bagley somgen223.stanford.edu 1 More ggplot somgen223.stanford.edu 2 iris

Sambuz

Useful Links

Newsletter

Mail Us

Syllabus CSCI376: Human-Computer Interaction Mondays & Thursdays; 2:35-3:50pm; TCL206