CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
TAs : Bahman Bahmani Juthika Dabholkar Pierre Kreitmann Lu Li Aditya Ramesh Office hours: Jure: Tuesdays 9-10am, Gates 418 See course website for TA office hours 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2
Course website: http://cs246.stanford.edu Lecture slides (at least 6h before the lecture) Announcements, homeworks, solutions Readings! Readings: Book Mining of Massive Datasets by Anand Rajaraman and Jeffrey D. Ullman Free online: http://i.stanford.edu/~ullman/mmds.html 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3
4 longer homeworks: 40% Theoretical and programming questions All homeworks (even if empty) must be handed in Assignments take time. Start early! How to submit? Paper: Box outside the class and in the Gates east wing We will grade on paper! You should also submit electronic copy: 1 PDF/ZIP file (writeups, experimental results, code) Submission website: http://cs246.stanford.edu/submit/ SCPD: Only submit electronic copy & send us email 7 late days for the quarter: Max 5 late days per assignment 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4
Short weekly quizzes: 20% Short e-quizzes on Gradiance (see course website!) First quiz is already online You have 7 days to complete it. No late days! Final exam: 40% March 19 at 8:30am It’s going to be fun and hard work 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5
Homework schedule: Date Out In 1/11 HW1 1/25 HW2 HW1 2/8 HW3 HW2 2/22 HW4 HW3 3/7 HW4 No class: 1/16: Martin Luther King Jr. 2/20: President’s day 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6
Recitation sessions: Review of probability and statistics Installing and working with Hadoop We prepared a virtual machine with Hadoop preinstalled HW0 helps you write your first Hadoop program See course website! We will announce the dates later Sessions will be recorded 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
Algorithms (CS161) Dynamic programming, basic data structures Basic probability (CS109 or Stat116) Moments, typical distributions, MLE, … Programming (CS107 or CS145) Your choice, but C++/Java will be very useful We provide some background, but the class will be fast paced 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
CS345a: Data mining got split into 2 courses CS246: Mining massive datasets: Methods/algorithms oriented course Homeworks (theory & programming) No class project CS341: Project in mining massive datasets: Project oriented class Lectures/readings related to the project Unlimited access to Amazon EC2 cluster We intend to keep the class small Taking CS246 is basically prerequisite 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9
For questions/clarifications use Piazza! If you don’t have @stanford.edu email address email us and we will register you To communicate with the course staff use cs246-win1112-staff@lists.stanford.edu We will post announcements to cs246-win1112-all@lists.stanford.edu If you are not registered or auditing send us email and we will subscribe you! You are welcome to sit-in & audit the class Send us email saying that you will be auditing 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10
Much of the course will be devoted to ways to data mining on the Web: Mining to discover things about the Web E.g., PageRank, finding spam sites Mining data from the Web itself E.g., analysis of click streams, similar products at Amazon, making recommendations 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11
Much of the course will be devoted to l arge scale computing for data mining Challenges: How to distribute computation? Distributed/parallel programming is hard Map-reduce addresses all of the above Google’s computational/data manipulation model Elegant way to work with big data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12
High-dimensional data: Locality Sensitive Hashing Dimensionality reduction Clustering The data is a graph: Link Analysis: PageRank, Hubs & Authorities Machine Learning: k-NN, Perceptron, SVM, Decision Trees Data is infinite: Mining data streams 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13
Applications: Association Rules Recommender systems Advertising on the Web Web spam detection 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14
Discovery of patterns and models that are: Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-obvious to the system Understandable: humans should be able to interpret the pattern Subsidiary issues: Data cleansing: detection of bogus data Visualization: something better than MBs of output Warehousing of data (for retrieval) 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16
Predictive Methods Use some variables to predict unknown or future values of other variables Descriptive Methods Find human-interpretable patterns that describe the data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17
Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18
Overlaps with: Databases: Large-scale (non-main-memory) data Machine learning: Complex methods, small data Statistics: Models Different cultures: To a DB person, data mining Statistics/ Machine Learning/ is an extreme form of AI Pattern analytic processing – Recognition queries that examine large amounts of data Data Mining Result is the query answer To a statistician, data-mining is Database the inference of models systems Result is the parameters of the model 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19
A big data-mining risk is that you will “discover” patterns that are meaningless . Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20
Joseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception He devised an experiment where subjects were asked to guess 10 hidden cards – red or blue He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right! 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21
He told these people they had ESP and called them in for another test of the same type Alas, he discovered that almost all of them had lost their ESP What did he conclude? He concluded that you shouldn’t tell people they have ESP; it causes them to lose it 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22
CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24
20+ billion web pages x 20KB = 400+ TB 1 computer reads 30-35 MB/sec from disk ~4 months to read the web ~1,000 hard drives to store the web Takes even more to do something useful with the data! Standard architecture is emerging: Cluster of commodity Linux nodes Gigabit ethernet interconnect 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25
2-10 Gbps backbone between racks 1 Gbps between Switch any pair of nodes in a rack Switch Switch CPU CPU CPU CPU … … Mem Mem Mem Mem Disk Disk Disk Disk Each rack contains 16-64 nodes In Aug 2006 Google had ~450,000 machines 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26
Large-scale computing for data mining problems on commodity hardware Challenges: How do you distribute computation? How can we make it easy to write distributed programs? Machines fail: One server may stay up 3 years (1,000 days) If you have 1,0000 servers, expect to loose 1/day In Aug 2006 Google had ~450,000 machines 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27
Recommend
More recommend