http cs246 stanford edu tas
play TAs : Bahman Bahmani Juthika Dabholkar - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University TAs : Bahman Bahmani Juthika Dabholkar Pierre Kreitmann Lu Li Aditya Ramesh Office hours: Jure: Tuesdays 9-10am, Gates 418

  1. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

  2.  TAs :  Bahman Bahmani  Juthika Dabholkar  Pierre Kreitmann  Lu Li  Aditya Ramesh  Office hours:  Jure: Tuesdays 9-10am, Gates 418  See course website for TA office hours 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 2

  3.  Course website:  Lecture slides (at least 6h before the lecture)  Announcements, homeworks, solutions  Readings!  Readings: Book Mining of Massive Datasets by Anand Rajaraman and Jeffrey D. Ullman Free online: 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 3

  4.  4 longer homeworks: 40%  Theoretical and programming questions  All homeworks (even if empty) must be handed in  Assignments take time. Start early!  How to submit?  Paper: Box outside the class and in the Gates east wing  We will grade on paper!  You should also submit electronic copy:  1 PDF/ZIP file (writeups, experimental results, code)  Submission website:  SCPD: Only submit electronic copy & send us email  7 late days for the quarter:  Max 5 late days per assignment 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 4

  5.  Short weekly quizzes: 20%  Short e-quizzes on Gradiance (see course website!)  First quiz is already online  You have 7 days to complete it. No late days!  Final exam: 40%  March 19 at 8:30am  It’s going to be fun and hard work  1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 5

  6.  Homework schedule: Date Out In 1/11 HW1 1/25 HW2 HW1 2/8 HW3 HW2 2/22 HW4 HW3 3/7 HW4  No class: 1/16: Martin Luther King Jr. 2/20: President’s day 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 6

  7.  Recitation sessions:  Review of probability and statistics  Installing and working with Hadoop  We prepared a virtual machine with Hadoop preinstalled  HW0 helps you write your first Hadoop program  See course website!  We will announce the dates later  Sessions will be recorded 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 7

  8.  Algorithms (CS161)  Dynamic programming, basic data structures  Basic probability (CS109 or Stat116)  Moments, typical distributions, MLE, …  Programming (CS107 or CS145)  Your choice, but C++/Java will be very useful  We provide some background, but the class will be fast paced 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 8

  9.  CS345a: Data mining got split into 2 courses  CS246: Mining massive datasets:  Methods/algorithms oriented course  Homeworks (theory & programming)  No class project  CS341: Project in mining massive datasets:  Project oriented class  Lectures/readings related to the project  Unlimited access to Amazon EC2 cluster  We intend to keep the class small  Taking CS246 is basically prerequisite 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 9

  10.  For questions/clarifications use Piazza!  If you don’t have email address email us and we will register you  To communicate with the course staff use   We will post announcements to   If you are not registered or auditing send us email and we will subscribe you!  You are welcome to sit-in & audit the class  Send us email saying that you will be auditing 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 10

  11.  Much of the course will be devoted to ways to data mining on the Web:  Mining to discover things about the Web  E.g., PageRank, finding spam sites  Mining data from the Web itself  E.g., analysis of click streams, similar products at Amazon, making recommendations 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 11

  12.  Much of the course will be devoted to l arge scale computing for data mining  Challenges:  How to distribute computation?  Distributed/parallel programming is hard  Map-reduce addresses all of the above  Google’s computational/data manipulation model  Elegant way to work with big data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 12

  13.  High-dimensional data:  Locality Sensitive Hashing  Dimensionality reduction  Clustering  The data is a graph:  Link Analysis: PageRank, Hubs & Authorities  Machine Learning:  k-NN, Perceptron, SVM, Decision Trees  Data is infinite:  Mining data streams 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 13

  14.  Applications:  Association Rules  Recommender systems  Advertising on the Web  Web spam detection 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 14

  15.  Discovery of patterns and models that are:  Valid: hold on new data with some certainty  Useful: should be possible to act on the item  Unexpected: non-obvious to the system  Understandable: humans should be able to interpret the pattern  Subsidiary issues:  Data cleansing: detection of bogus data  Visualization: something better than MBs of output  Warehousing of data (for retrieval) 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 16

  16.  Predictive Methods  Use some variables to predict unknown or future values of other variables  Descriptive Methods  Find human-interpretable patterns that describe the data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 17

  17.  Scalability  Dimensionality  Complex and Heterogeneous Data  Data Quality  Data Ownership and Distribution  Privacy Preservation  Streaming Data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 18

  18.  Overlaps with:  Databases: Large-scale (non-main-memory) data  Machine learning: Complex methods, small data  Statistics: Models  Different cultures:  To a DB person, data mining Statistics/ Machine Learning/ is an extreme form of AI Pattern analytic processing – Recognition queries that examine large amounts of data Data Mining  Result is the query answer  To a statistician, data-mining is Database the inference of models systems  Result is the parameters of the model 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 19

  19.  A big data-mining risk is that you will “discover” patterns that are meaningless .  Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 20

  20.  Joseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception  He devised an experiment where subjects were asked to guess 10 hidden cards – red or blue  He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right! 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 21

  21.  He told these people they had ESP and called them in for another test of the same type  Alas, he discovered that almost all of them had lost their ESP  What did he conclude?  He concluded that you shouldn’t tell people they have ESP; it causes them to lose it  1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 22

  22. CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 24

  23.  20+ billion web pages x 20KB = 400+ TB  1 computer reads 30-35 MB/sec from disk  ~4 months to read the web  ~1,000 hard drives to store the web  Takes even more to do something useful with the data!  Standard architecture is emerging:  Cluster of commodity Linux nodes  Gigabit ethernet interconnect 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 25

  24. 2-10 Gbps backbone between racks 1 Gbps between Switch any pair of nodes in a rack Switch Switch CPU CPU CPU CPU … … Mem Mem Mem Mem Disk Disk Disk Disk Each rack contains 16-64 nodes In Aug 2006 Google had ~450,000 machines 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 26

  25.  Large-scale computing for data mining problems on commodity hardware  Challenges:  How do you distribute computation?  How can we make it easy to write distributed programs?  Machines fail:  One server may stay up 3 years (1,000 days)  If you have 1,0000 servers, expect to loose 1/day  In Aug 2006 Google had ~450,000 machines 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, 27


More recommend