http://cs246.stanford.edu Overlaps with machine learning, - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

 Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on  scalability of number Statistics/ Machine Learning/ of features and instances AI Pattern  stress on algorithms and Recognition architectures Data Mining  automation for handling large data Database systems 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

 MapReduce  Association Rules  Finding Similar Items  Locality Sensitive Hashing  Dim. Reduction (SVD, CUR))  Clustering  Recommender systems  PageRank and TrustRank  Machine Learning: kNN, SVM, Decision Trees  Mining data streams  Advertising on the Web 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

Provided by the Provided by the programmer programmer MAP: Reduce: Group by key: reads input and Collect all values Collect all pairs produces a set of belonging to the with same key key value pairs key and output Sequentially read the data Only sequential reads (the, 1) (crew, 1) The crew of the space shuttle Endeavor recently returned to (crew, 1) (crew, 1) Earth as ambassadors, (crew, 2) harbingers of a new era of (of, 1) (space, 1) space exploration. Scientists (space, 1) at NASA are saying that the (the, 1) (the, 1) recent assembly of the Dextre (the, 3) bot is the first step in a long- (space, 1) (the, 1) term space-based (shuttle, 1) man/machine partnership. (shuttle, 1) (the, 1) '"The work we're doing now -- (recently, 1) the robotics we're doing -- is (Endeavor, 1) (shuttle, 1) what we're going to need to … do to build any work station (recently, 1) (recently, 1) or habitat structure on the moon or Mars," said Allard …. … Beutel. Big document (key, value) (key, value) (key, value) 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

High-dimensional data: Locality Sensitive Hashing Dimensionality reduction Clustering The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities Machine Learning: kNN, Perceptron, SVM, Decision Trees Data is infinite: Mining data streams Advertising on the Web Applications: Association Rules Recommender systems 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

 Many problems can be expressed as finding “similar” sets:  Find near-neighbors in high-D space  Distance metrics:  Points in ℜ n : L1, L2, Manhattan distance  Vectors: Cosine similarity  Sets of items: Jaccard similarity, Hamming distance  Problem:  Find near-duplicate documents 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

Candidate pairs : Locality- those pairs Docu- sensitive of signatures ment Hashing that we need to test for Signatures : short The set of strings similarity. integer vectors that of length k that represent the sets, appear in the and reflect their document similarity Shingling: convert docs to sets 1. Minhashing: convert large sets to short 2. signatures, while preserving similarity. Locality-sensitive hashing: focus on pairs of 3. signatures likely to be similar 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

 Shingling: convert docs to sets of items  Shingle: sequence of k tokens that appear in doc  Example: k=2; D 1 = abcab , 2-shingles: S(D 1 )={ ab , bc , ca }  Represent a doc by the set of hashes of its shingles  MinHashing: convert large sets to short signatures, while preserving similarity  Similarity preserving hash func. h () s.t.: Pr [ h π (S(D 1 )) = h π (S(D 2 ))] = Sim (S(D 1 ), S(D 2 ))  For Jaccard use permutation of columns and index of first 1. 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

Input matrix Signature matrix M 1 4 3 1 0 1 0 2 1 2 1 1 0 0 1 3 2 4 2 1 4 1 0 1 0 1 7 1 7 1 2 1 2 0 1 0 1 6 3 6 0 1 0 1 2 6 1 Similarities: 1-3 2-4 1-2 3-4 5 7 2 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0 4 5 5 1 0 1 0 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

2 1 4 1 1 2 1 2 2 1 2 1  Hash cols of signature  Sim(C 1 ,C 2 )= s matrix M: Similar columns  Prob. that at least 1 band is likely hash to same bucket identical = 1 - (1 - s r ) b  Cols. x and y are a candidate  Given s , tune r and b to get pair if M ( i, x ) = M ( i, y ) for at almost all pairs with similar least frac. s values of i signatures, but eliminate  Divide matrix M into b bands most pairs that do not have of r rows similar signatures Buckets b=20, r=5 1-(1-s r ) b s Prob. of sharing .2 .006 a bucket .3 .047 .4 .186 b bands .5 .470 .6 .802 r rows .7 .975 Matrix M Sim. threshold s .8 .9996 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

n n ≈ Σ V T m m A U 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

 A = U Σ V T - example: user-to-concept similarity matrix Casablanca SciFi-concept Serenity Amelie Matrix Romance-concept Alien 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

 A = U Σ V T - example: Casablanca Serenity Amelie Matrix Alien ‘strength’ of SciFi-concept 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

 A = U Σ V T - example: movie-to-concept Casablanca similarity matrix Serenity Amelie Matrix Alien 0.18 0 SciFi-concept 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

 How to do dimensionality reduction:  Set small singular values to zero  How to query?  Map query vector into “concept space” –  How? Compute q∙V Even though d and q do not share Casablanca a movie, they are still similar Serenity Amelie Matrix SciFi-concept Alien 1.16 0 d= 0 4 5 0 0 q= 0.58 0 5 0 0 0 0 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

 Hierarchical:  Agglomerative (bottom up):  Initially, each point is a cluster  Repeatedly combine the two “nearest” clusters into one  Represent a cluster by its centroid or clustroid  Point Assignment:  Maintain a set of clusters  Points belong to “nearest” cluster 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

 k-means : initialize cluster centroids  Iterate:  For each point, place it in the cluster whose current centroid it is nearest  Update the cluster centroids based on memberships 2 Reassigned 4 points x 6 3 1 8 7 5 x Clusters after first round 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

 LSH:  Find somewhat similar pairs of items while avoiding O(N 2 ) comparisons  Clustering:  Assign points into a prespecified number of clusters  Each point belongs to a single cluster  Summarize the cluster by a centroid (e.g., topic vector)  SVD (dimensionality reduction):  Want to explore correlations in the data  Some dimensions may be irrelevant  Useful for visualization, removing noise from the data, detecting anomalies 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

High-dimensional data: Locality Sensitive Hashing Dimensionality reduction Clustering The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities Machine Learning: kNN, Perceptron, SVM, Decision Trees Data is infinite: Mining data streams Advertising on the Web Applications: Association Rules Recommender systems 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

 Rank nodes using link structure  PageRank:  Link voting:  P with importance x has n out-links, each link gets x/n votes  Page R’s importance is the sum of the votes on its in-links  Complications: Spider traps, Dead-ends  At each step, random surfer has two options:  With probability β , follow a link at random  With prob. 1- β , jump to some page uniformly at random 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

 TrustRank : topic-specific PageRank with a teleport set of “trusted” pages  Spam mass of page p:  Fraction of pagerank score r(p) coming from spam pages: |r(p) – r + (p)| / r(p)  SimRank : measure similarity between items  a k -partite graph with k types of nodes  Example: picture nodes and tag nodes  Perform a random-walk with restarts from node N  i.e., teleport set = {N}.  Resulting prob. distribution measures similarity to N 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

http://cs246.stanford.edu Overlaps with machine learning, - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on scalability of number

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

http://cs246.stanford.edu Web advertising We discussed how to match

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to

http://cs246.stanford.edu Training data 100 million ratings, 480,000 users, 17,770 movies

http://cs246.stanford.edu TAs : Bahman Bahmani Juthika Dabholkar Pierre Kreitmann

http://cs246.stanford.edu High dimensional == many features Find

From brain responses to algorithms: advances in parsing the computational architecture of

Groundwater Water held Groundwater underground in the soil or in pores and crevices in rock.

CSCI 446: Artificial Intelligence Markov Models Instructor: Michele Van Dyne [These slides were

Extending Decision Trees Alice Gao Lecture 20 Based on work by K. Leyton-Brown, K. Larson, and

Probabilistic representation, representation of uncertainty Applied artificial intelligence

USDA-NIFAS WATER FOR AGRICULTURE : A MECHANISM TO FUND A BROADER PORTFOLIO IN SURFACE AND

SLIDES SHOW STORY OF BANGKOK-YANGON SUBSIDENCE Presentation June 2015 DOI:

Building resilience to climate disaster risk: innovation and best practices 1 November 8th, 2017