http cs246 stanford edu overlaps with machine learning
play

http://cs246.stanford.edu Overlaps with machine learning, - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on scalability of number


  1. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

  2.  Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on  scalability of number Statistics/ Machine Learning/ of features and instances AI Pattern  stress on algorithms and Recognition architectures Data Mining  automation for handling large data Database systems 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

  3.  MapReduce  Association Rules  Finding Similar Items  Locality Sensitive Hashing  Dim. Reduction (SVD, CUR))  Clustering  Recommender systems  PageRank and TrustRank  Machine Learning: kNN, SVM, Decision Trees  Mining data streams  Advertising on the Web 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

  4. 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

  5. Provided by the Provided by the programmer programmer MAP: Reduce: Group by key: reads input and Collect all values Collect all pairs produces a set of belonging to the with same key key value pairs key and output Sequentially read the data Only sequential reads (the, 1) (crew, 1) The crew of the space shuttle Endeavor recently returned to (crew, 1) (crew, 1) Earth as ambassadors, (crew, 2) harbingers of a new era of (of, 1) (space, 1) space exploration. Scientists (space, 1) at NASA are saying that the (the, 1) (the, 1) recent assembly of the Dextre (the, 3) bot is the first step in a long- (space, 1) (the, 1) term space-based (shuttle, 1) man/machine partnership. (shuttle, 1) (the, 1) '"The work we're doing now -- (recently, 1) the robotics we're doing -- is (Endeavor, 1) (shuttle, 1) what we're going to need to … do to build any work station (recently, 1) (recently, 1) or habitat structure on the moon or Mars," said Allard …. … Beutel. Big document (key, value) (key, value) (key, value) 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

  6. High-dimensional data: Locality Sensitive Hashing Dimensionality reduction Clustering The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities Machine Learning: kNN, Perceptron, SVM, Decision Trees Data is infinite: Mining data streams Advertising on the Web Applications: Association Rules Recommender systems 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

  7.  Many problems can be expressed as finding “similar” sets:  Find near-neighbors in high-D space  Distance metrics:  Points in ℜ n : L1, L2, Manhattan distance  Vectors: Cosine similarity  Sets of items: Jaccard similarity, Hamming distance  Problem:  Find near-duplicate documents 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

  8. Candidate pairs : Locality- those pairs Docu- sensitive of signatures ment Hashing that we need to test for Signatures : short The set of strings similarity. integer vectors that of length k that represent the sets, appear in the and reflect their document similarity Shingling: convert docs to sets 1. Minhashing: convert large sets to short 2. signatures, while preserving similarity. Locality-sensitive hashing: focus on pairs of 3. signatures likely to be similar 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

  9.  Shingling: convert docs to sets of items  Shingle: sequence of k tokens that appear in doc  Example: k=2; D 1 = abcab , 2-shingles: S(D 1 )={ ab , bc , ca }  Represent a doc by the set of hashes of its shingles  MinHashing: convert large sets to short signatures, while preserving similarity  Similarity preserving hash func. h () s.t.: Pr [ h π (S(D 1 )) = h π (S(D 2 ))] = Sim (S(D 1 ), S(D 2 ))  For Jaccard use permutation of columns and index of first 1. 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

  10. Input matrix Signature matrix M 1 4 3 1 0 1 0 2 1 2 1 1 0 0 1 3 2 4 2 1 4 1 0 1 0 1 7 1 7 1 2 1 2 0 1 0 1 6 3 6 0 1 0 1 2 6 1 Similarities: 1-3 2-4 1-2 3-4 5 7 2 1 0 1 0 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0 4 5 5 1 0 1 0 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

  11. 2 1 4 1 1 2 1 2 2 1 2 1  Hash cols of signature  Sim(C 1 ,C 2 )= s matrix M: Similar columns  Prob. that at least 1 band is likely hash to same bucket identical = 1 - (1 - s r ) b  Cols. x and y are a candidate  Given s , tune r and b to get pair if M ( i, x ) = M ( i, y ) for at almost all pairs with similar least frac. s values of i signatures, but eliminate  Divide matrix M into b bands most pairs that do not have of r rows similar signatures Buckets b=20, r=5 1-(1-s r ) b s Prob. of sharing .2 .006 a bucket .3 .047 .4 .186 b bands .5 .470 .6 .802 r rows .7 .975 Matrix M Sim. threshold s .8 .9996 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

  12. n n ≈ Σ V T m m A U 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

  13.  A = U Σ V T - example: user-to-concept similarity matrix Casablanca SciFi-concept Serenity Amelie Matrix Romance-concept Alien 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

  14.  A = U Σ V T - example: Casablanca Serenity Amelie Matrix Alien ‘strength’ of SciFi-concept 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

  15.  A = U Σ V T - example: movie-to-concept Casablanca similarity matrix Serenity Amelie Matrix Alien 0.18 0 SciFi-concept 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

  16.  How to do dimensionality reduction:  Set small singular values to zero  How to query?  Map query vector into “concept space” –  How? Compute q∙V Even though d and q do not share Casablanca a movie, they are still similar Serenity Amelie Matrix SciFi-concept Alien 1.16 0 d= 0 4 5 0 0 q= 0.58 0 5 0 0 0 0 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

  17.  Hierarchical:  Agglomerative (bottom up):  Initially, each point is a cluster  Repeatedly combine the two “nearest” clusters into one  Represent a cluster by its centroid or clustroid  Point Assignment:  Maintain a set of clusters  Points belong to “nearest” cluster 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

  18.  k-means : initialize cluster centroids  Iterate:  For each point, place it in the cluster whose current centroid it is nearest  Update the cluster centroids based on memberships 2 Reassigned 4 points x 6 3 1 8 7 5 x Clusters after first round 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

  19.  LSH:  Find somewhat similar pairs of items while avoiding O(N 2 ) comparisons  Clustering:  Assign points into a prespecified number of clusters  Each point belongs to a single cluster  Summarize the cluster by a centroid (e.g., topic vector)  SVD (dimensionality reduction):  Want to explore correlations in the data  Some dimensions may be irrelevant  Useful for visualization, removing noise from the data, detecting anomalies 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

  20. High-dimensional data: Locality Sensitive Hashing Dimensionality reduction Clustering The data is a graph: Link Analysis: PageRank, TrustRank, Hubs & Authorities Machine Learning: kNN, Perceptron, SVM, Decision Trees Data is infinite: Mining data streams Advertising on the Web Applications: Association Rules Recommender systems 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

  21.  Rank nodes using link structure  PageRank:  Link voting:  P with importance x has n out-links, each link gets x/n votes  Page R’s importance is the sum of the votes on its in-links  Complications: Spider traps, Dead-ends  At each step, random surfer has two options:  With probability β , follow a link at random  With prob. 1- β , jump to some page uniformly at random 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

  22.  TrustRank : topic-specific PageRank with a teleport set of “trusted” pages  Spam mass of page p:  Fraction of pagerank score r(p) coming from spam pages: |r(p) – r + (p)| / r(p)  SimRank : measure similarity between items  a k -partite graph with k types of nodes  Example: picture nodes and tag nodes  Perform a random-walk with restarts from node N  i.e., teleport set = {N}.  Resulting prob. distribution measures similarity to N 3/9/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

Recommend


More recommend