http://cs246.stanford.edu High-dimension == many features Find - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

 High-dimension == many features  Find concepts/topics/genres:  Documents:  Features: thousands of words, millions of word pairs  Surveys – Netflix: 480k users x 177k movies 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

 Compress / reduce dimensionality:  10 6 rows; 10 3 columns; no updates  random access to any cell(s); small error: OK 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

 Assumption: Data lies on or near a low d -dimensional subspace  Axes of this subspace are effective representation of the data 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

 Why reduce dimensionality?  Discover hidden correlations/topics  Words that occur commonly together  Remove redundant and noisy features  Not all words are useful  Interpretation and visualization  Easier storage and processing of the data 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

A [n x m] = U [n x r] Σ [ r x r] ( V [m x r] ) T  A : Input data matrix  n x m matrix (e.g., n documents, m terms)  U : Left singular vectors  n x r matrix (n documents, r concepts)  Σ : Singular values  r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix)  V : Right singular vectos  m x r matrix (m terms, r concepts) 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

n n ≈ Σ V T m m A U 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

n σ 1 u 1 v 1 σ 2 u 2 v 2 ≈ + m A σ i … scalar u i … vector v i … vector 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

It is always possible to decompose a real matrix A into A = U Σ V T , where  U, Σ , V : unique  U, V : column orthonormal:  U T U = I ; V T V = I ( I : identity matrix)  (Cols. are orthogonal unit vectors)  Σ : diagonal  Entries (singular values) are positive, and sorted in decreasing order ( σ 1 ≥ σ 2 ≥ σ 3 ≥ ...) 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

 A = U Σ V T - example: Casablanca Serenity Amelie Matrix Alien 1 1 1 0 0 0.18 0 2 2 2 0 0 0.36 0 SciFi 1 1 1 0 0 9.64 0 0.18 0 x x = 5 5 5 0 0 0 5.29 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

 A = U Σ V T - example: Casablanca SciFi-concept Serenity Amelie Matrix Romance-concept Alien 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

 A = U Σ V T - example: user-to-concept similarity matrix Casablanca SciFi-concept Serenity Amelie Matrix Romance-concept Alien 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

 A = U Σ V T - example: Casablanca Serenity Amelie Matrix Alien ‘strength’ of SciFi-concept 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

 A = U Σ V T - example: movie-to-concept Casablanca similarity matrix Serenity Amelie Matrix Alien 0.18 0 SciFi-concept 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

 A = U Σ V T - example: movie-to-concept Casablanca similarity matrix Serenity Amelie Matrix Alien 0.18 0 SciFi-concept 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

‘movies’, ‘users’ and ‘concepts’:  U : user-to-concept similarity matrix  V : movie-to-concept sim. matrix  Σ : its diagonal elements: ‘strength’ of each concept 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

Movie 2 rating SVD gives best axis to project on:  ‘best’ = min sum first singular of squares of vector projection errors  minimum reconstruction v 1 error Movie 1 rating 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

 A = U Σ V T - example: 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 = x x 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 v 1 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

 A = U Σ V T - example: variance (‘spread’) on the v 1 axis 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 = x x 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

 A = U Σ V T - example:  U Σ: gives the coordinates of the points in the projection axis 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 = x x 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

More details  Q: How exactly is dim. reduction done? 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 = x x 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

More details  Q: How exactly is dim. reduction done?  A: Set the smallest singular values to zero 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 = x x 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 A= 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

More details  Q: How exactly is dim. reduction done?  A: Set the smallest singular values to zero 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 A= 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

More details  Q: How exactly is dim. reduction done?  A: Set the smallest singular values to zero: 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 A= 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

More details  Q: How exactly is dim. reduction done?  A: Set the smallest singular values to zero: 0.18 1 1 1 0 0 0.36 2 2 2 0 0 9.64 0.18 1 1 1 0 0 x x ~ 5 5 5 0 0 0.90 0 0 0 2 2 0 A= 0 0 0 3 3 0 0.58 0.58 0.58 0 0 0 0 0 1 1 0 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

More details  Q: How exactly is dim. reduction done?  A: Set the smallest singular values to zero B= 1 1 1 0 0 1 1 1 0 0 2 2 2 0 0 2 2 2 0 0 1 1 1 0 0 1 1 1 0 0 ~ A= 5 5 5 0 0 5 5 5 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 0 0 0 1 1 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

 Theorem: Let A = U Σ V T ( σ 1 ≥ σ 2 ≥ …, rank(A)=n) then B = U S V T  S = diagonal n x n matrix where s i = σ i (i=1…k) else s i =0 is a best rank-k approximation to A:  B is solution to min B ǁA -B ǁ F where rank(B)=k  Why? n ∑ − = Σ − = σ − 2 min A B min S min ( s ) s i i = F F i B , rank ( B ) k = i 1 k n n ∑ ∑ ∑ = σ − + σ = σ 2 2 2 min ( s ) s i i i i i = = + = + i 1 i k 1 i k 1 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

Equivalent: ‘spectral decomposition’ of the matrix: 1 1 1 0 0 2 2 2 0 0 σ 1 1 1 1 0 0 x x = u 1 u 2 5 5 5 0 0 σ 2 0 0 0 2 2 0 0 0 3 3 v 1 0 0 0 1 1 v 2 1/24/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

http://cs246.stanford.edu High-dimension == many features Find - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres: Documents: Features: thousands of words, millions of word pairs Surveys

http://cs246.stanford.edu High dimensional == many features Find

http://cs246.stanford.edu Input features: N features: X 1 , X 2 , X N A Each X j

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu Many real-world problems Web Search and Text Mining Billions

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://cs246.stanford.edu Goal: Given a large number (N in the millions or billions) of text

http://cs246.stanford.edu Supermarket shelf management Market-basket model: Goal: Identify

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

http://cs246.stanford.edu Web advertising We discussed how to match

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Homework 2 is out:

Spanning Trees Recall the definitions of: - graphs, the vertex set V={0,1,2,,n-1}, the edge

The Revised Simplex Method Combinatorial Problem Solving (CPS) Javier Larrosa Albert Oliveras

Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal:

Implementation of Xen PVHVM drivers in OpenBSD Mike Belopuhov Esdenera Networks GmbH

11/14/2016 Disclosures Research grants to my institution from Merck and GSK Tuberculosis:

MATH 12002 - CALCULUS I 5.2: Laws of Logarithms Professor Donald L. White Department of

Emerging Economies, Trade Policy, and Macroeconomic Shocks Chad P. Bown Meredith A. Crowley The

Sambuz

Useful Links

Newsletter

Mail Us