CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University
Homework 2 is out: Homework 2 is out: Due Monday 15 th at midnight! Submit PDFs Submit PDFs Talk: http://rain.stanford.edu Wed at 12:30 in Terman 453 Yehuda Koren – Winner of the Netflix challenge! g 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2
Text ‐ LSI: find ‘concepts’ T t LSI fi d ‘ t ’ 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3
Compress / reduce dimensionality 10 6 rows; 10 3 columns; no updates random access to any cell(s); small error: OK 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5
2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6
A [n x m] = U [n x r] r x r] ( V [m x r] ) T ) T ( A : n x m matrix A : n x m matrix (eg., n documents, m terms) U : n x r matrix (n documents, r concepts) : r x r diagonal matrix (strength of each ‘concept’) (strength of each concept ) (r : rank of the matrix) V : m x r matrix (m terms, r concepts) 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7
n n V T m m m A A U U 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8
n 1 u 1 v 1 2 u 2 v 2 A + m A 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9
THEOREM [P THEOREM [Press+92]: always possible to 92] l ibl decompose matrix A into A = U V T , where U V : unique U, V : unique U , V : column orthonormal: U T U = I ; V T V = I (I: identity matrix ) ; ( y ) (Cols. are orthogonal unit vectors) : diagonal Entries (singular values) are positive, and sorted in decreasing order 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10
A = U V T ‐ example: U V T A l retrieval inf . brainlung g brain d data 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11
A = U V T ‐ example: U V T A l retrieval CS ‐ concept inf . brainlung g MD ‐ concept p brain d data 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12
A = U V T ‐ example: U V T A l d doc ‐ to ‐ concept t t similarity matrix retrieval CS ‐ concept inf . brainlung g MD ‐ concept p brain data d 0.18 0 1 1 1 0 0 0 36 0 0.36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13
A = U V T ‐ example: U V T A l retrieval inf . brainlung g brain data d ‘strength’ of CS ‐ concept 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14
A = U V T ‐ example: U V T A l term ‐ to ‐ concept retrieval similarity matrix inf . brainlung g brain d data 0.18 0 CS ‐ concept 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15
A = U V T ‐ example: U V T A l term ‐ to ‐ concept retrieval similarity matrix inf . brainlung g brain d data 0.18 0 CS ‐ concept 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16
‘d ‘documents’, ‘terms’ and ‘concepts’: t ’ ‘t ’ d ‘ t ’ U : document ‐ to ‐ concept similarity matrix V : term ‐ to ‐ concept sim. matrix : its diagonal elements: it di l l t ‘strength’ of each concept 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17
SVD: gives best axis to project first singular best axis to vector project on: p j (‘best’ = min sum of squares v1 of projection of projection errors) minimum RMS error 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18
A = U V T ‐ example: U V T A l 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 v 1 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19
A = U V T ‐ example: U V T A l variance (‘spread’) p on the v 1 axis 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20
A = U V T ‐ example: U V T A l U gives the coordinates of the points in the projection axis points in the projection axis 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21
M More details d t il Q: how exactly is dim. reduction done? 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22
M More details d t il Q: how exactly is dim. reduction done? A: set the smallest singular values to zero: A: set the smallest singular values to zero: 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23
0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24
0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25
0.18 1 1 1 0 0 0 36 0.36 2 2 2 2 2 2 0 0 0 0 9.64 0.18 1 1 1 0 0 x x ~ 5 5 5 0 0 0.90 0 0 0 2 2 0 0 0 0 3 3 0 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 26
1 1 1 0 0 1 1 1 0 0 2 2 2 0 0 2 2 2 2 2 2 0 0 0 0 1 1 1 0 0 1 1 1 0 0 5 5 5 0 0 ~ 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 0 0 0 1 1 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27
Recommend
More recommend