cs345a data mining jure leskovec and anand rajaraman j
play

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Homework 2 is out: Homework 2 is out: Due Monday 15 th at midnight! Submit PDFs Submit PDFs Talk: http://rain.stanford.edu Wed at 12:30 in Terman 453


  1. CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University

  2. Homework 2 is out: Homework 2 is out:  Due Monday 15 th at midnight!  Submit PDFs  Submit PDFs Talk:  http://rain.stanford.edu  Wed at 12:30 in Terman 453  Yehuda Koren – Winner of the Netflix challenge! g 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

  3.  Text ‐ LSI: find ‘concepts’ T t LSI fi d ‘ t ’ 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

  4.  Compress / reduce dimensionality  10 6 rows; 10 3 columns; no updates  random access to any cell(s); small error: OK 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

  5. 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

  6. 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

  7. A [n x m] = U [n x r]   r x r] ( V [m x r] ) T  ) T (  A : n x m matrix  A : n x m matrix (eg., n documents, m terms)  U : n x r matrix (n documents, r concepts)   : r x r diagonal matrix (strength of each ‘concept’) (strength of each concept ) (r : rank of the matrix)  V : m x r matrix (m terms, r concepts) 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

  8. n n    V T m m m A A U U 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

  9. n  1 u 1  v 1  2 u 2  v 2 A   + m A 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

  10. THEOREM [P THEOREM [Press+92]: always possible to 92] l ibl decompose matrix A into A = U  V T , where  U  V : unique  U,  V : unique  U , V : column orthonormal:  U T U = I ; V T V = I (I: identity matrix ) ; ( y )  (Cols. are orthogonal unit vectors)   : diagonal  Entries (singular values) are positive, and sorted in decreasing order 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

  11.  A = U  V T ‐ example: U  V T A l retrieval inf . brainlung g brain d data 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

  12.  A = U  V T ‐ example: U  V T A l retrieval CS ‐ concept inf . brainlung g MD ‐ concept p brain d data 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

  13.  A = U  V T ‐ example: U  V T A l d doc ‐ to ‐ concept t t similarity matrix retrieval CS ‐ concept inf . brainlung g MD ‐ concept p brain data d 0.18 0 1 1 1 0 0 0 36 0 0.36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

  14.  A = U  V T ‐ example: U  V T A l retrieval inf . brainlung g brain data d ‘strength’ of CS ‐ concept 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

  15.  A = U  V T ‐ example: U  V T A l term ‐ to ‐ concept retrieval similarity matrix inf . brainlung g brain d data 0.18 0 CS ‐ concept 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15

  16.  A = U  V T ‐ example: U  V T A l term ‐ to ‐ concept retrieval similarity matrix inf . brainlung g brain d data 0.18 0 CS ‐ concept 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

  17. ‘d ‘documents’, ‘terms’ and ‘concepts’: t ’ ‘t ’ d ‘ t ’  U : document ‐ to ‐ concept similarity matrix  V : term ‐ to ‐ concept sim. matrix   : its diagonal elements:  it di l l t ‘strength’ of each concept 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17

  18. SVD: gives best axis to project first singular  best axis to vector project on: p j (‘best’ = min sum of squares v1 of projection of projection errors)  minimum RMS error 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18

  19.  A = U  V T ‐ example: U  V T A l 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 v 1 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19

  20.  A = U  V T ‐ example: U  V T A l variance (‘spread’) p on the v 1 axis 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

  21.  A = U  V T ‐ example: U  V T A l  U  gives the coordinates of the points in the projection axis points in the projection axis 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

  22. M More details d t il  Q: how exactly is dim. reduction done? 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

  23. M More details d t il  Q: how exactly is dim. reduction done?  A: set the smallest singular values to zero:  A: set the smallest singular values to zero: 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

  24. 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

  25. 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25

  26. 0.18 1 1 1 0 0 0 36 0.36 2 2 2 2 2 2 0 0 0 0 9.64 0.18 1 1 1 0 0 x x ~ 5 5 5 0 0 0.90 0 0 0 2 2 0 0 0 0 3 3 0 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 26

  27. 1 1 1 0 0 1 1 1 0 0 2 2 2 0 0 2 2 2 2 2 2 0 0 0 0 1 1 1 0 0 1 1 1 0 0 5 5 5 0 0 ~ 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 0 0 0 1 1 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27

Recommend


More recommend