cs345a data mining jure leskovec and anand rajaraman j
play

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University HW3 is out HW3 is out Poster session is on last day of classes: Thu March 11 at 4:15 Thu March 11 at 4:15 Reports are due March 14 Final is


  1. CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University

  2.  HW3 is out  HW3 is out  Poster session is on last day of classes:  Thu March 11 at 4:15  Thu March 11 at 4:15  Reports are due March 14  Final is March 18 at 12:15  Final is March 18 at 12:15  Open book, open notes  No laptop N l t 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

  3.  Which is best linear separator? Whi h i b t li t ? + Data: +  Examples: +  (x 1 , y 1 ),… (x n , y n ) ‐ + + + +  Example i:  Example i: ‐  x i =(x 1 (1) ,…, x 1 (d) ) ‐ +  y i  { ‐ 1, +1} y i { , } ‐ ‐ ‐  Inner product: ‐  w  x=  w x 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

  4.  Confidence: f d w  x=0 =(w  x i )y i + + +  For all datapoints:  For all datapoints:  i = + + ‐ + + ‐ + ‐ + ‐ ‐ ‐ ‐ 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

  5. +  Maximize the margin:  Maximize the margin: + + w  x=0  Good according to + + intuition, theory & practice intuition theory & practice ‐ + + +  ‐ ‐ max +  , w      ‐ . . , ( ) s t i y x w + i i ‐ ‐ ‐ 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

  6.  Canonical hyperplanes:  Canonical hyperplanes:  Projection of x i on plane w w  x=0: w  x=0:       x x x x i i || w || 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

  7.  Maximizing the margin:  Maximizing the margin:  max  , w      . . , ( ) s t i y x w i i  Equivalent: 2 min min || || || || w w w     . . , ( ) 1 s t i y x w i i SVM with “hard” constraints 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

  8.  If data not separable introduce penalty  If data not separable introduce penalty 1  w   min C # number of mistakes w + w + 2 w  x=0 w  x=0     . . , ( ) 1 s t i y x w + i i + ‐ + +  Choose C based Ch C b d ‐ + on cross validation ‐ + ‐  How to penalize ‐ ‐ mistakes? i t k ? ‐ 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

  9. + +  Introduce slack variables  :  Introduce slack variables  : n 1 +     min w w C + + i   2 2 , 0 w +  i i 1 1 i i       ‐ ‐ + . . , ( ) 1 s t i y x w i i i + ‐  Hinge loss: Hi l ‐ ‐ w  x=0 For each datapoint: If margin>1, don’t care If margin<1 pay linear penalty If margin<1, pay linear penalty 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

  10.  SVM in the natural form  SVM in the “natural” form arg min (w) f w w  Where: n 1           ( ( ) ) max{ max{ 0 0 , , 1 1 ( ( )} )} f f w w w w w w C C y y x x w w i i i i 2  1 i 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

  11.  Use quadratic solver:  Use quadratic solver: n n 1 1     min w w C i  Minimize quadratic function   2 , 0 w  i 1 i              Subject to linear constraints  S bj . . , ( ( ) ) 1 1 s s t t i i y y x x w w t t li t i t i i i  Stochastic gradient descent:  Minimize: Mi i i n 1         ( ) max{ 0 , 1 ( )} f w w w C y x w i i 2 2  i 1  Update:    ( , ) L wx y                   t t ' ( ( ) ) w w w w f f w w w w w w  t t   w 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

  12.  Example by Leon Bottou:  Example by Leon Bottou:  Reuters RCV1 document corpus   m=781k training examples, 23k test examples 781k t i i l 23k t t l  d=50k features  Training time: 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

  13. 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

  14.  What if we subsample the dataset?  SGD on full dataset vs.  Conjugate gradient on n training examples 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

  15.  Need to choose learning rate  :  Need to choose learning rate  :    ' ( ) w w L w  1 t t t  Leon suggests:  Select small subsample  Select small subsample  Try various rates   Pick the one that most reduces the loss Pi k th th t t d th l  Use  for next 100k iterations on the full dataset 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15

  16.  Stopping criteria:  Stopping criteria: How many iterations of SGD?  Early stopping with cross validation Early stopping with cross validation  Create validation set  Monitor cost function on the validation set  Stop when loss stops decreasing  Early stopping a priori  Extract two disjoint subsamples A and B of training data  Extract two disjoint subsamples A and B of training data  Determine the number of epochs k by training on A, stop by validating on B  Train for k epochs on the full dataset 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

  17.  Kernel function: K(x x ) =  (x )   (x )  Kernel function: K(x i ,x j ) =  (x i )   (x j )  Does the SVM kernel trick still work?  Yes (but not without a price):  Yes (but not without a price):  Represent w with its kernel expansion: w =  i  i   (x i )   ( )  Usually: d L(w)/ d w = ‐    (x j )  ( ) d L( )/ d  Then update w at epoch t by combining  :  t = (1 ‐  )  t +  3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17

  18. [Shalev ‐ Shwartz et al. ICML ‘07]  We had before:  We had before:  Can replace C with  : 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18

  19. [Shalev ‐ Shwartz et al. ICML ‘07] |A t | = 1 | t | |A t | = S |A t | S Stochastic gradient Subgradient method Subgradient Projection 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19

  20. [Shalev ‐ Shwartz et al. ICML ‘07]  Choosing |A t |=1 and a linear kernel over R n Choosing |A t | 1 and a linear kernel over R  Theorem [Shalev ‐ Shwartz et al. ‘07]:  Run ‐ time required for Pegasos to find  d f f d accurate solution with prob. >1 ‐   Run ‐ time depends on number of features n i d d b f f  Does not depend on #examples m  Depends on “difficulty” of problem (  and  )  Depends on difficulty of problem (  and  ) 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

  21.  SVM and structured output prediction  SVM and structured output prediction  Setting:  Assume: Data is i.i.d. from Assume: Data is i.i.d. from  Given: Training sample  Goal: Find function from input space X to output Y Complex objects 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

  22.  Examples:  Examples:  Natural Language Parsing  Given a sequence of words x predict the parse tree y  Given a sequence of words x, predict the parse tree y  Dependencies from structural constraints, since y has to be a tree y S NP VP x The dog chased the cat The dog chased the cat NP Det N V Det N 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

  23.  Approach: view as multi ‐ class classification task pp  Every complex output is one class  Problems:  Exponentially many classes!  Exponentially many classes!  How to predict efficiently?  How to learn efficiently?  Potentially huge model!  Potentially huge model! y 1 y S S VP VP VP VP NP  Manageable number of features? V N V Det N y 2 S NP VP NP NP x x The dog chased the cat Det N V Det N … y k k S VP NP NP Det N V Det N 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

  24.  Feature vector describes match between x and y y  Learn single weight vector and rank by Hard ‐ margin optimization problem: … … 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

  25. [Yue et al., SIGIR ‘07]  Ranking:  Ranking:  Given a query x, predict a ranking y.  D  Dependencies between results (e.g. d i b t lt ( avoid redundant hits)  Loss function over rankings (e g AvgPrec)  Loss function over rankings (e.g. AvgPrec) y x 1. Kernel ‐ Machines SVM 2. SVM ‐ Light 3. Learning with Kernels L i i h K l 4. SV Meppen Fan Club 5. Service Master & Co. 6. School of Volunteer Management g 7. SV Mattersburg Online … 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25

Recommend


More recommend