CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University
HW3 is out HW3 is out Poster session is on last day of classes: Thu March 11 at 4:15 Thu March 11 at 4:15 Reports are due March 14 Final is March 18 at 12:15 Final is March 18 at 12:15 Open book, open notes No laptop N l t 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2
Which is best linear separator? Whi h i b t li t ? + Data: + Examples: + (x 1 , y 1 ),… (x n , y n ) ‐ + + + + Example i: Example i: ‐ x i =(x 1 (1) ,…, x 1 (d) ) ‐ + y i { ‐ 1, +1} y i { , } ‐ ‐ ‐ Inner product: ‐ w x= w x 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3
Confidence: f d w x=0 =(w x i )y i + + + For all datapoints: For all datapoints: i = + + ‐ + + ‐ + ‐ + ‐ ‐ ‐ ‐ 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4
+ Maximize the margin: Maximize the margin: + + w x=0 Good according to + + intuition, theory & practice intuition theory & practice ‐ + + + ‐ ‐ max + , w ‐ . . , ( ) s t i y x w + i i ‐ ‐ ‐ 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5
Canonical hyperplanes: Canonical hyperplanes: Projection of x i on plane w w x=0: w x=0: x x x x i i || w || 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6
Maximizing the margin: Maximizing the margin: max , w . . , ( ) s t i y x w i i Equivalent: 2 min min || || || || w w w . . , ( ) 1 s t i y x w i i SVM with “hard” constraints 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7
If data not separable introduce penalty If data not separable introduce penalty 1 w min C # number of mistakes w + w + 2 w x=0 w x=0 . . , ( ) 1 s t i y x w + i i + ‐ + + Choose C based Ch C b d ‐ + on cross validation ‐ + ‐ How to penalize ‐ ‐ mistakes? i t k ? ‐ 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8
+ + Introduce slack variables : Introduce slack variables : n 1 + min w w C + + i 2 2 , 0 w + i i 1 1 i i ‐ ‐ + . . , ( ) 1 s t i y x w i i i + ‐ Hinge loss: Hi l ‐ ‐ w x=0 For each datapoint: If margin>1, don’t care If margin<1 pay linear penalty If margin<1, pay linear penalty 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9
SVM in the natural form SVM in the “natural” form arg min (w) f w w Where: n 1 ( ( ) ) max{ max{ 0 0 , , 1 1 ( ( )} )} f f w w w w w w C C y y x x w w i i i i 2 1 i 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10
Use quadratic solver: Use quadratic solver: n n 1 1 min w w C i Minimize quadratic function 2 , 0 w i 1 i Subject to linear constraints S bj . . , ( ( ) ) 1 1 s s t t i i y y x x w w t t li t i t i i i Stochastic gradient descent: Minimize: Mi i i n 1 ( ) max{ 0 , 1 ( )} f w w w C y x w i i 2 2 i 1 Update: ( , ) L wx y t t ' ( ( ) ) w w w w f f w w w w w w t t w 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11
Example by Leon Bottou: Example by Leon Bottou: Reuters RCV1 document corpus m=781k training examples, 23k test examples 781k t i i l 23k t t l d=50k features Training time: 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13
What if we subsample the dataset? SGD on full dataset vs. Conjugate gradient on n training examples 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14
Need to choose learning rate : Need to choose learning rate : ' ( ) w w L w 1 t t t Leon suggests: Select small subsample Select small subsample Try various rates Pick the one that most reduces the loss Pi k th th t t d th l Use for next 100k iterations on the full dataset 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15
Stopping criteria: Stopping criteria: How many iterations of SGD? Early stopping with cross validation Early stopping with cross validation Create validation set Monitor cost function on the validation set Stop when loss stops decreasing Early stopping a priori Extract two disjoint subsamples A and B of training data Extract two disjoint subsamples A and B of training data Determine the number of epochs k by training on A, stop by validating on B Train for k epochs on the full dataset 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16
Kernel function: K(x x ) = (x ) (x ) Kernel function: K(x i ,x j ) = (x i ) (x j ) Does the SVM kernel trick still work? Yes (but not without a price): Yes (but not without a price): Represent w with its kernel expansion: w = i i (x i ) ( ) Usually: d L(w)/ d w = ‐ (x j ) ( ) d L( )/ d Then update w at epoch t by combining : t = (1 ‐ ) t + 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17
[Shalev ‐ Shwartz et al. ICML ‘07] We had before: We had before: Can replace C with : 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18
[Shalev ‐ Shwartz et al. ICML ‘07] |A t | = 1 | t | |A t | = S |A t | S Stochastic gradient Subgradient method Subgradient Projection 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19
[Shalev ‐ Shwartz et al. ICML ‘07] Choosing |A t |=1 and a linear kernel over R n Choosing |A t | 1 and a linear kernel over R Theorem [Shalev ‐ Shwartz et al. ‘07]: Run ‐ time required for Pegasos to find d f f d accurate solution with prob. >1 ‐ Run ‐ time depends on number of features n i d d b f f Does not depend on #examples m Depends on “difficulty” of problem ( and ) Depends on difficulty of problem ( and ) 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20
SVM and structured output prediction SVM and structured output prediction Setting: Assume: Data is i.i.d. from Assume: Data is i.i.d. from Given: Training sample Goal: Find function from input space X to output Y Complex objects 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21
Examples: Examples: Natural Language Parsing Given a sequence of words x predict the parse tree y Given a sequence of words x, predict the parse tree y Dependencies from structural constraints, since y has to be a tree y S NP VP x The dog chased the cat The dog chased the cat NP Det N V Det N 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22
Approach: view as multi ‐ class classification task pp Every complex output is one class Problems: Exponentially many classes! Exponentially many classes! How to predict efficiently? How to learn efficiently? Potentially huge model! Potentially huge model! y 1 y S S VP VP VP VP NP Manageable number of features? V N V Det N y 2 S NP VP NP NP x x The dog chased the cat Det N V Det N … y k k S VP NP NP Det N V Det N 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23
Feature vector describes match between x and y y Learn single weight vector and rank by Hard ‐ margin optimization problem: … … 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24
[Yue et al., SIGIR ‘07] Ranking: Ranking: Given a query x, predict a ranking y. D Dependencies between results (e.g. d i b t lt ( avoid redundant hits) Loss function over rankings (e g AvgPrec) Loss function over rankings (e.g. AvgPrec) y x 1. Kernel ‐ Machines SVM 2. SVM ‐ Light 3. Learning with Kernels L i i h K l 4. SV Meppen Fan Club 5. Service Master & Co. 6. School of Volunteer Management g 7. SV Mattersburg Online … 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25
Recommend
More recommend