cs345a data mining jure leskovec and anand rajaraman j
play

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Would like to do prediction: Would like to do prediction: learn a function: y = f(x) Where y can be: h b Real: Regression Categorical: Classification


  1. CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University

  2.  Would like to do prediction:  Would like to do prediction: learn a function: y = f(x)  Where y can be: h b  Real: Regression  Categorical: Classification l l f  More complex:  Ranking Str ct red prediction etc  Ranking, Structured prediction, etc.  Data is labeled:  Have many pairs (x,y) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

  3.  We will talk about the following methods:  We will talk about the following methods:  k ‐ Nearest Neighbor (Instance based learning)  P  Perceptron algorithm t l ith  Support Vector Machines  Decision trees (lecture on Thursday by D i i (l Th d b Sugato Basu from Google)  How to efficiently train (build a model)? 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

  4.  Instance based learning  Instance based learning  Example: Nearest neighbor  Keep the whole training dataset: (x y)  Keep the whole training dataset: (x,y)  A query example x’ comes  Find closest example(s) x* Fi d l t l ( ) *  Predict y* 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

  5.  To make things work we need 4 things: g g  Distance metric:  Euclidean  How many neighbors to look at? y g  One  Weighting function (optional):  Unused  How to fit with the local points?  Just predict the same output as the nearest neighbor 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

  6.  Suppose x  Suppose x 1 ,…, x m are two dimensional: x are two dimensional:  x 1 =(x 11 ,x 12 ), x 2 =(x 21 ,x 22 ), …  One can draw nearest neighbor regions:  One can draw nearest neighbor regions: d(xi,xj)=(x i1 ‐ x j1 ) 2 +(3x i2 ‐ 3x j2 ) 2 d(xi,xj)=(x i1 ‐ x j1 ) 2 +(x i2 ‐ x j2 ) 2 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

  7.  Distance metric:  Euclidean  How many neighbors to look at?  k  Weighting function (optional):  Unused  How to fit with the local points?  Just predict the average output among k nearest neighbors k=9 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

  8.  Distance metric:  Euclidean  Euclidean  How many neighbors to look at?  All of them  Weighting function: Weighting function:  w i =exp( ‐ d(x i , q) 2 /K w )  Nearby points to query q are weighted more strongly. K w …kernel width.  How to fit with the local points? p  Predict weighted average:  w i y i /  w i K=10 K=10 K=20 K=20 K=80 K=80 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

  9.  Given: a set P of n points in R d  Given: a set P of n points in R  Goal: Given a query point q :  NN: find the nearest neighbor p of q in P  NN: find the nearest neighbor p of q in P  Range search: find one/all points in P within distance r from q distance r from q p q 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

  10.  Main memory:  Main memory:  Linear scan  T  Tree based: b d  Quadtree  kd ‐ tree  kd ‐ tree  Hashing:  Locality ‐ Sensitive Hashing Locality Sensitive Hashing  Secondary storage:  R ‐ trees R trees 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

  11.  Simplest spatial structure on Earth!  Simplest spatial structure on Earth!  Split the space into 2 d equal subsquares  Repeat until done: Repeat until done:  only one pixel left  only one point left y p  only a few points left  Variants:  split only one dimension at a time  kd ‐ trees (in a moment) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

  12. Range search:  Range search:  Put root node on the stack  Repeat:  pop the next node T from the stack  pop the next node T from the stack q  for each child C of T :  if C is a leaf, examine point(s) in C  if C intersects with the ball of radius r if C intersects with the ball of radius r around q , add C to the stack  Nearest neighbor:  Great in 2 or 3  Start range search with r =  g dimensions dimensions  Whenever a point is found,  Nodes have 2 d update r parents  Only investigate nodes with Only investigate nodes with  Space issues: S i respect to current r 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

  13.  Main ideas [Bentley ’75] : Main ideas [Bentley 75] :  Only one ‐ dimensional splits  Choose the split “carefully” p y (many variations)  Queries: as for quadtrees  Advantages:  no (or less) empty spaces  only linear space  Query time at most:  Query time at most:  Min[ dn , exponential( d )] 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

  14.  “Bottom ‐ up” approach [Guttman 84]: Bottom ‐ up approach [Guttman 84]:   Start with a set of points/rectangles  Partition the set into groups of small cardinality  Partition the set into groups of small cardinality  For each group, find minimum rectangle containing objects from this group (MBR) j g p ( )  Repeat  Advantages: Advantages  Supports near(est) neighbor search (similar as before)  Works for points and rectangles  Works for points and rectangles  Avoids empty spaces 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

  15.  R trees with fan out 4:  R ‐ trees with fan ‐ out 4:  group nearby rectangles to parent MBRs I C A A G G H F B J J E D 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #15

  16.  R trees with fan out 4:  R ‐ trees with fan ‐ out 4:  every parent node completely covers its ‘children’ P1 P3 I C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #16

  17.  R trees with fan out 4:  R ‐ trees with fan ‐ out 4:  every parent node completely covers its ‘children’ P1 P3 I P1 P2 P3 P4 C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #17

  18.  Example of a range search query  Example of a range search query P1 P3 I P1 P2 P3 P4 C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #18

  19.  Example of a range search query  Example of a range search query P1 P3 I P1 P2 P3 P4 C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #19

  20.  Example: Spam filtering  Example: Spam filtering  Instance space X:  Feature vector of word occurrences (binary, TF ‐ IDF)  d features (d~100,000)  Class Y:  Spam (+1), Ham ( ‐ 1) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

  21.  Very loose motivation: Neuron  Very loose motivation: Neuron  Inputs are feature values  Each feature has a weight w  Each feature has a weight w w 1 x 1  Activation is the sum: w 2 x 2  w 3 >0? x 3  f(x)=  i w i  x i = w  x  f(x)  w x w x 3 w w 4 x 4  If the f(x) is: w  x=0 geria  Positive predict +1  Positive: predict +1 nig x 1 Spam=1  Negative: predict ‐ 1 x 2 w viagra Ham= ‐ 1 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

  22.  If more than 2 classes:  If more than 2 classes:  Weight vector w c for each class  C l  Calculate activation for each class l t ti ti f h l  f(x,c)=  i w c,i  x i = w c  x w 3  x  Highest activation wins:  Highest activation wins: biggest  c = arg max c f(x,c) w 3 w 2  x w 2 w 1 biggest w 1  x biggest biggest 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

  23.  Define a model:  Define a model: Perceptron: y = sign(w  x)  Define a loss function: L(w) = –  i y i  w  x i i i i  Minimize the loss:  Compute gradient L’(w) and optimize: w t+1 = w t ‐  t  L’(w) = w t ‐  t   i d L(y i  w  x i )/ d w (Batch gradient descent) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

  24.  Stochastic gradient descent:  Stochastic gradient descent:  Examples are drawn from a finite training set  Pi k  Pick random example x j and update d l d d t w t+1 = w t ‐  t  d L(w  x j , y j )/ d w Cost per Time to reach Time for accuracy  optimization error <  iteration O(m  d) O(m  d  log(1/  ))  O(  d 2 /   log 2 (1/  )) GD g g 2 nd order GD O(m  d  log log(1/  )) O(d 2 /   log(1/  )  log log(1/  )) O(d(d+m)) O(  d/  ) O(  d/  ) Stochastic GD O(d) [Bottou LeCun 04] [Bottou ‐ LeCun ‘04] m number of examples m… number of examples d… number of features  … condition number 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

Recommend


More recommend