CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University
Would like to do prediction: Would like to do prediction: learn a function: y = f(x) Where y can be: h b Real: Regression Categorical: Classification l l f More complex: Ranking Str ct red prediction etc Ranking, Structured prediction, etc. Data is labeled: Have many pairs (x,y) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2
We will talk about the following methods: We will talk about the following methods: k ‐ Nearest Neighbor (Instance based learning) P Perceptron algorithm t l ith Support Vector Machines Decision trees (lecture on Thursday by D i i (l Th d b Sugato Basu from Google) How to efficiently train (build a model)? 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3
Instance based learning Instance based learning Example: Nearest neighbor Keep the whole training dataset: (x y) Keep the whole training dataset: (x,y) A query example x’ comes Find closest example(s) x* Fi d l t l ( ) * Predict y* 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4
To make things work we need 4 things: g g Distance metric: Euclidean How many neighbors to look at? y g One Weighting function (optional): Unused How to fit with the local points? Just predict the same output as the nearest neighbor 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5
Suppose x Suppose x 1 ,…, x m are two dimensional: x are two dimensional: x 1 =(x 11 ,x 12 ), x 2 =(x 21 ,x 22 ), … One can draw nearest neighbor regions: One can draw nearest neighbor regions: d(xi,xj)=(x i1 ‐ x j1 ) 2 +(3x i2 ‐ 3x j2 ) 2 d(xi,xj)=(x i1 ‐ x j1 ) 2 +(x i2 ‐ x j2 ) 2 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6
Distance metric: Euclidean How many neighbors to look at? k Weighting function (optional): Unused How to fit with the local points? Just predict the average output among k nearest neighbors k=9 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7
Distance metric: Euclidean Euclidean How many neighbors to look at? All of them Weighting function: Weighting function: w i =exp( ‐ d(x i , q) 2 /K w ) Nearby points to query q are weighted more strongly. K w …kernel width. How to fit with the local points? p Predict weighted average: w i y i / w i K=10 K=10 K=20 K=20 K=80 K=80 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8
Given: a set P of n points in R d Given: a set P of n points in R Goal: Given a query point q : NN: find the nearest neighbor p of q in P NN: find the nearest neighbor p of q in P Range search: find one/all points in P within distance r from q distance r from q p q 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9
Main memory: Main memory: Linear scan T Tree based: b d Quadtree kd ‐ tree kd ‐ tree Hashing: Locality ‐ Sensitive Hashing Locality Sensitive Hashing Secondary storage: R ‐ trees R trees 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10
Simplest spatial structure on Earth! Simplest spatial structure on Earth! Split the space into 2 d equal subsquares Repeat until done: Repeat until done: only one pixel left only one point left y p only a few points left Variants: split only one dimension at a time kd ‐ trees (in a moment) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11
Range search: Range search: Put root node on the stack Repeat: pop the next node T from the stack pop the next node T from the stack q for each child C of T : if C is a leaf, examine point(s) in C if C intersects with the ball of radius r if C intersects with the ball of radius r around q , add C to the stack Nearest neighbor: Great in 2 or 3 Start range search with r = g dimensions dimensions Whenever a point is found, Nodes have 2 d update r parents Only investigate nodes with Only investigate nodes with Space issues: S i respect to current r 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12
Main ideas [Bentley ’75] : Main ideas [Bentley 75] : Only one ‐ dimensional splits Choose the split “carefully” p y (many variations) Queries: as for quadtrees Advantages: no (or less) empty spaces only linear space Query time at most: Query time at most: Min[ dn , exponential( d )] 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13
“Bottom ‐ up” approach [Guttman 84]: Bottom ‐ up approach [Guttman 84]: Start with a set of points/rectangles Partition the set into groups of small cardinality Partition the set into groups of small cardinality For each group, find minimum rectangle containing objects from this group (MBR) j g p ( ) Repeat Advantages: Advantages Supports near(est) neighbor search (similar as before) Works for points and rectangles Works for points and rectangles Avoids empty spaces 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14
R trees with fan out 4: R ‐ trees with fan ‐ out 4: group nearby rectangles to parent MBRs I C A A G G H F B J J E D 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #15
R trees with fan out 4: R ‐ trees with fan ‐ out 4: every parent node completely covers its ‘children’ P1 P3 I C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #16
R trees with fan out 4: R ‐ trees with fan ‐ out 4: every parent node completely covers its ‘children’ P1 P3 I P1 P2 P3 P4 C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #17
Example of a range search query Example of a range search query P1 P3 I P1 P2 P3 P4 C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #18
Example of a range search query Example of a range search query P1 P3 I P1 P2 P3 P4 C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #19
Example: Spam filtering Example: Spam filtering Instance space X: Feature vector of word occurrences (binary, TF ‐ IDF) d features (d~100,000) Class Y: Spam (+1), Ham ( ‐ 1) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20
Very loose motivation: Neuron Very loose motivation: Neuron Inputs are feature values Each feature has a weight w Each feature has a weight w w 1 x 1 Activation is the sum: w 2 x 2 w 3 >0? x 3 f(x)= i w i x i = w x f(x) w x w x 3 w w 4 x 4 If the f(x) is: w x=0 geria Positive predict +1 Positive: predict +1 nig x 1 Spam=1 Negative: predict ‐ 1 x 2 w viagra Ham= ‐ 1 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21
If more than 2 classes: If more than 2 classes: Weight vector w c for each class C l Calculate activation for each class l t ti ti f h l f(x,c)= i w c,i x i = w c x w 3 x Highest activation wins: Highest activation wins: biggest c = arg max c f(x,c) w 3 w 2 x w 2 w 1 biggest w 1 x biggest biggest 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22
Define a model: Define a model: Perceptron: y = sign(w x) Define a loss function: L(w) = – i y i w x i i i i Minimize the loss: Compute gradient L’(w) and optimize: w t+1 = w t ‐ t L’(w) = w t ‐ t i d L(y i w x i )/ d w (Batch gradient descent) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23
Stochastic gradient descent: Stochastic gradient descent: Examples are drawn from a finite training set Pi k Pick random example x j and update d l d d t w t+1 = w t ‐ t d L(w x j , y j )/ d w Cost per Time to reach Time for accuracy optimization error < iteration O(m d) O(m d log(1/ )) O( d 2 / log 2 (1/ )) GD g g 2 nd order GD O(m d log log(1/ )) O(d 2 / log(1/ ) log log(1/ )) O(d(d+m)) O( d/ ) O( d/ ) Stochastic GD O(d) [Bottou LeCun 04] [Bottou ‐ LeCun ‘04] m number of examples m… number of examples d… number of features … condition number 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24
Recommend
More recommend