CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University

 Would like to do prediction:  Would like to do prediction: learn a function: y = f(x)  Where y can be: h b  Real: Regression  Categorical: Classification l l f  More complex:  Ranking Str ct red prediction etc  Ranking, Structured prediction, etc.  Data is labeled:  Have many pairs (x,y) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

 We will talk about the following methods:  We will talk about the following methods:  k ‐ Nearest Neighbor (Instance based learning)  P  Perceptron algorithm t l ith  Support Vector Machines  Decision trees (lecture on Thursday by D i i (l Th d b Sugato Basu from Google)  How to efficiently train (build a model)? 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

 Instance based learning  Instance based learning  Example: Nearest neighbor  Keep the whole training dataset: (x y)  Keep the whole training dataset: (x,y)  A query example x’ comes  Find closest example(s) x* Fi d l t l ( ) *  Predict y* 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

 To make things work we need 4 things: g g  Distance metric:  Euclidean  How many neighbors to look at? y g  One  Weighting function (optional):  Unused  How to fit with the local points?  Just predict the same output as the nearest neighbor 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

 Suppose x  Suppose x 1 ,…, x m are two dimensional: x are two dimensional:  x 1 =(x 11 ,x 12 ), x 2 =(x 21 ,x 22 ), …  One can draw nearest neighbor regions:  One can draw nearest neighbor regions: d(xi,xj)=(x i1 ‐ x j1 ) 2 +(3x i2 ‐ 3x j2 ) 2 d(xi,xj)=(x i1 ‐ x j1 ) 2 +(x i2 ‐ x j2 ) 2 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

 Distance metric:  Euclidean  How many neighbors to look at?  k  Weighting function (optional):  Unused  How to fit with the local points?  Just predict the average output among k nearest neighbors k=9 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

 Distance metric:  Euclidean  Euclidean  How many neighbors to look at?  All of them  Weighting function: Weighting function:  w i =exp( ‐ d(x i , q) 2 /K w )  Nearby points to query q are weighted more strongly. K w …kernel width.  How to fit with the local points? p  Predict weighted average:  w i y i /  w i K=10 K=10 K=20 K=20 K=80 K=80 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

 Given: a set P of n points in R d  Given: a set P of n points in R  Goal: Given a query point q :  NN: find the nearest neighbor p of q in P  NN: find the nearest neighbor p of q in P  Range search: find one/all points in P within distance r from q distance r from q p q 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

 Main memory:  Main memory:  Linear scan  T  Tree based: b d  Quadtree  kd ‐ tree  kd ‐ tree  Hashing:  Locality ‐ Sensitive Hashing Locality Sensitive Hashing  Secondary storage:  R ‐ trees R trees 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

 Simplest spatial structure on Earth!  Simplest spatial structure on Earth!  Split the space into 2 d equal subsquares  Repeat until done: Repeat until done:  only one pixel left  only one point left y p  only a few points left  Variants:  split only one dimension at a time  kd ‐ trees (in a moment) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

Range search:  Range search:  Put root node on the stack  Repeat:  pop the next node T from the stack  pop the next node T from the stack q  for each child C of T :  if C is a leaf, examine point(s) in C  if C intersects with the ball of radius r if C intersects with the ball of radius r around q , add C to the stack  Nearest neighbor:  Great in 2 or 3  Start range search with r =  g dimensions dimensions  Whenever a point is found,  Nodes have 2 d update r parents  Only investigate nodes with Only investigate nodes with  Space issues: S i respect to current r 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

 Main ideas [Bentley ’75] : Main ideas [Bentley 75] :  Only one ‐ dimensional splits  Choose the split “carefully” p y (many variations)  Queries: as for quadtrees  Advantages:  no (or less) empty spaces  only linear space  Query time at most:  Query time at most:  Min[ dn , exponential( d )] 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

 “Bottom ‐ up” approach [Guttman 84]: Bottom ‐ up approach [Guttman 84]:   Start with a set of points/rectangles  Partition the set into groups of small cardinality  Partition the set into groups of small cardinality  For each group, find minimum rectangle containing objects from this group (MBR) j g p ( )  Repeat  Advantages: Advantages  Supports near(est) neighbor search (similar as before)  Works for points and rectangles  Works for points and rectangles  Avoids empty spaces 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

 R trees with fan out 4:  R ‐ trees with fan ‐ out 4:  group nearby rectangles to parent MBRs I C A A G G H F B J J E D 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #15

 R trees with fan out 4:  R ‐ trees with fan ‐ out 4:  every parent node completely covers its ‘children’ P1 P3 I C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #16

 R trees with fan out 4:  R ‐ trees with fan ‐ out 4:  every parent node completely covers its ‘children’ P1 P3 I P1 P2 P3 P4 C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #17

 Example of a range search query  Example of a range search query P1 P3 I P1 P2 P3 P4 C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #18

 Example of a range search query  Example of a range search query P1 P3 I P1 P2 P3 P4 C A A G G H F B J J A A B B C C H H I I J J E P4 P2 D D E F G 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #19

 Example: Spam filtering  Example: Spam filtering  Instance space X:  Feature vector of word occurrences (binary, TF ‐ IDF)  d features (d~100,000)  Class Y:  Spam (+1), Ham ( ‐ 1) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

 Very loose motivation: Neuron  Very loose motivation: Neuron  Inputs are feature values  Each feature has a weight w  Each feature has a weight w w 1 x 1  Activation is the sum: w 2 x 2  w 3 >0? x 3  f(x)=  i w i  x i = w  x  f(x)  w x w x 3 w w 4 x 4  If the f(x) is: w  x=0 geria  Positive predict +1  Positive: predict +1 nig x 1 Spam=1  Negative: predict ‐ 1 x 2 w viagra Ham= ‐ 1 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

 If more than 2 classes:  If more than 2 classes:  Weight vector w c for each class  C l  Calculate activation for each class l t ti ti f h l  f(x,c)=  i w c,i  x i = w c  x w 3  x  Highest activation wins:  Highest activation wins: biggest  c = arg max c f(x,c) w 3 w 2  x w 2 w 1 biggest w 1  x biggest biggest 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

 Define a model:  Define a model: Perceptron: y = sign(w  x)  Define a loss function: L(w) = –  i y i  w  x i i i i  Minimize the loss:  Compute gradient L’(w) and optimize: w t+1 = w t ‐  t  L’(w) = w t ‐  t   i d L(y i  w  x i )/ d w (Batch gradient descent) 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

 Stochastic gradient descent:  Stochastic gradient descent:  Examples are drawn from a finite training set  Pi k  Pick random example x j and update d l d d t w t+1 = w t ‐  t  d L(w  x j , y j )/ d w Cost per Time to reach Time for accuracy  optimization error <  iteration O(m  d) O(m  d  log(1/  ))  O(  d 2 /   log 2 (1/  )) GD g g 2 nd order GD O(m  d  log log(1/  )) O(d 2 /   log(1/  )  log log(1/  )) O(d(d+m)) O(  d/  ) O(  d/  ) Stochastic GD O(d) [Bottou LeCun 04] [Bottou ‐ LeCun ‘04] m number of examples m… number of examples d… number of features  … condition number 2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Would like to do prediction: Would like to do prediction: learn a function: y = f(x) Where y can be: h b Real: Regression Categorical: Classification

CS345a: Data Mining Jure Leskovec Stanford University Instructors: Instructors: Jure

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

CS345a: Data Mining Jure Leskovec Stanford University CPU Machine Learning, Statistics Memory

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Friday 5:30 at

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Instead of generic

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Feature selection:

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University HW3 is out HW3

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Homework 2 is out:

Clustering Algorithms CS345a: Data Mining Jure Leskovec and Anand

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

End-toEnd In-memory Graph Analytics Jure Leskovec (@jure) Including joint work with Rok Sosic,

http://cs224w.stanford.edu 10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information

DATA MINING LECTURE 15 The Map-Reduce Computational Paradigm Most of the slides are taken from:

http://cs224w.stanford.edu October August 12/3/2013 Jure Leskovec, Stanford CS224W: Social and

http://cs224w.stanford.edu 10/31/2012 Jure Leskovec, Stanford CS224W: Social and Information

Analytics on Sensor Networks Joint work with D. D. Ha Hallac , S. Vare, S. Bhooshan, R. Sosic, S.

Observational Cosmology (C. Porciani / K. Basu) Lecture 7 Cosmology with galaxy clusters (Mass

Divide and Conquer Roadmap Algorithms for Real Algebraic Sets Complexity issues M. Safey El Din

Polynomial threshold functions and Boolean threshold circuits Kristoffer Arnsfelt Hansen 1

Real lines on random cubic surfaces Chiara Meroni ICERM August 28, 2020 August 28, 2020 1 / 26

A Coq formalization of a sign determination algo- rithm TYPES Tallinn, May 20 2015 Cyril

Model theory and combinatorial geometry, II Artem Chernikov UCLA Model Theory conference

21: Virtual Substitution & Real Arithmetic 15-424: Foundations of Cyber-Physical Systems

Computing real points on determinantal varieties and spectrahedra Didier Henrion 1 , 2 , 3 Simone

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Would like to do prediction: Would like to do prediction: learn a function: y = f(x) Where y can be: h b Real: Regression Categorical: Classification

CS345a: Data Mining Jure Leskovec Stanford University Instructors: Instructors: Jure

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

CS345a: Data Mining Jure Leskovec Stanford University CPU Machine Learning, Statistics Memory

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Friday 5:30 at

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Instead of generic

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Feature selection:

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University HW3 is out HW3

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Homework 2 is out:

Clustering Algorithms CS345a: Data Mining Jure Leskovec and Anand

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

End-toEnd In-memory Graph Analytics Jure Leskovec (@jure) Including joint work with Rok Sosic,

http://cs224w.stanford.edu 10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information

DATA MINING LECTURE 15 The Map-Reduce Computational Paradigm Most of the slides are taken from:

http://cs224w.stanford.edu October August 12/3/2013 Jure Leskovec, Stanford CS224W: Social and

http://cs224w.stanford.edu 10/31/2012 Jure Leskovec, Stanford CS224W: Social and Information

Analytics on Sensor Networks Joint work with D. D. Ha Hallac , S. Vare, S. Bhooshan, R. Sosic, S.

Observational Cosmology (C. Porciani / K. Basu) Lecture 7 Cosmology with galaxy clusters (Mass

Divide and Conquer Roadmap Algorithms for Real Algebraic Sets Complexity issues M. Safey El Din

Polynomial threshold functions and Boolean threshold circuits Kristoffer Arnsfelt Hansen 1

Real lines on random cubic surfaces Chiara Meroni ICERM August 28, 2020 August 28, 2020 1 / 26

A Coq formalization of a sign determination algo- rithm TYPES Tallinn, May 20 2015 Cyril

Model theory and combinatorial geometry, II Artem Chernikov UCLA Model Theory conference

21: Virtual Substitution &amp; Real Arithmetic 15-424: Foundations of Cyber-Physical Systems

Computing real points on determinantal varieties and spectrahedra Didier Henrion 1 , 2 , 3 Simone

21: Virtual Substitution & Real Arithmetic 15-424: Foundations of Cyber-Physical Systems