http://cs246.stanford.edu Would like to do prediction: estimate a - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

¡ Would like to do prediction: estimate a function f(x) so that y = f(x) ¡ Where y can be: § Real number: Regression § Categorical: Classification § Complex object: § Ranking of items, Parse tree, etc. ¡ Data is labeled: § Have many pairs {(x, y)} § x … vector of binary, categorical, real valued features § y … class: {+1, -1}, or a real number 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2

¡ Task: Given data (X,Y) build a model f to predict Y’ based on X’ ¡ Strategy: Estimate 𝒛 = 𝒈 𝒚 Training X Y on (𝒀, 𝒁) . data Hope that the same 𝒈(𝒚) also Test Y X’ data works to predict unknown 𝒁’ ’ § The “hope” is called generalization § Overfitting: If f(x) predicts Y well but is unable to predict Y’ § We want to build a model that generalizes well to unseen data 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

training points ¡ 1) Training data is drawn independently at random according to unknown probability distribution 𝑄(𝒚, 𝑧) ¡ 2) The learning algorithm analyzes the examples and produces a classifier 𝒈 ¡ Given new data 𝒚, 𝑧 drawn from 𝑸 , the classifier is given 𝒚 and predicts . 𝒛 = 𝒈(𝒚) ¡ The loss 𝓜(. 𝒛, 𝒛) is then measured ¡ Goal of the learning algorithm: Find 𝒈 that minimizes expected loss 𝑭 𝑸 [𝓜 ] 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4

test data 𝑄(𝒚, 𝑧) (𝒚, 𝑧) 𝑧 training data 𝒚 Training Learning 𝑔 set 𝑻 algorithm 𝑧 ) 𝑧 Why is it hard? loss We estimate 𝒈 on training data function but want the 𝒈 to work well on unseen future (i.e., test) data ℒ() 𝑧, 𝑧) 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5

¡ Goal: Minimize the expected loss min 𝔽 # [𝓜] " ¡ But we don’t have access to 𝑸 -- we only know the training sample 𝑬 : min 𝔽 $ [𝓜] " ¡ So, we minimize the average loss on the training data: % Problem: Just 1 memorizing the min 𝐾 𝑔 = min 𝑂 ) ℒ 𝑔(𝑦 " ), 𝑧 " training data gives ! ! us a perfect model "#$ (with zero loss) 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6

¡ Given: § A set of N training examples § {(𝑦 ! , 𝑧 ! ), (𝑦 " , 𝑧 " ), … , (𝑦 # , 𝑧 # )} § A loss function 𝓜 ¡ Choose the model: 𝒈 𝒙 𝒚 = 𝒙 ⋅ 𝒚 + 𝒄 ¡ Find: § The weight vector 𝑥 that minimizes the expected loss on the training data % 𝐾 𝑔 = 1 𝑂 ) ℒ 𝑥 ⋅ 𝑦 " + 𝑐, 𝑧 " "#$ 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7

¡ Problem: Step-wise Constant Loss function 6 5 4 3 Loss 2 1 0 -4 -2 0 2 4 -1 f w (x) Derivative is either 0 or ∞ 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

¡ Approximating the expected loss by a smooth function § Replace the original objective function by a surrogate loss function. E.g., hinge loss : % 𝐾 𝒙 = 1 max 0, 1 − 𝑧 " 𝑔(𝒚 " ) 6 𝑂 ) "#$ When 𝑧 = 1 : y*f(x) 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9

¡ Want to separate “+” from “-” using a line Data: + ¡ Training examples: + § (x 1 , y 1 ) … (x n , y n ) + ¡ Each example i : - + + - § x i = ( x i(1) ,… , x i(d) ) - + § x i(j) is real valued - § y i Î { -1, +1 } - ¡ Inner product: - - 𝑥 (&) ⋅ 𝑦 (&) ) 𝒙 ⋅ 𝒚 = ∑ &'( Which is best linear separator (defined by w,b )? 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11

¡ Distance from the A C separating + + hyperplane + + corresponds to - + - B the “confidence” + + of prediction - ¡ Example: - + - + § We are more sure - - about the class of - A and B than of C - 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12

¡ Margin 𝜹 : Distance of closest example from the decision line/hyperplane The reason we define margin this way is due to theoretical convenience and existence of generalization error bounds that depend on the value of margin. 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

¡ Remember: The Dot product 𝑩 ⋅ 𝑪 = 𝑩 ⋅ 𝑪 ⋅ 𝐝𝐩𝐭 𝜾 𝑩 𝒅𝒑𝒕𝜾 𝒆 𝑩 (𝒌) 𝟑 | 𝑩 | = ' 𝒌"𝟐 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14

¡ Dot product w × x + b = 0 𝒙 𝑩 ⋅ 𝑪 = 𝑩 𝑪 𝐝𝐩𝐭 𝜾 x 2 + x 1 + ¡ What is 𝒙 ⋅ 𝒚 𝟐 , 𝒙 ⋅ 𝒚 𝟑 ? x 2 x 2 + + x 1 x 1 + + 𝒙 𝒙 * 𝑥 (() + | w | = ' In this case In this case 𝜹 𝟑 ≈ 𝟑 𝒙 𝟑 𝒙 𝟑 (") 𝜹 𝟐 ≈ ¡ So, 𝜹 roughly corresponds to the margin § Bottom line: Bigger 𝜹 , bigger the separation 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

Distance from a point to a line Note we assume L ¡ Let: 𝒙 𝟑 = 𝟐 w · x + b = 0 w A (x A(1) , x A(2) ) § Line L : w·x+b = + w (1) x (1) +w (2) x (2) +b=0 d(A, L ) § w = (w (1) , w (2) ) § Point A = (x A(1) , x A(2) ) H § Point M on a line = (x M(1) , x M(2) ) d( A , L ) = |AH| (0,0) = |(A-M) ∙ w| = |(x A(1) – x M(1) ) w (1) + (x A(2) – x M(2) ) w (2) | M (x M(1) , x M(2) ) = |x A(1) w (1) + x A(2) w (2) + b| = |w ∙ A + b| Remember x M(1) w (1) + x M(2) w (2) = - b since M belongs to line L 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

¡ Prediction = sign( w × x + b) 𝒙 ¡ “ Confidence ” = (w × x + b) y + w × x + b = 0 + ¡ For i-th datapoint: + + - 𝒙 × 𝒚 𝒋 + 𝒄 𝒛 𝒋 𝜹 𝒋 = + - ¡ Want to solve: + - 𝐧𝐛𝐲 𝒙,𝒄 𝐧𝐣𝐨 𝜹 𝒋 + 𝒋 - - ¡ Can rewrite as - - g max g w , " × + ³ g s . t . i , y ( w x b ) i i 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

¡ Maximize the margin: + + w × x+b=0 § Good according to intuition, theory (c.f. “VC dimension”) + + - and practice + g + g + g max - - g w , - + g " × + ³ g s . t . i , y ( w x b ) - - - i i § 𝜹 is margin … distance from Maximizing the margin the separating hyperplane 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

¡ Separating hyperplane is defined by the support vectors § Points on + / - planes from the solution § If you knew these points, you could ignore the rest § Generally, d+1 support vectors (for d dim. data) 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20

w × x+b=-1 w × x+b=0 ¡ Problem: w × x+b=+1 § Let 𝒙 × 𝒚 + 𝒄 𝒛 = 𝜹 then 𝟑𝒙 × 𝒚 + 𝟑𝒄 𝒛 = 𝟑𝜹 x 2 § Scaling w increases margin! ¡ Solution: x 1 § Work with normalized w : w 𝒙 || w || 𝒙 × 𝒚 + 𝒄 𝒛 𝜹 = § Let’s also require support vectors 𝒚 𝒌 to be on the plane defined by: * 𝑥 (() + | w | = ' 𝒙 ⋅ 𝒚 𝒌 + 𝒄 = ±𝟐 (") 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21

w × x+b=-1 w × x+b=0 ¡ Want to maximize margin! w × x+b=+1 ¡ What is the relation between x 1 and x 2 ? 2 g 𝒙 x § 𝒚 𝟐 = 𝒚 𝟑 + 𝟑𝜹 2 ||𝒙|| § We also know: x 1 § 𝒙 ⋅ 𝒚 𝟐 + 𝒄 = +𝟐 § 𝒙 ⋅ 𝒚 𝟑 + 𝒄 = −𝟐 w ¡ So: || w || § 𝒙 ⋅ 𝒚 𝟐 + 𝒄 = +𝟐 𝒙 § 𝒙 𝒚 𝟑 + 𝟑𝜹 ||𝒙|| + 𝒄 = +𝟐 w 1 Þ g = = § 𝒙 ⋅ 𝒚 𝟑 + 𝒄 + 𝟑𝜹 𝒙⋅𝒙 𝒙 = +𝟐 × w w w Note: 2 × = -1 w w w 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

w × x+b=-1 w × x+b=0 ¡ We started with w × x+b=+1 g max g w , " × + ³ g s . t . i , y ( w x b ) 2 g x i i 2 But w can be arbitrarily large! ¡ We normalized and... x 1 1 2 g = = = arg max arg max arg min w arg min 1 w 2 w w || w || ¡ Then: 2 min || w || 1 w 2 " × + ³ s . t . i , y ( w x b ) 1 i i This is called SVM with “hard” constraints 2/27/20 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23

http://cs246.stanford.edu Would like to do prediction: estimate a - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found our material useful for giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

http://cs246.stanford.edu Web advertising We discussed how to match

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to

http://cs246.stanford.edu Training data 100 million ratings, 480,000 users, 17,770 movies

http://cs246.stanford.edu TAs : Bahman Bahmani Juthika Dabholkar Pierre Kreitmann

http://cs246.stanford.edu High dimensional == many features Find

Time dependence of AGN pair echo, and halo emission as a probe of extragalactic magnetic fields FO,

Possibilities for Future SR and FEL Development in the UK. Richard Walker Technical Director,

A synthetic synchrotron diagnostic for runaways in tokamaks Mathias Hoppe 1 Ola Embrus 1 , Alex

Synchrotron Light Sources the next big thing Brookhaven National Broad band, polarized Light

Dr. Hans-Georg Eer http://ohm.hgesser.de/ v1.0, 10/06/2013 Hans-Georg Eer, TH Nuremberg

Tutorial 1 : Setting Up Your Unix Environment to work with Synopsys Tools Authors : Bibhas Ghoshal

Week 1- Introduction to model checking B. Srivathsan Chennai Mathematical Institute NPTEL-course

Good Practices for Designing Cryptographic Primitives in Hardware Miroslav Kne evi NXP