1
play

1 Dont Make Me Get Non -Linear! A Grounding Example: Linear - PDF document

What is Machine Learning? A (Very Short) List of Applications Many different forms of Machine Learning Machine learning widely used in many contexts We focus on the problem of prediction Stock price prediction o Using economic


  1. What is Machine Learning? A (Very Short) List of Applications • Many different forms of “Machine Learning” • Machine learning widely used in many contexts  We focus on the problem of prediction  Stock price prediction o Using economic indicators, predict if stock with go up/down • Want to make a prediction based on observations  Computational biology and medical diagnosis  Vector X of m observed variables: <X 1 , X 2 , …, X m > o Predicting gene expression based on DNA o X 1 , X 2 , …, X m are called “input features/variables” o Determine likelihood for cancer using clinical/demographic data o Also called “independent variables,” but this can be misleading!  Predict people likely to purchase product or click on ad • X 1 , X 2 , …, X m need not be (and usually are not) independent o “Based on past purchases, you might want to buy…”  Based on observed X , want to predict unseen variable Y  Credit card fraud and telephone fraud detection o Y called “output feature/variable” (or the “dependent variable”) ˆ o Based on past purchases/phone calls is a new one fraudulent? Y   Seek to “learn” a function g ( X ) to predict Y: g ( X ) • Saves companies billions(!) of dollars annually o When Y is discrete, prediction of Y is called “classification”  Spam E-mail detection (gmail, hotmail, many others) o When Y is continuous, prediction of Y is called “regression” Spam, Spam… Go Away! What is Bayes Doing in My Mail Server? • This is spam: • The constant battle with spam Let’s get Bayesian on your spam: Content analysis details: (49.5 hits, 7.0 required) 0.9 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL [93.40.189.29 listed in zen.spamhaus.org] 1.5 URIBL_WS_SURBL Contains an URL listed in the WS SURBL blocklist [URIs: recragas.cn] 5.0 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist [URIs: recragas.cn] 5.0 URIBL_OB_SURBL Contains an URL listed in the OB SURBL blocklist [URIs: recragas.cn] 5.0 URIBL_SC_SURBL Contains an URL listed in the SC SURBL blocklist [URIs: recragas.cn] 2.0 URIBL_BLACK Contains an URL listed in the URIBL blacklist [URIs: recragas.cn] 8.0 BAYES_99 BODY: Bayesian spam probability is 99 to 100% [score: 1.0000] Who was crazy enough to think of that? “And machine -learning algorithms developed to merge and rank large sets of Google search results allow us to combine hundreds of factors to classify spam.” Source: http://www.google.com/mail/help/fightspam/spamexplained.html Training a Learning Machine The Machine Learning Process • We consider statistical learning paradigm here Learning g ( X ) Output algorithm (Classifier) (Class)  We are given set of N “training” instances Training o Each training instance is pair: (<x 1 , x 2 , …, x m >, y) data o Training instances are previously observed data X o Gives the output value y associated with each observed vector of input values <x 1 , x 2 , …, x m >  Training data: set of N pre-classified data instances  Learning: use training data to specify g ( X ) o N training pairs: (<x> (1) ,y (1) ), (<x> (2) ,y (2) ), …, (<x> ( N ) , y ( N ) ) o Generally, first select a parametric form for g ( X ) • Use superscripts to denote i -th training instance o Then, estimate parameters of model g ( X ) using training data  Learning algorithm: method for determining g ( X ) o For regression, usually want g ( X ) that minimizes E[(Y – g ( X ) ) 2 ] o Given a new input observation of X = <X 1 , X 2 , …, X m > • Mean squared error (MSE) “loss” function. (Others exist.) o Use g ( X ) to compute a corresponding output (prediction) ˆ  ( ) arg max ( | ) o For classification, generally best choice of g X P Y X o When prediction is discrete, we call g ( X ) a “classifier” and call y the output the predicted “class” of the input 1

  2. Don’t Make Me Get Non -Linear! A Grounding Example: Linear Regression   • Predict real value Y based on observing variable X • Minimize objective function 2 E [( Y aX b ) ] ˆ     Assume model is linear: ( )  Compute derivatives w.r.t. a and b Y g X aX b            2 2 E [( Y aX b ) ] E [ 2 X ( Y aX b )] 2 E [ XY ] 2 aE [ X ] 2 bE [ X ]  Training data  a  o Each vector X has one observed variable: <X 1 > (just call it X)   2         E [( Y aX b ) ] E [ 2 ( Y aX b )] 2 E [ Y ] 2 aE [ X ] 2 b  b o Y is continuous output variable  Set derivatives to 0 and solve simultaneous equations: o Given N training pairs: (<x> (1) ,y (1) ), (<x> (2) ,y (2) ), …, (<x> ( N ) , y ( N ) )   E [ XY ] E [ X ] E [ Y ] Cov ( X , Y )     y a ( X , Y )   • Use superscripts to denote i -th training instance [ 2 ] ( [ ]) 2 ( ) E X E X Var X x          Determine a and b minimizing E[(Y – g ( X ) ) 2 ] [ ] [ ] ( , ) y b E Y aE X X Y  y x  x y X  Substitution yields:       Y ( X , Y ) ( ) o First, minimize objective function:  x y x        2 2 2 [( ( )) ] [( ( )) ] [( ) ]  Estimate parameters based on observed training data: E Y g X E Y aX b E Y aX b  ˆ ˆ     ˆ y   ( ) ( , ) ( ) Y g X x X Y x X Y  ˆ x A Simple Classification Example Estimating the Joint PMF • Predict Y based on observing variable X • Given training data, compute joint PMF: p X,Y ( x , y )  MLE : count number of times each pair (x, y) appears  X has discrete value from {1, 2, 3, 4}  MAP using Laplace prior : add 1 to all the MLE counts o X denotes temperature range today: <50, 50-60, 60-70, >70  Normalize to get true distribution (sums to 1)  Y has discrete value from {rain, sun} X  Observed 50 data points: 1 2 3 4 o Y denotes general weather outlook tomorrow Y rain 5 3 2 0 ˆ ( , )  Given training data, estimate joint PMF: p x y , X Y sun 3 7 10 20 p ( x , y ) p ( x | y ) p ( y )  count in cell count in cell 1  ,  , ˆ  ˆ   Note Bayes rule: ( | ) X Y X Y Y p P Y X p MLE Laplace  total # data points total # data points total # cells ( ) ( ) p x p x X X MLE estimate Laplace (MAP) estimate ˆ ˆ    For new X, predict X X Y g ( X ) arg max P ( Y | X ) 1 2 3 4 p Y (y) 1 2 3 4 p Y (y) Y Y y o Note p x (x) is not affected by choice of y , yielding: rain 0.10 0.06 0.04 0.00 0.20 rain 0.103 0.069 0.052 0.017 0.241 ˆ ˆ ˆ ˆ ˆ     sun 0.06 0.14 0.20 0.40 0.80 sun 0.069 0.138 0.190 0.362 0.759 Y g ( X ) arg max P ( Y | X ) arg max P ( X , Y ) arg max P ( X | Y ) P ( Y ) p X (x) 0.16 0.20 0.24 0.40 1.00 y y y p X (x) 0.172 0.207 0.242 0.379 1.00 Classify New Observation Classification with Multiple Observables • Say today’s temperature is 75, so X = 4 • Say, we have m input values X = <X 1 , X 2 , …, X m >  Note that variables X 1 , X 2 , …, X m can de dependent!  Recall X temperature ranges: <50, 50-60, 60-70, >70  Prediction for Y (weather outlook tomorrow)  In theory , could predict Y as before, using ˆ ˆ ˆ ˆ   ˆ ˆ ˆ ˆ   Y arg max P ( X , Y ) arg max P ( X | Y ) P ( Y ) Y arg max P ( X , Y ) arg max P ( X | Y ) P ( Y ) y y y y MLE estimate Laplace (MAP) estimate o Why won’t this necessarily work? X X 1 2 3 4 p Y (y) 1 2 3 4 p Y (y)  Need to estimate P(X 1 , X 2 , …, X m | Y) Y Y rain 0.10 0.06 0.04 0.00 0.20 rain 0.103 0.069 0.052 0.017 0.241 o Fine if m is small, but what if m = 10 or 100 or 10,000? sun 0.06 0.14 0.20 0.40 0.80 sun 0.069 0.138 0.190 0.362 0.759 o Note: size of PMF table is exponential in m (e.g. O(2 m )) p X (x) 0.16 0.20 0.24 0.40 1.00 p X (x) 0.172 0.207 0.242 0.379 1.00 o Need ridiculous amount of data for good probability estimates!  What if we asked what is probability of rain tomorrow? o Likely to have many 0’s in table (bad times) o MLE: absolutely, positively no chance of rain!  Need to consider a simpler model o Laplace estimate: very small (~2%) chance  “never say never” 2

Recommend


More recommend