10701 Ensemble of trees: Begging and Random Forest
Bagging • Bagging or bootstrap aggregation a technique for reducing the variance of an estimated prediction function. • For classification, a committee of trees each cast a vote for the predicted class.
Bootstrap The basic idea: randomly draw datasets with replacement from the training data, each sample of the same size
Bagging Create bootstrap samples from the training data M features N examples ....…
Random Forest Classifier Construct a decision tree M features N examples ....…
Bagging tree classifier M features N examples Take the majority vote ....… ....…
Bagging Z = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N )} Z *b where= 1,.., B.. The prediction at input x when bootstrap sample b is used for training
Bagging Treat the voting Proportions as probabilities Hastie
Random forest classifier Random forest classifier, an extension to bagging which uses a subset of the features rather than the samples.
Random Forest Classifier Training Data M features N examples
Random Forest Classifier Create bootstrap samples from the training data M features N examples ....…
Random Forest Classifier Construct a decision tree M features N examples ....…
Random Forest Classifier At each node in choosing the split feature choose only among m < M features M features N examples ....…
Random Forest Classifier Create decision tree from each bootstrap sample M features N examples ....… ....…
Random Forest Classifier M features N examples Take he majority vote ....… ....…
Random forest for biology TAP GeneExpress Domain GeneExpress Y2H GeneExpress GOProcess N N HMS_PCI SynExpress HMS-PCI Y2H ProteinExpress Y GeneOccur Y GOLocalization GeneExpress ProteinExpress
10-701 Machine Learning Regression
Where we are Inputs Prob- Density √ ability Estimator Inputs Predict √ Classifier category Inputs Predict Regressor Today real no.
Choosing a restaurant • In everyday life we need to make decisions Reviews $ Distance Cuisine score (out of 5 (out of by taking into account lots of factors stars) 10) • The question is what weight we put on each 4 30 21 7 8.5 of these factors (how important are they with 2 15 12 8 7.8 respect to the others). • Assume we would like to build a 5 27 53 9 6.7 recommender system for ranking potential 3 20 5 6 5.4 restaurants based on an individuals’ preferences • If we have many observations we may be able to recover the weights ?
Linear regression • Given an input x we would like to compute an output y • For example: Y - Predict height from age - Predict Google’s price from Yahoo’s price - Predict distance from wall using sensor readings X Note that now Y can be continuous
Linear regression • Given an input x we would like to compute an output y • In linear regression we assume that y and x are related with the Y following equation: Observed values What we are trying to predict y = wx+ where w is a parameter and X represents measurement or other noise
Linear regression = wx + y Y • Our goal is to estimate w from a training data of <x i ,y i > pairs • One way to find such relationship is to minimize the a least squares error: − 2 arg min ( y wx ) w i i X i • Several other approaches can be used as well If the noise is Gaussian • So why least squares? with mean 0 then least squares is also the - minimizes squared distance between maximum likelihood measurements and predicted line estimate of w - has a nice probabilistic interpretation - easy to compute
Solving linear regression using least squares minimization • You should be familiar with this by now … • We just take the derivative w.r.t. to w and set to 0: − = − − 2 ( y wx ) 2 x ( y wx ) i i i i i w i i − = 2 x ( y wx ) 0 i i i i = 2 x y wx i i i i i x y i i = i w 2 x i i
Regression example • Generated: w=2 • Recovered: w=2.03 • Noise: std=1
Regression example • Generated: w=2 • Recovered: w=2.05 • Noise: std=2
Regression example • Generated: w=2 • Recovered: w=2.08 • Noise: std=4
Bias term • So far we assumed that the line passes through the origin Y • What if the line does not? • No problem, simply change the model to y = w 0 + w 1 x+ w 0 • Can use least squares to determine w 0 , w 1 X − − x ( y w ) y w x i i 0 i 1 i = = i w i w 1 2 x 0 n i i
Bias term • So far we assumed that the line passes through the origin Y • What if the line does not? • No problem, simply change the model to y = w 0 + w 1 x+ Just a second, we will soon w 0 give a simpler solution • Can use least squares to determine w 0 , w 1 X − − x ( y w ) y w x i i 0 i 1 i = = i w i w 1 2 x 0 n i i
Multivariate regression • What if we have several inputs? - Stock prices for Yahoo, Microsoft and Ebay for the Google prediction task • This becomes a multivariate linear regression problem • Again, its easy to model: y = w 0 + w 1 x 1 + … + w k x k + Microsoft’s stock price Google’s stock price Yahoo’s stock price
Multivariate regression • What if we have several inputs? - Stock prices for Yahoo, Microsoft and Ebay for Not all functions can be the Google prediction task approximated using the input • This becomes a multivariate regression problem values directly • Again, its easy to model: y = w 0 + w 1 x 1 + … + w k x k +
2 + 2 -2x 2 y=10+3x 1 In some cases we would like to use polynomial or other terms based on the input data, are these still linear regression problems? Yes. As long as the coefficients are linear the equation is still a linear regression problem!
Non-Linear basis function • So far we only used the observed values • However, linear regression can be applied in the same way to functions of these values • As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a linear regression problem = + + + + 2 2 y w w x w k x 0 1 1 k
Non-Linear basis function • What type of functions can we use? • A few common examples: - Polynomial: j (x) = x j for j=0 … n j ( x ) = ( x − j ) - Gaussian: 2 j 2 1 j ( x ) = - Sigmoid: Any function of the input 1 + exp( − s j x ) values can be used. The solution for the parameters of the regression remains the same.
General linear regression problem • Using our new notations for the basis function linear regression can n be written as y = w j j ( x ) j = 0 Where j (x) can be either x j for multivariate regression or one of the • non linear basis we defined • Once again we can use ‘least squares’ to find the optimal solution.
LMS for the general linear regression problem k = y w ( x ) Our goal is to minimize the following j j = loss function: j 0 ( y i − 2 J (w) = w j j ( x i ) ) w – vector of dimension k+1 (x i ) – vector of dimension k+1 i j y i – a scaler Moving to vector notations we get: ( y i − w T ( x i )) 2 J (w) = i We take the derivative w.r.t w − = − i T i 2 i T i i T ( y w ( x )) 2 ( y w ( x )) ( x ) w i i − = i T i i T 2 ( y w ( x )) ( x ) 0 Equating to 0 we get i = i i T T i i T y ( x ) w ( x ) ( x ) i i
LMS for general linear regression problem ( y i − w T ( x i )) 2 J (w) = We take the derivative w.r.t w i ( y i − w T ( x i )) 2 ( y i − w T ( x i )) = 2 ( x i ) T w i i ( y i − w T ( x i )) ( x i ) T = 0 Equating to 0 we get 2 i ( x i ) T = w T ( x i ) ( x i ) T y i i i 1 1 1 ( x ) ( x ) ( x ) Define: 0 1 k 2 2 2 ( x ) ( x ) ( x ) = 0 1 k n n n ( x ) ( x ) ( x ) 0 1 k Then deriving w w = ( T ) − 1 T y we get:
LMS for general linear regression problem ( y i − w T ( x i )) 2 J (w) = i w = ( T ) − 1 T y Deriving w we get: n entries vector k+1 entries vector n by k+1 matrix This solution is also known as ‘psuedo inverse’
Example: Polynomial regression
A probabilistic interpretation Our least squares minimization solution can also be motivated by a probabilistic in interpretation of the y = w T ( x ) + regression problem: The MLE for w in this model is the same as the solution we derived for least squares criteria: w = ( T ) − 1 T y
Recommend
More recommend