Nearest-Neighbor Methods Store all training examples Given a new test example, find the k that are closest to it in feature space (distance: Eu- About this class clidean/Mahalanobis?) k -Nearest-Neighbors Return majority classification among those k points Bagging Curse of dimensionality – irrelevant features can dominate classification Boosting Training is trivial, but e ffi ciency of finding k nearest points? Use intelligent data structures like kd -trees. Worst case behavior is bad for nearest-neighor 1 2
Bagging Bootstrap Aggregating (Breiman, 1994) Key idea: Build t independent replicates of the training set L by sampling with replacement search ( O ( l )) but much better on average (al- though distribution dependent). Initial fixed Train classifier on each of them cost to building the tree Predict the majority of all these classifiers Big caveat: search cost seems to scale badly with the number of dimensions in the feature In the case of a regression problem, predict the space! average Very simple but efective algorithm! For decision trees: significant improvement in accuracy, but a loss in comprehensibility Works well for unstable algorithms. Intuition: unstable algorithms can change their predic- tions substantially based on small changes in the training set, which is essentially what each 3
replicate training set is doing. When you aver- age over multiple sets of training data, you’re Boosting getting a more stable predictor. Basic question: Can we take an algorithm that Let f A be the aggregated predictor. Then learns weak hypotheses that perform some- f A ( x ) attempts to approximate E L f ( x ) what better than chance and make it into a How di ff erent are the training sets? The prob- strong learner? ability that a given example is not in a given subset is (1 − 1 /n ) n → 1 /e = 0 . 368 as n → ∞ . Answer: yes (Freund and Schapire, various pa- pers) Empirically, 50 replicates give all the benefit of bagging, often a 20% to 40% reduction in We’ll again build an ensemble classifier, but, error rate. unlike bagging, members of the ensemble will have di ff erent weights Each trained model has higher initial variance since it is trained on a smaller training set. Bagging reduces variance (albeit slower than Bagging stable classifiers can somewhat de- 1 /n because the training set is replicated), but grade performance boosting reduces bias by making the hypothe- sis space more flexible What would happen with linear regression or Naive Bayes? 4
AdaBoost Algorithm 2. α t ← 1 � 1 − ǫ t � 2 log ǫ t Given: 3. Update: Training examples ( x i , y i ) , . . . , ( x m , y m ) D ( i ) ← D ( i ) e − α t if h t ( x i ) = y i Z A weak learning algorithm, guaranteed to make error ǫ ≤ 1 2 − γ D ( i ) ← D ( i ) e α t if h t ( x i ) � = y i Z Maintain a weight distribution D over training where Z is a normalization factor examples. Initialize D ( i ) = 1 /m . Return final hypothesis: Now repeat for a number of rounds T : T � H ( x ) = sgn α t h t ( x ) t =1 1. Train weak learner using distribution D. This gives a weak hypothesis h t : X → Caveat: We need a weak learner that can learn {± 1 } . h t has error even on hard weight distributions! ǫ t = Pr [ h ( x i ) � = y i ] i ∼ D t 5
Training Error Substituting from above, � � ǫ ≤ D T +1 ( i ) Z t First let’s bound the weight distribution: t i � = Z t t D T +1 ( i ) = D T ( i ) exp( − α T h T ( x i ) y i ) Z T Finally, D t ( i ) e − α t + � � Z t = D t ( i ) e α t T D T +1 ( i ) = 1 1 � exp( − α T h T ( x i ) y i ) i : h t ( x i )= y i i : h t ( x i ) � = y i n Z t t =1 = e − α t (1 − ǫ t ) + e α t ( ǫ t ) exp � T = 1 t =1 ( − α t h t ( x i ) y i ) � = 2 ǫ t (1 − ǫ t ) � T n t =1 Z t � 1 − 4 γ 2 = t Now for the training error: Then ǫ = 1 � � I [ y i α t h t ( x i ) ≤ 0] γ 2 � ǫ ≤ exp( − 2 t ) n i t t ≤ 1 � � exp( − y i α t h t ( x i )) So, proof that we can boost weak learners that n t i meet the requisite conditions into strong learn- (because e − z ≥ 1 if z ≤ 0) ers! 6
Generalization and Empirical Properties Fairly robust to overfitting. In fact, often test error keeps decreasing even after training error has converged Works well with a range of hypotheses, includ- ing decision trees, stumps, and Naive Bayes Relation to SVMs? Can think of boosting as maximizing a di ff erent margin, and of us- ing multiple weak learners to go to a high di- mensional space, instead of using a kernel like SVMs do. Computationally, boosting is easier (LP as opposed to QP) 7
Recommend
More recommend