a new boosting algorithm using input dependent regularizer
play

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin - PDF document

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin rong+@cs.cmu.edu Yan Liu yanliu@cs.cmu.edu Luo Si lsi@cs.cmu.edu Jaime Carbonell jgc@cs.cmu.edu Alexander G. Hauptmann alex+@cs.cmu.edu School of Computer Science,


  1. A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin rong+@cs.cmu.edu Yan Liu yanliu@cs.cmu.edu Luo Si lsi@cs.cmu.edu Jaime Carbonell jgc@cs.cmu.edu Alexander G. Hauptmann alex+@cs.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-8213, USA Abstract AdaBoost is able to focus on those data points that have been misclassified in previous training iterations AdaBoost has proved to be an effective and therefore minimizes the training errors. method to improve the performance of base classifiers both theoretically and empirically. Since AdaBoost is a greedy algorithm and intention- However, previous studies have shown that ally focuses on the minimization of training errors, AdaBoost might suffer from the overfitting there have been many studies on the issues of over- problem, especially for noisy data. In ad- fitting for AdaBoost (Quinlan, 1996; Grove & Schuur- dition, most current work on boosting as- mans, 1998; Ratsch et al., 1998). The general conclu- sumes that the combination weights are fixed sion from early studies appears to be that in practice constants and therefore does not take par- AdaBoost seldom overfits the training data; namely, ticular input patterns into consideration. In even though the AdaBoost algorithm greedily mini- this paper, we present a new boosting algo- mizes the training errors via gradient decent, the test- rithm, “WeightBoost”, which tries to solve ing error usually goes down accordingly. Recent stud- these two problems by introducing an input- ies have implied that this phenomena might be related dependent regularization factor to the com- to the fact that the greedy search procedure used in bination weight. Similarly to AdaBoost, we AdaBoost is able to implicitly maximize the classifi- derive a learning procedure for WeightBoost, cation margin (Onoda et al., 1998; Friedman et al., which is guaranteed to minimize training er- 1998). However, other studies (Opitz & Macline, 1999; rors. Empirical studies on eight different UCI Jiang, 2000; Ratsch et al., 2000; Grove & Schuur- data sets and one text categorization data mans, 1998; Dietterich, 2000) have shown that Ad- set show that WeightBoost almost always aBoost might have the problem of overfitting, partic- achieves a considerably better classification ularly when the data are noisy. accuracy than AdaBoost. Furthermore, ex- In fact, noise in the data can be introduced by two periments on data with artificially controlled factors, either the mislabelled data or the limitation noise indicate that the WeightBoost is more of the hypothesis space of the base classifier. When robust to noise than AdaBoost. noise level is high, there could be some data pat- terns that are difficult for the classifiers to capture. 1. Introduction Therefore, the boosting algorithm is forced to focus on those noisy data patterns and thereby distort the As a generally effective algorithm to create a ”strong” optimal decision boundary. As a result, the decision classifier out of a weak classifier, boosting has gained boundary will only be suitable for those difficult data popularity recently. Boosting works by repeatedly patterns and not necessarily general enough for other running the weak classifier on various training exam- data. The overfitting problem can also be analyzed ples sampled from the original training pool, and com- from the viewpoint of generalized error bound (Ratsch bining the base classifiers produced by the week learner et al., 2000). As discussed in (Ratsch et al., 2000), the into a single composite classifier. AdaBoost has been margin maximized by AdaBoost is actually a “hard theoretically proved and empirically shown to be an ef- margin”, namely the smallest margin of those noisy fective method to improve the classification accuracy. data patterns. As a consequence, the margin of the Through the particular weight updating procedure, Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003) , Washington DC, 2003.

  2. other data points may decrease significantly when we for linear combination models, it seems more desir- maximize the “hard margin” and thus force the gen- able to have those weighting factors dependent on the eralized error bound (Schapire, 1999) to increase. input patterns, namely given larger values if the associ- ated submodel is appropriate for the input pattern and In order to deal with the overfitting problems in Ad- smaller value otherwise. Unfortunately, previous work aboost, several strategies have been proposed, such as on additive models for boosting almost all assumes the smoothing (Schapire & Singer, 1998), Gentle Boost combination weights as fixed constants, and therefore (J. Friedman & Tibshirani, 1998), BrownBoost (Fre- is not able to take the input patterns into account. und, 2001), Weight Decay (Ratsch et al., 1998) and regularized AdaBoost (Ratsch et al., 2000). The main In order to solve the two problems pointed out above, ideas of these methods can be summarized into two i.e. overfitting and constant combination weights for groups: one is changing the cost function such as in- all input examples, we propose a new boosting algo- troducing regularization factors into the cost function; rithm that combines the base classifiers using a set the other is introducing a soft margin. The problem of input-dependent weighting factors. This new ap- of overfitting for AdaBoost may be related to the ex- proach is able to alleviate the second problem because ponential cost function, which makes the weights of by those input-dependent weight factors, we are able the noisy data grow exponentially and leads to the to force each submodel to focus on what it is good at. overemphasis of those data patterns. The solution Meanwhile, we can show that those input-dependent to this issue can be, either to introduce a different factors can also alleviate the overfitting problem sub- cost function, such as a logistic regression function stantially. The reason is summarized as follows: in the in (Friedman et al., 1998), or to regularize the ex- common practice of boosting, a set of “weak” classi- ponential cost function with a penalty term such as fiers are combined with fixed constants. Therefore, for the weight decay method used in (Ratsch et al., 1998), noisy data patterns, the error will accumulate through or to introduce a different weighting function, such as the sum and the weights of distribution will grow expo- BrownBoost (Freund, 2001). A more general solution nentially. In our work, we intentionally set the weight- is to replace the “hard margin” in AdaBoost with a ing factors of “weak classifiers” to be inverse to the “soft margin”. Similar to the strategy used in support previously accumulated weights, therefore we are able vector machine (SVM) algorithm (Cortes & Vapnik, to significantly discount the weights of noisy data and 1995), the boosting algorithm with a soft margin is alleviate the problem of overfitting. able to allow a larger margin at the expenses of some The rest of the paper is arranged as follows: in Sec- misclassification errors. This idea leads to works such tion 2 we discuss the related work on other boosting as the regularized boosting algorithms using both lin- algorithms. The full description of our algorithm is ear programming and quadratic programming (Ratsch presented in Section 3. The empirical study of our et al., 1999). new algorithm versus the AdaBoost algorithm is de- However, there is another problem with AdaBoost that scribed in Section 4. Finally, we draw conclusions and has been overlooked in previous studies. The Ad- discuss future work. aBoost algorithm employs a linear combination strat- egy, that is, combining different base classifiers with 2. Related Work a set of constants. As illustrated in previous studies, one advantage of the ensemble approach over a sin- Since the AdaBoost algorithm is a greedy algorithm gle model is that, an ensemble approach allows each and intentionally focuses on minimizing the training sub-model to cover a different aspect of the data set. errors, there have been many studies on the issue of By combining them together, we are able to explain overfitting for AdaBoost algorithm. However, most the whole data set thoroughly. Therefore, in order to of the modification algorithms are unsuccessful either take full strength of each submodel, a good combi- due to the high computational cost or lack of strong nation strategy should be able to examine the input empirical results of improvement. One of the most pattern and invoke the sub-models that are only ap- successful algorithms is the Weight Decay method. propriate for the input pattern. For example, in the The basic idea of Weight Decay method can be de- Hierarchical Mixture Expert model (Jordan & Jacobs, scribed as follows: in analogy to weight decay in neural 1994), the sub-models are organized into a tree struc- 2 , i = ( � t network, Ratsch et al. define ζ t t =1 α t h t ( x i )) ture, with leaf nodes acting as experts and internal where the inner sum ζ t i is the cumulative weight of the nodes as “gates”. Those internal nodes examine the pattern in the previous iterations. Similar to the Sup- patterns of the input data and route them to the most port Vector Machines(Vapnik, 1995), they add ζ t i into appropriate leaf nodes for classification. Therefore,

Recommend


More recommend