Optimal and Adaptive Algorithms for Online Boosting Alina Beygelzimer 1 Satyen Kale 1 Haipeng Luo 2 1 Yahoo! Labs, NYC 2 Computer Science Department, Princeton University December 11, 2015
Boosting: An Example Idea: combine weak “rules of thumb” to form a highly accurate predictor. 2 / 18
Boosting: An Example Idea: combine weak “rules of thumb” to form a highly accurate predictor. Example: email spam detection. 2 / 18
Boosting: An Example Idea: combine weak “rules of thumb” to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ◮ (“Attn: Beneficiary Contractor Foreign Money Transfer ...”, spam) ◮ (“Let’s meet to discuss QPR –Edo”, not spam) 2 / 18
Boosting: An Example Idea: combine weak “rules of thumb” to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ◮ (“Attn: Beneficiary Contractor Foreign Money Transfer ...”, spam) ◮ (“Let’s meet to discuss QPR –Edo”, not spam) Obtain a classifier by asking a “weak learning algorithm”: ◮ e.g. contains the word “money” ⇒ spam. 2 / 18
Boosting: An Example Idea: combine weak “rules of thumb” to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ◮ (“Attn: Beneficiary Contractor Foreign Money Transfer ...”, spam) ◮ (“Let’s meet to discuss QPR –Edo”, not spam) Obtain a classifier by asking a “weak learning algorithm”: ◮ e.g. contains the word “money” ⇒ spam. Reweight the examples so that “difficult” ones get more attention. ◮ e.g. spam that doesn’t contain “money”. 2 / 18
Boosting: An Example Idea: combine weak “rules of thumb” to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ◮ (“Attn: Beneficiary Contractor Foreign Money Transfer ...”, spam) ◮ (“Let’s meet to discuss QPR –Edo”, not spam) Obtain a classifier by asking a “weak learning algorithm”: ◮ e.g. contains the word “money” ⇒ spam. Reweight the examples so that “difficult” ones get more attention. ◮ e.g. spam that doesn’t contain “money”. Obtain another classifier: ◮ e.g. empty “to address” ⇒ spam. 2 / 18
Boosting: An Example Idea: combine weak “rules of thumb” to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ◮ (“Attn: Beneficiary Contractor Foreign Money Transfer ...”, spam) ◮ (“Let’s meet to discuss QPR –Edo”, not spam) Obtain a classifier by asking a “weak learning algorithm”: ◮ e.g. contains the word “money” ⇒ spam. Reweight the examples so that “difficult” ones get more attention. ◮ e.g. spam that doesn’t contain “money”. Obtain another classifier: ◮ e.g. empty “to address” ⇒ spam. ...... 2 / 18
Boosting: An Example Idea: combine weak “rules of thumb” to form a highly accurate predictor. Example: email spam detection. Given: a set of training examples. ◮ (“Attn: Beneficiary Contractor Foreign Money Transfer ...”, spam) ◮ (“Let’s meet to discuss QPR –Edo”, not spam) Obtain a classifier by asking a “weak learning algorithm”: ◮ e.g. contains the word “money” ⇒ spam. Reweight the examples so that “difficult” ones get more attention. ◮ e.g. spam that doesn’t contain “money”. Obtain another classifier: ◮ e.g. empty “to address” ⇒ spam. ...... At the end, predict by taking a (weighted) majority vote. 2 / 18
Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. 3 / 18
Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. Online learning has proven extremely useful: one pass of the data, make prediction on the fly. 3 / 18
Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. Online learning has proven extremely useful: one pass of the data, make prediction on the fly. works even in an adversarial environment. ◮ e.g. spam detection. 3 / 18
Online Boosting: Motivation Boosting is well studied in the batch setting, but become infeasible when the amount of data is huge. Online learning has proven extremely useful: one pass of the data, make prediction on the fly. works even in an adversarial environment. ◮ e.g. spam detection. An natural question: how to extend boosting to the online setting? 3 / 18
Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008) . mimic offline counterparts. achieve great success in many real-world applications. no theoretical guarantees. 4 / 18
Related Work Several algorithms exist (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008) . mimic offline counterparts. achieve great success in many real-world applications. no theoretical guarantees. Chen et al. (2012): first online boosting algorithms with theoretical guarantees. online analogue of weak learning assumption. connecting online boosting and smooth batch boosting. 4 / 18
Batch Boosting Given a batch of T examples, ( x t , y t ) ∈ X × {− 1 , 1 } for t = 1 , . . . , T . Learner A predicts A ( x t ) ∈ {− 1 , 1 } for example x t . 5 / 18
Batch Boosting Given a batch of T examples, ( x t , y t ) ∈ X × {− 1 , 1 } for t = 1 , . . . , T . Learner A predicts A ( x t ) ∈ {− 1 , 1 } for example x t . Weak learner A (with edge γ ): � T t =1 1 {A ( x t ) � = y t } ≤ ( 1 2 − γ ) T 5 / 18
Batch Boosting Given a batch of T examples, ( x t , y t ) ∈ X × {− 1 , 1 } for t = 1 , . . . , T . Learner A predicts A ( x t ) ∈ {− 1 , 1 } for example x t . Weak learner A (with edge γ ): � T t =1 1 {A ( x t ) � = y t } ≤ ( 1 2 − γ ) T Strong learner A ′ (with any target error rate ǫ ): � T t =1 1 {A ′ ( x t ) � = y t } ≤ ǫ T 5 / 18
Batch Boosting Given a batch of T examples, ( x t , y t ) ∈ X × {− 1 , 1 } for t = 1 , . . . , T . Learner A predicts A ( x t ) ∈ {− 1 , 1 } for example x t . Weak learner A (with edge γ ): � T t =1 1 {A ( x t ) � = y t } ≤ ( 1 2 − γ ) T ⇓ Boosting (Schapire, 1990; Freund, 1995) Strong learner A ′ (with any target error rate ǫ ): � T t =1 1 {A ′ ( x t ) � = y t } ≤ ǫ T 5 / 18
Online Boosting Examples ( x t , y t ) ∈ X × {− 1 , 1 } arrive online, for t = 1 , . . . , T . Learner A observes x t and predicts A ( x t ) ∈ {− 1 , 1 } before seeing y t . Weak Online learner A (with edge γ ): � T t =1 1 {A ( x t ) � = y t } ≤ ( 1 2 − γ ) T Strong Online learner A ′ (with any target error rate ǫ ): � T t =1 1 {A ′ ( x t ) � = y t } ≤ ǫ T 5 / 18
Online Boosting Examples ( x t , y t ) ∈ X × {− 1 , 1 } arrive online, for t = 1 , . . . , T . Learner A observes x t and predicts A ( x t ) ∈ {− 1 , 1 } before seeing y t . Weak Online learner A (with edge γ and excess loss S ): � T t =1 1 {A ( x t ) � = y t } ≤ ( 1 2 − γ ) T + S Strong Online learner A ′ (with any target error rate ǫ and excess loss S ′ ) � T t =1 1 {A ′ ( x t ) � = y t } ≤ ǫ T + S ′ 5 / 18
Online Boosting Examples ( x t , y t ) ∈ X × {− 1 , 1 } arrive online, for t = 1 , . . . , T . Learner A observes x t and predicts A ( x t ) ∈ {− 1 , 1 } before seeing y t . Weak Online learner A (with edge γ and excess loss S ): � T t =1 1 {A ( x t ) � = y t } ≤ ( 1 2 − γ ) T + S ⇓ Online Boosting (our result) Strong Online learner A ′ (with any target error rate ǫ and excess loss S ′ ) � T t =1 1 {A ′ ( x t ) � = y t } ≤ ǫ T + S ′ 5 / 18
Online Boosting Examples ( x t , y t ) ∈ X × {− 1 , 1 } arrive online, for t = 1 , . . . , T . Learner A observes x t and predicts A ( x t ) ∈ {− 1 , 1 } before seeing y t . Weak Online learner A (with edge γ and excess loss S ): � T t =1 1 {A ( x t ) � = y t } ≤ ( 1 2 − γ ) T + S ⇓ Online Boosting (our result) Strong Online learner A ′ (with any target error rate ǫ and excess loss S ′ ) � T t =1 1 {A ′ ( x t ) � = y t } ≤ ǫ T + S ′ √ this talk: S = 1 γ (corresponds to T regret) 5 / 18
Main Results Parameters of interest: N = number of weak learners (of edge γ ) needed to achieve error rate ǫ . T ǫ = minimal number of examples s.t. error rate is ǫ . Algorithm Optimal? Adaptive? N T ǫ √ ˜ O ( 1 γ 2 ln 1 O ( 1 Online BBM ǫ ) ǫγ 2 ) × √ ˜ O ( 1 1 AdaBoost.OL ǫγ 2 ) O ( ǫ 2 γ 4 ) × O ( 1 O ( 1 ˜ Chen et al. (2012) ǫγ 2 ) ǫγ 2 ) × × 6 / 18
Structure of Online Boosting x 1 Booster
Structure of Online Boosting x 1 x 1 WL 1 predict y 1 ˆ 1 x 1 WL 2 predict y 2 ˆ 1 Booster . . . x 1 WL N predict y N ˆ 1
Structure of Online Boosting ˆ x 1 y 1 y 1 x 1 WL 1 predict y 1 ˆ 1 x 1 WL 2 predict y 2 ˆ 1 Booster . . . x 1 WL N predict y N ˆ 1
Structure of Online Boosting ˆ x 1 y 1 y 1 x 1 w.p. p 1 WL 1 WL 1 1 predict ( x 1 , y 1 ) update y 1 ˆ 1 x 1 w.p. p 2 WL 2 WL 2 1 predict update ( x 1 , y 1 ) y 2 ˆ 1 Booster . . . . . . x 1 w.p. p N WL N WL N 1 predict update ( x 1 , y 1 ) y N ˆ 1 7 / 18
Recommend
More recommend