Boosting Algorithm with Sequence-loss Cost Function for Structured Prediction Tomasz Kajdanowicz , Przemysław Kazienko, Jan Kraszewski Wroclaw University of Technology, Poland
Outline 1. Introduction to Structured Prediction 2. Problem Description 3. The concept of AdaBoost Seq 4. Experiments 2
Structured prediction Single value prediction Structured prediction • function f maps an input to an • prediction problems with more simple output (binary complex outputs (structured classification , multiclass prediction) classification or regression) Example : Example : problem of predicting whether the problem of predicting weather for next day will or will not be next few days. rainy on the basis of historical weather data. 3
Structured prediction • Structured prediction is a cost-sensitive prediction problem, where output has structure of elements decomposing into variable-length vectors. [Daume] Vector notation is treated as useful encoding not only for sequence labeling problems. 0 1 0 1 1 1 Input = original input + partially produced output (extended notion for feature input space) 4
Structured prediction algorithms • Most algorithms are based on the well know binary classification adapted in the specific way [Nguyen et al.] • Structured perceptron [Collins] – minimal requirements on output space shape – easy to implement – poor generalization • Max-margin Markov Nets [Taskar et al.] – very useful – perform very slow – limited to Hamming loss function 5
Structured prediction algorithms • Conditional Random Fields [Lafferty et al.] • extention of logistic regression to the structured outputs • probabilistic outputs • good generalization • relatively slow • Support Vector Machine for Interdependent and Structured Outputs (SVM STRUCT ) [Tsochantaridis et al.] • more loss functions 6
Ensembles • Combined may be better – the goal is to select the right component for building a good hybrid system – Lotfi Zadeh is reputed to have said: Good combined system is like Bad combined system is like British Police British Cuisine German Mechanics German Police French Cuisine French Mechanics Swiss Banking Italian Banking Italian Love Swiss Love 7
Problem Description prediction of sequential values • for single case a attributes output sequence of output values 8
Problem Statement • Binary sequence classification problem f : X Y where : X – vector input, Y - variable-length vector ( y 1 , y 2 , ..., y T ) i {-1,1} y μ • where i =1,2,…, N – number of observations μ =1,2,…, T – length of sequence 9
Problem Statement • Goal : T classifiers combined: – optimally designed linear combination – K base classifiers of the form K F x x ; k k where k 1 Φ ( x , Θ k ) - k th base classifier Θ k - parameters of k th classifier k - weight associated to the k th classifier 10
General Idea of AdaBoost Seq Attributes 1 2 3 4 5 6 7 8 case 1 case 2 case 3 . . . And so on.. . . . . . . case N input target 11
AdaBoost Seq • A novel algorithm for sequence prediction • Optimization for each sequence item: N arg min exp y F x i i ; ; k : 1 , K k k i 1 • Equation is highly complex => a stage-wise suboptimal method is performed 12
AdaBoost Seq • By definition of the m th partial sum: m F x x ; , m 1 , 2 ,..., K m k k k 1 • The recurence is obvious: F x F x x ; m m m m 1 • Stagewise optimization – m th step , F m-1 (x) is part of the previous step – the new target is: , arg min J , m m , 13
AdaBoost Seq N J , exp y F x 1 y R ( x ) x ; i i i m i i m 1 i 1 where - impact function denoting the influence of R m the quality of preceding sequence labels m 1 prediction ( ) ( ) R x R x m i i i 1 F ( x ) 1 i y K i 1 j j 1 R ( x ) 14 1
AdaBoost Seq N • For given α : ( m ) arg min w exp y x ; i i i i 1 ( m ) w exp y F x 1 y R ( x ) i i m 1 i i (m) does not depend neighter on α nor • Because w i Ф (x i ; Θ ), it can be threated as a weigth of x i • Binary nature of base classifier: N ( m ) arg min P w I 1 y x ; m m i i i i 1 0 , if x 0 I ( x ) P m – weighted empirical error 1 , if x 0
AdaBoost Seq • Computing base classifier at step m: N ( m ) w P i m y x ; 0 i i m N ( m ) w 1 P i m y x ; 0 i i m 16
AdaBoost Seq • Getting equations together: arg min exp( )( 1 P ) exp( ) P m m m P m • derivative: 1 1 P m ln m 2 P m 17
AdaBoost Seq • Weight of the i th case: ( m ) w exp y x ; 1 R ( x ) ( m 1 ) i i m i m m w i Z m • Z m – normalizator: N ( m ) Z w exp y x ; 1 R ( x ) m i i m i m m i 1 18
Algorithm AdaBoost Seq • For each sequence position ( μ =1 to T ) – Initialization: w i (1) =1/N, i=1,2,...,N; m=1 – While termination criterion is not met: • obtain optimal Θ m and Ф ( ∙ ; Θ m ) (min. P m ) • obtain optimal Pm • α m =1/2ln((1-P m )/P m ) • Z m = 0.0 • For i = 1 do N (m) exp(-y i α m Ф (x i ; Θ m )-(1- ) α m R μ (x)) – w i (m+1) = w i – Z m =Z m +w i (m+1) • End For • For i = 1 do N – w i (m+1) = w i (m) /Z m • End For • K = m; m = m+1 – End while – f μ ( ∙)=sign(Σ K k=1 α k Ф ( ∙ ; Θ k ) ) • End for 19
Profile of AdaBoost Seq • A new algorithm for sequence prediction • For each sequence item – AdaBoost Seq considers also prediction errors for all previous items in the sequence within the boosting algorithm – the more errors on previous sequence items, the stronger focus on bad cases at the recent item • Self-adaptive 20
Experiments • 4019 cases in the dataset • 20 input features • Sequence lenght=10 • Decision stump as the base classifier • 10 fold cross-validation 21
AdaBoost vs. AdaBoost Seq (with ) Mean Absolute Error 0,12 Ssequence mean absolute error 0,0922 0,1011 0,0945 0,1 0,0872 0,0828 0,0810 0,0762 0,08 =0.6 0,06 the best 0,04 0,02 0 ξ = 0.4 ξ =0.5 ξ =0.6 ξ =0.7 ξ =0.8 ξ =0.9 ξ =1 For =1 it is a standard AdaBoost (the worst) 22
Summary of the Experiments • For item 2+ error reduced dramatically 0,35 ξ = 0.4 ξ =0.6 ξ =0.8 ξ =1 (6 times!) since it 0,3 0,25 respects errors on previous Mean absolute error items 0,2 • influences error 0,15 0,1 • =0.6 error decreases 0,05 by 24% for the whole 0 sequence compared to 1 2 3 4 5 6 7 8 9 10 Sequence item the standard approach ( =1) 23
Conclusions and Future Work • AdaBoost Seq - a new algorithm for sequence prediction based on AdaBoost • While prediction of the following items in sequence, the errors from the previous items are utilized • Much more accurate than AdaBoost applied to sequence items independently • Parametrized, - how much errors are respected • Recent application: prediction for debt valuation • Future work: new cost functions (on HMM canva) 24
26
Recommend
More recommend