boosting algorithm with sequence loss cost function
play

Boosting Algorithm with Sequence-loss Cost Function for Structured - PowerPoint PPT Presentation

Boosting Algorithm with Sequence-loss Cost Function for Structured Prediction Tomasz Kajdanowicz , Przemysaw Kazienko, Jan Kraszewski Wroclaw University of Technology, Poland Outline 1. Introduction to Structured Prediction 2. Problem


  1. Boosting Algorithm with Sequence-loss Cost Function for Structured Prediction Tomasz Kajdanowicz , Przemysław Kazienko, Jan Kraszewski Wroclaw University of Technology, Poland

  2. Outline 1. Introduction to Structured Prediction 2. Problem Description 3. The concept of AdaBoost Seq 4. Experiments 2

  3. Structured prediction Single value prediction Structured prediction • function f maps an input to an • prediction problems with more simple output (binary complex outputs (structured classification , multiclass prediction) classification or regression) Example : Example : problem of predicting whether the problem of predicting weather for next day will or will not be next few days. rainy on the basis of historical weather data. 3

  4. Structured prediction • Structured prediction is a cost-sensitive prediction problem, where output has structure of elements decomposing into variable-length vectors. [Daume] Vector notation is treated as useful encoding not only for sequence labeling problems. 0 1 0 1 1 1 Input = original input + partially produced output (extended notion for feature input space) 4

  5. Structured prediction algorithms • Most algorithms are based on the well know binary classification adapted in the specific way [Nguyen et al.] • Structured perceptron [Collins] – minimal requirements on output space shape – easy to implement – poor generalization • Max-margin Markov Nets [Taskar et al.] – very useful – perform very slow – limited to Hamming loss function 5

  6. Structured prediction algorithms • Conditional Random Fields [Lafferty et al.] • extention of logistic regression to the structured outputs • probabilistic outputs • good generalization • relatively slow • Support Vector Machine for Interdependent and Structured Outputs (SVM STRUCT ) [Tsochantaridis et al.] • more loss functions 6

  7. Ensembles • Combined may be better – the goal is to select the right component for building a good hybrid system – Lotfi Zadeh is reputed to have said: Good combined system is like Bad combined system is like British Police British Cuisine German Mechanics German Police French Cuisine French Mechanics Swiss Banking Italian Banking Italian Love Swiss Love 7

  8. Problem Description prediction of sequential values • for single case a attributes output sequence of output values 8

  9. Problem Statement • Binary sequence classification problem f : X  Y where : X – vector input, Y - variable-length vector ( y 1 , y 2 , ..., y T ) i  {-1,1} y μ • where i =1,2,…, N – number of observations μ =1,2,…, T – length of sequence 9

  10. Problem Statement • Goal : T classifiers combined: – optimally designed linear combination – K base classifiers of the form K           F x x ; k k where  k 1 Φ ( x , Θ k ) - k th base classifier Θ k - parameters of k th classifier  k - weight associated to the k th classifier 10

  11. General Idea of AdaBoost Seq Attributes 1 2 3 4 5 6 7 8 case 1 case 2 case 3 . . . And so on.. . . . . . . case N input target 11

  12. AdaBoost Seq • A novel algorithm for sequence prediction • Optimization for each sequence item:   N      arg min exp y F x i i   ; ; k : 1 , K  k k i 1 • Equation is highly complex => a stage-wise suboptimal method is performed 12

  13. AdaBoost Seq • By definition of the m th partial sum: m            F x x ; , m 1 , 2 ,..., K m k k  k 1 • The recurence is obvious:              F x F x x ;  m m m m 1 • Stagewise optimization – m th step , F m-1 (x) is part of the previous step          – the new target is: , arg min J , m m   , 13

  14. AdaBoost Seq     N                         J , exp y F x 1 y R ( x ) x ;  i i i m i i m 1  i 1 where   - impact function denoting the influence of R m the quality of preceding sequence labels  m 1  prediction      ( ) ( ) R x R x m i i  i 1    F ( x ) 1 i y   K i 1  j    j 1 R ( x )   14 1

  15. AdaBoost Seq N   • For given α :           ( m ) arg min w exp y x ; i i i  i 1                 ( m ) w exp y F x 1 y R ( x )  i i m 1 i i (m) does not depend neighter on α nor • Because w i Ф (x i ; Θ ), it can be threated as a weigth of x i • Binary nature of base classifier:   N             ( m )  arg min P w I 1 y x ; m m i i i     i 1   0 , if x 0   I ( x ) P m – weighted empirical error   1 , if x 0

  16. AdaBoost Seq • Computing base classifier at step m: N   ( m ) w P i m      y x ; 0 i i m N    ( m ) w 1 P i m      y x ; 0 i i m 16

  17. AdaBoost Seq • Getting equations together:          arg min exp( )( 1 P ) exp( ) P m m m P m  • derivative: 1 1 P   m ln m 2 P m 17

  18. AdaBoost Seq • Weight of the i th case:                ( m ) w exp y x ; 1 R ( x )   ( m 1 ) i i m i m m w i Z m • Z m – normalizator:   N                ( m ) Z w exp y x ; 1 R ( x ) m i i m i m m  i 1 18

  19. Algorithm AdaBoost Seq • For each sequence position ( μ =1 to T ) – Initialization: w i (1) =1/N, i=1,2,...,N; m=1 – While termination criterion is not met: • obtain optimal Θ m and Ф ( ∙ ; Θ m ) (min. P m ) • obtain optimal Pm • α m =1/2ln((1-P m )/P m ) • Z m = 0.0 • For i = 1 do N (m) exp(-y i  α m Ф (x i ; Θ m )-(1-  ) α m R μ (x)) – w i (m+1) = w i – Z m =Z m +w i (m+1) • End For • For i = 1 do N – w i (m+1) = w i (m) /Z m • End For • K = m; m = m+1 – End while – f μ ( ∙)=sign(Σ K k=1 α k Ф ( ∙ ; Θ k ) ) • End for 19

  20. Profile of AdaBoost Seq • A new algorithm for sequence prediction • For each sequence item – AdaBoost Seq considers also prediction errors for all previous items in the sequence within the boosting algorithm – the more errors on previous sequence items, the stronger focus on bad cases at the recent item • Self-adaptive 20

  21. Experiments • 4019 cases in the dataset • 20 input features • Sequence lenght=10 • Decision stump as the base classifier • 10 fold cross-validation 21

  22. AdaBoost vs. AdaBoost Seq (with  ) Mean Absolute Error 0,12 Ssequence mean absolute error 0,0922 0,1011 0,0945 0,1 0,0872 0,0828 0,0810 0,0762 0,08  =0.6 0,06 the best 0,04 0,02 0 ξ = 0.4 ξ =0.5 ξ =0.6 ξ =0.7 ξ =0.8 ξ =0.9 ξ =1 For  =1 it is a standard AdaBoost (the worst) 22

  23. Summary of the Experiments • For item 2+ error reduced dramatically 0,35 ξ = 0.4 ξ =0.6 ξ =0.8 ξ =1 (6 times!) since it 0,3 0,25 respects errors on previous Mean absolute error items 0,2 •  influences error 0,15 0,1 •  =0.6 error decreases 0,05 by 24% for the whole 0 sequence compared to 1 2 3 4 5 6 7 8 9 10 Sequence item the standard approach (  =1) 23

  24. Conclusions and Future Work • AdaBoost Seq - a new algorithm for sequence prediction based on AdaBoost • While prediction of the following items in sequence, the errors from the previous items are utilized • Much more accurate than AdaBoost applied to sequence items independently • Parametrized,  - how much errors are respected • Recent application: prediction for debt valuation • Future work: new cost functions (on HMM canva) 24

  25. 26

Recommend


More recommend