Learning Multiple Tasks with Boosted Decision Trees Jean Baptiste - PDF document

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Learning Multiple Tasks with Boosted Decision Trees Jean Baptiste Faddoul, Boris Chidlovskii, Fabien Torre, R´ emi Gilleron CAp’12, May 2012 Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Multitask Learning Multitask Learning MTL considers learning multiple ” related ” tasks jointly, in order to improve their predictive performance.

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Related Tasks ? Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Related Tasks ?

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Table of contents Introduction 1 Learning MT-DTs 2 MT-Adaboost 3 Experiments 4 Conclusion 5 Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Label Correspondence Assumption Learning multiple tasks becomes easier under the label correspondence assumption, where either: Tasks share the same labels sets. Tasks share the same training data points, each has a label for each task.

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Global Relatedness Assumption Related tasks might show different relatedness degrees / signs across the learning space. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Prior-Art [Quadrianto et al., 2010] formulates MTL as a maximization problem of mutual information among the label sets. Their approach assumes a global relatedness pattern between the tasks. In previous work [Faddoul et al., 2010] we proposed MT-Adaboost , an adaptation of Adaboost to MTL. The weak classifier proposed is called MT-Stump .

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Prior-Art (Cont’d) No label correspondence or global relatedness assumptions. But, the sequential design of multi-task stump and its greedy algorithm can fail to capture task relatedness. Contribution We propose a novel technique for the multi-task learning which addresses the limitations of previous approaches: We propose an adaptation to decision trees learning to multi-task setting. We derive an information-theoretic criterion and prove its superiority to baseline information gain. We integrate the proposed classifier in boosting framework. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion A bit of Notation N classification tasks T 1 , . . . , T N over the instance space X and label sets Y 1 , . . . , Y N Distribution D over X × { 1 , . . . , N } . Training set S = { < x i , y i , j > | x i ∈ X , y i ∈ Y j , j ∈ { 1 , . . . , N } , 1 ≤ i ≤ m } . Output h : X → Y 1 × . . . × Y N which minimizes error( h ) = Pr < x , y , j > ∼ D [ h j ( x ) � = y ], where h j ( x ) is the j -th component of h ( x ).

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Multi-Task Decision Tree (MT-DT) Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain Decision tree learning is based Information Gain (IG) criteria. IG ( Y ; X ) on a random variable Y obtained after observing the value of X is the Kullback-Leibler divergent D KL ( p ( Y | X ) || p ( Y | I )) It is the reduction of Y ’s entropy obtained by observing the value of X . IG defines a preferred sequence of attributes to investigate to most rapidly narrow down the state of Y . IG ( Y ; X ) = H ( Y ) − H ( Y | X ) ,

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL As a baseline, we can pool all the tasks as a single multi-class task. The IG in this case will be given by: IG J = IG ( ⊕ N j =1 Y j ; X ) ⊕ indicates the pooling of all tasks’ labels Another baseline can be the sum of individual IGs: T � IG U = IG ( Y j ; X ) j =1 Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) Evaluations show that IG U fails to make better than IG J . We prove that IG J is equivalent to the weighted sum of individual task information gains. Then we derive IG M a criterion superior to IG J .

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) Theorem For N tasks with the class sets Y 1 , . . . , Y N , let p j denote the | S j | fraction of task j in the full dataset, p j = j =1 | S j | , j = 1 , . . . , N, � N � N j =1 p j = 1 . Then we have N � IG ( ⊕ N j =1 Y j ; X ) = p j IG ( Y j ; X ) ≤ max ( IG ( Y 1 ; X ) , . . . , IG ( Y N ; X )) j =1 Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) To prove our theorem we use the Generalized Grouping property of the entropy: Lemma For q kj ≥ 0 , such that � n � m j =1 q kj = 1 , p k = � m j =1 q kj , ∀ k = 1 , . . . , n, the following k =1 holds H ( q 11 , . . . , q 1 m , q 21 , . . . , q 2 m , . . . , q n 1 , . . . , q nm ) = (1) � q k 1 , . . . , q km � � H ( p 1 , . . . , p n ) + p k H , p k > 0 , ∀ k . (2) p k p k

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) First, we use Lemma to develop the entropy term H ( ⊕ N j =1 Y j ) of the information gain N � H ( ⊕ N j =1 Y j ) = H ( p 1 , . . . , p N ) + p j H ( Y j ) , (3) j =1 where � N j =1 p j = 1. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) Second, we develop the conditional entropy term as follows. We assume here that the tasks proportions are independent of the observation, i.e., H ( p 1 , . . . , p N | x ) = H ( p 1 , . . . , p N ). � H ( ⊕ N p ( x ) H ( ⊕ N j =1 Y j | X ) = j =1 Y j | X = x ) x   N � � = p ( x )  H ( p 1 , . . . , p N ) + p j H ( Y j | X = x )  x j =1 N � � = H ( p 1 , . . . , p N ) + p j p ( x ) H ( Y j | X = x ) x j =1 N � = H ( p 1 , . . . , p N ) + p j H ( Y j | X ) . j =1

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) Now we combine the entropy and the conditional entropy terms to evaluate the joint information gain IG ( ⊕ N j =1 Y j ; X ). We obtain IG ( ⊕ N H ( ⊕ N j =1 Y j ) − H ( ⊕ N j =1 Y j ; X ) = j =1 Y j | X ) (4) N � = p j IG ( Y j ; X ) (5) j =1 N � ≤ p j max ( IG ( Y 1 ; X ) , . . . , IG ( Y N ; X )) (6) j =1 = max ( IG ( Y 1 ; X ) , . . . , IG ( Y N ; X )) . (7) This completes the proof of the theorem. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion IGs Comparison Figure: Information gain for randomly generated datasets.

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Adaboost Adaboost is a meta-algorithm, comes from PAC framework [Valiant, 1984] A Weak-learner means that it is slightly better than random ( ǫ < 0 . 5) A Strong-learner allows to chose its error ǫ as small as wanted Adaboost transforms any weak learner into a strong learner! Head lines of the algorithm: Initialize examples weights D 0 1 At each round n = 1 , . . . , N : 2 Call a weak learner on the distribution D n , to learn a hypothesis h n Calculate ǫ n the error of the h n on the training set. Weights of each incorrectly classified example are increased. Weights of each correctly classified example are decreased. The final classifier is a weighted sum of the weak classifier’s 3 output Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion What is Good about Adaboost Framework ? Not prone to overfitting Can be used with many different classifiers Inherits the features of its weak classifier (semi-supervised, relational, statistical, etc ...) Simple to implement Those properties motivate our choice of Adaboost as the framework of our multi-task algorithm Requirements We have to: Define Multi-Task-hypotheses and its learning algorithm: in our case it is MT-DT Modify AdaBoost for Multi-Task learning

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion MT-Adaboost We adapt Adaboost.M1 which was introduced in [Schapire and Singer, 1999]. Any other boosting algorithm can be used. In the case of multi-task, the distribution is defined over pairs of (example, task), i.e., D ⊂ X × { 1 , . . . , N } The output of the algorithm is a function which takes an example as input and give a label per task as output: i = T � H j ( x ) = arg max ( (ln 1 /β t )) , 1 ≤ j ≤ N y ∈ Y j i =1 Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion MT-Adaboost Require: S = ∪ N j =1 { e i = < x i , y i , j > | x i ∈ X ; y i ∈ Y j } 1: D 1 = init ( S ) initialize distribution 2: for t = 1 to T do h t = WL ( S , D t ) { train the weak learner and get an hypothesis MT-DT } 3: 4: Calculate the error of h t : ǫ t = � N � j ( xi ) � = yi D j ( x i ) . i : ht j =1 5: if ǫ t > 1 / 2 then 6: Set T = t − 1 and abort loop. 7: end if 8: ǫ t β t = 1 − ǫ t { Update distribution: } 9: if h t j ( x i ) == y i then D t +1 ( e i ) = Dt ( ei ) × β t 10: Zt 11: else D t +1 ( e i ) = Dt ( ei ) 12: Zt 13: end if 14: end for { Where Z t is a normalization constant chosen so that D t +1 is a distribution } 15: return Classifier H defined by: i = T � H j ( x ) = arg max ( (ln 1 /β t )) , 1 ≤ j ≤ N y ∈Y j i =1

Learning Multiple Tasks with Boosted Decision Trees Jean Baptiste - PDF document

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Learning Multiple Tasks with Boosted Decision Trees Jean Baptiste Faddoul, Boris Chidlovskii, Fabien Torre, R emi Gilleron CAp12, May 2012 Introduction Learning MT-DTs

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

Tree Computation for Ranking and Classification CS240A, T. Yang, 2016 Outlines Decision Trees

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks Maksym

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G IN P YTH ON Michael

Internet Governance in July & August 2015 25 August 2015 2 3 Major events in July &

The Shepard Tone and Higher-Order Multi-rate Synchronous Data-Flow Programming in S IG

CMness is determined by inertia groups Ben Blum-Smith Maurice Auslander Distinguished Lectures

Overview Georges Qunot Laboratoire d'Informatique de Grenoble George Awad Dakota Consulting -

IsoGlib: an Isogeometric Analysis library for the solution of high-order Partial Differential

Outline Algorithms for Multiagent Learning A. Introduction Equilibrium Learners B. Single Agent

IGA Lecture III: Twisted Spin c structures Eckhard Meinrenken Adelaide, September 7, 2011

Isogeometric Analysis for high-order two-point singularly perturbed problems of

Learning Multiple Tasks with Boosted Decision Trees Jean Baptiste - PDF document

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Learning Multiple Tasks with Boosted Decision Trees Jean Baptiste Faddoul, Boris Chidlovskii, Fabien Torre, R emi Gilleron CAp12, May 2012 Introduction Learning MT-DTs

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

Tree Computation for Ranking and Classification CS240A, T. Yang, 2016 Outlines Decision Trees

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Boosted Top Tagging Seung J. Lee Outline Introduction: top jets @ LHC Modern boosted top

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks Maksym

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

Gradient boosted trees w ith XGBoost C R E D IT R ISK MOD E L IN G IN P YTH ON Michael

Internet Governance in July &amp; August 2015 25 August 2015 2 3 Major events in July &amp;

The Shepard Tone and Higher-Order Multi-rate Synchronous Data-Flow Programming in S IG

CMness is determined by inertia groups Ben Blum-Smith Maurice Auslander Distinguished Lectures

Overview Georges Qunot Laboratoire d'Informatique de Grenoble George Awad Dakota Consulting -

IsoGlib: an Isogeometric Analysis library for the solution of high-order Partial Differential

Outline Algorithms for Multiagent Learning A. Introduction Equilibrium Learners B. Single Agent

IGA Lecture III: Twisted Spin c structures Eckhard Meinrenken Adelaide, September 7, 2011

Isogeometric Analysis for high-order two-point singularly perturbed problems of

Internet Governance in July & August 2015 25 August 2015 2 3 Major events in July &