On-line Hierarchical Multi-label Text Classification Jesse Read Supervised by Bernhard (and Eibe and Geoff) On-line Hierarchical Multi-label Text Classification 1
Multi-label Classification Multi-class (“Single-label”) Classification e.g. Class set C = { Sports, Environment, Science, Politics } For a text document d , select a class c ∈ C Multi-label Classification e.g. Label set L = { Sports, Environment, Science, Politics } . For a text document d select a label subset S ⊆ L Doc. Labels ( S ⊆ L ) 1 { Sports,Politics } e.g.: 2 { Science,Politics } 3 { Sports } 4 { Environment,Science } ...how to do multi-label classification? On-line Hierarchical Multi-label Text Classification 2
Problem Transformation Methods (PT) Transforming a multi-label problem into a multi-class problem without losing information: 1. (LC) Label Combination Method 2. (BC) Binary Classifiers Method 3. (RT) Ranking Threshold Method Our toy multi-label problem: Label Set L = { Sports, Environment, Science, Politics } Doc. Labels ( S ⊆ L ) 1 { Sports,Politics } 2 { Science,Politics } 3 { Sports } 4 { Environment,Science } On-line Hierarchical Multi-label Text Classification 3
1. Label Combination Method (LC) Train Doc. Class 1 Sports+Politics 2 Science+Politics 3 Sports 4 Science+Environment Test Doc. Class X ? • May generate many classes for few documents • Possibly inflexible for time-ordered data On-line Hierarchical Multi-label Text Classification 4
2. Binary Classifiers Method (BC) Train B Sports B Environment B Science B P olitics Doc. Class Doc. Class Doc. Class Doc. Class 1 1 1 0 1 0 1 1 2 0 2 0 2 1 2 1 3 1 3 0 3 0 3 0 4 0 4 1 4 1 4 0 Doc. B Sports B Environment B Science B P olitics Test X ? ? ? ? • Slow, need | L | classifiers. • Assumes that all labels are independent On-line Hierarchical Multi-label Text Classification 5
3. Ranking Threshold Method (RT) Doc. Class 1 Sports 1 Politics 2 Science Train 2 Politics 3 Sports 4 Science 4 Environment Doc. Certainty Distribution Test X ( Y w , Y x , Y y , Y z ) = (?,?,?,?) • Difficulty in selecting a threshold • Assumes that all labels are independent On-line Hierarchical Multi-label Text Classification 6
Algorithm Adaption Methods We have seen the 3 main “Problem Transformation” methods. There are also Algorithm Adaption methods, for example: • Modifying the entropy of J48 • Multiple actions for Association Rules • AdaBoost.MH, AdaBoost.MR • Modifications to SMO, kNN, . . . Although most algorithm adaption methods just use a problem transformation method internally, e.g. Association Rules—LC, AdaBoost.MH—“AdaBoost Transformation”(AT), AdaBoost.MR—RT ...what about hierarchy? On-line Hierarchical Multi-label Text Classification 7
Hierarchical Classification Includes some method to recognise relationships between labels. For text data, we recognise a tree structured topic hierarchy, known as a taxonomy . There are two approaches to hierarchical classification: • Global Hierarchical (a.k.a. the “big bang” approach) • Local Hierarchical (a.k.a the “top down” approach) On-line Hierarchical Multi-label Text Classification 8
Global Hierarchical root Americas. Americas MidEast. MidEast. Sports. Sports. Sci/Tech US Canada Iraq Iran Soccer Rugby + Improvements in accuracy − Difficult to maintain; can get very computationally complex E.g. • Stacking (e.g. on BC) • EM (e.g. on LC) • Boosting (e.g. with AT) • Association Rules • Predictive Clustering Trees (multi-label tree learners) On-line Hierarchical Multi-label Text Classification 9
Local Hierarchical root Americas Mid.East Sci/Tech Sports US Canada Iraq Iran Soccer Rugby + Divides up the problem: easy to maintain; intuitive − Error propagation; accuracy similar to flat PT E.g. • Pachinko Machine, e.g. Fuzzy Relational Thesauri (FRT) • Probabilistic • Hybrid: ECOC, Error Recovery, Can return to higher nodes On-line Hierarchical Multi-label Text Classification 10
Multi-label Datasets Key | D | | L | UC ( D, L ) LC ( D, L ) Hier. Seq. Text YEAST 2,417 14 198 4.24 N N N MEDC 978 45 94 1.25 N N Y 20NG 19,300 20 55 1.03 Y Y Y ENRN 1,702 53 753 3.38 Y Y Y MARX 3,617 101 208 1.13 Y Y Y REUT 6,000 103 811 1.46 Y N Y | D | = Number of documents | L | = Number of possible labels UC ( D, L ) = | S | S ⊆ L, ∃ d ∈ D : L ( d ) = S | � | D | 1 LC ( D, L ) = for i = 1 · · · | D | , S i ⊆ L , where ( d i , S i ): i =1 | S i | | D | Hier. = Hierarchical structure defined within dataset Seq. = Time-ordered data Text = Text Dataset On-line Hierarchical Multi-label Text Classification 11
Multi-label Evaluation • Percentage of correctly classified instances? – Too harsh • Percentage of correctly classified labels? – Too easy Let C be a multi-label classifier, S i ⊆ L and Y i = C ( x i ) be label predictions by C for document x i : | D | 1 | S i ∩ Y i | � Accuracy ( C, D ) = (1) | D | | S i ∪ Y i | i =1 Hierarchical Evaluation: • Should we give partial credit? • If so, how? On-line Hierarchical Multi-label Text Classification 12
Algorithms Multi-class algorithms commonly used in prior multi-label work: Key Type Description NB Bayes Na¨ ıve Bayes BAG. Meta Bagging (with J48) SMO Function Support Vector Machines J48 Tree J48 IBk kNN k Nearest Neighbor NN Neural Neural Networks Pilot experiments showed that: • Default NN too slow • IBk does not perform well with sparse data On-line Hierarchical Multi-label Text Classification 13
Experiments — Tables Flat vs Global Hierarchical vs Local Hierarchical 1. Problem Transformation LC BC RT NB BAG SMO J48 NB BAG SMO J48 NB BAG SMO J48 MEDI 68.05 71.77* 71.10* 72.13* 55.82 75.58* 73.59* 65.83 67.81 64.20 65.72 60.22 20NG 57.47* 57.58* 57.35* 52.74 32.33 - 47.67 41.09 56.05* 47.19 54.61* 50.55 ENRN 32.72* 25.42 - 22.96 21.82 31.35* 30.56* 26.26 15.16 30.25* 24.09 27.82 MARX 48.15* 48.93* 43.26 44.79 32.6 31.69 38.64 33.95 48.44* 36.07 40.46 38.71 REUT 43.76 51.47 - 41.68 18.21 44.09 56.23* 43.83 37.13 45.9 58.65* 45.31 2. Global Hierarchical LC-EM BC-Stack(RT-NB) AT NB BAG SMO J48 NB BAG SMO J48 BAG J48 MEDI 67.45 74.71* 70.75 72.31 56.09 70.76 73.65* 65.85 67.06 67.82 20NG 57.48* 57.58* 57.45* 53.39 29.8 - 49.06 40.88 - - ENRN 34.6* 25.46 - 23.31 20.66 31.79 27.01 25.35 - - MARX 48.18 50.64* 43.29 44.82 39.09 32.08 38.87 34.25 - - REUT 43.77 51.49* - 41.69 19.78 43.83 57.32* 43.68 - - 3. Local Hierarchical LC BC RT NB BAG SMO J48 NB BAG SMO J48 NB BAG SMO J48 20NG 56.49 58.31* 58.83* 53.48 43.68 - 52.44 42.03 54.87 40.58 53.37 49.26 ENRN 25.96 29.38 27.73 25.23 15.3 34.99* - 26.26 4.67 25.51 23.59 27.63 MARX 48.49 54.57* 42.4 46.84 41.69 38.67 40.34 38.65 46.44 33.59 38.32 41.23 On-line Hierarchical Multi-label Text Classification 14
Experiments — 20NG — Accuracy 100 LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO 80 GH.AT_J48 60 40 20 0 10 100 1000 10000 100000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 15
Experiments — 20NG — Build Time 12000 LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB 10000 GH.BC_STACK-SMO GH.AT_J48 8000 6000 4000 2000 0 10 100 1000 10000 100000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 16
Experiments — ENRN — Accuracy 100 LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO 80 GH.AT_J48 60 40 20 0 10 100 1000 10000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 17
Experiments — ENRN — Build Time 4500 LH.LC-SMO LH.BC-SMO 4000 LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO 3500 GH.AT_J48 3000 2500 2000 1500 1000 500 0 10 100 1000 10000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 18
Experiments — MARX — Accuracy 100 LH.LC-SMO LH.BC-SMO LH.RT-NB GH.LC_EM-NB GH.BC_STACK-SMO 80 GH.AT_J48 60 40 20 0 10 100 1000 10000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 19
Experiments — MARX — Build Time 1400 LH.LC-SMO LH.BC-SMO LH.RT-NB 1200 GH.LC_EM-NB GH.BC_STACK-SMO GH.AT_J48 1000 800 600 400 200 0 10 100 1000 10000 % Labeled(Training) Examples On-line Hierarchical Multi-label Text Classification 20
Conclusions Problem Transformation methods: • No problem transformation method is best on all datasets • BC and RT might do better with a better selected | S | • Complexity determined by D , L , LC ( D, L ) and UC ( D, L ) Multi-class algorithms: • J48 not that great • BC doesn’t go well with Na¨ ıve Bayes, RT does, and LC works equally with either Hierarchical: • Global PT-extensions improve on flat • In practice there is overhead involved with building local hierarchical classifiers but also in theory more flexible On-line Hierarchical Multi-label Text Classification 21
Recommend
More recommend