reducing noise in labels and features for a real world
play

Reducing Noise in Labels and Features for a Real World Dataset: - PowerPoint PPT Presentation

Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods Rebecca J. Passonneau, Cynthia Rudin, Axinia Radeva, and Zhi An Liu Center for Computational Learning Systems (CCLS) Columbia University


  1. Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods Rebecca J. Passonneau, Cynthia Rudin, Axinia Radeva, and Zhi An Liu Center for Computational Learning Systems (CCLS) Columbia University

  2. Motivation: Secondary Electrical Grid A dense network of structures and cables provide power to NYC buildings Structures at 2 nd Ave & 83 rd Street, Manhattan • Manholes • Service boxes Serious event: manhole fire in the Village, April 2008 March 6, 2009 CICLING Reducing Noise in Labels & Features 2

  3. Emergency Control System (ECS) Ticket 1 MR. ROBERT TOBIA (718)555 ‐ 5124 ‐ SMOKING. COVER OFF. ‐ RMKS: 2 01/06/03 08:40 MDETHUILOT DISPATCHED BY 55988 3 01/06/03 09:30 MDETHUILOT ARRIVED BY 55988 4 01/06/02 09:55 THUILOT REPORTS NO SMOKE ON ARRIVAL. THERE IS 5 A SHUNT ON LOCATION ‐ SHUNT & SERVICE NOT EFFECTED. . . . 8 REQUESTING FLUSH/ORDERED (#2836). 9 ******* NO PARKING : TUES. & FRIDAY, 11:30AM ‐ 1PM ****** RV 10 01/06/03 10:45 THUILOT REPORTS BUILDING 260 W.139 ST. 11 COMPLAINED OF LIGHT PROBLEMS. FOUND 1 ‐ PHASE DOWN ‐ BRIDGED 12 @ 10:30 ( 2 ‐ PHASE SERVICE ) CONSUMER IS CONTENT. . . . 18 01/06/03 18:45 FERNANDEZ REPORTS THAT IN SB ‐ 521117 F/O254 19 W139 ST. HE CUT OUT A 3W2W COPPERED JT & REPLACED IT W/ 20 A 4W NEO CRAB....BY USING 1 LEG OFF THE 7W FROM THE HE 21 WAS ABLE TO PUSH THE MISSING PHASE BACK TO 260, BRIDGE 22 REMOVED....@ THIS TIME FERNANDEZ REPORTS THERE ARE MORE 23 B/O'S & 2 MORE JTS TO C/O, WILL F/U W/ MORE INFO....TCP March 6, 2009 CICLING Reducing Noise in Labels & Features 3

  4. Outline • NLP/IE versus real world problem and data • Ranking problem: stucture vulnerability • ECS ticket classification problem – Relation to labels on structures – Relation to feature representation of structures • Annotation task: can humans classify tickets? • Results – Overall noise reduction – Improvements to top of list • Importance of knowledge transfer paths for ML March 6, 2009 CICLING Reducing Noise in Labels & Features 4

  5. Typical Impasse • A real world “database” has free text fields that could provide new relations in an rdb 03/06/09 SMH S/W/C BROAD & MAIN FITZSIMMONS REPORTS THE TBL HOLE IS SB-00001 FOUND ON ....SMOKING LIGHTY • Institutional owner gives db to NLP group for data mining – abysmal gap in domain knowledge March 6, 2009 CICLING Reducing Noise in Labels & Features 5

  6. NIST 2007 ACE (Automatic Content Extraction) Results in max/avg value score (roughly, accuracy) • Entity mentions (5 sites participating, 7 major entity types, e.g., geopolitical, facility,org): – Broadcast news: 65.9/52.7 – Newswire: 58.1/44.0 – Telephone: 49.2/35.5 – Usenet: 44.0/31.4 • Events (1 site, 8 major event types, e.g., business, meeting, conflict) – Broadcast news: 12.9 – Newswire: 15.9 – Telephone: 6.6 – Usenet: 11.3 March 6, 2009 CICLING Reducing Noise in Labels & Features 6

  7. CCLS/Consolidated Edison Collaboration • Idea( lization ): – Help reduce serious events in the secondary electrical grid – Use 10 years of Emergency Control System (ECS) trouble ticket data (plus other data sources) • A succession of automated/free ‐ text entries in one ticket • A procedure for assigning a “trouble type” to each ticket – Rank vulnerability of structures to “serious events” • Reality: – Data dump of very noisy data – No operational definition of “ serious ” March 6, 2009 CICLING Reducing Noise in Labels & Features 7

  8. Related Work • Devaney & Ram, 2005: case ‐ based reasoning – 10,000 maintenance logs, machine X – Unsupervised text clustering, OWL/RDF domain model • Liddy et al., 2006: sublanguage analysis – ECS trouble tickets 1995 ‐ 2005: 70K train, 7k test, 100 eval – Reclassification of MSE (misc) trouble type tickets into two trouble types, SMH and WL • Oza et al, In Press: similar gap in domain knowledge for a complex domain – 800,000 reports from aeronautics db – SVM and Non ‐ negative matrix factorization on BOW document representation for topical classification (similar to LSA) March 6, 2009 CICLING Reducing Noise in Labels & Features 8

  9. Scope of Structure Ranking Problem • Number of structures in Manhattan: 51,912 • ECS tickets for Manhattan – Relevant Trouble Types (N=21) : 61,730 – Number of structures in tickets: 27,235 (44%) • Number of “serious” events per year depends on the definition – Fires and explosions (MHX, MHF, MHO): ~150 (0.6% of structures in ECS) – Other events: e.g., smoking manholes: ~470 (1.8% of structures in ECS) March 6, 2009 CICLING Reducing Noise in Labels & Features 9

  10. Learning Approach to Structure Ranking • Formulated as a supervised bipartite ranking problem – A real ‐ valued score is assigned to each structure – Goal is to rank positively ‐ labeled examples above negatively labeled examples • Learning algorithm – Maximizes a weighted version of the AUC – Here we used SVM ‐ perf (Joachims, T., 2005) – We have also used P ‐ Norm Push (Rudin, C., 2008; a generalization of RankBoost) March 6, 2009 CICLING Reducing Noise in Labels & Features 10

  11. Event Classification: Labels and Features Depends on defining “serious event” • Label structures: Did s i have a serious event in Y j ? • Identify small number of explanatory features – Four ECS ‐ based features affect the top of the list • Did s i have a serious event recently (> ( Y j ‐ 3 ) & < Y j )? • How many recent tickets mention s j ? • Did s i have a serious event in the past (> 1996 & < Y j )? • How many past tickets mention s j ? – One cable density feature affects the rest of the list • Train on 2005, test on 2006, evaluate on 2007 March 6, 2009 CICLING Reducing Noise in Labels & Features 11

  12. Baseline Event Classification • Length constraint: At least 3 free ‐ text lines • Not all tickets correspond to distinct events (referred tickets; no work performed; non ‐ secondary) • ECS Ticket Trouble Types (N=21) – MHX/MHF/MHO: good indicator event is serious – SMH: moderate indicator event is serious – ACB: good indicator event is not serious – 16 other trouble types: generally not serious March 6, 2009 CICLING Reducing Noise in Labels & Features 12

  13. ECS Tickets • Enormous length variation: 1 ‐ 522 lines • Varying proportion of free text lines: 0 ‐ 69% • Fragmentary and telegraphic language • Specialized terminology (sublanguage) – CRAB, C&R, TROUBLE HOLE, FLUSH • Intra ‐ word line breaks: AFFECTE/ D • Misspellings inflate vocabulary size – Before normalization: ~57K unigram types – After normalization: ~22K unigram types March 6, 2009 CICLING Reducing Noise in Labels & Features 13

  14. Human Annotation Task To acquire an extensional definition of “serious" • Data: 171 ECS tickets; text only, no access to trouble type etc • Annotators: 2 domain experts • Task: sort tickets into one of three classes 1. Serious event 2. Potential precursor event 3. Exclude as irrelevant (e.g., not secondary; not an event) March 6, 2009 CICLING Reducing Noise in Labels & Features 14

  15. Experts versus Baseline • Kappa agreement coefficient results – Ranges from 1 (perfect agreement) to 0 (random) to ‐ 1 (perfect disagreement) – Experts with baseline (3 ‐ way kappa): 0.25 – Experts with each other: 0.49 • Trouble type does not correspond to expert judgment • Experts have moderate agreement – subjective • Difficult prediction problem March 6, 2009 CICLING Reducing Noise in Labels & Features 15

  16. Expert vs. Baseline, Annotated Tickets Ticket Non ‐ Event Precursor Serious Expert Category Type Dis ‐ agree Base ‐ Ex ‐ Base ‐ Ex ‐ Base ‐ Ex ‐ line perts line perts line perts ACB 0 0 21 16 0 3 2 MHX/F/O 0 2 1 0 9 7 0 SMH 0 3 0 7 27 15 2 Other 8 17 106 58 0 4 35 Totals 8 22 128 81 36 29 39 March 6, 2009 CICLING Reducing Noise in Labels & Features 16

  17. Expert vs. Baseline, All Tickets Ticket Precursor Type Serious Category Baseline Rules Baseline Rules ACB 6,171 5,364 192 162 MHX/F/O 0 25 1,785 1,481 SMH 0 1,105 4,906 3,397 Other 25,776 16,978 81 75 Totals 31,947 23,472 6,964 5,115 • Baseline Precursor + Serious = 38,911 • Rules Precursor + Serious = 28,587 March 6, 2009 CICLING Reducing Noise in Labels & Features 17

  18. Results: AUC scores • Best improvement on Test Set (2006) TRAIN TEST • Obscures what changed: – Many large demotions of Baseline 67.63 65.01 structures that are not so Rule ‐ vulnerable (e.g., 759/52K 68.29 67.55 based to 3105/52K) – Side ‐ effect: small promotions of vulnerable structures (e.g., 69/52K to 45/52K) March 6, 2009 CICLING Reducing Noise in Labels & Features 18

  19. Changes to Top of Ranked List • Jaccard coefficient finds the “similarity” of two sets ∩ A B (range is 0 to 1, with J=1 when A=B) ∪ A B • For N=5 to 1000, compare the top N structures of the ranked list from the baseline classification of events versus the rule ‐ based classification N 5 10 15 20 100 500 1000 Jaccard 0.25 0.33 0.50 0.60 0.72 0.75 0.81 • EG: every fourth structure in top 500 of Rules ranking is not in top 500 of Baseline ranking March 6, 2009 CICLING Reducing Noise in Labels & Features 19

Recommend


More recommend