Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods Rebecca J. Passonneau, Cynthia Rudin, Axinia Radeva, and Zhi An Liu Center for Computational Learning Systems (CCLS) Columbia University
Motivation: Secondary Electrical Grid A dense network of structures and cables provide power to NYC buildings Structures at 2 nd Ave & 83 rd Street, Manhattan • Manholes • Service boxes Serious event: manhole fire in the Village, April 2008 March 6, 2009 CICLING Reducing Noise in Labels & Features 2
Emergency Control System (ECS) Ticket 1 MR. ROBERT TOBIA (718)555 ‐ 5124 ‐ SMOKING. COVER OFF. ‐ RMKS: 2 01/06/03 08:40 MDETHUILOT DISPATCHED BY 55988 3 01/06/03 09:30 MDETHUILOT ARRIVED BY 55988 4 01/06/02 09:55 THUILOT REPORTS NO SMOKE ON ARRIVAL. THERE IS 5 A SHUNT ON LOCATION ‐ SHUNT & SERVICE NOT EFFECTED. . . . 8 REQUESTING FLUSH/ORDERED (#2836). 9 ******* NO PARKING : TUES. & FRIDAY, 11:30AM ‐ 1PM ****** RV 10 01/06/03 10:45 THUILOT REPORTS BUILDING 260 W.139 ST. 11 COMPLAINED OF LIGHT PROBLEMS. FOUND 1 ‐ PHASE DOWN ‐ BRIDGED 12 @ 10:30 ( 2 ‐ PHASE SERVICE ) CONSUMER IS CONTENT. . . . 18 01/06/03 18:45 FERNANDEZ REPORTS THAT IN SB ‐ 521117 F/O254 19 W139 ST. HE CUT OUT A 3W2W COPPERED JT & REPLACED IT W/ 20 A 4W NEO CRAB....BY USING 1 LEG OFF THE 7W FROM THE HE 21 WAS ABLE TO PUSH THE MISSING PHASE BACK TO 260, BRIDGE 22 REMOVED....@ THIS TIME FERNANDEZ REPORTS THERE ARE MORE 23 B/O'S & 2 MORE JTS TO C/O, WILL F/U W/ MORE INFO....TCP March 6, 2009 CICLING Reducing Noise in Labels & Features 3
Outline • NLP/IE versus real world problem and data • Ranking problem: stucture vulnerability • ECS ticket classification problem – Relation to labels on structures – Relation to feature representation of structures • Annotation task: can humans classify tickets? • Results – Overall noise reduction – Improvements to top of list • Importance of knowledge transfer paths for ML March 6, 2009 CICLING Reducing Noise in Labels & Features 4
Typical Impasse • A real world “database” has free text fields that could provide new relations in an rdb 03/06/09 SMH S/W/C BROAD & MAIN FITZSIMMONS REPORTS THE TBL HOLE IS SB-00001 FOUND ON ....SMOKING LIGHTY • Institutional owner gives db to NLP group for data mining – abysmal gap in domain knowledge March 6, 2009 CICLING Reducing Noise in Labels & Features 5
NIST 2007 ACE (Automatic Content Extraction) Results in max/avg value score (roughly, accuracy) • Entity mentions (5 sites participating, 7 major entity types, e.g., geopolitical, facility,org): – Broadcast news: 65.9/52.7 – Newswire: 58.1/44.0 – Telephone: 49.2/35.5 – Usenet: 44.0/31.4 • Events (1 site, 8 major event types, e.g., business, meeting, conflict) – Broadcast news: 12.9 – Newswire: 15.9 – Telephone: 6.6 – Usenet: 11.3 March 6, 2009 CICLING Reducing Noise in Labels & Features 6
CCLS/Consolidated Edison Collaboration • Idea( lization ): – Help reduce serious events in the secondary electrical grid – Use 10 years of Emergency Control System (ECS) trouble ticket data (plus other data sources) • A succession of automated/free ‐ text entries in one ticket • A procedure for assigning a “trouble type” to each ticket – Rank vulnerability of structures to “serious events” • Reality: – Data dump of very noisy data – No operational definition of “ serious ” March 6, 2009 CICLING Reducing Noise in Labels & Features 7
Related Work • Devaney & Ram, 2005: case ‐ based reasoning – 10,000 maintenance logs, machine X – Unsupervised text clustering, OWL/RDF domain model • Liddy et al., 2006: sublanguage analysis – ECS trouble tickets 1995 ‐ 2005: 70K train, 7k test, 100 eval – Reclassification of MSE (misc) trouble type tickets into two trouble types, SMH and WL • Oza et al, In Press: similar gap in domain knowledge for a complex domain – 800,000 reports from aeronautics db – SVM and Non ‐ negative matrix factorization on BOW document representation for topical classification (similar to LSA) March 6, 2009 CICLING Reducing Noise in Labels & Features 8
Scope of Structure Ranking Problem • Number of structures in Manhattan: 51,912 • ECS tickets for Manhattan – Relevant Trouble Types (N=21) : 61,730 – Number of structures in tickets: 27,235 (44%) • Number of “serious” events per year depends on the definition – Fires and explosions (MHX, MHF, MHO): ~150 (0.6% of structures in ECS) – Other events: e.g., smoking manholes: ~470 (1.8% of structures in ECS) March 6, 2009 CICLING Reducing Noise in Labels & Features 9
Learning Approach to Structure Ranking • Formulated as a supervised bipartite ranking problem – A real ‐ valued score is assigned to each structure – Goal is to rank positively ‐ labeled examples above negatively labeled examples • Learning algorithm – Maximizes a weighted version of the AUC – Here we used SVM ‐ perf (Joachims, T., 2005) – We have also used P ‐ Norm Push (Rudin, C., 2008; a generalization of RankBoost) March 6, 2009 CICLING Reducing Noise in Labels & Features 10
Event Classification: Labels and Features Depends on defining “serious event” • Label structures: Did s i have a serious event in Y j ? • Identify small number of explanatory features – Four ECS ‐ based features affect the top of the list • Did s i have a serious event recently (> ( Y j ‐ 3 ) & < Y j )? • How many recent tickets mention s j ? • Did s i have a serious event in the past (> 1996 & < Y j )? • How many past tickets mention s j ? – One cable density feature affects the rest of the list • Train on 2005, test on 2006, evaluate on 2007 March 6, 2009 CICLING Reducing Noise in Labels & Features 11
Baseline Event Classification • Length constraint: At least 3 free ‐ text lines • Not all tickets correspond to distinct events (referred tickets; no work performed; non ‐ secondary) • ECS Ticket Trouble Types (N=21) – MHX/MHF/MHO: good indicator event is serious – SMH: moderate indicator event is serious – ACB: good indicator event is not serious – 16 other trouble types: generally not serious March 6, 2009 CICLING Reducing Noise in Labels & Features 12
ECS Tickets • Enormous length variation: 1 ‐ 522 lines • Varying proportion of free text lines: 0 ‐ 69% • Fragmentary and telegraphic language • Specialized terminology (sublanguage) – CRAB, C&R, TROUBLE HOLE, FLUSH • Intra ‐ word line breaks: AFFECTE/ D • Misspellings inflate vocabulary size – Before normalization: ~57K unigram types – After normalization: ~22K unigram types March 6, 2009 CICLING Reducing Noise in Labels & Features 13
Human Annotation Task To acquire an extensional definition of “serious" • Data: 171 ECS tickets; text only, no access to trouble type etc • Annotators: 2 domain experts • Task: sort tickets into one of three classes 1. Serious event 2. Potential precursor event 3. Exclude as irrelevant (e.g., not secondary; not an event) March 6, 2009 CICLING Reducing Noise in Labels & Features 14
Experts versus Baseline • Kappa agreement coefficient results – Ranges from 1 (perfect agreement) to 0 (random) to ‐ 1 (perfect disagreement) – Experts with baseline (3 ‐ way kappa): 0.25 – Experts with each other: 0.49 • Trouble type does not correspond to expert judgment • Experts have moderate agreement – subjective • Difficult prediction problem March 6, 2009 CICLING Reducing Noise in Labels & Features 15
Expert vs. Baseline, Annotated Tickets Ticket Non ‐ Event Precursor Serious Expert Category Type Dis ‐ agree Base ‐ Ex ‐ Base ‐ Ex ‐ Base ‐ Ex ‐ line perts line perts line perts ACB 0 0 21 16 0 3 2 MHX/F/O 0 2 1 0 9 7 0 SMH 0 3 0 7 27 15 2 Other 8 17 106 58 0 4 35 Totals 8 22 128 81 36 29 39 March 6, 2009 CICLING Reducing Noise in Labels & Features 16
Expert vs. Baseline, All Tickets Ticket Precursor Type Serious Category Baseline Rules Baseline Rules ACB 6,171 5,364 192 162 MHX/F/O 0 25 1,785 1,481 SMH 0 1,105 4,906 3,397 Other 25,776 16,978 81 75 Totals 31,947 23,472 6,964 5,115 • Baseline Precursor + Serious = 38,911 • Rules Precursor + Serious = 28,587 March 6, 2009 CICLING Reducing Noise in Labels & Features 17
Results: AUC scores • Best improvement on Test Set (2006) TRAIN TEST • Obscures what changed: – Many large demotions of Baseline 67.63 65.01 structures that are not so Rule ‐ vulnerable (e.g., 759/52K 68.29 67.55 based to 3105/52K) – Side ‐ effect: small promotions of vulnerable structures (e.g., 69/52K to 45/52K) March 6, 2009 CICLING Reducing Noise in Labels & Features 18
Changes to Top of Ranked List • Jaccard coefficient finds the “similarity” of two sets ∩ A B (range is 0 to 1, with J=1 when A=B) ∪ A B • For N=5 to 1000, compare the top N structures of the ranked list from the baseline classification of events versus the rule ‐ based classification N 5 10 15 20 100 500 1000 Jaccard 0.25 0.33 0.50 0.60 0.72 0.75 0.81 • EG: every fourth structure in top 500 of Rules ranking is not in top 500 of Baseline ranking March 6, 2009 CICLING Reducing Noise in Labels & Features 19
Recommend
More recommend