Reducing Noise in Labels and Features for a Real World Dataset: - PowerPoint PPT Presentation

Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods Rebecca J. Passonneau, Cynthia Rudin, Axinia Radeva, and Zhi An Liu Center for Computational Learning Systems (CCLS) Columbia University

Motivation: Secondary Electrical Grid A dense network of structures and cables provide power to NYC buildings Structures at 2 nd Ave & 83 rd Street, Manhattan • Manholes • Service boxes Serious event: manhole fire in the Village, April 2008 March 6, 2009 CICLING Reducing Noise in Labels & Features 2

Emergency Control System (ECS) Ticket 1 MR. ROBERT TOBIA (718)555 ‐ 5124 ‐ SMOKING. COVER OFF. ‐ RMKS: 2 01/06/03 08:40 MDETHUILOT DISPATCHED BY 55988 3 01/06/03 09:30 MDETHUILOT ARRIVED BY 55988 4 01/06/02 09:55 THUILOT REPORTS NO SMOKE ON ARRIVAL. THERE IS 5 A SHUNT ON LOCATION ‐ SHUNT & SERVICE NOT EFFECTED. . . . 8 REQUESTING FLUSH/ORDERED (#2836). 9 ******* NO PARKING : TUES. & FRIDAY, 11:30AM ‐ 1PM ****** RV 10 01/06/03 10:45 THUILOT REPORTS BUILDING 260 W.139 ST. 11 COMPLAINED OF LIGHT PROBLEMS. FOUND 1 ‐ PHASE DOWN ‐ BRIDGED 12 @ 10:30 ( 2 ‐ PHASE SERVICE ) CONSUMER IS CONTENT. . . . 18 01/06/03 18:45 FERNANDEZ REPORTS THAT IN SB ‐ 521117 F/O254 19 W139 ST. HE CUT OUT A 3W2W COPPERED JT & REPLACED IT W/ 20 A 4W NEO CRAB....BY USING 1 LEG OFF THE 7W FROM THE HE 21 WAS ABLE TO PUSH THE MISSING PHASE BACK TO 260, BRIDGE 22 REMOVED....@ THIS TIME FERNANDEZ REPORTS THERE ARE MORE 23 B/O'S & 2 MORE JTS TO C/O, WILL F/U W/ MORE INFO....TCP March 6, 2009 CICLING Reducing Noise in Labels & Features 3

Outline • NLP/IE versus real world problem and data • Ranking problem: stucture vulnerability • ECS ticket classification problem – Relation to labels on structures – Relation to feature representation of structures • Annotation task: can humans classify tickets? • Results – Overall noise reduction – Improvements to top of list • Importance of knowledge transfer paths for ML March 6, 2009 CICLING Reducing Noise in Labels & Features 4

Typical Impasse • A real world “database” has free text fields that could provide new relations in an rdb 03/06/09 SMH S/W/C BROAD & MAIN FITZSIMMONS REPORTS THE TBL HOLE IS SB-00001 FOUND ON ....SMOKING LIGHTY • Institutional owner gives db to NLP group for data mining – abysmal gap in domain knowledge March 6, 2009 CICLING Reducing Noise in Labels & Features 5

NIST 2007 ACE (Automatic Content Extraction) Results in max/avg value score (roughly, accuracy) • Entity mentions (5 sites participating, 7 major entity types, e.g., geopolitical, facility,org): – Broadcast news: 65.9/52.7 – Newswire: 58.1/44.0 – Telephone: 49.2/35.5 – Usenet: 44.0/31.4 • Events (1 site, 8 major event types, e.g., business, meeting, conflict) – Broadcast news: 12.9 – Newswire: 15.9 – Telephone: 6.6 – Usenet: 11.3 March 6, 2009 CICLING Reducing Noise in Labels & Features 6

CCLS/Consolidated Edison Collaboration • Idea( lization ): – Help reduce serious events in the secondary electrical grid – Use 10 years of Emergency Control System (ECS) trouble ticket data (plus other data sources) • A succession of automated/free ‐ text entries in one ticket • A procedure for assigning a “trouble type” to each ticket – Rank vulnerability of structures to “serious events” • Reality: – Data dump of very noisy data – No operational definition of “ serious ” March 6, 2009 CICLING Reducing Noise in Labels & Features 7

Related Work • Devaney & Ram, 2005: case ‐ based reasoning – 10,000 maintenance logs, machine X – Unsupervised text clustering, OWL/RDF domain model • Liddy et al., 2006: sublanguage analysis – ECS trouble tickets 1995 ‐ 2005: 70K train, 7k test, 100 eval – Reclassification of MSE (misc) trouble type tickets into two trouble types, SMH and WL • Oza et al, In Press: similar gap in domain knowledge for a complex domain – 800,000 reports from aeronautics db – SVM and Non ‐ negative matrix factorization on BOW document representation for topical classification (similar to LSA) March 6, 2009 CICLING Reducing Noise in Labels & Features 8

Scope of Structure Ranking Problem • Number of structures in Manhattan: 51,912 • ECS tickets for Manhattan – Relevant Trouble Types (N=21) : 61,730 – Number of structures in tickets: 27,235 (44%) • Number of “serious” events per year depends on the definition – Fires and explosions (MHX, MHF, MHO): ~150 (0.6% of structures in ECS) – Other events: e.g., smoking manholes: ~470 (1.8% of structures in ECS) March 6, 2009 CICLING Reducing Noise in Labels & Features 9

Learning Approach to Structure Ranking • Formulated as a supervised bipartite ranking problem – A real ‐ valued score is assigned to each structure – Goal is to rank positively ‐ labeled examples above negatively labeled examples • Learning algorithm – Maximizes a weighted version of the AUC – Here we used SVM ‐ perf (Joachims, T., 2005) – We have also used P ‐ Norm Push (Rudin, C., 2008; a generalization of RankBoost) March 6, 2009 CICLING Reducing Noise in Labels & Features 10

Event Classification: Labels and Features Depends on defining “serious event” • Label structures: Did s i have a serious event in Y j ? • Identify small number of explanatory features – Four ECS ‐ based features affect the top of the list • Did s i have a serious event recently (> ( Y j ‐ 3 ) & < Y j )? • How many recent tickets mention s j ? • Did s i have a serious event in the past (> 1996 & < Y j )? • How many past tickets mention s j ? – One cable density feature affects the rest of the list • Train on 2005, test on 2006, evaluate on 2007 March 6, 2009 CICLING Reducing Noise in Labels & Features 11

Baseline Event Classification • Length constraint: At least 3 free ‐ text lines • Not all tickets correspond to distinct events (referred tickets; no work performed; non ‐ secondary) • ECS Ticket Trouble Types (N=21) – MHX/MHF/MHO: good indicator event is serious – SMH: moderate indicator event is serious – ACB: good indicator event is not serious – 16 other trouble types: generally not serious March 6, 2009 CICLING Reducing Noise in Labels & Features 12

ECS Tickets • Enormous length variation: 1 ‐ 522 lines • Varying proportion of free text lines: 0 ‐ 69% • Fragmentary and telegraphic language • Specialized terminology (sublanguage) – CRAB, C&R, TROUBLE HOLE, FLUSH • Intra ‐ word line breaks: AFFECTE/ D • Misspellings inflate vocabulary size – Before normalization: ~57K unigram types – After normalization: ~22K unigram types March 6, 2009 CICLING Reducing Noise in Labels & Features 13

Human Annotation Task To acquire an extensional definition of “serious" • Data: 171 ECS tickets; text only, no access to trouble type etc • Annotators: 2 domain experts • Task: sort tickets into one of three classes 1. Serious event 2. Potential precursor event 3. Exclude as irrelevant (e.g., not secondary; not an event) March 6, 2009 CICLING Reducing Noise in Labels & Features 14

Experts versus Baseline • Kappa agreement coefficient results – Ranges from 1 (perfect agreement) to 0 (random) to ‐ 1 (perfect disagreement) – Experts with baseline (3 ‐ way kappa): 0.25 – Experts with each other: 0.49 • Trouble type does not correspond to expert judgment • Experts have moderate agreement – subjective • Difficult prediction problem March 6, 2009 CICLING Reducing Noise in Labels & Features 15

Expert vs. Baseline, Annotated Tickets Ticket Non ‐ Event Precursor Serious Expert Category Type Dis ‐ agree Base ‐ Ex ‐ Base ‐ Ex ‐ Base ‐ Ex ‐ line perts line perts line perts ACB 0 0 21 16 0 3 2 MHX/F/O 0 2 1 0 9 7 0 SMH 0 3 0 7 27 15 2 Other 8 17 106 58 0 4 35 Totals 8 22 128 81 36 29 39 March 6, 2009 CICLING Reducing Noise in Labels & Features 16

Expert vs. Baseline, All Tickets Ticket Precursor Type Serious Category Baseline Rules Baseline Rules ACB 6,171 5,364 192 162 MHX/F/O 0 25 1,785 1,481 SMH 0 1,105 4,906 3,397 Other 25,776 16,978 81 75 Totals 31,947 23,472 6,964 5,115 • Baseline Precursor + Serious = 38,911 • Rules Precursor + Serious = 28,587 March 6, 2009 CICLING Reducing Noise in Labels & Features 17

Results: AUC scores • Best improvement on Test Set (2006) TRAIN TEST • Obscures what changed: – Many large demotions of Baseline 67.63 65.01 structures that are not so Rule ‐ vulnerable (e.g., 759/52K 68.29 67.55 based to 3105/52K) – Side ‐ effect: small promotions of vulnerable structures (e.g., 69/52K to 45/52K) March 6, 2009 CICLING Reducing Noise in Labels & Features 18

Changes to Top of Ranked List • Jaccard coefficient finds the “similarity” of two sets ∩ A B (range is 0 to 1, with J=1 when A=B) ∪ A B • For N=5 to 1000, compare the top N structures of the ranked list from the baseline classification of events versus the rule ‐ based classification N 5 10 15 20 100 500 1000 Jaccard 0.25 0.33 0.50 0.60 0.72 0.75 0.81 • EG: every fourth structure in top 500 of Rules ranking is not in top 500 of Baseline ranking March 6, 2009 CICLING Reducing Noise in Labels & Features 19

Reducing Noise in Labels and Features for a Real World Dataset: - PowerPoint PPT Presentation

Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods Rebecca J. Passonneau, Cynthia Rudin, Axinia Radeva, and Zhi An Liu Center for Computational Learning Systems (CCLS) Columbia University

2016 Vegetable Pesticide Update: Weeds 1) New/Changed labels 2) Labels soon 3) Auxin Technologies

2012 GFVGA: Herbicide Update 2012 Weed Control Update 1. Recent labels 2. New labels 3. Near

Module-2c: Two Port Noise Modelling 20 July 2018 16:40 Shot Noise vs. Flicker Noise Simple

Visioning Committee Air Quality and Noise January 23, 2020 Noise Data Noise is evaluated on

Johnson Noise: Determinations of k and Absolute Zero Edwin Ng | 12 December 2011 Nyquists

Lecture 19- ECE 240a Laser Phase Noise 1 ECE 240a Lasers - Fall 2019 Lecture 19 Phase Noise

Making Polynomials Robust to Noise Alexander Sherstov U C L A Noise in computation 2 Noise in

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random:

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random:

Dave Mark Intrinsic Algorithm Reducing the world to mathematical equations! Reducing

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Real graduates, Real graduates, real transitions, real transitions, real stories: real

NOISE AT WORK AWARENESS SESSION FOR WORKERS WHAT IS NOISE Noise is all around us at home,

Widening and Improvements Noise Review: Grant Road Hampton St to Santa Rita Rd January 13, 2016

Noise Barrier Meeting March 12, 2019 WHY ARE WE HERE TONIGHT? Noise Barrier Final Design Noise

Noise Programs & NextGen Briefing Stan Shepherd, Manager Airport Noise Programs 1

AMCHAM PHILIPPINES OSAC MEETING 17 MAY 2018 Resilience Everyday PSA Philippines Consultancy Inc

1 Contact Admissions Offjce at +65 6899 5030 | admissions@claridenglobal.com |

pricing and International tax Vispi T. Patel Vispi T. Patel & Associates February 16, 2019

INTRODUCTION TO ASTI AND YIC At the end of this session you should: Understand who and what

Due Diligence Presentation 2015 The EU Timber Regulation (EUTR) Regulation (EU) No 995/2010 So,

Magazine Subscription by Telemarketing Case of Korea (By Kim Yong-Kook for Session IV:

Seeking Call Center Partnerships We are strategically seeking call center partners like yourself,

The Telephone Consumer Protection Act James P. Berg, partner PIB Law Chad Fuller, partner

Reducing Noise in Labels and Features for a Real World Dataset: - PowerPoint PPT Presentation

Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods Rebecca J. Passonneau, Cynthia Rudin, Axinia Radeva, and Zhi An Liu Center for Computational Learning Systems (CCLS) Columbia University

2016 Vegetable Pesticide Update: Weeds 1) New/Changed labels 2) Labels soon 3) Auxin Technologies

2012 GFVGA: Herbicide Update 2012 Weed Control Update 1. Recent labels 2. New labels 3. Near

Module-2c: Two Port Noise Modelling 20 July 2018 16:40 Shot Noise vs. Flicker Noise Simple

Visioning Committee Air Quality and Noise January 23, 2020 Noise Data Noise is evaluated on

Johnson Noise: Determinations of k and Absolute Zero Edwin Ng | 12 December 2011 Nyquists

Lecture 19- ECE 240a Laser Phase Noise 1 ECE 240a Lasers - Fall 2019 Lecture 19 Phase Noise

Making Polynomials Robust to Noise Alexander Sherstov U C L A Noise in computation 2 Noise in

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -&gt; value Pseudo-random:

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -&gt; value Pseudo-random:

Dave Mark Intrinsic Algorithm Reducing the world to mathematical equations! Reducing

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Real graduates, Real graduates, real transitions, real transitions, real stories: real

NOISE AT WORK AWARENESS SESSION FOR WORKERS WHAT IS NOISE Noise is all around us at home,

Widening and Improvements Noise Review: Grant Road Hampton St to Santa Rita Rd January 13, 2016

Noise Barrier Meeting March 12, 2019 WHY ARE WE HERE TONIGHT? Noise Barrier Final Design Noise

Noise Programs &amp; NextGen Briefing Stan Shepherd, Manager Airport Noise Programs 1

AMCHAM PHILIPPINES OSAC MEETING 17 MAY 2018 Resilience Everyday PSA Philippines Consultancy Inc

1 Contact Admissions Offjce at +65 6899 5030 | admissions@claridenglobal.com |

pricing and International tax Vispi T. Patel Vispi T. Patel &amp; Associates February 16, 2019

INTRODUCTION TO ASTI AND YIC At the end of this session you should: Understand who and what

Due Diligence Presentation 2015 The EU Timber Regulation (EUTR) Regulation (EU) No 995/2010 So,

Magazine Subscription by Telemarketing Case of Korea (By Kim Yong-Kook for Session IV:

Seeking Call Center Partnerships We are strategically seeking call center partners like yourself,

The Telephone Consumer Protection Act James P. Berg, partner PIB Law Chad Fuller, partner

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random:

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random:

Noise Programs & NextGen Briefing Stan Shepherd, Manager Airport Noise Programs 1

pricing and International tax Vispi T. Patel Vispi T. Patel & Associates February 16, 2019