Natural Language Processing Info 159/259 Lecture 5: Truth and - PowerPoint PPT Presentation

Natural Language Processing Info 159/259   Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley

Hwæt! Wé Gárde   na in géardagum, þ éodcyninga   þ rym gefrúnon, hú ð á æ þ elingas ellen fremedon. Oft Scyld Scéfing scea þ ena Natural Language Processing Info 159/259   Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley

x W I 3.1 x 1 1 -2.7 x 1 W 1 1.4 -0.7 hated x 2 h 1 =f(I, hated, it) -1.4 9.2 h 1 -3.1 it x 3 x 2 W 2 -2.7 h 2 =f(it, I, really) 1 1.4 0.1 h 2 x 4 I 1 0.3 -0.4 x 3 W 3 -2.4 h 3 -4.7 really x 5 5.7 h 3 =f(really, hated, it) x 6 hated h 1 = σ ( x 1 W 1 + x 2 W 2 + x 3 W 3 ) h 2 = σ ( x 3 W 1 + x 4 W 2 + x 5 W 3 ) it x 7 h 3 = σ ( x 5 W 1 + x 6 W 2 + x 7 W 3 )

Convolutional networks x 1 x 2 1 x 3 10 x 4 2 10 x 5 -1 This defines one filter. x 6 5 x 7 convolution max pooling

Zhang and Wallace 2016, “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”

Modern NLP is driven by annotated data • Penn Treebank (1993; 1995;1999); morphosyntactic annotations of WSJ • OntoNotes (2007–2013); syntax, predicate-argument structure, word sense, coreference • FrameNet (1998–): frame-semantic lexica/annotations • MPQA (2005): opinion/sentiment • SQuAD (2016): annotated questions + spans of answers in Wikipedia

Modern NLP is driven by annotated data • In most cases, the data we have is the product of human judgments. • What’s the correct part of speech tag? • Syntactic structure? • Sentiment?

Ambiguity “One morning I shot   an elephant in my pajamas” Animal Crackers

Dogmatism Fast and Horvitz (2016), “Identifying Dogmatism in Social Media: Signals and Models”

Sarcasm https://www.nytimes.com/2016/08/12/opinion/an-even-stranger-donald-trump.html?ref=opinion

Fake News http://www.fakenewschallenge.org

Annotation pipeline Pustejovsky and Stubbs (2012),   Natural Language Annotation for Machine Learning

Annotation Guidelines • Our goal: given the constraints of our problem, how can we formalize our description of the annotation process to encourage multiple annotators to provide the same judgment?

Annotation guidelines • What is the goal of the project? • What is each tag called and how is it used? (Be specific: provide examples, and discuss gray areas.) • What parts of the text do you want annotated, and what should be left alone? • How will the annotation be created? (For example, explain which tags or documents to annotate first, how to use the annotation tools, etc.) Pustejovsky and Stubbs (2012), Natural Language Annotation for Machine Learning

Practicalities • Annotation takes time, concentration (can’t do it 8 hours a day) • Annotators get better as they annotate (earlier annotations not as good as later ones)

Why not do it yourself? • Expensive/time-consuming • Multiple people provide a measure of consistency: is the task well enough defined? • Low agreement = not enough training, guidelines not well enough defined, task is bad

Adjudication • Adjudication is the process of deciding on a single annotation for a piece of text, using information about the independent annotations. • Can be as time-consuming (or more so) as a primary annotation. • Does not need to be identical with a primary annotation (both annotators can be wrong by chance)

Interannotator agreement annotator A fried puppy annotator B chicken 6 3 puppy fried 2 5 chicken observed agreement = 11/16 = 68.75% https://twitter.com/teenybiscuit/status/705232709220769792/photo/1

Cohen’s kappa • If classes are imbalanced, we can get high inter annotator agreement simply by chance annotator A fried puppy annotator B chicken 7 4 puppy fried 8 81 chicken

Cohen’s kappa • If classes are imbalanced, we can get high inter annotator agreement simply by chance annotator A κ = p o − p e fried puppy 1 − p e annotator B chicken 7 4 puppy κ = 0 . 88 − p e 1 − p e fried 8 81 chicken

Cohen’s kappa • Expected probability of agreement is how often we would expect two annotators to agree assuming independent annotations p e = P ( A = puppy , B = puppy) + P ( A = chicken , B = chicken) = P ( A = puppy) P ( B = puppy) + P ( A = chicken) P ( B = chicken)

Cohen’s kappa = P ( A = puppy) P ( B = puppy) + P ( A = chicken) P ( B = chicken) annotator A P(A=puppy) 15/100 = 0.15 P(B=puppy) 11/100 = 0.11 fried puppy annotator B P(A=chicken) 85/100 = 0.85 chicken P(B=chicken) 89/100 = 0.89 7 4 puppy fried = 0 . 15 × 0 . 11 + 0 . 85 × 0 . 89 8 81 chicken = 0 . 773

Cohen’s kappa • If classes are imbalanced, we can get high inter annotator agreement simply by chance annotator A κ = p o − p e 1 − p e fried puppy annotator B κ = 0 . 88 − p e chicken 1 − p e 7 4 puppy κ = 0 . 88 − 0 . 773 fried 1 − 0 . 773 8 81 chicken = 0 . 471

Cohen’s kappa • “Good” values are subject to interpretation, but rule of thumb: 0.80-1.00 Very good agreement 0.60-0.80 Good agreement 0.40-0.60 Moderate agreement 0.20-0.40 Fair agreement < 0.20 Poor agreement

annotator A fried puppy annotator B chicken 0 0 puppy fried 0 100 chicken

Interannotator agreement • Cohen’s kappa can be used for any number of classes. • Still requires two annotators who evaluate the same items. • Fleiss’ kappa generalizes to multiple annotators, each of whom may evaluate different items (e.g., crowdsourcing)

Fleiss’ kappa • Same fundamental idea of measuring the observed agreement compared to κ = P o − P e the agreement we would 1 − P e expect by chance. • With N > 2, we calculate agreement among pairs of annotators

Fleiss’ kappa n ij Number of annotators who assign category j to item i K 1 For item i with n annotations, how � P i = n ij ( n ij − 1) many annotators agree, among all n ( n − 1) n(n-1) possible pairs j =1

Fleiss’ kappa K 1 For item i with n annotations, how � P i = n ij ( n ij − 1) many annotators agree, among all n ( n − 1) n(n-1) possible pairs j =1 Annotator A-B B-A A B C D agreeing pairs   A-C of annotators → C-A + + + - B-C C-B 1 P i = 4(3)(3(2) + 1(0)) n ij Label + 3 - 1

Fleiss’ kappa N P o = 1 � Average agreement among all items P i N i =1 N 1 � p j = n ij Probability of category j Nn i =1 Expected agreement by chance — K joint probability two raters pick the � p 2 P e = same label is the product of their j independent probabilities of picking j =1 that label

Annotator bias correction • Dawid, A. P. and Skene, A. M. "Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm," Journal of the Royal Statistical Society, 28(1):20–28, 1979. • Weibe et al. (1999), "Development and use of a gold-standard data set for subjectivity classifications," ACL (for sentiment) • Carpenter (2010), "Multilevel Bayesian Models of Categorical Data Annotation" • Rion Snow, Brendan O'Connor, Daniel Jurafsky and Andrew Y. Ng. Cheap and Fast - But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. EMNLP 2008 • Sheng et al. (2008), "Get another label? improving data quality and data mining using multiple, noisy labelers", KDD. • Raykar et al. (2009), "Supervised learning from multiple experts: whom to trust when everyone lies a bit," ICML • Hovy et al. (2013), "Learning Whom to Trust with MACE," NAACL

Annotator bias correction annotator label positive negative mixed unknown positive 0.95 0 0.03 0.02 negative 0 0.80 0.10 0.10 truth mixed 0.20 0.05 0.50 0.25 unknown 0.15 0.10 0.10 0.70 P (label | truth) confusion matrix for a single annotator (David)

Annotator bias correction Dawid and Skene 1979 Annotator bias correction 0.4 0.3 0.2 0.1 0.0 Basic idea: the true label is truth unobserved; what we observe are noisy judgments by annotators annotator confusion matrix 0.4 0.3 labels 0.2 P(label | truth) 0.1 0.0 L I

Evaluation • A critical part of development new algorithms and methods and demonstrating that they work

Classification A mapping h from input data x (drawn from instance space 𝓨 ) to a label (or labels) y from some enumerable output space 𝒵 𝓨 = set of all documents 𝒵 = {english, mandarin, greek, …} x = a single document y = ancient greek

𝓨 instance space train dev test

Experiment design training development testing size 80% 10% 10% evaluation; never look at it purpose training models model selection until the very end

Natural Language Processing Info 159/259 Lecture 5: Truth and - PowerPoint PPT Presentation

Natural Language Processing Info 159/259 Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley Hwt! W Grde na in gardagum, odcyninga rym gefrnon, h elingas ellen fremedon. Oft Scyld

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Manipulating and Annotating Slides in a Multi-Display Environment Patrick Chiu, Qiong Liu, John

Are st udent s really reading t he web pages I direct t hem t o? Lorena O English Washingt

Annotation in a Publishing Context (Or Thinking Beyond the Annotated Bibliography) James

Overview BPD is a moving target - definitions have changed over time What is BPD? 1.

Using Type Annotations in Python by Philippe Fremy / IDEMIA Python code can be obscure def

From Text to Networks Tutorial @ DH 2018, Montreal Nils Reiter, Sandra Murr, Max Overbeck,

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation Matthieu

Sambuz

Useful Links

Newsletter

Mail Us