natural language processing
play

Natural Language Processing Info 159/259 Lecture 5: Truth and - PowerPoint PPT Presentation

Natural Language Processing Info 159/259 Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley Hwt! W Grde na in gardagum, odcyninga rym gefrnon, h elingas ellen fremedon. Oft Scyld


  1. Natural Language Processing Info 159/259 
 Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley

  2. Hwæt! Wé Gárde 
 na in géardagum, þ éodcyninga 
 þ rym gefrúnon, hú ð á æ þ elingas ellen fremedon. Oft Scyld Scéfing scea þ ena Natural Language Processing Info 159/259 
 Lecture 5: Truth and ethics (Sept 6, 2018) David Bamman, UC Berkeley

  3. x W I 3.1 x 1 1 -2.7 x 1 W 1 1.4 -0.7 hated x 2 h 1 =f(I, hated, it) -1.4 9.2 h 1 -3.1 it x 3 x 2 W 2 -2.7 h 2 =f(it, I, really) 1 1.4 0.1 h 2 x 4 I 1 0.3 -0.4 x 3 W 3 -2.4 h 3 -4.7 really x 5 5.7 h 3 =f(really, hated, it) x 6 hated h 1 = σ ( x 1 W 1 + x 2 W 2 + x 3 W 3 ) h 2 = σ ( x 3 W 1 + x 4 W 2 + x 5 W 3 ) it x 7 h 3 = σ ( x 5 W 1 + x 6 W 2 + x 7 W 3 )

  4. Convolutional networks x 1 x 2 1 x 3 10 x 4 2 10 x 5 -1 This defines one filter. x 6 5 x 7 convolution max pooling

  5. Zhang and Wallace 2016, “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”

  6. Modern NLP is driven by annotated data • Penn Treebank (1993; 1995;1999); morphosyntactic annotations of WSJ • OntoNotes (2007–2013); syntax, predicate-argument structure, word sense, coreference • FrameNet (1998–): frame-semantic lexica/annotations • MPQA (2005): opinion/sentiment • SQuAD (2016): annotated questions + spans of answers in Wikipedia

  7. Modern NLP is driven by annotated data • In most cases, the data we have is the product of human judgments. • What’s the correct part of speech tag? • Syntactic structure? • Sentiment?

  8. Ambiguity “One morning I shot 
 an elephant in my pajamas” Animal Crackers

  9. Dogmatism Fast and Horvitz (2016), “Identifying Dogmatism in Social Media: Signals and Models”

  10. Sarcasm https://www.nytimes.com/2016/08/12/opinion/an-even-stranger-donald-trump.html?ref=opinion

  11. Fake News http://www.fakenewschallenge.org

  12. Annotation pipeline Pustejovsky and Stubbs (2012), 
 Natural Language Annotation for Machine Learning

  13. Annotation pipeline Pustejovsky and Stubbs (2012), 
 Natural Language Annotation for Machine Learning

  14. Annotation Guidelines • Our goal: given the constraints of our problem, how can we formalize our description of the annotation process to encourage multiple annotators to provide the same judgment?

  15. Annotation guidelines • What is the goal of the project? • What is each tag called and how is it used? (Be specific: provide examples, and discuss gray areas.) • What parts of the text do you want annotated, and what should be left alone? • How will the annotation be created? (For example, explain which tags or documents to annotate first, how to use the annotation tools, etc.) Pustejovsky and Stubbs (2012), Natural Language Annotation for Machine Learning

  16. Practicalities • Annotation takes time, concentration (can’t do it 8 hours a day) • Annotators get better as they annotate (earlier annotations not as good as later ones)

  17. Why not do it yourself? • Expensive/time-consuming • Multiple people provide a measure of consistency: is the task well enough defined? • Low agreement = not enough training, guidelines not well enough defined, task is bad

  18. Adjudication • Adjudication is the process of deciding on a single annotation for a piece of text, using information about the independent annotations. • Can be as time-consuming (or more so) as a primary annotation. • Does not need to be identical with a primary annotation (both annotators can be wrong by chance)

  19. Interannotator agreement annotator A fried puppy annotator B chicken 6 3 puppy fried 2 5 chicken observed agreement = 11/16 = 68.75% https://twitter.com/teenybiscuit/status/705232709220769792/photo/1

  20. Cohen’s kappa • If classes are imbalanced, we can get high inter annotator agreement simply by chance annotator A fried puppy annotator B chicken 7 4 puppy fried 8 81 chicken

  21. Cohen’s kappa • If classes are imbalanced, we can get high inter annotator agreement simply by chance annotator A κ = p o − p e fried puppy 1 − p e annotator B chicken 7 4 puppy κ = 0 . 88 − p e 1 − p e fried 8 81 chicken

  22. Cohen’s kappa • Expected probability of agreement is how often we would expect two annotators to agree assuming independent annotations p e = P ( A = puppy , B = puppy) + P ( A = chicken , B = chicken) = P ( A = puppy) P ( B = puppy) + P ( A = chicken) P ( B = chicken)

  23. Cohen’s kappa = P ( A = puppy) P ( B = puppy) + P ( A = chicken) P ( B = chicken) annotator A P(A=puppy) 15/100 = 0.15 P(B=puppy) 11/100 = 0.11 fried puppy annotator B P(A=chicken) 85/100 = 0.85 chicken P(B=chicken) 89/100 = 0.89 7 4 puppy fried = 0 . 15 × 0 . 11 + 0 . 85 × 0 . 89 8 81 chicken = 0 . 773

  24. Cohen’s kappa • If classes are imbalanced, we can get high inter annotator agreement simply by chance annotator A κ = p o − p e 1 − p e fried puppy annotator B κ = 0 . 88 − p e chicken 1 − p e 7 4 puppy κ = 0 . 88 − 0 . 773 fried 1 − 0 . 773 8 81 chicken = 0 . 471

  25. Cohen’s kappa • “Good” values are subject to interpretation, but rule of thumb: 0.80-1.00 Very good agreement 0.60-0.80 Good agreement 0.40-0.60 Moderate agreement 0.20-0.40 Fair agreement < 0.20 Poor agreement

  26. annotator A fried puppy annotator B chicken 0 0 puppy fried 0 100 chicken

  27. annotator A fried puppy annotator B chicken 50 0 puppy fried 0 50 chicken

  28. annotator A fried puppy annotator B chicken 0 50 puppy fried 50 0 chicken

  29. Interannotator agreement • Cohen’s kappa can be used for any number of classes. • Still requires two annotators who evaluate the same items. • Fleiss’ kappa generalizes to multiple annotators, each of whom may evaluate different items (e.g., crowdsourcing)

  30. Fleiss’ kappa • Same fundamental idea of measuring the observed agreement compared to κ = P o − P e the agreement we would 1 − P e expect by chance. • With N > 2, we calculate agreement among pairs of annotators

  31. Fleiss’ kappa n ij Number of annotators who assign category j to item i K 1 For item i with n annotations, how � P i = n ij ( n ij − 1) many annotators agree, among all n ( n − 1) n(n-1) possible pairs j =1

  32. Fleiss’ kappa K 1 For item i with n annotations, how � P i = n ij ( n ij − 1) many annotators agree, among all n ( n − 1) n(n-1) possible pairs j =1 Annotator A-B B-A A B C D agreeing pairs 
 A-C of annotators → C-A + + + - B-C C-B 1 P i = 4(3)(3(2) + 1(0)) n ij Label + 3 - 1

  33. Fleiss’ kappa N P o = 1 � Average agreement among all items P i N i =1 N 1 � p j = n ij Probability of category j Nn i =1 Expected agreement by chance — K joint probability two raters pick the � p 2 P e = same label is the product of their j independent probabilities of picking j =1 that label

  34. Annotator bias correction • Dawid, A. P. and Skene, A. M. "Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm," Journal of the Royal Statistical Society, 28(1):20–28, 1979. • Weibe et al. (1999), "Development and use of a gold-standard data set for subjectivity classifications," ACL (for sentiment) • Carpenter (2010), "Multilevel Bayesian Models of Categorical Data Annotation" • Rion Snow, Brendan O'Connor, Daniel Jurafsky and Andrew Y. Ng. Cheap and Fast - But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. EMNLP 2008 • Sheng et al. (2008), "Get another label? improving data quality and data mining using multiple, noisy labelers", KDD. • Raykar et al. (2009), "Supervised learning from multiple experts: whom to trust when everyone lies a bit," ICML • Hovy et al. (2013), "Learning Whom to Trust with MACE," NAACL

  35. Annotator bias correction annotator label positive negative mixed unknown positive 0.95 0 0.03 0.02 negative 0 0.80 0.10 0.10 truth mixed 0.20 0.05 0.50 0.25 unknown 0.15 0.10 0.10 0.70 P (label | truth) confusion matrix for a single annotator (David)

  36. Annotator bias correction Dawid and Skene 1979 Annotator bias correction 0.4 0.3 0.2 0.1 0.0 Basic idea: the true label is truth unobserved; what we observe are noisy judgments by annotators annotator confusion matrix 0.4 0.3 labels 0.2 P(label | truth) 0.1 0.0 L I

  37. Evaluation • A critical part of development new algorithms and methods and demonstrating that they work

  38. Classification A mapping h from input data x (drawn from instance space 𝓨 ) to a label (or labels) y from some enumerable output space 𝒵 𝓨 = set of all documents 𝒵 = {english, mandarin, greek, …} x = a single document y = ancient greek

  39. 𝓨 instance space train dev test

  40. Experiment design training development testing size 80% 10% 10% evaluation; never look at it purpose training models model selection until the very end

Recommend


More recommend