Injecting Linguistics into NLP by Annotation Eduard Hovy Information Sciences Institute University of Southern California
Lesson 1: Banko and Brill, HLT-01 • Confusion set disambiguation task: {you‘re | your}, {to | too | two}, {its | it‘s} • 5 Algorithms: ngram table, winnow, perceptron, transformation-based learning, decision trees • Training: 10 6 10 9 words • Lessons: – All methods improved to almost same point – Simple method can end above complex one – Don‘t waste your time with algorithms and optimization
Lesson 1: Banko and Brill, HLT-01 • Confusion set disambiguation task: {you‘re | your}, {to | too | two}, {its | it‘s} • 5 Algorithms: ngram table, winnow, perceptron, transformation-based learning, decision trees • Training: 10 6 10 9 words You don‘t have to be smart, • Lessons: you just need enough training data – All methods improved to almost same point – Simple method can end above complex one – Don‘t waste your time with algorithms and optimization
Lesson 2: Och, ACL-02 • Best MT system in world (Arabic English, by BLEU and NIST, 2002 –2005): Och‘s work • Method: learn ngram correspondence patterns (alignment templates) using MaxEnt (log-linear translation model) and trained to maximize BLEU score w 4 w 3 w 1 w 2 w 3 w 4 w 5 w 1 w 2 w 3 w 4 w 2 w 1 w 1 w 2 w 3 w 4 w 5 • Approximately: EBMT + Viterbi search • Lesson: the more you store, the better your MT
Lesson 2: Och, ACL-02 • Best MT system in world (Arabic English, by BLEU and NIST, 2002 –2005): Och‘s work • Method: learn ngram correspondence patterns (alignment templates) using MaxEnt (log-linear translation model) and trained to maximize BLEU score You don‘t have to be smart, w 4 you just need enough storage w 3 w 1 w 2 w 3 w 4 w 5 w 1 w 2 w 3 w 4 w 2 w 1 w 1 w 2 w 3 w 4 w 5 • Approximately: EBMT + Viterbi search • Lesson: the more you store, the better your MT
Lesson 3: Chiang et al., HLT-2009 • 11,001 New Features for Statistical MT. David Chiang, Kevin Knight, Wei Wang. 2009. Proc. NAACL HLT . Best paper award • Learn MT rules: NP-C(x0:NPB PP(IN(of x1:NPB)) < – > x1 de x0 • Several hundred count features of various kinds: reward rules seen more often; punish rules that partly overlap; punish rules that insert is, the , etc. into English … • 10,000 word context features: for each triple ( f; e; f +1 ), feature that counts the number of times that f is aligned to e and f +1 occurs to the right of f ; and similarly for triples ( f; e; f -1 ) with f -1 occurring to the left of f . Restrict words to the 100 most frequent in training data
Lesson 3: Chiang et al., HLT-2009 • 11,001 New Features for Statistical MT. David Chiang, Kevin Knight, Wei Wang. 2009. Proc. NAACL HLT . Best paper award • Learn MT rules: NP-C(x0:NPB PP(IN(of x1:NPB)) < – > x1 de x0 • Several hundred count features of various kinds: reward rules seen more often; punish rules that partly overlap; punish rules that insert is, the , etc. into English … You don‘t have to know anything, • 10,000 word context features: for each triple ( f; e; f +1 ), feature that counts the number of times that f is aligned to e and f +1 occurs to you just need enough features the right of f ; and similarly for triples ( f; e; f -1 ) with f -1 occurring to the left of f . Restrict words to the 100 most frequent in training data
Lesson 4: Fleischman and Hovy, ACL-03 • Text mining: classify locations and people from free text into fine-grain classes – Simple appositive IE patterns – 2+ mill examples, collapsed into 1 mill instances (avg: 2 mentions/instance, 40+ for George W. Bush) Performance on a Question • Test: QA on ―who is X?‖: Answ ering Task 50 45 – 100 questions from AskJeeves 40 % Correct 35 30 – System 1: Table of instances 25 20 – System 2: ISI‘s TextMap QA system 15 10 – Table system scored 25% better Partial Correct Incorrect State of the Art System Extraction System – Over half of questions that TextMap got wrong could have benefited from information in the concept-instance pairs – This method took 10 seconds, TextMap took ~9 hours
Lesson 4: Fleischman and Hovy, ACL-03 • Text mining: classify locations and people from free text into fine-grain classes – Simple appositive IE patterns – 2+ mill examples, collapsed into 1 mill instances (avg: 2 mentions/instance, 40+ for George W. Bush) You don‘t have to reason, Performance on a Question • Test: QA on ―who is X?‖: Answ ering Task 50 you just need to collect the 45 – 100 questions from AskJeeves 40 % Correct 35 knowledge beforehand 30 – System 1: Table of instances 25 20 – System 2: ISI‘s TextMap QA system 15 10 – Table system scored 25% better Partial Correct Incorrect State of the Art System Extraction System – Over half of questions that TextMap got wrong could have benefited from information in the concept-instance pairs – This method took 10 seconds, TextMap took ~9 hours
Four lessons • You don‘t have to be smart, you just need — the web has all you need enough training data • You don‘t have to be smart, you just need — memory gets cheaper enough memory • You don‘t have to be smart, you just need — computers get faster enough features • You don‘t have to be smart, you just need to collect the knowledge beforehand …we are moving to a new world: • Conclusion: NLP as table lookup
So you may be happy with this, but I am not … I want to understand what‘s going on in language and thought • We have no theory of language or even of language processing in NLP • Our general approach is: – Goal: Transform notation 1 into notation 2 (maybe adding tags…) – Learn how to do this automatically – Design an algorithm to beat the other guy • How can one inject understanding?
• Generally, to reduce the size of a transformation table / statistical model, you introduce a generalization step: – POS tags, syntactic trees, modality labels… • If you‘re smart, the theory behind the generalization actually ‗explains‘ or ‗captures‘ the phenomenon – Classes of the phenomenon + rules linking them • ‗Good‘ NLP can test the adequacy of a theory by determining the table reduction factor • How can you introduce the generalization info?
Annotation! 1. Preparation – Which corpus? – Choose the corpus – Interface design issues – Build the interfaces 2. Instantiating the theory – How remain true to – Create the annotation choices theory? – Test-run them for stability 3. Annotation – How many annotators? – Annotate – Which procedure? – Reconcile among annotators 4. Validation – Which measures? – Measure inter-annotator agreement – Possibly adjust theory instantiation 5. Delivery – Wrap the result ‗annotation science‘
The new NLP world • Fundamental methodological assumptions of NLP: – Old-style NLP: process is deterministic; manually written rules will exactly generate desired product – Statistical NLP: process is (somewhat) nondeterministic; probabilities predict likelihood of products – Underlying assumption: as long as annotator consistency can be achieved, there is systematicity, and systems will learn to find it • Theory creation (and testing!) through corpus annotation – But we (still) have to manually identify generalizations (= equivalence classes of individual instances of phenomena) to obtain expressive generality/power – This is the ‗theory‘ – (and we need to understand how to do annotation properly)
Who are the people with the ‗theory‘? Not us! • Our ‗theory‘ of sentiment • Our ‗theory‘ of entailment • Our ‗theory‘ of MT • Our ‗theory‘ of IR • Our ‗theory‘ of QA • …
A fruitful cycle Linguists, psycholinguists, cognitive linguists… Analysis, theorizing, annotation annotated problems: low corpus performance evaluation Storage in Machine large tables, learning of automated optimization transformations creation method NLP companies Current NLP researchers • Each one influences the others • Different people like different work
Toward a theory of NLP? • Basic tenets: 1. NLP is notation transformation 2. There exists a natural and optimal set of transformation steps , each involving a dedicated and distinct representation • Problem: syntax-semantics and semantics-pragmatics interfaces 3. Each rep. is based on a suitable (family of) theories in linguistics, philosophy, rhetorics, social interaction studies, etc. • Problem: which theory/ies? Why? 4. Except for a few circumscribed phenomena (morphology, number expressions, etc.), the phenomena being represented are too complex and interrelated for human-built rules to handle them well • Puzzle: but they can (usually) be annotated in corpora: why? 5. A set of machine learning algorithms and a set of features can be used to learn the transformations from suitably annotated corpora • Problem: which algorithms and features? Why? • Observation: We (almost) completely lack the theoretical framework to describe and measure the informational content and complexity of the representation levels we use — a challenge for the future
The face of NLP tomorrow Three (and a Half) Trends — The Near Future of NLP: 1. Machine learning transformations 2. Analysis and corpus construction 3. Table construction and use 4. Evaluation frameworks Who are you ???
Thank you!
Recommend
More recommend