What is NLP? CS 188: Artificial Intelligence Spring 2006 Lecture 27: NLP 4/27/2006 � Fundamental goal: deep understand of broad language � Not just string processing or keyword matching! � End systems that we want to build: Dan Klein – UC Berkeley � Ambitious: speech recognition, machine translation, information extraction, dialog interfaces, question answering… � Modest: spelling correction, text categorization… Why is Language Hard? The Big Open Problems � Ambiguity � Machine translation � EYE DROPS OFF SHELF � MINERS REFUSE TO WORK AFTER DEATH � KILLER SENTENCED TO DIE FOR SECOND � Information extraction TIME IN 10 YEARS � LACK OF BRAINS HINDERS RESEARCH � Solid speech recognition � Deep content understanding Machine Translation Information Extraction � Information Extraction (IE) � Unstructured text to database entries New York Times Co. named Russell T. Lewis, 45, president and general manager of its flagship New York Times newspaper, responsible for all business-side activities. He was executive vice president and deputy general manager. He succeeds Lance R. Primis, who in September was named president and chief operating officer of the parent. Person Company Post State Russell T. Lewis New York Times president and start newspaper general manager Russell T. Lewis New York Times executive vice end newspaper president � Translation systems encode: Lance R. Primis New York Times Co. president and CEO start � Something about fluent language � Something about how two languages correspond � SOTA: perhaps 70% accuracy for multi-sentence � SOTA: for easy language pairs, better than nothing, but more an temples, 90%+ for single easy fields understanding aid than a replacement for human translators 1
Question Answering Models of Language � � Two main ways of modeling language Question Answering: � More than search � Ask general comprehension � Language modeling: putting a distribution P(s) over questions of a document sentences s collection � Can be really easy: � Useful for modeling fluency in a noisy channel setting, like “What’s the capital of machine translation or ASR Wyoming?” � � Typically simple models, trained on lots of data Can be harder: “How many US states’ capitals are also their largest cities?” � Language analysis: determining the structure and/or � Can be open ended: meaning behind a sentence “What are the main issues in the global � Useful for deeper processing like information extraction or warming debate?” question answering � Starting to be used for MT � SOTA: Can do factoids, even when text isn’t a perfect match The Speech Recognition Problem N-Gram Language Models � No loss of generality to break sentence probability down � We want to predict a sentence given an acoustic sequence: with the chain rule = * arg max ( | ) s P s A ∏ = … … ( ) ( | ) s P w w w P w w w w − � 1 2 n i 1 2 i 1 The noisy channel approach: i � Build a generative model of production (encoding) = ( , ) ( ) ( | ) P A s P s P A s � Too many histories! � To decode, we use Bayes’ rule to write = � N-gram solution: assume each word depends only on a s * arg max P ( s | A ) short linear history s = arg max P ( s ) P ( A | s ) / P ( A ) s ∏ = arg max P ( s ) P ( A | s ) = … … P ( w w w ) P ( w | w w ) − − 1 2 1 n i i k i s � i Now, we have to find a sentence maximizing this product Unigram Models Bigram Models � � Simplest case: unigrams Big problem with unigrams: P(the the the the) >> P(I like ice cream) ∏ = � … Condition on last word: ( ) ( ) P w w w P w 1 2 n i ∏ … = i P ( w w w ) P ( w | w ) � Generative process: pick a word, pick another word, … − 1 2 1 n i i � As a graphical model: i START w 1 w 2 w n -1 STOP STOP w 1 w 2 …………. w n -1 � Any better? � To make this a proper distribution over sentences, we have to generate a � [texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, special STOP symbol last. (Why?) said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, � Examples: from, five, hundred, fifty, five, yen] � [fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass.] � [outside, new, car, parking, lot, of, the, agreement, reached] � [thrift, did, eighty, said, hard, 'm, july, bullish] � [although, common, shares, rose, forty, six, point, four, hundred, dollars, � [that, or, limited, the] � [] from, thirty, seconds, at, the, greatest, play, disingenuous, to, be, reset, � [after, any, on, consistently, hospital, lake, of, of, other, and, factors, raised, analyst, too, allowed, annually, the, buy, out, of, american, brands, vying, for, mr., womack, mexico, never, consider, fall, bungled, davison, that, obtain, price, lines, the, to, sass, the, the, further, currently, sharedata, incorporated, believe, chemical, prices, undoubtedly, board, a, details, machinists, the, companies, which, rivals, an, because, longer, oakes, percent, a, will, be, as, much, is, scheduled, to, conscientious, teaching] they, three, edward, it, currier, an, within, in, three, wrote, is, you, s., longer, institute, dentistry, pay, however, said, possible, to, rooms, hiding, eggs, approximate, financial, canada, the, so, workers, � [this, would, be, a, record, november] advancers, half, between, nasdaq] 2
Sparsity Smoothing � Problems with n-gram models: � We often want to make estimates from sparse statistics: � New words appear all the time: 1 P(w | denied the) 0.8 � Synaptitute Fraction Seen 3 allegations 0.6 allegations � 132,701.03 2 reports Unigrams 0.4 � fuzzificational 1 claims outcome Bigrams reports 0.2 attack Rules 1 request � New bigrams: even more often claims request … man 0 7 total � Trigrams or more – still worse! 0 200000 400000 600000 800000 1000000 Number of Words � Smoothing flattens spiky distributions so they generalize better � Zipf’s Law P(w | denied the) � Types (words) vs. tokens (word occurences) 2.5 allegations allegations 1.5 reports � Broadly: most word types are rare allegations outcome 0.5 claims � Specifically: reports attack 0.5 request man … � Rank word types by token frequency claims 2 other request � Frequency inversely proportional to rank 7 total � Not special to language: randomly generated character strings � Very important all over NLP, but easy to do badly! have this property Phrase Structure Parsing PP Attachment � Phrase structure parsing organizes syntax into constituents or brackets � In general, this involves nested trees � Linguists can, and do, argue about details S � Lots of ambiguity VP NP PP NP � Not the only kind of N’ NP syntax… new art critics write reviews with computers Attachment is a Simplification Syntactic Ambiguities I � Prepositional phrases: They cooked the beans in the pot on the stove with � I cleaned the dishes from dinner handles. � Particle vs. preposition: A good pharmacist dispenses with accuracy. � I cleaned the dishes with detergent The puppy tore up the staircase. � Complement structures � I cleaned the dishes in the sink The tourists objected to the guide that they couldn’t hear. She knows you like the back of her hand. � Gerund vs. participial adjective Visiting relatives can be boring. Changing schedules frequently confused passengers. 3
Syntactic Ambiguities II Human Processing � Modifier scope within NPs � Garden pathing: impractical design requirements plastic cup holder � Multiple gap constructions The chicken is ready to eat. The contractors are rich enough to sue. � Coordination scope: � Ambiguity maintenance Small rats and mice can squeeze into holes or cracks in the wall. Context-Free Grammars Example CFG � A context - free grammar is a tuple < N, T, S, R > � Can just write the grammar (rules with non-terminal � N : the set of non-terminals LHSs) and lexicon (rules with pre-terminal LHSs) � Phrasal categories: S, NP, VP, ADJP, etc. � Parts-of-speech (pre-terminals): NN, JJ, DT, VB Grammar Lexicon � T : the set of terminals (the words) JJ → new NP → NNS ROOT → S � S : the start symbol NN → art NP → NN � Often written as ROOT or TOP S → NP VP NNS → critics � Not usually the sentence non-terminal S NP → JJ NP VP → VBP � R : the set of rules NNS → reviews VP → VBP NP NP → NP NNS � Of the form X → Y 1 Y 2 … Y k , with X, Y i ∈ N NNS → computers VP → VP PP NP → NP PP � Examples: S → NP VP, VP → VP CC VP VBP → write PP → IN NP � Also called rewrites, productions, or local trees IN → with Top-Down Generation from CFGs Corpora � A corpus is a collection of text � A CFG generates a language � Often annotated in some way � Fix an order: apply rules to leftmost non-terminal � Sometimes just lots of text � Balanced vs. uniform corpora ROOT ROOT S � Examples S NP VP � Newswire collections: 500M+ words VP NP NNS VP � Brown corpus: 1M words of tagged “balanced” text critics VP NNS VBP NP � Penn Treebank: 1M words of parsed critics VBP NP WSJ � Canadian Hansards: 10M+ words of critics write NP critics write NNS aligned French / English sentences critics write NNS � The Web: billions of words of who reviews critics write reviews knows what � Gives a derivation of a tree using rules of the grammar 4
Recommend
More recommend