questions that linguistics should answer corpora
play

Questions that linguistics should answer Corpora What kinds of - PDF document

Questions that linguistics should answer Corpora What kinds of things do people say? What do these things say/ask/request about the world? A corpus is a body of naturally occurring text, normally Example: In addition to this, she


  1. Questions that linguistics should answer Corpora � What kinds of things do people say? � What do these things say/ask/request about the world? � A corpus is a body of naturally occurring text, normally Example: In addition to this, she insisted that women were one organized or selected in some way regarded as a different existence from men unfairly. � Greek: one corpus, two corpora � A balanced corpus tries to be representative across a � Text corpora give us data with which to answer these language or other domain questions � Balance is something of a chimaera: What is balanced? � What words, rules, statistical facts do we find? Who spends what percent of their time reading the sports � Can we build programs that learn effectively from this pages? data, and can then do NLP tasks? � They are an externalization of linguistic knowledge 5 21 The Brown corpus Recent corpora � Famous early corpus. Made by W. Nelson Francis and � British National Corpus. 100 million words, tagged for Henry Kuˇ cera at Brown University in the 1960s. A bal- anced corpus of written American English in 1960 (ex- part of speech. Balanced. cept poetry!). � Newswire ( NYT or WSJ are most commonly used): Some- � 1 million words, which seemed huge at the time. thing like 600 million words is fairly easily available. Sorting the words to produce a word list took 17 hours of (dedicated) � Legal reports; UN or EU proceedings (parallel multilin- processing time, because the computer (an IBM 7070) had the equiva- lent of only about 40 kilobytes of memory, and so the sort algorithm gual corpora – same text in multiple languages) had to store the data being sorted on tape drives. � Its significance has increased over time, but also aware- � The Web (in the billions of words, but need to filter for ness of its limitations. distinctness). � Tagged for part of speech in the 1970s � Penn Treebank: 2 million words (1 million WSJ , 1 million � The/AT General/JJ-TL Assembly/NN-TL ,/, which/WDT speech) of parsed sentences (as phrase structure trees). adjourns/VBZ today/NR ,/, has/HVZ performed/VBN 22 23 Sparsity Large and strange, sparse, discrete distributions � How often does an every day word like kick occur in a � Both features and assigned classes regularly involve multi- million words of text? nomial distributions over huge numbers of values (often � kick : about 10 [depends vastly on genre, of course] in the tens of thousands). � wrist : about 5 � The distributions are very uneven, and have fat tails � Normally we want to know about something bigger than � Enormous problems with data sparseness: much work a single word, like how often you kick a ball , or how on smoothing distributions/backoff (shrinkage), etc. often the conative alternation he kicked at the balloon � We normally have inadequate (labeled) data to estimate occurs. probabilities � How often can we expect that to occur in 1 million words? � Unknown/unseen things are usually a central problem � Almost never. � Generally dealing with discrete distributions though � “There’s no data like more data” [if of the right domain] 41 42

  2. Probabilistic language modeling n -gram models: the classic example of a � Assigns probability P(t) to a word sequence t = w 1 w 2 · · · w n statistical model of language � Chain rule and joint/conditional probabilities for text t : P(t) = P(w 1 · · · w n ) = P(w 1 ) · · · P(w n | w 1 , · · · w n − 1 ) � Each word is predicted according to a conditional distri- n bution based on a limited context � = P(w i | w 1 · · · w i − 1 ) i = 1 � Conditional Probability Table (CPT): P(X | both ) where � P( of | both ) = 0 . 066 P(w 1 . . . w k ) C(w 1 . . . w k ) P(w k | w 1 . . . w k − 1 ) = P(w 1 . . . w k − 1 ) ≈ � P( to | both ) = 0 . 041 C(w 1 . . . w k − 1 ) � P( in | both ) = 0 . 038 � The chain rule leads to a history-based model: we pre- � From 1940s onward (or even 1910s – Markov 1913) dict following things from past things � a.k.a. Markov (chain) models � We cluster histories into equivalence classes to reduce the number of parameters to estimate 91 92 Markov models = n -gram models n -gram models � Deterministic FSMs with probabilities in:0.01 broccoli:0.002 a ij a ij a ij W 1 W 2 W 3 W 4 for:0.05 eats:0.01 � s � In both ?? fish:0.1 at:0.03 � Simplest linear graphical model chicken:0.15 for:0.1 � Words are random variables, arrows are direct depen- � No long distance dependencies dencies between them (CPTs) � “The future is independent of the past given the present” � These simple engineering models have just been amaz- � No notion of structure or syntactic dependency ingly successful. � But lexical � (And: robust, have frequency information, . . . ) 93 94 n -gram models n -th order Markov models � Core language model for the engineering task of better � First order Markov assumption = bigram predicting the next word: P(w k | w 1 . . . w k − 1 ) ≈ P(w k | w k − 1 ) = P(w k − 1 w k ) � Speech recognition P(w k − 1 ) � OCR � Similarly, n -th order Markov assumption � Context-sensitive spelling correction � Most commonly, trigram (2nd order): � It is only recently that they have been improved on for P(w k − 1 w k ) P(w k | w 1 . . . w k − 1 ) ≈ P(w k | w k − 1 ) = these tasks (Chelba and Jelinek 1998; Charniak 2001). P(w k − 2 , w k − 1 ) � But linguistically, they are appalling simple and naive 95 96

  3. Why do they work? Why mightn’t n -gram models work? � That kind of thing doesn’t happen much � Relationships (say between subject and verb) can be ar- � Collins (1997): bitrarily distant and convoluted, as linguists love to point � 74% of dependencies (in the Penn Treebank – WSJ) out: are with an adjacent word (95% with one ≤ 5 words � The man that I was watching without pausing to look away), once one treats simple NPs as units: at what was happening down the street, and quite � Below, 4/6 = 66% based on words oblivious to the situation that was about to befall him confidently strode into the center of the road. The post office will hold out discounts 97 98 Evaluation of language models Why is that? � Best evaluation of probability model is task-based Sapir (1921: 14): � As substitute for evaluating one component, standardly ‘When I say, for instance, “I had a good breakfast this use corpus per-word cross entropy: morning,” it is clear that I am not in the throes of n H(X, p ) = − 1 laborious thought, that what I have to transmit is � log 2 P(w i | w 1 , . . . , w i − 1 ) n hardly more than a pleasurable memory symbolically i = 1 � Or perplexity (measure of uncertainty of predictions): rendered in the grooves of habitual expression. . . . It − 1 /n  n  is somewhat as though a dynamo capable of gener- PP(X, p ) = 2 H(X, p ) = � P(w i | w 1 , . . . , w i − 1 )   ating enough power to run an elevator were operated i = 1 almost exclusively to feed an electric doorbell.’ � Needs to be assessed on independent, unseen, test data 99 100 I want to eat Chinese food lunch Relative frequency = Maximum Likelihood I 8 1087 0 13 0 0 0 want 3 0 786 0 6 8 6 Estimate to 3 0 10 860 3 0 12 eat 0 0 2 0 19 2 52 P(w 2 | w 1 ) = C(w 1 , w 2 ) Chinese 2 0 0 0 0 120 1 C(w 1 ) food 19 0 17 0 0 0 0 lunch 4 0 0 0 0 1 0 (or similarly for higher order or joint probabilities) Makes training data as probable as possible Selected bigram counts (Berkeley Restaurant Project – J&M) 101 102

Recommend


More recommend