Language and Stats 11-(7/6)61 Introduction Objectives Logistics Statistical Language Modeling (SLM); Computational Linguistics (CL) Bhiksha Raj 11-761 1
11-761 2
Language and Statistics • Iozmne pqmnzg habfbngyeydh shahmw • Language or not? 11-761 3
Language and Statistics • Iozmne pqmnzg habfbngyeydh shahmw • Language or not? • pair none fair happy happy happy but but but brave brave brave the the the the deserves • Language or not? • happy happy happy pair none but the brave none but the brave none but the brave deserves the fair • Language or not? 11-761 4
Language and Statistics • Iozmne pqmnzg habfbngyeydh shahmw • Language or not? • pair none fair happy happy happy but but but brave brave brave the the the the deserves • Language or not? • happy happy happy pair none but the brave none but the brave none but the brave deserves the fair • Language or not? 11-761 5
Language and Statistics • Composed of mutually agreed upon units • pair happy none fair deserves but the • In a mutually agreed upon arrangement • happy happy happy pair none but the brave none but the brave none but the brave deserves the fair 11-761 6
The linguistic point of view • Language is the outcome of a complex process of lexical semiosis to communicate information • Requiring conceptualization, planning, formation and delivery • Based on a set of implicitly agreed upon units and rules of combination • Phonological, morphological and syntactic rules • Adequately conveying semantics requires following rules • Deep complex theories dating back to Plato.. • Key point: absolutely not random! • Random gobbledygook doesn’t convey any useful meaning 11-761 7
The linguistic point of view • Language is the outcome of a complex process of lexical semiosis to communicate information • Requiring conceptualization, planning, formation and delivery • Based on a set of implicitly agreed upon units and rules of combination • Phonological, morphological and syntactic rules • Adequately conveying semantics requires following rules • Deep complex theories dating back to Plato.. • Key point: absolutely not random! • Random gobbledygook doesn’t convey any useful meaning 11-761 8
“Mutually agreed upon”? • When a fox is in the bottle where the tweetle beetles battle with their paddles in a puddle on a noodle-eating poodle, THIS is what they call…a tweetle beetle noodle poodle bottled paddled muddled duddled fuddled wuddled fox in socks, sir 11-761 9
“Mutually agreed upon”? • When a fox is in the bottle where the tweetle beetles battle with their paddles in a puddle on a noodle-eating poodle, THIS is what they call…a tweetle beetle noodle poodle bottled paddled muddled duddled fuddled wuddled fox in socks, sir • ’Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. 11-761 10
Rules? • It’s like déjà vu all over again. • We made too many wrong mistakes. • I never said most of the things I said. • The future ain’t what it used to be. • Bill Dickey is learning me his experience. Who? • • “How do you like them apples?” • "Soylent Green is people!“ • "It's a fool looks for logic in the chambers of the human heart." • Slim said, "You hadda, George. I swear you hadda. Come on with me." He led George into the entrance of the trail and up toward the highway. Curley and Carlson looked after them. And Carlson said, "Now what the hell ya suppose is eatin' them two guys?“ 11-761 11
Rules? • It’s like déjà vu all over again. • We made too many wrong mistakes. • I never said most of the things I said. • The future ain’t what it used to be. • Bill Dickey is learning me his experience. Who? • • “How do you like them apples?” • "Soylent Green is people!“ • "It's a fool looks for logic in the chambers of the human heart." • Slim said, "You hadda, George. I swear you hadda. Come on with me." He led George into the entrance of the trail and up toward the highway. Curley and Carlson looked after them. And Carlson said, "Now what the hell ya suppose is eatin' them two guys?“ 11-761 12
Rules? • It’s like déjà vu all over again. • We made too many wrong mistakes. • I never said most of the things I said. • The future ain’t what it used to be. • Bill Dickey is learning me his experience. Who? • • “How do you like them apples?” • "Soylent Green is people!“ • "It's a fool looks for logic in the chambers of the human heart." • Slim said, "You hadda, George. I swear you hadda. Come on with me." He led George into the entrance of the trail and up toward the highway. Curley and Carlson looked after them. And Carlson said, "Now what the hell ya suppose is eatin' them two guys?“ 11-761 13
Language and Statistics • Why are these understandable? • Bill Dickey is learning me his experience • It's a fool looks for logic • What’s is eating them two guys • They are built on common usages • Statistically plausible • Statistical approach: The “acceptability” of a sequence of words is related to how frequently it is used • Or how statistically plausible it is 11-761 14
Statistical approach • Based entirely on frequency of occurrence • “Acceptable” word sequences will occur • “Unacceptable” ones wont • Actually – predicted frequency of occurrence • Not just counting from whatever we have already observed • Will require predicting probability of word sequences we have never encountered • Some sequences we have never seen are nevertheless much more likely to be expressed in a valid sentence than other sequences we have never seen 11-761 15
The problem with the statistical approach char O, o[]; main(l) {for(;~l;O||puts(o)) O=(O[o]=~(l=getchar())?4<(4^l>>5)?l:46:0)?-~O & printf("%02x ",l)*5:!O;} • Will a statistical model know with certainty if the above is valid code? 11-761 16
The linguist’s objection • The statistical approach treats language as a random process • Language is not random • A blind statistical approach ignores agency : • Human (or animal) language treated no differently from other patterned sequences of symbolic units • Is the sequence of sounds your car produces really language? • Language has an agent • Is generally the outcome of a deliberate act of communication • With an entire sequence of conceptualization, composition and communication • Agents intend to communicate • The rules of language affect what unseen word sequences are likely 11-761 17
In this course • We take the perspective that the statistical framework is more appropriate • Can never explicitly catalogue all the rules of language • Particularly when they change all the time • Not utilizing the prescriptive theory of linguistics • Frequency/plausibility of usage is representative of the rules of the language • Statistical characterization of language • Related to descriptive theory of linguistics • But the framework may be informed by linguistics or linguistic intuition • Required in particular to predict occurrences/behaviors of previously unseen patterns 11-761 18
The fiction we maintain • Language comes from a probabilistic source.. • Which randomly produces the text we see • We will focus on written language • We will concede agency • The source is trying to convey a message, not just to produce text • But we will often ignore it(!) 11-761 19
The fiction we maintain • To generate a text, the source randomly chooses a “hidden” message ℎ • The concept to be conveyed • It also randomly produces a “surface form” to convey the message ℎ • The accessible form • Words, sentences, paragraphs, documents.. • We only get to observe the surface form • This is what we must work with • To try to decipher inner message ℎ • Or just to learn all about valid surface forms • Course objectives: Learn all about statistical mechanisms to achieve the above.. 11-761 20
Course Goals • Teaching statistical foundation and techniques for language technologies • Plugging gaping holes in LTI/CS grad student education in probability, statistics and information theory. • “This course is about how to convert linguistic intuition and understanding of language into statistical models”. • About how to developed statistically sound methodology, but informed by what we know of the domain of language.” 11-761 21
Course philosophy • Socratic Method • Based on discussion and enagagement • Participation strongly encouraged (pls state your name) • Highly interactive • Highly adaptable • based on how fast we move • Lots of Probability, Statistics, Information theory • not in the abstract, but rather as the need arises • Lectures emphasize intuition, not rigor or detail • background reading will have rigor & detail • Will be done partially using slides, and partially on the board 11-761 22
Course Prerequisites & Mechanics • You need to be able to program, from scratch. • Largest program is O(100) lines • You need to be comfortable with probabilities • Can you derive Bayes equation in your sleep? • 11661 (masters level): no final project • Hand in assignments via Blackboard • Vigorous enforcement of collaboration & disclosure policy 11-761 23
Recommend
More recommend