language and computers
play

Language and Computers Unsupervised Learning Features & - PowerPoint PPT Presentation

Language and Computers Classifying Documents Introduction Language Identification Machine Learning Supervised Learning Language and Computers Unsupervised Learning Features & Classifying Documents Evidence Measuring sucess


  1. Language and Computers Classifying Documents Introduction Language Identification Machine Learning Supervised Learning Language and Computers Unsupervised Learning Features & Classifying Documents Evidence Measuring sucess Document Based on Dickinson, Brew, & Meurers (2013) classifiers Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 1 / 45

  2. Language and Document classification Computers Classifying Documents Introduction Document classification = sort documents into Language Identification user-defined classes Machine Learning ◮ e.g., email sent to the New York Times could be Supervised Learning Unsupervised Learning classified into letters to the editor, new subscription Features & Evidence requests, complaints about undelivered papers, job Measuring sucess inquiries, proposals to buy ad pages, and others Document classifiers Consider the case of sentiment analysis : Authorship ◮ automate the detection of positive and negative Attribution Author Identification statements in documents Stylometry Lexical Markers ◮ would allow one to track opinions about policies, Lexical Markers: Function Words products, & positions Plagiarism Detection What is plagiarism? Plagiarism Detection References 2 / 45

  3. Language and Sentiment Analysis Computers Classifying Example #1 Documents Introduction Language Identification Machine Learning Supervised Learning Unsupervised Learning Features & For the movie Pearl Harbor : Evidence Measuring sucess Ridiculous movie. Worst movie I’ve seen in my Document classifiers entire life [Koen D. on metacritic] Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 3 / 45

  4. Language and Sentiment Analysis Computers Classifying Example #2 Documents Introduction Language Identification Machine Learning Supervised Learning Unsupervised Learning Features & One of my favorite movies. It’s a bit on the Evidence lengthy side, sure. But its made up of a really great Measuring sucess cast which, for me, just brings it all together. [Erica Document classifiers H., again on metacritic] Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 4 / 45

  5. Language and Sentiment Analysis Computers Classifying Example #3 Documents Introduction Language Identification Machine Learning The Japanese sneak attack on Pearl Harbor Supervised Learning that brought the United States into World War II Unsupervised Learning Features & has inspired a splendid movie, full of vivid Evidence performances and unforgettable scenes, a movie Measuring sucess that uses the coming of war as a backdrop for Document classifiers individual stories of love, ambition, heroism and Authorship betrayal. The name of that movie is ”From Here to Attribution Author Identification Eternity.” (First lines of Alan Scott’s review of “Pearl Stylometry Lexical Markers Harbor”, New York Times, May 25, 2001) Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 5 / 45

  6. Language and Sentiment Analysis Computers Classifying Example #4 Documents Introduction Language Identification Machine Learning The film is not as painful as a blow to the head, Supervised Learning but it will cost you up to $10, and it takes three Unsupervised Learning Features & hours. The first hour and forty-five minutes Evidence establishes one of the most banal love triangles Measuring sucess ever put to film. Childhood friends Rafe McCawley Document classifiers and Danny Walker (Ben Affleck and Josh Hartnett) Authorship both find themselves in love with the same woman, Attribution Author Identification Evelyn Johnson (Kate Beckinsale). [Heather Stylometry Lexical Markers Feher, from www.filmstew.com] Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 6 / 45

  7. Language and Some document classification tasks Computers Classifying Documents Introduction Language Identification ◮ Sentiment analysis : what is the attitude of the text? Machine Learning ◮ Authorship attribution : who wrote a text? Supervised Learning Unsupervised Learning ◮ Author Identification (who penned The Federalist Features & Evidence Papers ?) Measuring sucess ◮ Forensic Evidence (who wrote the note?) Document ◮ Plagiarism Detection (who did the work?) classifiers Authorship ◮ Spam filtering : is this email junk or not? Attribution Author Identification ◮ Language identification : which language is this Stylometry Lexical Markers document in? Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 7 / 45

  8. Language and Language identification Computers Classifying Documents Introduction Language Let’s consider this relatively simple task first . . . Identification Machine Learning ◮ Can sometimes the language tell by Supervised Learning Unsupervised Learning ◮ which characters are used, Features & ◮ e.g. Liebe Gr¨ uße uses ¨ u and ß → German Evidence ◮ which character encoding is being used Measuring sucess ◮ e.g., ISO 8859-8 is used to encode Hebrew characters Document classifiers → text is written in Hebrew Authorship ◮ But how can you tell if you are reading English vs. Attribution Author Identification Japanese transliterated into the Roman alphabet? Or Stylometry Lexical Markers Swedish vs. Norwegian? Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 8 / 45

  9. Language and Language identification Computers Classifying N-grams Documents Introduction Language Identification ◮ One simple technique for identifying languages is to use Machine Learning n-grams = stretch of n tokens (i.e., letters or words): Supervised Learning Unsupervised Learning ◮ Go through texts for which we know which language Features & they are written in and store the n-grams of letters Evidence found, for a certain n . Measuring sucess ◮ e.g., extracting the trigrams (3-grams) for the last Document classifiers sentence we’d get: Go , o t , th , thr , hro , rou , . . . Authorship ◮ This provides us with an indication of what sequences Attribution of letters are possible in a given language (and how Author Identification Stylometry frequent they occur). Lexical Markers Lexical Markers: Function ◮ e.g., thr is not a likely Japanese string. Words Plagiarism ◮ How do we make this more concrete? Detection What is plagiarism? Plagiarism Detection References 9 / 45

  10. Language and Language identification Computers Classifying Frequency distributions Documents Introduction ◮ Store a frequency distribution of trigrams, i.e., how Language many times each n-gram appears for a given language. Identification Machine Learning n-gram English Japanese Supervised Learning Unsupervised Learning aba 12 54 Features & ace 95 10 Evidence act 45 1 Measuring sucess Document arc 8 0 classifiers . . . . . . Authorship Attribution ◮ Now, apply the frequency distribution to a new text and Author Identification Stylometry use it to help calculate the probability of the text being a Lexical Markers Lexical Markers: Function particular language. Words Plagiarism ◮ Compare each n-gram to see if it is more likely to be Detection What is plagiarism? English or Japanese. Plagiarism Detection ◮ See which language won the most comparisons. References 10 / 45

  11. Language and Machine Learning Computers Classifying Documents Document classification is an example of a computer science activity called machine learning , which is itself part Introduction of the subfield of artificial intelligence Language Identification ◮ We have access to a training set of examples, from Machine Learning which we will learn Supervised Learning Unsupervised Learning ◮ e.g., articles from the on-line version of last month’s Features & Evidence New York Times Measuring sucess ◮ Long-term goal: use what we have learned in order to Document build a robust system that can process future examples classifiers of the same kind as in the training set Authorship Attribution ◮ e.g., articles that are going to appear in next month’s Author Identification New York Times Stylometry Lexical Markers ◮ As an approximation, we use a separate test set of Lexical Markers: Function Words examples to stand in for the unavailable future ones Plagiarism Detection ◮ e.g., this month’s New York Times articles What is plagiarism? ◮ Since the test set is separate from the training set, the Plagiarism Detection References system will not have seen them. 11 / 45

Recommend


More recommend