introduction to artificial intelligence natural language
play

Introduction to Artificial Intelligence Natural Language Processing - PowerPoint PPT Presentation

Introduction to Artificial Intelligence Natural Language Processing Janyl Jumadinova November 14, 2016 Credit: NLP Stanford Question Answering: IBMs Watson 2/25 Information Extraction 3/25 Sentiment Extraction 4/25 Source: Washington


  1. Introduction to Artificial Intelligence Natural Language Processing Janyl Jumadinova November 14, 2016 Credit: NLP Stanford

  2. Question Answering: IBM’s Watson 2/25

  3. Information Extraction 3/25

  4. Sentiment Extraction 4/25 Source: Washington Post

  5. Machine Translation 5/25

  6. Language Technology 6/25

  7. Ambiguity makes NLP hard 7/25

  8. Ambiguity makes NLP hard ◮ Teacher Strikes Idle Kids ◮ Red Tape Holds Up New Bridges ◮ Juvenile Court to Try Shooting Defendant ◮ Local High School Dropouts Cut in Half 7/25

  9. Other NLP Difficulties 8/25

  10. Progress ◮ What tools do we need? ◮ Knowledge about language ◮ Knowledge about the world ◮ A way to combine knowledge sources 9/25

  11. Progress ◮ What tools do we need? ◮ Knowledge about language ◮ Knowledge about the world ◮ A way to combine knowledge sources ◮ How we generally do this: ◮ Probabilistic models built from language data ◮ P(“maison” → “house”) → high ◮ P(“L’avocat general” → “the general avocado”) → low 9/25

  12. Basic Text Processing Regular Expressions ◮ A formal language for specifying text strings. 10/25

  13. Basic Text Processing Regular Expressions ◮ A formal language for specifying text strings. ◮ How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks 10/25

  14. Regular Expressions: Disjunctions 11/25

  15. Regular Expressions: Negation in Disjunction ◮ Negations [ ∧ Ss ] ◮ Carat means negation only when first in [] 12/25

  16. Regular Expressions: More Disjunction ◮ Woodchucks is another name for groundhog! ◮ The pipe | for disjunction 13/25

  17. Regular Expressions: ? * + . 14/25

  18. Regular Expressions: Example Find all instances of the word “the” in a text 15/25

  19. Basic Text Processing Word tokenization Every NLP task needs to do text normalization: 1. Segmenting/tokenizing words in running text 2. Normalizing word formats 3. Segmenting sentences in running text 16/25

  20. How Many Words? 17/25

  21. Simple Tokenization in UNIX 18/25

  22. Basic Text Processing Normalization Every NLP task needs to do text normalization: 1. Segmenting/tokenizing words in running text 2. Normalizing word formats 3. Segmenting sentences in running text 19/25

  23. Issues in Tokenization ◮ Finland’s capital → Finland Finlands Finland’s ◮ what’re, I’m, isn’t → What are, I am, is not ◮ Hewlett-Packard → Hewlett Packard ◮ state-of-the-art → state of the art ◮ Lowercase → lower-case lowercase lower case ◮ San Francisco → one token or two? 20/25

  24. Issues in Tokenization ◮ Finland’s capital → Finland Finlands Finland’s ◮ what’re, I’m, isn’t → What are, I am, is not ◮ Hewlett-Packard → Hewlett Packard ◮ state-of-the-art → state of the art ◮ Lowercase → lower-case lowercase lower case ◮ San Francisco → one token or two? ◮ Language Issues : French, German, Japanese, Chinese,... 20/25

  25. Basic Text Processing Stemming Every NLP task needs to do text normalization: 1. Segmenting/tokenizing words in running text 2. Normalizing word formats 3. Segmenting sentences in running text 21/25

  26. Stemming ◮ Reduce terms to their stems in information retrieval ◮ Stemming is crude chopping of affixes language dependent ◮ Example: automate(s) , automatic , automation all reduced to automat . 22/25

  27. Porter’s Algorithm Most common English stemmer. 23/25

  28. Sentence Segmentation ◮ !, ? are relatively unambiguous 24/25

  29. Sentence Segmentation ◮ !, ? are relatively unambiguous ◮ Period “.” is quite ambiguous - Sentence boundary - Abbreviations like Inc. or Dr. - Numbers like .02 or 4.3 24/25

  30. Sentence Segmentation ◮ !, ? are relatively unambiguous ◮ Period “.” is quite ambiguous - Sentence boundary - Abbreviations like Inc. or Dr. - Numbers like .02 or 4.3 ◮ Build a binary classifier - Classifiers: hand-written rules, regular expressions, or machine-learning 24/25

  31. Determining if a word is end-of-sentence: a Decision Tree 25/25

Recommend


More recommend