natural language processing with python
play

Natural Language Processing with Python CS372: Spring, 20 15 - PowerPoint PPT Presentation

Natural Language Processing with Python CS372: Spring, 20 15 Lecture 12 Categorizing and Tagging Words Jong C. Park Department of Computer Science Korea Advanced Institute of Science and Technology CATEGORIZING AND TAGGING WORDS Using a


  1. Natural Language Processing with Python CS372: Spring, 20 15 Lecture 12 Categorizing and Tagging Words Jong C. Park Department of Computer Science Korea Advanced Institute of Science and Technology

  2. CATEGORIZING AND TAGGING WORDS Using a Tagger Tagged Corpora Mapping Words to Properties Using Python Dictionaries Automatic Tagging N-Gram Tagging Transformation-based Tagging How to Determine the Category of a Word 2015-04-09 CS372: NLP with Python 2

  3. Introduction  Questions • What are lexical categories, and how are they used in natural language processing? • What is a good Python data structure for storing words and their categories? • How can we automatically tag each word of a text with its word class? 2015-04-09 CS372: NLP with Python 3

  4. Mapping Words to Properties Using Python Dictionaries  Indexing Lists Versus Dictionaries  Dictionaries in Python  Defining Dictionaries  Default Dictionaries  Incrementally Updating a Dictionary  Complex Keys and Values  Inverting a Dictionary dictionary data type 2015-04-09 CS372: NLP with Python 4

  5. Indexing Lists Versus Dictionaries  List • A text is treated in Python as a list of words. • We can look up a particular item by giving its index. • text1[100]  Figure 5-2. List lookup. 2015-04-09 CS372: NLP with Python 5

  6. Indexing Lists Versus Dictionaries  With frequency distributions, we specify a word and get back a number. • fdist[‘monstrous’]  Figure 5-3. Dictionary lookup. Other names for dictionary are map, hashmap, hash, and associative array. 2015-04-09 CS372: NLP with Python 6

  7. Indexing Lists Versus Dictionaries  In Figure 5-3, we mapped from names to numbers, unlike with a list.  Table 5-4. Linguistic objects as mappings from keys to values. The mapping is from a “word” to some structured object. 2015-04-09 CS372: NLP with Python 7

  8. Dictionaries in Python  Python provides a dictionary data type that can be used for mapping between arbitrary types. pos is defined as an empty dictionary. 2015-04-09 CS372: NLP with Python 8

  9. Dictionaries in Python  We can employ the keys to retrieve values.  Question: • How do we work out the legal keys for a dictionary, where in the case of lists and strings we can use len() to work out which integers will be legal indexes? If the dictionary is not big, we can simply inspect its contents by evaluating the variable pos. 2015-04-09 CS372: NLP with Python 9

  10. Dictionaries in Python  To just find the keys, we can either convert the dictionary to a list or use the dictionary in a context where a list is expected, as the parameter of sorted() or in a for loop. 2015-04-09 CS372: NLP with Python 10

  11. Dictionaries in Python  The dictionary methods keys() , values() , and items() allow us to access the keys, values, and key-value pairs as separate lists. 2015-04-09 CS372: NLP with Python 11

  12. Dictionaries in Python  When we look something up in a dictionary, we get only one value for each key.  However, there is a way of storing multiple values in an entry.  We may use a list value, e.g., pos[‘sleep’] = [‘N’, ‘V’]. Cf. the CMU Pronouncing Dictionary 2015-04-09 CS372: NLP with Python 12

  13. Defining Dictionaries  We can use the same key-value pair format to create a dictionary.  Dictionary keys must be immutable types, such as strings and tuples. 2015-04-09 CS372: NLP with Python 13

  14. Default Dictionaries  If we try to access a key that is not in a dictionary, we get an error.  Since Python 2.5, a special kind of dictionary called a defaultdict has been available. int, float, str, list, dict, tuple When we access a non-existent entry, it is automatically added to the dictionary. 2015-04-09 CS372: NLP with Python 14

  15. Default Dictionaries  We can use default dictionaries to deal with hapaxes and low frequency words. We can replace low frequency words with a special “out of vocabulary” token. 2015-04-09 CS372: NLP with Python 15

  16. Incrementally Updating a Dictionary  Example 5-3. Incrementally updating a dictionary, and sorting by value. 2015-04-09 CS372: NLP with Python 16

  17. Incrementally Updating a Dictionary 2015-04-09 CS372: NLP with Python 17

  18. Incrementally Updating a Dictionary itemgetter(n) returns a function that can be called on some other sequence object to obtain the nth element. 2015-04-09 CS372: NLP with Python 18

  19. Incrementally Updating a Dictionary  Useful programming idiom: • We initialize a defaultdict and then use a for loop to update its values. 2015-04-09 CS372: NLP with Python 19

  20. Incrementally Updating a Dictionary  The following example uses the same pattern to create an anagram dictionary.  NLTK provides a convenient way of accumulating words through nltk.Index(). 2015-04-09 CS372: NLP with Python 20

  21. Complex Keys and Values  Default dictionaries can have complex keys and values. 2015-04-09 CS372: NLP with Python 21

  22. Inverting a Dictionary  Dictionaries support efficient lookup. • However, finding a key given a value is slower and more cumbersome. • If we expect to do this kind of “reverse lookup” often, it helps to construct a dictionary that maps values to keys. 2015-04-09 CS372: NLP with Python 22

  23. Inverting a Dictionary  Examples of reverse lookup 2015-04-09 CS372: NLP with Python 23

  24. Inverting a Dictionary  Table 5-5. Python’s dictionary methods. 2015-04-09 CS372: NLP with Python 24

  25. Automatic Tagging  The Default Tagger  The Regular Expression Tagger  The Lookup Tagger  Evaluation >>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories=‘news’) >>> brown_sents = brown.sents(categories=‘news’) 2015-04-09 CS372: NLP with Python 25

  26. The Default Tagger  The simplest possible tagger assigns the same tag to each token. • It establishes an important baseline. • In order to get the best result, we tag each word with the most likely tag. Pros? 2015-04-09 CS372: NLP with Python 26

  27. The Regular Expression Tagger  Assign tags to tokens on the basis of matching patterns. 2015-04-09 CS372: NLP with Python 27

  28. The Lookup Tagger  Find the hundred most frequent words and store their most likely tag. • Use it as the model for a “lookup tagger”. 2015-04-09 CS372: NLP with Python 28

  29. The Lookup Tagger  Example 5-4. Lookup tagger performance with varying model size. 2015-04-09 CS372: NLP with Python 29

  30. The Lookup Tagger 2015-04-09 CS372: NLP with Python 30

  31. Evaluation  We evaluate the performance of a tagger relative to the tags a human expert would assign. • Since we usually don’t have access to an expert and impartial human judge we make do instead with gold standard test data. • The tagger is regarded as being correct if the tag it guesses for a given word is the same as the gold standard tag. 2015-04-09 CS372: NLP with Python 31

  32. Summary  Mapping Words to Properties Using Python Dictionaries • Indexing Lists Versus Dictionaries • Dictionaries in Python • Defining Dictionaries • Default Dictionaries • Incrementally Updating a Dictionary • Complex Keys and Values • Inverting a Dictionary 2015-04-09 CS372: NLP with Python 32

  33. Summary  Automatic Tagging • The Default Tagger • The Regular Expression Tagger • The Lookup Tagger • Evaluation 2015-04-09 CS372: NLP with Python 33

  34. Project: First Presentation  30 April, 2015 (in class)  Prepare a 5 minute presentation for your term project (approximately 7 slides). • The project must have a clear I/ O. • Explain the measure of the quality of the output. • Give a measure of how good your system is (together with a prediction of your system’s performance against this measure). 2015-04-09 CS372: NLP with Python 34

Recommend


More recommend