information extraction
play

Information Extraction Pedro Szekely Information Sciences - PowerPoint PPT Presentation

Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of Engineering 1 Agenda Information extraction classification Text extraction techniques Storing extractions in knowledge graphs myDIG demo Summary


  1. Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of Engineering 1

  2. Agenda Information extraction classification Text extraction techniques Storing extractions in knowledge graphs myDIG demo Summary

  3. Document Features Grammatical Text Astro Teller is the CEO and co-founder of sentences BodyMedia. Astro holds a Ph.D. in Artificial paragraphs Intelligence from Carnegie Mellon University, where plus some without he was inducted as a national Hertz fellow. His M.S. formatting & in symbolic and heuristic computation and B.S. in formatting computer science are from Stanford University. His links work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, Tables Charts rich formatting & links 3

  4. Scope Genre specific (e.g., forums) Web site specific Wide, non-specific 4 Kejriwal, Szekely

  5. Pattern Complexity E.g., word patterns Regular set Closed set U.S. phone numbers U.S. states Phone: (413) 545-1323 He was born in Alabama … The CALD main office can be The big Wyoming sky… reached at 412-268-1299 Ambiguous patterns, Complex pattern needing context and “YOU don't wanna miss out on U.S. postal addresses many sources of evidence ME :) Perfect lil booty Green Person names University of Arkansas eyes Long curly black hair Im a P.O. Box 140 …was among the six houses Irish, Armenian and Filipino Hope, AR 71802 sold by Hope Feldman that mixed princess :) ❤ Kim ❤ year. 7 ○ 7~7two7~7four77 ❤ HH 80 Pawel Opalinski, Software Headquarters: roses ❤ Hour 120 roses ❤ 15 Engineer at WhizBang Labs. 1128 Main Street, 4th Floor mins 60 roses” Cincinnati, Ohio 45210 Courtesy of Andrew McCallum 5

  6. small amount of relevant content irrelevant content very similar to relevant content 6

  7. Practical Considerations How good (precision/recall) is necessary? High precision when showing extractions to users High recall when used for ranking results How long does it take to construct? Minutes, hours, days, months What expertise do I need? None (domain expertise), patience (annotation), simple scripting, machine learning guru What tools can I use? Many … 7

  8. Information Extraction Process Segmentation Data Extraction 8

  9. Information Extraction Process Segmentation Data Extraction 9

  10. Information Extraction Process Segmentation Data Extraction Name: Legacy Ventures Intl, Inc. Stock: LGYV Date: 2017-07-14 Market Cap: 391,030 10

  11. Segmentation Semi-structured extraction Table extraction Main content identification Custom regular expressions 11

  12. Segmentation Semi-structured extraction Text Table extraction segments Main content identification Custom regular expressions 12

  13. Text Extraction Techniques Glossary Regular expressions Natural language rules Named entity recognition Sequence labeling (Conditional Random Fields) 13

  14. Glossary Extraction

  15. Glossary Extraction Simple list of words or phrases to extract Challenges Ambiguity: Charlotte is a name of a person and a city Colloquial expressions: “Asia Broadband, Inc.” vs “Asia Broadband” Research Improving precision of glossary extractions using context Creating/extending glossaries automatically 15

  16. Regex Extraction

  17. Extraction Using Regular Expressions Too difficult for non-programmers regex for North American phone numbers: ^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02- 9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02- 9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0- 9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$ Brittle and difficult to adapt to unusual domains unusual nomenclature and short-hands obfuscation 17

  18. NLP Rule-Based Extraction

  19. NLP Rule-Based Extraction Pattern Tokenization Matching 19

  20. Tokenization My name is Pedro My name is Pedro 310-822-1511 310-822-1511 310 - 822 - 1511 Candy is here Candy is here Candy is here 20

  21. Token Properties Surface properties Literal, type, shape, capitalization, length, prefix, suffix, minimum, maximum Language properties Part of speech tag, lemma, dependency 21

  22. Token Types

  23. Patterns Pattern := Token-Spec Optional [Token-Spec] One or more Token-Spec + Token-Spec Pattern 27

  24. Positive/Negative Patterns Positive Generate candidates Negative Remove candidates Output overlaps positive candidates 28

  25. Positive/Negative Patterns General Positive Generate candidates Specific Negative Remove candidates Output overlaps positive candidates 29

  26. DIG Demo 30 Kejriwal, Szekely

  27. https://spacy.io/docs/usage/rule-based-matching 31 Kejriwal, Szekely

  28. Advantages/Disadvantages Advantages Easy to define High precision Recall increases with number of rules Disadvantages Text must follow strict patterns 32

  29. NLP Rule-Based Extraction Tokenization for unusual domains tokenize on white-space, punctuation and emojis Token properties literal, part of speech tag, lemma, in/out of dictionary dependency parsing relationships (advanced) type (alphanumeric, alphabetic, numeric) shape (pattern of digits and characters), capitalization, prefix and suffix number of characters, range (numbers) Pattern Sequence of required/optional tokens positive and negative patterns 33 Kejriwal, Szekely

  30. Named-Entity Recognizers

  31. Named Entity Recognizers Machine learning models people, places, organizations and a few others SpaCy complete NLP toolkit, Python (Cython), MIT license code: https://github.com/explosion/spaCy demo: http://textanalysisonline.com/spacy-named-entity-recognition-ner Stanford NER part of Stanford’s NLP software library, Java, GNU license code: https://nlp.stanford.edu/software/CRF-NER.shtml demo: http://nlp.stanford.edu:8080/ner/process 35 Kejriwal, Szekely

  32. https://spacy.io/docs/usage/entity-recognition 36 Kejriwal, Szekely

  33. https://demos.explosion.ai/displacy-ent 37 Kejriwal, Szekely

  34. Advantages/Disadvantages Advantages Easy to use Tolerant of some noise Easy to train Disadvantages Performance degrades rapidly for new genres, language models Requires hundreds to thousands of training examples 38

  35. Conditional Random Fields

  36. Discriminative Vs. Generative ● Generative Model: A model that generate observed data randomly ● Naïve Bayes: once the class label is known, all the features are independent ● Discriminative: Directly estimate the posterior probability; Aim at modeling the “discrimination” between different outputs ● MaxEnt classifier: linear combination of feature function in the exponent, Both generative models and discriminative models describe distributions over (y , x), but they work in different directions. slide by Daniel Khashabi

  37. Discriminative Vs. Generative =observable =unobservable slide by Daniel Khashabi

  38. Chain CRFs ● Each potential function will operate on pairs of adjacent label variables Feature functions ● Parameters to be estimated, =unobservable =observable slide by Daniel Khashabi

  39. Chain CRF ● We can change it so that each state depends on more observations =unobservable =observable ● Or inputs at previous steps ● Or all inputs slide by Daniel Khashabi

  40. Modeling Problems With CRF X1 X2 X3 Y i (word) (capitalized) (POS Tag) (entity) 1 My 1 Possessive Pron Other 2 name 0 Noun Other 3 is 0 Verb Other 4 Pedro 1 Proper Noun Person-Name 5 Szekely 1 Proper Noun Person-Name 44

  41. Modeling Problems With CRF X1 X2 X3 Y i (word) (capitalized) (POS Tag) (entity) 1 My 1 Possessive Pron Other 2 name 0 Noun Other 3 is 0 Verb Other 4 Pedro 1 Proper Noun Person-Name 5 Szekely 1 Proper Noun Person-Name Other common features: lemma, prefix, suffix, length 45

  42. Modeling Problems With CRF X1 X2 X3 Y i (word) (capitalized) (POS Tag) (entity) 1 My 1 Possessive Pron Other 2 name 0 Noun Other 3 is 0 Verb Other 4 Pedro 1 Proper Noun Person-Name 5 Szekely 1 Proper Noun Person-Name feature functions f j (x, y i-1 , y i , i) 46

  43. Advantages/Disadvantages Advantages Expressive Tolerant of noise Stood test of time Software packages available Disadvantages Requires feature engineering Requires thousands of training examples 47

  44. Open Information Extraction

  45. http://openie.allenai.org/ 49 Kejriwal, Szekely

  46. Practical IE Technologies Semi- Glossary Regex NLP Rules CRF NER Table Structured O(1000) O(10) assemble hours hours minutes annotati zero annotati Effort glossary ons ons high, low- minimal program low minimal zero minimal Expertise medium mer medium medium- medium- (ambiguit high high high high Precision high high y) medium low medium (formatti f(# high medium medium high Recall f(# rules) 50 Kejriwal, Szekely ng) regex)

  47. how to represent KGs? 51

  48. KG Definition a directed, labeled multi-relational graph representing facts/assertions as triples (h, r, t) head entity, relation, tail entity (s, p, o) subject, predicate, object

Recommend


More recommend