Information Extraction Pedro Szekely Information Sciences - PowerPoint PPT Presentation

Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of Engineering 1

Agenda Information extraction classification Text extraction techniques Storing extractions in knowledge graphs myDIG demo Summary

Document Features Grammatical Text Astro Teller is the CEO and co-founder of sentences BodyMedia. Astro holds a Ph.D. in Artificial paragraphs Intelligence from Carnegie Mellon University, where plus some without he was inducted as a national Hertz fellow. His M.S. formatting & in symbolic and heuristic computation and B.S. in formatting computer science are from Stanford University. His links work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, Tables Charts rich formatting & links 3

Scope Genre specific (e.g., forums) Web site specific Wide, non-specific 4 Kejriwal, Szekely

Pattern Complexity E.g., word patterns Regular set Closed set U.S. phone numbers U.S. states Phone: (413) 545-1323 He was born in Alabama … The CALD main office can be The big Wyoming sky… reached at 412-268-1299 Ambiguous patterns, Complex pattern needing context and “YOU don't wanna miss out on U.S. postal addresses many sources of evidence ME :) Perfect lil booty Green Person names University of Arkansas eyes Long curly black hair Im a P.O. Box 140 …was among the six houses Irish, Armenian and Filipino Hope, AR 71802 sold by Hope Feldman that mixed princess :) ❤ Kim ❤ year. 7 ○ 7~7two7~7four77 ❤ HH 80 Pawel Opalinski, Software Headquarters: roses ❤ Hour 120 roses ❤ 15 Engineer at WhizBang Labs. 1128 Main Street, 4th Floor mins 60 roses” Cincinnati, Ohio 45210 Courtesy of Andrew McCallum 5

small amount of relevant content irrelevant content very similar to relevant content 6

Practical Considerations How good (precision/recall) is necessary? High precision when showing extractions to users High recall when used for ranking results How long does it take to construct? Minutes, hours, days, months What expertise do I need? None (domain expertise), patience (annotation), simple scripting, machine learning guru What tools can I use? Many … 7

Information Extraction Process Segmentation Data Extraction 8

Information Extraction Process Segmentation Data Extraction 9

Information Extraction Process Segmentation Data Extraction Name: Legacy Ventures Intl, Inc. Stock: LGYV Date: 2017-07-14 Market Cap: 391,030 10

Segmentation Semi-structured extraction Table extraction Main content identification Custom regular expressions 11

Segmentation Semi-structured extraction Text Table extraction segments Main content identification Custom regular expressions 12

Text Extraction Techniques Glossary Regular expressions Natural language rules Named entity recognition Sequence labeling (Conditional Random Fields) 13

Glossary Extraction

Glossary Extraction Simple list of words or phrases to extract Challenges Ambiguity: Charlotte is a name of a person and a city Colloquial expressions: “Asia Broadband, Inc.” vs “Asia Broadband” Research Improving precision of glossary extractions using context Creating/extending glossaries automatically 15

Regex Extraction

Extraction Using Regular Expressions Too difficult for non-programmers regex for North American phone numbers: ^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:$\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02- 9])\s*$|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02- 9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0- 9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$ Brittle and difficult to adapt to unusual domains unusual nomenclature and short-hands obfuscation 17

NLP Rule-Based Extraction

NLP Rule-Based Extraction Pattern Tokenization Matching 19

Tokenization My name is Pedro My name is Pedro 310-822-1511 310-822-1511 310 - 822 - 1511 Candy is here Candy is here Candy is here 20

Token Properties Surface properties Literal, type, shape, capitalization, length, prefix, suffix, minimum, maximum Language properties Part of speech tag, lemma, dependency 21

Token Types

Patterns Pattern := Token-Spec Optional [Token-Spec] One or more Token-Spec + Token-Spec Pattern 27

Positive/Negative Patterns Positive Generate candidates Negative Remove candidates Output overlaps positive candidates 28

Positive/Negative Patterns General Positive Generate candidates Specific Negative Remove candidates Output overlaps positive candidates 29

DIG Demo 30 Kejriwal, Szekely

https://spacy.io/docs/usage/rule-based-matching 31 Kejriwal, Szekely

Advantages/Disadvantages Advantages Easy to define High precision Recall increases with number of rules Disadvantages Text must follow strict patterns 32

NLP Rule-Based Extraction Tokenization for unusual domains tokenize on white-space, punctuation and emojis Token properties literal, part of speech tag, lemma, in/out of dictionary dependency parsing relationships (advanced) type (alphanumeric, alphabetic, numeric) shape (pattern of digits and characters), capitalization, prefix and suffix number of characters, range (numbers) Pattern Sequence of required/optional tokens positive and negative patterns 33 Kejriwal, Szekely

Named-Entity Recognizers

Named Entity Recognizers Machine learning models people, places, organizations and a few others SpaCy complete NLP toolkit, Python (Cython), MIT license code: https://github.com/explosion/spaCy demo: http://textanalysisonline.com/spacy-named-entity-recognition-ner Stanford NER part of Stanford’s NLP software library, Java, GNU license code: https://nlp.stanford.edu/software/CRF-NER.shtml demo: http://nlp.stanford.edu:8080/ner/process 35 Kejriwal, Szekely

https://spacy.io/docs/usage/entity-recognition 36 Kejriwal, Szekely

https://demos.explosion.ai/displacy-ent 37 Kejriwal, Szekely

Advantages/Disadvantages Advantages Easy to use Tolerant of some noise Easy to train Disadvantages Performance degrades rapidly for new genres, language models Requires hundreds to thousands of training examples 38

Conditional Random Fields

Discriminative Vs. Generative ● Generative Model: A model that generate observed data randomly ● Naïve Bayes: once the class label is known, all the features are independent ● Discriminative: Directly estimate the posterior probability; Aim at modeling the “discrimination” between different outputs ● MaxEnt classifier: linear combination of feature function in the exponent, Both generative models and discriminative models describe distributions over (y , x), but they work in different directions. slide by Daniel Khashabi

Discriminative Vs. Generative =observable =unobservable slide by Daniel Khashabi

Chain CRFs ● Each potential function will operate on pairs of adjacent label variables Feature functions ● Parameters to be estimated, =unobservable =observable slide by Daniel Khashabi

Chain CRF ● We can change it so that each state depends on more observations =unobservable =observable ● Or inputs at previous steps ● Or all inputs slide by Daniel Khashabi

Modeling Problems With CRF X1 X2 X3 Y i (word) (capitalized) (POS Tag) (entity) 1 My 1 Possessive Pron Other 2 name 0 Noun Other 3 is 0 Verb Other 4 Pedro 1 Proper Noun Person-Name 5 Szekely 1 Proper Noun Person-Name 44

Modeling Problems With CRF X1 X2 X3 Y i (word) (capitalized) (POS Tag) (entity) 1 My 1 Possessive Pron Other 2 name 0 Noun Other 3 is 0 Verb Other 4 Pedro 1 Proper Noun Person-Name 5 Szekely 1 Proper Noun Person-Name Other common features: lemma, prefix, suffix, length 45

Modeling Problems With CRF X1 X2 X3 Y i (word) (capitalized) (POS Tag) (entity) 1 My 1 Possessive Pron Other 2 name 0 Noun Other 3 is 0 Verb Other 4 Pedro 1 Proper Noun Person-Name 5 Szekely 1 Proper Noun Person-Name feature functions f j (x, y i-1 , y i , i) 46

Advantages/Disadvantages Advantages Expressive Tolerant of noise Stood test of time Software packages available Disadvantages Requires feature engineering Requires thousands of training examples 47

Open Information Extraction

http://openie.allenai.org/ 49 Kejriwal, Szekely

Practical IE Technologies Semi- Glossary Regex NLP Rules CRF NER Table Structured O(1000) O(10) assemble hours hours minutes annotati zero annotati Effort glossary ons ons high, low- minimal program low minimal zero minimal Expertise medium mer medium medium- medium- (ambiguit high high high high Precision high high y) medium low medium (formatti f(# high medium medium high Recall f(# rules) 50 Kejriwal, Szekely ng) regex)

how to represent KGs? 51

KG Definition a directed, labeled multi-relational graph representing facts/assertions as triples (h, r, t) head entity, relation, tail entity (s, p, o) subject, predicate, object

Information Extraction Pedro Szekely Information Sciences - PowerPoint PPT Presentation

Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of Engineering 1 Agenda Information extraction classification Text extraction techniques Storing extractions in knowledge graphs myDIG demo Summary

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Convex relaxations for weakly supervised information extraction Edouard Grave Columbia

Variability Extraction and Analysis Toolkit (VEXA) VEXA Introduction The Variability Extraction

Automated Feature Extraction Automated Feature Extraction for Object Recognition for Object

HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen and Mena Badieh Habib URSW 23

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

SI425 : NLP Set 13 Information Extraction Information Extraction Yesterday GM released third

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

Information Extraction Using the Structured Language Model Ciprian Chelba, Milind Mahajan

Multi-Source Information Extraction Valentin Tablan University of Sheffield University of

Querying Probabilistic Information Extraction Daisy Zhe Wang, Michael J. Franklin, Minos

GreenFIE: A Green Form-Based Information-Extraction System for Historical Documents

SPIRAL, FFTX, and the Path to SpectralPACK Franz Franchetti Carnegie Mellon University

Group Roles Each table will designate a role for each participant. We will maintain these roles

Trade Reform and Local Labour Markets in Post-Apartheid South Africa Refilwe Lepelle University

Game Design - The Reality-Virtuality Continuum - Prof. Dr. Andreas Schrader ISNM International

Theming for Sitebuilders: Getting started the Drupal way Drupalcon Portland May 22nd, 2013 Your

Internet Security [1] VU 184.216 Engin Kirda engin@infosys.tuwien.ac.at Christopher Kruegel

www.leadershippartnership.com www.leadershippartnership.com The last word

LAGRANGIAN OCEAN SEARCH TARGETS MICHAEL HART-DAVIS BJRN BACKEBERG MOSTAFA BAKHODAY-PASKYABI