from Information Extraction Models Rahul Gupta, Sunita Sarawagi - PowerPoint PPT Presentation

Creating Probabilistic Databases from Information Extraction Models Rahul Gupta, Sunita Sarawagi Presented by Guozhang Wang DB Lunch, April 13 rd , 2009 Several slides are from the authors

Outline  Problem background and challenges  Proposed Solutions ◦ Segmentation-per-row model ◦ One-row model ◦ Multi-row model  Experiments and conclusion

Extracting and Managing Structured Web Data  Information Extraction (using CRF, etc): ◦ Text Segmentation (McCallum, UMASS) ◦ Table Extraction (Cafarella, UW) ◦ Preference Collection (Wortman, UPenn)  Uncertainty Management: ◦ RDBMS ◦ Prob. RDBMS

Challenges in Presenting Data 52-A GoregaonWest Mumbai 400 062 House_no Area City Pincode Probability 52 GoregaonWest Mumbai 400 062 0.1 52-A Goregaon West Mumbai 400 062 0.2 52-A GoregaonWest Mumbai 400 062 0.5 52 Goregaon West Mumbai 400 062 0.2  Segmentation-per-row model  Storage efficiency v.s. query accuracy ◦ Top- 1 v.s. all segmentation for each string

Confidence = Probability of Correctness 1 Fraction correct 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Probability of top segmentation

Trade-off Between Accuracy and Efficiency I  Query Accuracy Only best extraction All extractions with probabilities 0.8 Square Error 0.6 0.4 0.2 0 1 2 3 4 Number of columns in projection query

Trade-off Between Accuracy and Efficiency II  Storage Efficiency 0.4 Frequency 0.3 0.2 0.1 0 1 2 3 4-10 11-20 21-30 31-50 51- >200 200 Number of segmentations required to cover 0.9 probability

Goal of This Paper  Design data models to achieve good trade-offs between storage efficiency and query accuracy ◦ To achieve query accuracy  Approximate the extracted segmentation distribution as similar as possible  Similarity metric: KL-Divergence KL(P||Q) =  s P(s) log (P(s)/Q(s))

Proposed Data Models  Segmentation-per-row model (Exact)  One-row model (Column Independence)  Multi-row model (Mixture of the two)

Segmentation-per-row Model HNO AREA CITY PINCODE PROB 52 Bandra West Bombay 400 062 0.1 52-A Bandra West 400 062 0.2 Bombay 52-A Bandra West Bombay 400 062 0.5 52 Bandra West 400 062 0.2 Bombay  Exact but impractical. We can have too many segmentations!

One-row Model HNO AREA CITY PINCODE 52 (0.3) Bandra West Bombay (0.6) 400 062 (1.0) (0.6) 52-A (0.7) Bandra (0.4) West Bombay (0.4)  Each column has an independent multinomial distribution “ Q y ( t,u )” ◦ E.g. P(52-A, BandraWest, Bombay, 400 062) = 0.7 x 0.6 x 0.6 x 1.0 = 0.252  Simple model, but computed confidences are approximated (even wrong)

Populating One-row Model Min KL(P||Q) = Min KL(P||  y Q y ) = Min  y KL(P y ||Q y )  Has a closed form solution Q y ( t,u ) = P( t,u,y ) where P( t,u,y ) is marginal dist’n .  Marginal P( t,u,y ) can be computed using forward-backward message passing algorithm:

Forward-Backward Algorithm   Marginal 52 Bandra Bombay 400 062 52-A Bandra West West Bombay Bandra West Bombay  P(t,u,y) = c  u (y)  y’  t-1 (y’) Score( t,u,y,y ’ )

Multi-row Model HNO AREA CITY PINCODE Prob 52 (0.167) Bandra Bombay (1.0) 400 062 (1.0) 0.6 West (1.0) 52-A (0.833) 52 (0.5) Bandra West Bombay 400 062 (1.0) 0.4 (1.0) (1.0) 52-A (0.5)  Rows with same ID are mutually exclusive with row probability “ π k ”  Columns in same row are independent ◦ E.g. P(52-A, BandraWest, Bombay, 400 062) = 0.833 x 1.0 x 1.0 x 1.0 x 0.6 + 0.5 x 0.0 x 0.0 x 1.0 x 0.4 = 0.50

Populating Multi-row Model (fix k) Min KL(P||Q) = Max  s KL(P s ||  k π k Q k s )  We cannot obtain the optimal parameter values in closed form because of the summation within the log  However, we can reduce this to a well- known mixture model parameter estimation problem, and solve it using EM algorithm.

Enumeration-based EM Approach  Initially guess the parameter values π k and Q k y ( t,u )  E Step: soft assign each segmentation s d to segmentation k  M Step: update the parameters with ML values using the above soft assignment Note the E step need to enumerate all segmentations s d

Enumeration-less Approach  Observation: ◦ We need to enumerate segmentations at E step since we use soft assignment.  Idea: ◦ Use hard assignment instead, so that each s d belongs to exactly one component.  We use a decision tree to make the hard assignment (use information gain to split node)  Then we can have a closed form solution to the optimization problem  Merge mechanism to remove the disjointness limit

Experiment I  Comparing multi-row with SPR

Experiment II  Comparing multi-row with one-row

Lessons Learned ?  Column Independence might not be suitable in some cases (8% v.s. 25%)  Multi-row model has a good illustration of the correlations between columns  (but) How to implement this probabilistic model? ◦ One single row in Multi-row model will take more space  Are accuracy and space efficiency equally important in this application scenario?

Questions?

from Information Extraction Models Rahul Gupta, Sunita Sarawagi - PowerPoint PPT Presentation

Creating Probabilistic Databases from Information Extraction Models Rahul Gupta, Sunita Sarawagi Presented by Guozhang Wang DB Lunch, April 13 rd , 2009 Several slides are from the authors Outline Problem background and challenges

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Convex relaxations for weakly supervised information extraction Edouard Grave Columbia

Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of

Variability Extraction and Analysis Toolkit (VEXA) VEXA Introduction The Variability Extraction

Automated Feature Extraction Automated Feature Extraction for Object Recognition for Object

HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen and Mena Badieh Habib URSW 23

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

SI425 : NLP Set 13 Information Extraction Information Extraction Yesterday GM released third

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

Information Extraction Using the Structured Language Model Ciprian Chelba, Milind Mahajan

Multi-Source Information Extraction Valentin Tablan University of Sheffield University of

Querying Probabilistic Information Extraction Daisy Zhe Wang, Michael J. Franklin, Minos

Joint longitudinal and survival models: associations between natural disasters exposure,

Explosive nucleosynthesis of heavy elements An astrophysical and nuclear physics challenge

Querying Data in Azure Data Explorer Xavier Morera HELPING DEVELOPERS UNDERSTAND SEARCH & BIG

Data Explorer (DEX) RWTH Aachen University Page 2 DEX is a generator for (parts of)

model[NL]generation: Natural Language Model Extraction 27.10.2013 Lars Ackermann (M. Sc.)

Cosmic Ray Signatures of Dark Matter Decay Alejandro Ibarra Technical University of Munich Many

Chapter 22 Dark Matter, Dark Energy, and 22.1 Unseen Influences in the Cosmos the Fate of the

The central role of low-resolution FORS 1/2 spectropolarimetric observations for the

from Information Extraction Models Rahul Gupta, Sunita Sarawagi - PowerPoint PPT Presentation

Creating Probabilistic Databases from Information Extraction Models Rahul Gupta, Sunita Sarawagi Presented by Guozhang Wang DB Lunch, April 13 rd , 2009 Several slides are from the authors Outline Problem background and challenges

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Convex relaxations for weakly supervised information extraction Edouard Grave Columbia

Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of

Variability Extraction and Analysis Toolkit (VEXA) VEXA Introduction The Variability Extraction

Automated Feature Extraction Automated Feature Extraction for Object Recognition for Object

HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen and Mena Badieh Habib URSW 23

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

SI425 : NLP Set 13 Information Extraction Information Extraction Yesterday GM released third

Sequence Labeling Markov Models Many information extraction tasks can be formulated as

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

Information Extraction Using the Structured Language Model Ciprian Chelba, Milind Mahajan

Multi-Source Information Extraction Valentin Tablan University of Sheffield University of

Querying Probabilistic Information Extraction Daisy Zhe Wang, Michael J. Franklin, Minos

Joint longitudinal and survival models: associations between natural disasters exposure,

Explosive nucleosynthesis of heavy elements An astrophysical and nuclear physics challenge

Querying Data in Azure Data Explorer Xavier Morera HELPING DEVELOPERS UNDERSTAND SEARCH &amp; BIG

Data Explorer (DEX) RWTH Aachen University Page 2 DEX is a generator for (parts of)

model[NL]generation: Natural Language Model Extraction 27.10.2013 Lars Ackermann (M. Sc.)

Cosmic Ray Signatures of Dark Matter Decay Alejandro Ibarra Technical University of Munich Many

Chapter 22 Dark Matter, Dark Energy, and 22.1 Unseen Influences in the Cosmos the Fate of the

The central role of low-resolution FORS 1/2 spectropolarimetric observations for the

Querying Data in Azure Data Explorer Xavier Morera HELPING DEVELOPERS UNDERSTAND SEARCH & BIG