Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang
Focus: Entity Extraction What are the longest hiking trails near Baltimore ? Data Source hiking trails near Baltimore Avalon Super Loop Patapsco Valley State Park Gunpowder Falls State Park Union Mills Hike Greenbury Point ... 1
Focus: Entity Extraction What are the longest hiking trails near Baltimore ? Data Source hiking trails near Baltimore Avalon Super Loop Patapsco Valley State Park Gunpowder Falls State Park Union Mills Hike Greenbury Point ... Applications: question answering / semantic parsing / taxonomy construction / ontology expansion / knowledge base population / ... 1
Semi-Structured Data on the Web 2
Challenge: Long Tail of Categories person location organization 3
Challenge: Long Tail of Categories person location organization airport battleship acid pitcher settlement headgear metaphor haircut poker hand biome enzyme superstition 3
Challenge: Long Tail of Categories person location organization airport battleship acid pitcher settlement headgear metaphor haircut poker hand biome enzyme superstition tutorials at ACL 2014 dishes at Pu Pu Hot Pot Stanford computer science professors We want to generalize to unseen categories 3
[Wang and Cohen, 2009; Google Sets; Sarmento et al. 2007; ...] Relevant Approaches Bootstrapping from Seed Examples: web pages web pages web pages answers Avalon Super Loop seeds Avalon Super Loop System Hilton Area Hilton Area Wildlands Loop ... Use seed examples to specify the entity category 4
[Wang and Cohen, 2009; Google Sets; Sarmento et al. 2007; ...] Relevant Approaches Bootstrapping from Seed Examples: web pages web pages web pages answers Avalon Super Loop seeds Avalon Super Loop System Hilton Area Hilton Area Wildlands Loop ... Use seed examples to specify the entity category ... but we might not have seeds (e.g. in question answering) 4
Our Work web page answers Avalon Super Loop query hiking trails System Hilton Area near Baltimore Wildlands Loop ... Use a natural language query to specify the entity category 5
Outline 1. Setup • Problem Setup • Dataset 2. Approach 3. Results 6
Problem Setup Input: • query x hiking trails near Baltimore • web page w 7
Problem Setup Input: • query x hiking trails near Baltimore • web page w 7
Problem Setup Input: • query x hiking trails near Baltimore • web page w 7
Problem Setup Input: • query x hiking trails near Baltimore • web page w Output: • list of entities y [Avalon Super Loop, Patapsco Valley State Park, ...] 7
Dataset We created the OpenWeb dataset with diverse queries and web pages. airlines of italy natural causes of global warming lsu football coaches bf3 submachine guns badminton tournaments foods high in dha technical colleges in south carolina songs on glee season 5 singers who use auto tune san francisco radio stations 8
Dataset We created the OpenWeb dataset with diverse queries and web pages. airlines of italy natural causes of global warming lsu football coaches 8
[Berant et al., 2013] Query Generation Breadth-first search on Google Suggest list of Google Suggest list of Indian movies ... 9
[Berant et al., 2013] Query Generation Breadth-first search on Google Suggest list of list of movies Google list of movies Suggest list of Indian ... Template list of Indian movies ... Extraction 9
[Berant et al., 2013] Query Generation Breadth-first search on Google Suggest list of list of movies Google list of movies Suggest list of Indian ... Template list of Indian movies ... Extraction 9
Dataset Annotation Annotate the first, second, and last entities matching the query using Amazon Mechanical Turk. 10
Dataset Annotation Annotate the first, second, and last entities matching the query using Amazon Mechanical Turk. airlines of italy Annotation First: Air Dolomiti Second: Air Europe Last: Wind Jet 10
Dataset Statistics 2773 examples 2269 unique queries 894 unique headwords ← long tail! 1483 unique web domains ← long tail! ( � = wrapper induction) 11
Outline 1. Setup 2. Approach • Extraction Predicate • Framework • Modeling • Features 3. Results 12
Extraction Predicate How can we choose what to extract from a web page w ? html head body table h1 table tr tr tr ... tr td td td td th th td td td td number of possible entity lists ≈ 2 number of nodes 13
[Sahuguet and Azavant, 1999; Liu et al., 2000; Crescenzi et al., 2001] Extraction Predicate Idea: Entities usually share the same tag and tree level html head body table h1 table tr tr tr ... tr td td td td th th td td td td z = /html[1]/body[1]/table[2]/tr/td[1] 14
[Sahuguet and Azavant, 1999; Liu et al., 2000; Crescenzi et al., 2001] Extraction Predicate Idea: Entities usually share the same tag and tree level html head body table h1 table tr tr tr ... tr td td td td th th td td td td z = /html[1]/body[1]/table[2]/tr/td[1] Captures structures such as table columns, list entries, headers of the same level, ... Each web page has ≈ 8500 extraction predicates z 14
Framework html hiking trails head body x w near Baltimore ... ... 15
Framework html hiking trails head body x w near Baltimore ... ... Generation ( |Z| ≈ 8500) Z 15
Framework html hiking trails head body x w near Baltimore ... ... Generation ( |Z| ≈ 8500) Z Model /html[1]/body[1]/table[2]/tr/td[1] z 15
Framework html hiking trails head body x w near Baltimore ... ... Generation ( |Z| ≈ 8500) Z Model /html[1]/body[1]/table[2]/tr/td[1] Execution z [Avalon Super Loop, Patapsco Valley State Park, ...] y 15
Framework html hiking trails head body x w near Baltimore ... ... Generation ( |Z| ≈ 8500) Z Model /html[1]/body[1]/table[2]/tr/td[1] Execution z [Avalon Super Loop, Patapsco Valley State Park, ...] y A graphical model with latent extraction predicate z 15
Modeling Let x be a query and w be a web page. Define a log-linear distribution over the extraction predicates z ∈ Z : p θ ( z | x, w ) ∝ exp { θ ⊤ φ ( x, w, z ) } • θ is a parameter vector • φ ( x, w, z ) is a feature vector 16
Modeling Let x be a query and w be a web page. Define a log-linear distribution over the extraction predicates z ∈ Z : p θ ( z | x, w ) ∝ exp { θ ⊤ φ ( x, w, z ) } • θ is a parameter vector • φ ( x, w, z ) is a feature vector • Find θ that maximizes the log-likelihood of the training data using AdaGrad [Duchi et al., 2010] 16
Features p θ ( z | x, w ) ∝ exp { θ ⊤ φ ( x, w, z ) } 17
Features p θ ( z | x, w ) ∝ exp { θ ⊤ φ ( x, w, z ) } Structural Features: context > 17
Features p θ ( z | x, w ) ∝ exp { θ ⊤ φ ( x, w, z ) } Denotation Features: content hiking trails near Baltimore hiking trails near Baltimore Avalon Super Loop Home Patapsco Valley State Park About Baltimore Tour > Gunpowder Falls State Park Pricing Rachel Carson Conservation Park Contact Union Mills Hike Online Support ... ... 17
Defining Features on Lists John Adams George Washington John Adams Blog John Adams John Adams Photos and Video Thomas Jefferson John Adams Briefing Room James Madison John Adams In the White House ... (39 more) ... John Adams Mobile Apps Barack Obama ... (100 more) ... Contact Us John Adams good bad bad 18
Defining Features on Lists John Adams George Washington John Adams Blog John Adams John Adams Photos and Video Thomas Jefferson John Adams Briefing Room James Madison John Adams In the White House ... (39 more) ... John Adams Mobile Apps Barack Obama ... (100 more) ... Contact Us John Adams good bad bad identity diverse identical diverse 18
Defining Features on Lists NNP NNP NNP NNP NNP NNP NN NNP NNP NNP NNP NNS CC NNP NNP NNP NNP NNP NN NN NNP NNP NNP NNP IN DT NNP NNP ... (39 more) ... NNP NNP NNP NNPS NNP NNP ... (100 more) ... NN PRP NNP NNP good bad bad identity diverse identical diverse POS identical identical diverse 18
Defining Features on Lists Avalon Super Loop Patapsco Valley State Park Gunpowder Falls State Park Union Mills Hike Greenbury Point 19
Defining Features on Lists Avalon Super Loop 3 Patapsco Valley State Park 4 Gunpowder Falls State Park 4 Union Mills Hike 3 Greenbury Point 2 1. Abstraction Map list elements into abstract tokens 19
Defining Features on Lists Entropy Avalon Super Loop 3 Majority Patapsco Valley State Park 4 MajorityRatio Gunpowder Falls State Park 4 Single 2 3 4 Union Mills Hike 3 Mean histogram Greenbury Point 2 Variance 1. Abstraction Map list elements into abstract tokens 2. Aggregation Define features using the histogram of the abstract tokens 19
Recommend
More recommend