Learning to rank adaptively for scalable information extraction Pablo Barrio , Columbia University Gonçalo Simões, INESC-ID and IST, University of Lisbon Helena Galhardas, INESC-ID and IST, University of Lisbon Luis Gravano, Columbia University
2 Information Extraction (IE) • Natural-language text embeds “structured” data • Information extraction systems extract this data < tornado , Florida > Extracted tuple for Natural Disaster-Location Natural Disaster-Location relation information extraction system “… A tornado swept the coast Much richer querying of Florida on Wednesday…” and analysis possible
3 IE is Challenging and Time Consuming • Operates over large sets of features Bag of words, N-grams, grammar productions, dependency paths tornado swept tornado “… A tornado swept the May grow as large as swept the swept coast of Florida on … … number of unique words Wednesday…” coast of wednesday and sequences of N words … … 2-grams • Requires complex text analysis Dependency parsing, entity recognition, syntactic parsing, shallow parsing, part-of-speech tagging, semantic role labeling Prepositional Nominal Prepositional modifier Determiner subject modifier Direct Determiner object A tornado swept the coast of Florida on Wednesday Natural Location Disaster May take several seconds per document (e.g., with subsequence kernel extractor for Natural Disaster-Location) Problematic over large document collections
4 Reducing Processing Time: Opportunities Documents are “useful” if they produce output for a given IE task • Small , topic-specific fraction of collection Should focus extraction Only 2% of documents in a New York Times archive, over these documents mostly environment-related , are useful for Natural Disaster-Location with a state-of-the-art IE system and ignore rest • Useful documents share distinctive words Can learn to differentiate and phrases between useful documents for an IE task “Earthquake,” “storm,” “Richter,” “volcano eruption” and rest for Natural Disaster-Location Information extraction system “labels” • IE process generates documents as useful or not for free ever-expanding training set for learning to identify useful documents
5 Existing Approaches: QXtract and FactCrawl QXtract and FactCrawl learn from small document sample and exhibit far-from-perfect recall FactCrawl ranks documents using learned queries and does not adapt to new processed documents [Eugene Agichtein and Luis Gravano, "Querying text databases for efficient information extraction." ICDE ’03 ] [Christoph Boden et al., "FactCrawl: A fact retrieval framework for full-text indices." WebDB ’11]
6 Our Approach: Key Aspects • Document ranking needs to be robust and efficient Learning to rank approach for document ranking Learning Ranking and processing Document s 1 ≥ s 2 ≥ s 3 ≥ ... s i ≥ … s n f(d i ) = s i Collection Features: Words and phrases Learning: Online, with in-training feature selection • Results of extraction process form ever-expanding training set Adaptive approach to update document ranking continuously New training New words: instances lava, fissure
7 Ranking Documents Adaptively for IE Learning Learns that “tornado,” “earthquake,” or “aftermath” are markers of useful documents Document processing and update detection Useful documents but on volcanoes, not yet observed prominently in IE process Document <tornado, hawaii> <volcano, chile> Collection New information can “… Still recovering from an “… ‘ Aftermath ’ narrates the story potentially help improve earthquake , Chile is threatened by of a man that goes missing…” ranking, so Update! the eruption of Copahue volcano …” Online relearning Learns that “volcano” and + Ranking “eruption” are now markers adaptation of useful documents Performs online learning
8 Ranking Documents Adaptively for IE: Our Alternatives • Efficient learning-to-rank techniques for information extraction: BAgg-IE , RSVM-IE • Update detection techniques for document ranking adaptation: Top- K , Mod-C
9 Efficient Learning to Rank for IE: BAgg-IE • Based on bootstrapping aggregation Learning Ranking and processing Relevant Learning Algorithm Ranking Model Ranking Model Features s 1 tornado, swept s 2 s i aftermath earthquake, s 3 tornado Scoring: Aggregation : Bootstrapping : Training: Normalized Randomly Sum of scores Binary SVM More stable Words and phrases score w/o replacement classifiers classification that make a document useful All models are trained using online learning and in-training feature selection
10 Efficient Learning to Rank for IE: RSVM-IE • Based on RankSVM Learning Training model instance Learns SVM classifier on pairwise Training Label is 1 iif - RankSVM d i is “better” than d n difference of documents SVM Ranking and processing Learning Learning Algorithm Relevant Features s i storm, swept Scoring: Classifier Training: RankSVM Words and phrases that over labeled score make a useful document document pairs rank higher than others Model is trained using online learning and in-training feature selection
11 Ranking Documents Adaptively for IE: Our Alternatives • Efficient learning-to-rank techniques for information extraction: BAgg-IE , RSVM-IE • Update detection techniques for document ranking adaptation: Top- K , Mod-C
12 Update Detection for Document Ranking Adaptation: Top- K • Uses only most important (top- K ) features Done during document ranking storm, 3.2 storm, 3.2 top- K richter, 2.8 richter, 2.8 Find top- K … … features people, 0.1 people, 0.1 Binary SVM Classifier Weights indicate importance Done during document processing Generalized > τ ? Update Detect richter, 3 richter, 3 top- K Spearman’s eruption, 2.9 eruption, 2.9 + Ranking feature Footrule … … people, 0.06 changes people, 0.06 Binary SVM Classifier
13 Update Detection for Document Ranking Adaptation: Mod-C • Uses all features Obtained “ for free” during document ranking tornado, 2.1 Obtain swept, 1.9 … features pending, 0.1 Current Features ranking recovered model from model Done during document processing eruption, 2.3 Detect > τ ? Update + tornado, 1.6 Cosine feature … Ranking Similarity morning, 0.08 changes Updated Features ranking recovered model from model
14 Experimental Settings • Dataset: archive: 1.8 million articles from 1987-2007 • Information extraction systems Simple extraction systems: HMMs, text patterns Disease-Outbreaks Person-Organization Person Organization Disease Time Period Google co-founders Larry Page and The Haiti cholera outbreak between Sergey Brin recently sat down with 2010 and 2013 was the worst Larry Page Google cholera between 2010 and billionaire venture capitalist Vinod 2013 epidemic of cholera in recent history. Sergey Brin Google Khosla for a lengthy interview. Man Made Disaster-Location Person-Career Person Career Disaster Location A fire destroyed a Cargill Meat Solutions "This is not a victimless crime," said beef processing plant in Booneville . Jim Kendall President fire Booneville Jim Kendall , president of the Washington Association of Internet Service Providers. Other relations: Person-Charge, Election-Winner, Natural Disaster-Location Dense relations Sparse relations Complex extraction systems: CRFs, SVM kernels
15 Does Learning Ranking Models Help? Person-Charge Use ranking model based on F-measure of small set of queries Learn ranking model on full document contents • Learning ranking models leads to better document ranking • RSVM-IE performs best at early stages • BAgg-IE obtains high gains later on • Objective function of learning model shapes document ranking Additional experiments in paper: analogous conclusions over all relations
16 Does Update Detection Help? Election-Winner Update detection baselines: Wind-F =Updates after processing 20,000 documents (2% of collection) Feat-S =Update method based on Gaussian kernel [Glazer, ICPR ‘12] Ranking Strategy: RSVM-IE • Feat-S unable to evaluate over new features, crucial during adaptation • Top- K and Mod-C improve the efficiency of the extraction process • Mod-C leads to best execution using more efficient approach, with fewer models Additional experiments in paper: analogous conclusions over all relations
17 Putting Learning to Rank and Update Detection Together: Recall Analysis Disease-Outbreak Our adaptive implementation of the state of the art Update detection method: Mod-C • Our techniques bring significant improvement for sparse relations • RSVM-IE performs best , as it prioritizes useful documents better, favoring adaptation Additional experiments in paper: analogous conclusions over all relations
18 Putting Learning to Rank and Update Detection Together: Extraction Time Person-Organization Affiliation Our adaptive implementation of the state of the art • Cost of adapting in A-FactCrawl hurts efficiency of extraction process • Our techniques improve efficiency of process even for inexpensive IE systems Additional experiments in paper for our techniques : • Analogous conclusions also for expensive IE systems and sparse relations • Scale linearly in the size of the collection
Recommend
More recommend