Extracting World and Linguistic Knowledge from Wikipedia Simone Paolo Ponzetto Michael Strube University of Heidelberg EML Research gGmbH Outline Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge Applications Outlook and future work Conclusions
Outline Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge Applications Outlook and future work Conclusions Encyclopedic knowledge & NLP The crisis at General Motors threatens to drag down Adam Opel , a storied German brand that GM bought 80 years ago, on the eve of the Great Depression. Many in the industry say Opel has a future only if it can get a temporary helping hand from the German government. But whether Chancellor Angela Merkel will make available the public financing needed to help release Opel from the clutches of General Motors now depends on a reluctant government, an influential automotive union that wants politicians to save jobs, and employees who yearn to re-establish Opel as an independent German company . source: Herald Tribune Europe, March 6, 2009 What about a widely used resource like WordNet ?
Encyclopedic knowledge & NLP Encyclopedic knowledge & NLP The crisis at General Motors threatens to drag down Adam Opel , a storied German brand that GM bought 80 years ago, on the eve of the Great Depression. Many in the industry say Opel has a future only if it can get a temporary helping hand from the German government. But whether Chancellor Angela Merkel will make available the public financing needed to help release Opel from the clutches of General Motors now depends on a reluctant government, an influential automotive union that wants politicians to save jobs, and employees who yearn to re-establish Opel as an independent German company . source: Herald Tribune Europe, March 6, 2009 What about a widely used resource like WordNet ? And Cyc ?
Encyclopedic knowledge & NLP Encyclopedic knowledge & NLP The crisis at General Motors threatens to drag down Adam Opel , a storied German brand that GM bought 80 years ago, on the eve of the Great Depression. Many in the industry say Opel has a future only if it can get a temporary helping hand from the German government. But whether Chancellor Angela Merkel will make available the public financing needed to help release Opel from the clutches of General Motors now depends on a reluctant government, an influential automotive union that wants politicians to save jobs, and employees who yearn to re-establish Opel as an independent German company . source: Herald Tribune Europe, March 6, 2009 What about a widely used resource like WordNet ? And Cyc ? Let’s check Wikipedia on that topic!
Wikipedia Wikipedia
Wikipedia Two main problems 1. where to get this knowledge from? 2. how to effectively use it within NLP applications to advance the state-of-the-art?
Outline Introduction Deriving world knowledge from Wikipedia Leveraging linguistic knowledge Applications Outlook and future work Conclusions Domain and world knowledge project-specific domain knowledge bases: + very high quality – small domain – reusability – high cost d d Trackball Ball has−ball has−trackball LTE−Lite−25 Compaq developed−by Notebook LTE−Lite−20 AT−Bus−HD−Drive Seagate−ST−3144 Seagate PC developed−by d d Workstation Hard−Disk Storage−Space uses−disk storage−space Computer−System Capacity−MB−Pair Hard−Disk−Drive d d Access−Time has−hd−drive access−time Time−MS−Pair System−Software d has−system−software d d Central−Unit CPU Clock−Frequency d has−cpu clock−frequency has−central−unit Clock−MHz−Pair
Domain and world knowledge WordNet + pretty high quality – very high cost + good coverage of everyday – sense proliferation language – coverage in domains arbitrary + many languages entity physical entity abstract entity thing thing object causal agent substance process abstraction change freshener horror jimdandy stinker whacker Domain and world knowledge Cyc + pretty high quality – very high cost + good coverage of everyday – coverage in domains arbitrary language – English only + common sense knowledge
Domain and world knowledge Ontology Learning from Text + low cost – mostly only small domains + potentially domain – low quality independent car company isa isa US car company German car company isa isa General Motors Opel belongs to Domain and world knowledge Manual approach Knowledge is manually input by human experts + ➠ it produces high-quality information − Limited amount of human experts ➠ expensive and low scalability to cover all domains Automatic approach It requires minimal supervision on large amounts of data + ➠ low cost and scalable − Overall quality lower than humans ➠ unconstrained output, not necessarily ‘ontologized’
Domain and world knowledge And Wikipedia? “ one of the most fascinating developments of the Digital Age ” “ incredible example of open-source intellectual collaboration ” “ faith-based encyclopedia ” “ a joke at best ” Domain and world knowledge And Wikipedia? “. . . an expert-led investigation carried out by Nature – the first to use peer review to compare Wikipedia and Britannica ’s coverage of science . . . revealed numerous errors in both encyclopedias, but among 42 entries tested, the difference in accuracy was not particularly great : the average science entry in Wikipedia contained around four inaccuracies; Britannica about three.” (Nature 15. Dec. 2005)
Domain and world knowledge And Wikipedia? + low cost – ?? + very good coverage, domain independent + very many languages + up to date We evaluate quality empirically! Where to get this knowledge from? we are after a “steak and lobster” combination . . . � manual approaches achieve high quality for a limited coverage � automatic ones achieve large coverage for a lower quality ➠ use manually annotated semi-structured input ➠ develop lightweight methods to generate large-coverage, high-quality structured output
Wikipedia Wikipedia is . . . • a free, on-line encyclopedia • based on a model of communal content creation • available in more than 266 different languages (April 2009) • user interface provided by a Web-based Wiki software application, e.g. MediaWiki, running on top of a LAMP architecture • edited as plain text by means of a markup language ( wiki markup ), in order to provide structured annotations Why Wikipedia Wikipedia is . . . 1. domain independent it has a large coverage ➠ 2. up-to-date to process current information ➠ 3. multilingual to process information ➠ in many languages
Wikipedia category network • since May 2004 Wikipedia provides a collaboratively generated category network Semantic relatedness with Wikipedia WikiRelate! (Strube & Ponzetto, 2006): 1. Wikipedia pages represent categorized concepts 2. all Wikipedia categories form a semantic network 3. relations between concepts are given along the network ➠ use the category network as a semantic network . . . ➠ . . . to compute semantic relatedness
Comparison of different approaches WikiRelate!, ESA and WLM leverage different features of Wikipedia • WikiRelate! uses categories ( ∼ 3 categories/article) • ESA uses articles ( ∼ 2,800,00) and words ( ∼ 400 words/article) • WLM uses hyperlinks ( ∼ 34 hyperlinks/article) Deriving a taxonomy from Wikipedia
Deriving a taxonomy • induce semantically-typed relations Deriving a taxonomy • the category network is merely a thematic categorization of the topics of articles task label the relations between categories ➠ as isa and notisa goal transform a thematic categorization • into a fully-fledged taxonomy
Deriving a taxonomy • methods : • syntactic matching • connectivity in the network • lexico-syntactic patterns • results : • we start with 337,522 categories and 743,140 links • we generate 335,128 isa relations large-scale , multi-domain taxonomy ➠ Category network cleanup (1) • removal of meta-categories used for encyclopedia management, e.g. categories under W IKIPEDIA A DMINISTRATION • we remove all nodes whose labels contain any of the following strings: MEDIAWIKI , TEMPLATE , USER , PORTAL , CATEGORIES , ARTICLES , PAGES • this leaves • 240,760 categories • 515,423 links still to be processed
Refinement link identification (2) ALBUMS BY ARTIST CUISINE BY NATIONALITY is−refined−by is−refined−by MILES DAVIS ALBUMS FRENCH CUISINE • patterns such as Y X and X BY Z • their purpose is to better structure and simplify the categorization network • we assume this represents is-refined-by -relations • this labels 126,920 category links notisa and leaves 388,503 relations to be analyzed Syntax-based methods (3) SCIENTISTS isa same lexical head COMPUTER SCIENTISTS isa BRITISH COMPUTER SCIENTISTS • head matching labels pairs of categories sharing the same lexical head word (or lemma) • we identify lexical heads using the Stanford parser and lemmata using morpha
Recommend
More recommend