Towards a Meaning-Aware Math Information Retrieval Petr Sojka et al Masaryk University, Faculty of Informatics, Brno, Czech Republic <> Wikipedia Subtask Pre-Conference Meeting, NTCIR-11 2014, Tokyo, Japan December 8th, 2014, 2PM Illustrations by Jiří Franek. }w� !"#$%&'()+,-./012345<yA|
Motivation MIaS at NTCIR Meaning search Formulae Evaluation Structured (Martin) Searching Subtask Wikipedia Math-2 Task (Michal) (???) Searching: MIaS Search Future of (Partha) Entailment Information Retrieval Math Talk Topics and Take-home Message Summary Entailment MIaS at NTCIR Wikipedia Task MIaS at NTCIR Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval
Motivation Searching: MIaS MIaS at NTCIR MIaS at NTCIR Wikipedia Task Entailment Summary Outline 1 Motivation 2 Searching: MIaS 3 MIaS at NTCIR 4 MIaS at NTCIR Wikipedia Task 5 Entailment 6 Summary and Future Work Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval
Motivation Searching: MIaS MIaS at NTCIR MIaS at NTCIR Wikipedia Task Entailment Summary Dependency on Information Retrieval: Information Society Now! Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval
Motivation Structures: math formulae, syntactic or sentence dependency trees, + without quotes or figures :-).”L^{p(x)}” $L^{p(x)}$ <> compositional named entity terms, knowledge base terms. to find the right meaning. Searching: MIaS Querying and search ing similar structures more and more important: striving Search: A Gate to Knowledge Summary Entailment MIaS at NTCIR Wikipedia Task MIaS at NTCIR Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval Meaning (semantics) is usually structured compositionally.
Motivation Nature 454, 263 (2008) | doi:10.1038/454263b numerous challenges identified during DML-CZ project. Workshop series Towards a Digital Mathematics Library founded to tackle Searching: MIaS Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval Summary MIaS at NTCIR MIaS at NTCIR Wikipedia Task Entailment Starting small but adding up: a free maths archive A small group of researchers is intends to change this. They have To make such an archive easier 1,500 journals whose archives have meeting in Birmingham, UK, later started small, with a handful of to search, researchers have found already been digitized. “A few years this month to plan a free digital digitization projects in Poland, ways to guess the subject of a ago, this model had the potential library of mathematics. Russia, Serbia and the Czech paper on the basis of the frequency to change the mathematics journal All the mathematical literature Republic. In a few years they hope to of symbols in it. But there will be literature in profound ways,” he ever published runs to more unite these repositories with their many more-practical challenges, says. But most publishers have han 50 million pages, with around western European counterparts such as finding the funds to scan rushed to scan their own archives in 75,000 articles added each in an archive to be hosted by the millions of old papers and striking order to lock them up and sell them year. Over the past decade there European Union, according to the deals with publishers who hold to libraries. have been several attempts to organizer, Petr Sojka, an informatics rights to them. “While the effort to digitize the make this prodigious body of scientist at Masaryk University It may already be too late to build smaller collections is admirable, work accessible in a single digital in Brno in the Czech Republic. a single free mathematical archive, and it’s certainly worthwhile, it’s archive, but so far none has Eventually this pan-European according to John Ewing, head of the unlikely to effect a larger change,” succeeded. archive could be expanded globally, American Mathematical Society, says Ewing. ■ A group of mathematicians he says. which maintains a list of more than Jascha Hoffman 263
Motivation Searching: MIaS MIaS at NTCIR MIaS at NTCIR Wikipedia Task Entailment Summary Math-aware Search and Indexing L A T EXSearch, LeActiveMath, MathWebSearch) problematic. Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval • Conventional searching approaches are not applicable for math. • Usage of existing mathematical search engines (MathDex, EgoMath, • New Math Indexer and Searcher (MIaS) developed at MU.
Motivation Math Indexer and Searcher MIaS — Features term, distributional representation of formulae. Searching: MIaS Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval Summary Entailment MIaS at NTCIR Wikipedia Task MIaS at NTCIR • Inspired mostly by MathDex and EgoMath. • Presentation and now also Content MathML. • Allows similarity (not only exact match) between query and matched • Commutativity. • Unification of variables and constants. • Subformulae matching. • Level of similarity calculation for expressions. • Mixed mathematical-textual queries. • Based on full text state of the art Apache Lucene core.
Motivation Searching: MIaS MIaS at NTCIR MIaS at NTCIR Wikipedia Task Entailment Summary Math Indexer and Searcher — Overall Design Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval
Motivation Searching: MIaS MIaS at NTCIR MIaS at NTCIR Wikipedia Task Entailment Summary Math Indexer and Searcher — Math Workflow Design Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval
Motivation Searching: MIaS MIaS at NTCIR MIaS at NTCIR Wikipedia Task Entailment Summary Formula Processing Weighting Example Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval
Motivation Summary Searching: MIaS Math Formulae Indexing Processing Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval Entailment MIaS at NTCIR Wikipedia Task MIaS at NTCIR indexing math processing ordering canonicalization tokenization weighting variables unification constants unification searching
Motivation Summary Searching: MIaS Example Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval MIaS at NTCIR Wikipedia Task Entailment MIaS at NTCIR indexing indexing searching y + y 3 y + y 2 x x math processing ordering canonicalization y + y 3 x y + y 2 x x y + y 3 , x y , y 3 , x , y , 3, + tokenization weighting 3 , x y , y 3 , x , y , 3, + , id 1 3 , id 1 variables unification id 2 + id 2 id 2 , id 1 id 2 + id 2 y + y 3 y + y 2 , id 1 2 x x constants unification 3 , x y , y 3 , x , y , 3, + , id 1 id 2 + id 2 3 , id 2 + id 2 y + y y + y 2 , id 1 2 , x x 3 , x const , y const , id 1 const , id 1 const , id 1 id 2 , id 1 y + y id 2 + id 2 y + y id 2 + id 2 const const id 1 x searching Match ! x y + y const , id 1 id 2 + id 2 const
Motivation Searching: MIaS MIaS at NTCIR MIaS at NTCIR Wikipedia Task Entailment Summary Implementation integrate to any Solr/Lucene based system as DSpace, Elasticsearch… Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval • Java • Solr + Lucene. • scalable: indexing 10 10 + formulae without problems. • Mathematical part implements Lucene’s interface Tokenizer — able to
Motivation Demo web interface: All up and ready on the EuDML system: <> (developed by students as part of Dean’s program at FI MU). EX input (LaTeXML for conversion to MathML). Searching: MIaS Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval Formulae Search Demonstration Comments Summary Entailment MIaS at NTCIR Wikipedia Task MIaS at NTCIR • MathML/T • Canonicalization of the query – our own MathCanEval canonicalizer • Matched document snippet generation. • MathJax for nicer math rendering and better portability. • Snuggle TeX for on-the-fly as-you-type rendering.
Motivation 1,940.0 3,021,865,236 59,647,566 8,301,545 Indexed Original Documents Formulae Table: Formulae count statistics 68 3,413.55 size [GiB] Searching: MIaS CPU Wall Clock Index Indexing times [min] Table: Index statistics MIaS4NTCIR: data indexing statistics Summary Entailment MIaS at NTCIR Wikipedia Task MIaS at NTCIR Petr Sojka, Wikipedia Subtask, NTCIR-11 2014, Tokyo, Japan, Dec 8th, 2PM: Towards a Meaning-Aware Information Retrieval
More recommend