CICM 2016, Doctoral program Augmenting Mathematical Formulae for More Effective Querying & Presentation Moritz Schubotz 31/07/2016 www.formulasearchengine.com 1
Motivation 26th of February 2011 18.03.-26.03.2011 28th of March 31/07/2016 www.formulasearchengine.com 2
1 1 π¦ β€ Example 1: π¦ $\frac 1 \mean ?x \le \mean \frac 1 ?x$ 1 <apply> 1. Different forms e.g. π¦ π¦ β₯ 1 <leq/> <apply> 2. Different notations e.g. <divide/> π π π¦ π¦ππ¦ = β©π¦βͺ Χ¬ <cn type="integer">1</cn> <apply> 3. Exact match seldom <mean/> 4. Ambiguity in syntax e.g. πΉΞ¨ = <qvar>x</qvar></apply></apply> <apply> ΰ·‘ πΌΞ¨ NTCIR-11 <mean/> Math-2 5. no TeX-function mean <apply> WMC-D1 <divide> <cn type="integer">1</cn> <qvar>x</qvar ></apply></apply>β¦
1 1 π¦ β€ Example 1: π¦ $\frac 1 \mean ?x \le \mean \frac 1 ?x$ 1 <apply> 1. Different forms e.g. π¦ π¦ β₯ 1 <leq/> <apply> 2. Different notations e.g. <divide/> π π π¦ π¦ππ¦ = β©π¦βͺ Χ¬ <cn type="integer">1</cn> <apply> 3. Exact match seldom <mean/> 4. Ambiguity in syntax e.g. πΉΞ¨ = <qvar>x</qvar></apply></apply> <apply> ΰ·‘ πΌΞ¨ NTCIR-11 <mean/> Math-2 5. no TeX-function mean <apply> WMC-D1 <divide> <cn type="integer">1</cn> <qvar>x</qvar ></apply></apply>β¦
Result 1: π(π½[π]) β€ π½[π(π)] $ \frac 1 \mean ?x \le $ \frac 1 \mean ?x \le $ ?f[type=function] \mean ?x \le $ ?f[type=function] \mean ?x \le \mean \frac 1 ?x$ \mean \frac 1 ?x$ \mean ?f[type=function] ?x $ \mean ?f[type=function] ?x $ \le \le \frac \mean ?f[type=function] \mean \mean \frac \mean ?f[type=function] 1 ?x ?x ?x 1 ?x
Result 1: π(π½[π]) β€ π½[π(π)] $ \frac 1 \mean ?x \le $ ?f[type=function] \mean ?x \le \mean \frac 1 ?x$ \mean ?f[type=function] ?x $ <apply> <apply> <leq/> <leq/> <apply> <apply> <qvar type="function">f</qvar> <divide/> $\frac 1$ -> $?f[type=function]$ <cn type="integer">1</cn> <apply> <apply> Not trivial <mean/> <mean/> <qvar>x</qvar></apply></apply> <qvar>x</qvar></apply></apply> <apply> <apply> <mean/> <mean/> <apply> <apply> <qvar type="function">f</qvar> <divide> <cn type="integer">1</cn> <qvar>x</qvar ></apply></apply>β¦ <qvar>x</qvar ></apply></apply>β¦
Solution 1 inexact matches β’ Refined query: $\superconceptOf[ orderby = editdistance ]{ \frac 1 \mean ?x \le \mean \frac 1 ?x }$ β’ Computational complexity β’ Restriction of the search space β’ Check most likely solutions at first 31/07/2016 www.formulasearchengine.com 8
But there are diverse information needs 1. Definition look-up 2. Explanation look-up 3. Proof look-up 4. Application look-up 5. Computation assistance 6. General formula search 31/07/2016 www.formulasearchengine.com 9
, and the data looks like that 31.07.2016 www.formulasearchengine.com 10
Levels of Abstraction Semantic Content Presentation 31/07/2016 www.formulasearchengine.com 11
Overview Image Presen- Structure Entity Image Content Semantic pro- tation detection Linkage cessing Integrated Queries Meta- Math Text data Presen- Content Semantic Keywords Relations tation 31/07/2016 www.formulasearchengine.com 12
Completed Research Querying Processing Scalibility - Making Math Searchable in - Mathematical Language - Applying Stratosphere for Big Data Analytics (coauthor, BTW 2013) Processing (CICM 2014) Wikipedia (CICM 2012) - Querying large Collections - Digital Repository of - Evaluation of Similarity- of Mathematical Mathematical Formulae (CICM Measure Factors for 2014 coauthor) Publications (NTCIR 2013 Formulae (NTCIR 2015) - Growing the DRMF with with Marcus Leich) - Wikipedia Subtask at NTCIR generic LaTeX sources 11 (SIGIR 2015) - Mathoid: Accessible Math - Exploring the single-brain Rendering for Wikipedia ( barrier (NTCIR 2016) CICM 2014) - Semantification of Identifiers in Mathematics for Better Math Information Retrieval (SIGIR 2016) Integrated Queries Meta- Math Text data Presen- Content Semantic Keywords Relations tation 31/07/2016 www.formulasearchengine.com 13
Mathoid: Robust, Scalable, Fast and Accessible Math Rendering for Wikipedia Let be a probability space, X an integrable real- valued random variable and Ο a convex function. Then: π(π½[π]) β€ π½[π(π)]. β’ convex function (Q319913, NDL ID 00573442 ) β’ subclass of function β’ ja: εΈι’ζ° 31.07.2016 www.formulasearchengine.com 14
Exploring the single-brain barrier β’ βone - brain barrierβ [1] β Metaphor: relevant knowledge to conduct math research needs to be co-located in one brain β’ Goals of our contribution to NTCIR12: β Create a point of reference w.r.t. to this barrier for a trained mathematician β Compare the performance of a human to MIR systems and analyse characteristic strengths and weaknesses β Derive insights to improve MIR systems β Combine the relevant results of the human and the MIR systems to create a gold standard 31/07/2016 www.formulasearchengine.com 15
Exploring the single-brain barrier 31/07/2016 www.formulasearchengine.com 16
Exploring the single-brain barrier 31/07/2016 www.formulasearchengine.com 17
Exploring the single-brain barrier β’ Strengths of MIR systems: β Definition lookup queries β Application lookup β’ Weaknesses of MIR systems β Low precision β No unified query language to specify query type β’ Gold standard dataset can help to develop a math-aware search engine for Wikipedia 31/07/2016 www.formulasearchengine.com 18
Semantification of Identifiers in Mathematics for Better Math Information Retrieval β’ First step to enable computer to understand mathematicians notations β’ Focus on identifiers β’ Extract identifier semantics by combining math and β’ Use computers to find relevant mathematics β’ Computers must understand semantics in math to provide needed information 31/07/2016 www.formulasearchengine.com 19
Math Augmentation Approach 31/07/2016 www.formulasearchengine.com 20
(1) Extract formulae 31/07/2016 www.formulasearchengine.com 21
(2) Extract identifiers 31/07/2016 www.formulasearchengine.com 22
(3) Find identifiers 31/07/2016 www.formulasearchengine.com 23
(4) Find definiens candidates 31/07/2016 www.formulasearchengine.com 24
(5) Score all identifer-definiens pairs 31/07/2016 www.formulasearchengine.com 25
(6) Generate feature vectors 31/07/2016 www.formulasearchengine.com 26
(7) Cluster feature vectors 31/07/2016 www.formulasearchengine.com 27
(8) Map clusters to subject hierarchy 31/07/2016 www.formulasearchengine.com 28
Wikipedia Subtask at NTCIR 11 β’ NTCIR 11 Wikipedia dataset* β’ 30k Wikipedia Articles β’ 280k Formulae β’ 100 queries *) Moritz Schubotz, Abdou Youssef, Volker Markl, and Howard S. Cohl. 2015. Challenges of Mathematical Information Retrieval in the NTCIR-11 Math Wikipedia Task. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 951-954. 31/07/2016 www.formulasearchengine.com 29
Wikipedia Subtask at NTCIR 11 β’ CICM 2012 (2 Participants) β’ NTCIR 2013 (pilot) (6 Participants) β’ NTCIR 2014 arXiv (8 Participants) Wikipedia (7 Participants) Part III / IV Completed Research 31/07/2016 30
Wikipedia Task results Higher β precision β Higher recall Part III / IV Completed Research 31/07/2016 31
Gold standard available from http://mlp.formulasearchengine.com 31/07/2016 www.formulasearchengine.com 32
Gold standard details 310 Identifiers 8 97 174 27 4 Wikidata item Two Wikidata items Wikidata item + NP Individual NP Multiple NP 31/07/2016 www.formulasearchengine.com 33
Results: Identifiers β’ 294/310 correctly extracted (94.8%) β’ 57 false positive (fp) β’ Problems π 1βπ 2 β Incorrect markup (8fn, 33fp) π = π 1 d β Symbols (9fp) dπ¦ 2 β Sub-super script (3fp, 2fn) π π§ β Special notation (10fp, 2fn) π Γ π = π πππ π£ π π€ π π π 31/07/2016 www.formulasearchengine.com 34
Results: (4) Find definiens candidates 31/07/2016 www.formulasearchengine.com 35
Results: Definitions Definitions 83 88 19 120 Exact match Partial match not found not in document 31/07/2016 www.formulasearchengine.com 36
Distribution of identifier counts 31/07/2016 www.formulasearchengine.com 37
Results: Definitions with Namespace support Definitions 60 103 147 Exact match Partial match not found 31/07/2016 www.formulasearchengine.com 38
Recommend
More recommend