cicm 2016 doctoral program
play

CICM 2016, Doctoral program Augmenting Mathematical Formulae for - PowerPoint PPT Presentation

CICM 2016, Doctoral program Augmenting Mathematical Formulae for More Effective Querying & Presentation Moritz Schubotz 31/07/2016 www.formulasearchengine.com 1 Motivation 26th of February 2011 18.03.-26.03.2011 28th of March


  1. CICM 2016, Doctoral program Augmenting Mathematical Formulae for More Effective Querying & Presentation Moritz Schubotz 31/07/2016 www.formulasearchengine.com 1

  2. Motivation 26th of February 2011 18.03.-26.03.2011 28th of March 31/07/2016 www.formulasearchengine.com 2

  3. 1 1 𝑦 ≀ Example 1: 𝑦 $\frac 1 \mean ?x \le \mean \frac 1 ?x$ 1 <apply> 1. Different forms e.g. 𝑦 𝑦 β‰₯ 1 <leq/> <apply> 2. Different notations e.g. <divide/> π‘Œ 𝑔 𝑦 𝑦𝑒𝑦 = βŒ©π‘¦βŒͺ Χ¬ <cn type="integer">1</cn> <apply> 3. Exact match seldom <mean/> 4. Ambiguity in syntax e.g. 𝐹Ψ = <qvar>x</qvar></apply></apply> <apply> ΰ·‘ 𝐼Ψ NTCIR-11 <mean/> Math-2 5. no TeX-function mean <apply> WMC-D1 <divide> <cn type="integer">1</cn> <qvar>x</qvar ></apply></apply>…

  4. 1 1 𝑦 ≀ Example 1: 𝑦 $\frac 1 \mean ?x \le \mean \frac 1 ?x$ 1 <apply> 1. Different forms e.g. 𝑦 𝑦 β‰₯ 1 <leq/> <apply> 2. Different notations e.g. <divide/> π‘Œ 𝑔 𝑦 𝑦𝑒𝑦 = βŒ©π‘¦βŒͺ Χ¬ <cn type="integer">1</cn> <apply> 3. Exact match seldom <mean/> 4. Ambiguity in syntax e.g. 𝐹Ψ = <qvar>x</qvar></apply></apply> <apply> ΰ·‘ 𝐼Ψ NTCIR-11 <mean/> Math-2 5. no TeX-function mean <apply> WMC-D1 <divide> <cn type="integer">1</cn> <qvar>x</qvar ></apply></apply>…

  5. Result 1: πœ’(𝔽[π‘Œ]) ≀ 𝔽[πœ’(π‘Œ)] $ \frac 1 \mean ?x \le $ \frac 1 \mean ?x \le $ ?f[type=function] \mean ?x \le $ ?f[type=function] \mean ?x \le \mean \frac 1 ?x$ \mean \frac 1 ?x$ \mean ?f[type=function] ?x $ \mean ?f[type=function] ?x $ \le \le \frac \mean ?f[type=function] \mean \mean \frac \mean ?f[type=function] 1 ?x ?x ?x 1 ?x

  6. Result 1: πœ’(𝔽[π‘Œ]) ≀ 𝔽[πœ’(π‘Œ)] $ \frac 1 \mean ?x \le $ ?f[type=function] \mean ?x \le \mean \frac 1 ?x$ \mean ?f[type=function] ?x $ <apply> <apply> <leq/> <leq/> <apply> <apply> <qvar type="function">f</qvar> <divide/> $\frac 1$ -> $?f[type=function]$ <cn type="integer">1</cn> <apply> <apply> Not trivial <mean/> <mean/> <qvar>x</qvar></apply></apply> <qvar>x</qvar></apply></apply> <apply> <apply> <mean/> <mean/> <apply> <apply> <qvar type="function">f</qvar> <divide> <cn type="integer">1</cn> <qvar>x</qvar ></apply></apply>… <qvar>x</qvar ></apply></apply>…

  7. Solution 1 inexact matches β€’ Refined query: $\superconceptOf[ orderby = editdistance ]{ \frac 1 \mean ?x \le \mean \frac 1 ?x }$ β€’ Computational complexity β€’ Restriction of the search space β€’ Check most likely solutions at first 31/07/2016 www.formulasearchengine.com 8

  8. But there are diverse information needs 1. Definition look-up 2. Explanation look-up 3. Proof look-up 4. Application look-up 5. Computation assistance 6. General formula search 31/07/2016 www.formulasearchengine.com 9

  9. , and the data looks like that 31.07.2016 www.formulasearchengine.com 10

  10. Levels of Abstraction Semantic Content Presentation 31/07/2016 www.formulasearchengine.com 11

  11. Overview Image Presen- Structure Entity Image Content Semantic pro- tation detection Linkage cessing Integrated Queries Meta- Math Text data Presen- Content Semantic Keywords Relations tation 31/07/2016 www.formulasearchengine.com 12

  12. Completed Research Querying Processing Scalibility - Making Math Searchable in - Mathematical Language - Applying Stratosphere for Big Data Analytics (coauthor, BTW 2013) Processing (CICM 2014) Wikipedia (CICM 2012) - Querying large Collections - Digital Repository of - Evaluation of Similarity- of Mathematical Mathematical Formulae (CICM Measure Factors for 2014 coauthor) Publications (NTCIR 2013 Formulae (NTCIR 2015) - Growing the DRMF with with Marcus Leich) - Wikipedia Subtask at NTCIR generic LaTeX sources 11 (SIGIR 2015) - Mathoid: Accessible Math - Exploring the single-brain Rendering for Wikipedia ( barrier (NTCIR 2016) CICM 2014) - Semantification of Identifiers in Mathematics for Better Math Information Retrieval (SIGIR 2016) Integrated Queries Meta- Math Text data Presen- Content Semantic Keywords Relations tation 31/07/2016 www.formulasearchengine.com 13

  13. Mathoid: Robust, Scalable, Fast and Accessible Math Rendering for Wikipedia Let be a probability space, X an integrable real- valued random variable and Ο† a convex function. Then: πœ’(𝔽[π‘Œ]) ≀ 𝔽[πœ’(π‘Œ)]. β€’ convex function (Q319913, NDL ID 00573442 ) β€’ subclass of function β€’ ja: ε‡Έι–’ζ•° 31.07.2016 www.formulasearchengine.com 14

  14. Exploring the single-brain barrier β€’ β€œone - brain barrier” [1] – Metaphor: relevant knowledge to conduct math research needs to be co-located in one brain β€’ Goals of our contribution to NTCIR12: – Create a point of reference w.r.t. to this barrier for a trained mathematician – Compare the performance of a human to MIR systems and analyse characteristic strengths and weaknesses – Derive insights to improve MIR systems – Combine the relevant results of the human and the MIR systems to create a gold standard 31/07/2016 www.formulasearchengine.com 15

  15. Exploring the single-brain barrier 31/07/2016 www.formulasearchengine.com 16

  16. Exploring the single-brain barrier 31/07/2016 www.formulasearchengine.com 17

  17. Exploring the single-brain barrier β€’ Strengths of MIR systems: – Definition lookup queries – Application lookup β€’ Weaknesses of MIR systems – Low precision – No unified query language to specify query type β€’ Gold standard dataset can help to develop a math-aware search engine for Wikipedia 31/07/2016 www.formulasearchengine.com 18

  18. Semantification of Identifiers in Mathematics for Better Math Information Retrieval β€’ First step to enable computer to understand mathematicians notations β€’ Focus on identifiers β€’ Extract identifier semantics by combining math and β€’ Use computers to find relevant mathematics β€’ Computers must understand semantics in math to provide needed information 31/07/2016 www.formulasearchengine.com 19

  19. Math Augmentation Approach 31/07/2016 www.formulasearchengine.com 20

  20. (1) Extract formulae 31/07/2016 www.formulasearchengine.com 21

  21. (2) Extract identifiers 31/07/2016 www.formulasearchengine.com 22

  22. (3) Find identifiers 31/07/2016 www.formulasearchengine.com 23

  23. (4) Find definiens candidates 31/07/2016 www.formulasearchengine.com 24

  24. (5) Score all identifer-definiens pairs 31/07/2016 www.formulasearchengine.com 25

  25. (6) Generate feature vectors 31/07/2016 www.formulasearchengine.com 26

  26. (7) Cluster feature vectors 31/07/2016 www.formulasearchengine.com 27

  27. (8) Map clusters to subject hierarchy 31/07/2016 www.formulasearchengine.com 28

  28. Wikipedia Subtask at NTCIR 11 β€’ NTCIR 11 Wikipedia dataset* β€’ 30k Wikipedia Articles β€’ 280k Formulae β€’ 100 queries *) Moritz Schubotz, Abdou Youssef, Volker Markl, and Howard S. Cohl. 2015. Challenges of Mathematical Information Retrieval in the NTCIR-11 Math Wikipedia Task. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 951-954. 31/07/2016 www.formulasearchengine.com 29

  29. Wikipedia Subtask at NTCIR 11 β€’ CICM 2012 (2 Participants) β€’ NTCIR 2013 (pilot) (6 Participants) β€’ NTCIR 2014 arXiv (8 Participants) Wikipedia (7 Participants) Part III / IV Completed Research 31/07/2016 30

  30. Wikipedia Task results Higher β€ž precision β€œ Higher recall Part III / IV Completed Research 31/07/2016 31

  31. Gold standard available from http://mlp.formulasearchengine.com 31/07/2016 www.formulasearchengine.com 32

  32. Gold standard details 310 Identifiers 8 97 174 27 4 Wikidata item Two Wikidata items Wikidata item + NP Individual NP Multiple NP 31/07/2016 www.formulasearchengine.com 33

  33. Results: Identifiers β€’ 294/310 correctly extracted (94.8%) β€’ 57 false positive (fp) β€’ Problems 𝑅1βˆ’π‘…2 – Incorrect markup (8fn, 33fp) πœƒ = 𝑅1 d – Symbols (9fp) d𝑦 2 – Sub-super script (3fp, 2fn) 𝜏 𝑧 – Special notation (10fp, 2fn) 𝒗 Γ— π’˜ = πœ— π‘—π‘˜π‘™ 𝑣 π‘˜ 𝑀 𝑙 𝒇 𝒋 31/07/2016 www.formulasearchengine.com 34

  34. Results: (4) Find definiens candidates 31/07/2016 www.formulasearchengine.com 35

  35. Results: Definitions Definitions 83 88 19 120 Exact match Partial match not found not in document 31/07/2016 www.formulasearchengine.com 36

  36. Distribution of identifier counts 31/07/2016 www.formulasearchengine.com 37

  37. Results: Definitions with Namespace support Definitions 60 103 147 Exact match Partial match not found 31/07/2016 www.formulasearchengine.com 38

Recommend


More recommend