Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines Michal Růžička, Vít Novotný, Petr Sojka; Jan Pomikálek, Radim Řehůřek Masaryk University, Faculty of Informa�cs, Brno, Czech Republic mruzicka@mail.muni.cz , witiko@mail.muni.cz , sojka@fi.muni.cz ; RaRe Technologies honza@rare-technologies.com , radim@rare-technologies.com https://mir.fi.muni.cz/ https://rare-technologies.com/ }w� !"#$%&'()+,-./012345<yA| Illustra�ons by Jiří Franek.
Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results Outline 1 Seman�c Indexing and Searching 2 String Encoding of Seman�c Vectors 3 Results Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results Outline 1 Seman�c Indexing and Searching 2 String Encoding of Seman�c Vectors 3 Results Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results Seman�c Indexing document as a file (e-mail, ���, …), document as Tokenizer Input ���, … DataReader plain text (e.g. ���� (e.g. pdf2text ) Document tokenizer) document as document as a token list a segment list Segmenter Seman�cModeler (e.g. paragraph / Segment2Vec (e.g. T�I��, LSI, deep logical part learning, doc2vec ) [table, formula] segments in segmenter) all documents document as a list of points represen�ng segments Index of Vectors Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results Seman�c Searching with Nuggets query document as a file Query Document Indexing Pipeline query as seman�c vectors doc � doc � nugget � nugget � nugget � doc � Query Nuggets Document Nuggets Similarity Search � ⋅ � seman�c vectors � � �, � � � Candidate Nuggets 3 1 Ranker � � � 2 Results as Sorted Nuggets � � � 3 1 2 Results as Sorted Documents Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
• … Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results Re-Ranking Techniques 1 Fast: find candidate nuggets via Elas�csearch. 2 Slow but precise: re-rank candidate nuggets with exact similarity metric. • Cosine similarity. • Euclidean similarity. Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results Re-Ranking Techniques 1 Fast: find candidate nuggets via Elas�csearch. 2 Slow but precise: re-rank candidate nuggets with exact similarity metric. • Cosine similarity. • Euclidean similarity. • … Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results Outline 1 Seman�c Indexing and Searching 2 String Encoding of Seman�c Vectors 3 Results Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
• Rounding to two decimal places, string encoded: • Feature tokens: • 0P2i0d12 • 1P2ineg0d13 • 2P2i0d07 Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results String Encoding of Seman�c Vectors • Encoding of seman�c vectors to strings (feature tokens): • Seman�c vector of three dimensions: � � [�.��, ��.��, �.���] ⃗ Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
• Feature tokens: • 0P2i0d12 • 1P2ineg0d13 • 2P2i0d07 Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results String Encoding of Seman�c Vectors • Encoding of seman�c vectors to strings (feature tokens): • Seman�c vector of three dimensions: � � [�.��, ��.��, �.���] ⃗ • Rounding to two decimal places, string encoded: � � [�.��, ��.��, �.���] ⃗ Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
• Feature tokens: • 0P2i0d12 • 1P2ineg0d13 • 2P2i0d07 Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results String Encoding of Seman�c Vectors • Encoding of seman�c vectors to strings (feature tokens): • Seman�c vector of three dimensions: � � [�.��, ��.��, �.���] ⃗ • Rounding to two decimal places, string encoded: � � [ ’0’ �.�� , ��.�� , �.��� ] ⃗ Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
• Feature tokens: • 0P2i0d12 • 1P2ineg0d13 • 2P2i0d07 Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results String Encoding of Seman�c Vectors • Encoding of seman�c vectors to strings (feature tokens): • Seman�c vector of three dimensions: � � [�.��, ��.��, �.���] ⃗ • Rounding to two decimal places, string encoded: � � [ ’0’ �.�� , ’1’ ��.�� , �.��� ] ⃗ Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
• Feature tokens: • 0P2i0d12 • 1P2ineg0d13 • 2P2i0d07 Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results String Encoding of Seman�c Vectors • Encoding of seman�c vectors to strings (feature tokens): • Seman�c vector of three dimensions: � � [�.��, ��.��, �.���] ⃗ • Rounding to two decimal places, string encoded: � � [ ’0’ �.�� , ’1’ ��.�� , ’2’ �.��� ] ⃗ Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
• Feature tokens: • 0P2i0d12 • 1P2ineg0d13 • 2P2i0d07 Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results String Encoding of Seman�c Vectors • Encoding of seman�c vectors to strings (feature tokens): • Seman�c vector of three dimensions: � � [�.��, ��.��, �.���] ⃗ • Rounding to two decimal places, string encoded: � � [ ’0P2’ �.�� , ’1’ ��.�� , ’2’ �.��� ] ⃗ Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
• Feature tokens: • 0P2i0d12 • 1P2ineg0d13 • 2P2i0d07 Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results String Encoding of Seman�c Vectors • Encoding of seman�c vectors to strings (feature tokens): • Seman�c vector of three dimensions: � � [�.��, ��.��, �.���] ⃗ • Rounding to two decimal places, string encoded: � � [ ’0P2’ �.�� , ’1P2’ ��.�� , ’2’ �.��� ] ⃗ Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
• Feature tokens: • 0P2i0d12 • 1P2ineg0d13 • 2P2i0d07 Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results String Encoding of Seman�c Vectors • Encoding of seman�c vectors to strings (feature tokens): • Seman�c vector of three dimensions: � � [�.��, ��.��, �.���] ⃗ • Rounding to two decimal places, string encoded: � � [ ’0P2’ �.�� , ’1P2’ ��.�� , ’2P2’ �.�� ] ⃗ Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
• Feature tokens: • 0P2i0d12 • 1P2ineg0d13 • 2P2i0d07 Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results String Encoding of Seman�c Vectors • Encoding of seman�c vectors to strings (feature tokens): • Seman�c vector of three dimensions: � � [�.��, ��.��, �.���] ⃗ • Rounding to two decimal places, string encoded: � � [ ’0P2i0d12’ , ’1P2’ ��.�� , ’2P2’ �.�� ] ⃗ Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
• Feature tokens: • 0P2i0d12 • 1P2ineg0d13 • 2P2i0d07 Seman�c Indexing and Searching String Encoding of Seman�c Vectors Results String Encoding of Seman�c Vectors • Encoding of seman�c vectors to strings (feature tokens): • Seman�c vector of three dimensions: � � [�.��, ��.��, �.���] ⃗ • Rounding to two decimal places, string encoded: � � [ ’0P2i0d12’ , ’1P2ineg0d13’ , ’2P2’ �.�� ] ⃗ Flexible Similarity Search of Seman�c Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
Recommend
More recommend