Attentive Neural Architecture for Ad-hoc Structured Document Retrieval Saeid Balaneshin 1 Alexander Kotov 1 Fedor Nikolaev 1 , 2 1 Textual Data Analytics Lab, Department of Computer Science, Wayne State University 2 Kazan Federal University 1/25
Ad-hoc Structured (Multi-field) Document Retrieval IR research traditionally views documents as holistic and homogeneous units of text The task of retrieving structured (multi-field) documents arises in many information access scenarios: ◮ Entity retrieval from knowledge graph(s) ◮ Web document retrieval ◮ Product search in e-Commerce 2/25
Entity Retrieval from Knowledge Graph(s) Names Attributes Categories Similar Entity Names Related Entity Names 3/25
Entity Retrieval from Knowledge Graph(s) Names Attributes Categories Similar Entity Names Related Entity Names 3/25
Entity Retrieval from Knowledge Graph(s) Names Attributes Categories Similar Entity Names Related Entity Names 3/25
Entity Retrieval from Knowledge Graph(s) Names Attributes Categories Similar Entity Names Related Entity Names 3/25
Product Search Title Description Attributes 4/25
Product Search Title Description Attributes 4/25
Product Search Title Description Attributes 4/25
Product Search Title Description Attributes 4/25
Web Search Title Texts in Large Font Contents Incoming Hyper-links Document Meta-data Alternative Texts for Im- ages 5/25
Web Search Title Texts in Large Font Contents Incoming Hyper-links Document Meta-data Alternative Texts for Im- ages 5/25
Web Search Title Texts in Large Font Contents Incoming Hyper-links Document Meta-data Alternative Texts for Im- ages 5/25
Web Search Title Texts in Large Font Contents Incoming Hyper-links Document Meta-data Alternative Texts for Im- ages 5/25
Document vs. Structured Document Retrieval Document Retrieval Structured Document Retrieval - relevance is quantified by aggre- - requires strategies for aggregating gating heuristics calculated at the heuristics calculated at the level of document or collection level (# of document fields into the matching occurrences and proximity of query score of an entire document terms, IDF, document length) - effective for retrieving documents with lexically similar, but semanti- cally diverse fields 6/25
Importance of Document Fields Aggregation of field-level statistics of query terms in structured document retrieval is informed by a relative importance of document fields, which depends on: properties or semantics of document fields : e.g. a query term matched in a section of a Web page, which is in larger font, should have a different importance than a query term matched in other sections query intent : e.g. in the query “attractive outdoor light with security features” “attractive” refers to product description, “outdoor light” to product name and “security features” to product attributes 7/25
Mixture of Language Models (MLM) [Ogilvie and Callan, SIGIR’03] Document D with F fields is ranked w.r.t query Q according to: P ( Q | D ) rank P ( q i | θ D ) n ( q i , Q ) � = q i ∈ Q where F w j P ( q i | θ j ) � P ( q i | θ D ) = j =1 8/25
Fielded Sequential Dependence Model (FSDM) [Zhiltsov et al., SIGIR’15] Extends SDM to the case of structured document retrieval (i.e. accounts for both unigram and sequential bigram concepts in a query and document structure) Document D with F fields is ranked w.r.t query Q according to: P ( D | Q ) rank � ˜ � ˜ = λ T f T ( q i , D ) + λ O f O ( q i , q i +1 , D )+ q i ∈ Q q i ∈ Q ˜ � f U ( q i , q i +1 , D ) λ U q i ∈ Q Potential function for query unigram q i : F ˜ � w j P ( q i | θ j ) f T ( q i , D ) = log j =1 9/25
Challenges of Structured Document Retrieval Methods for structured document retrieval (SDR) face three major challenges: identifying the key concepts (words or phrases) in keyword queries semantic matching of the key query concepts in different fields of structured documents aggregating the scores of the matched query phrases into the overall score of a structured document Key limitation: all previously proposed SDR methods are based on direct matching of concepts in queries and document fields → lexical gap 10/25
Proposed Neural Architecture A ttention-based N eural Architecture for Ad-hoc S tructured Document R etrieval ( ANSR ): Input: embeddings of words in a query and document fields Pooling layers: create compressed interaction matrices of the same dimensions between unigram- and bigram-based query and document field phrases Matching score aggregation layers: combine the matching scores of query phrases in different document fields into the overall document relevance score by taking into account relative importance of query phrases and document fields Document field attention layers: calculate relative importance of document fields Query phrase attention layers: calculate relative importance of query phrases 11/25
Pooling Layers (1) Step 1: create distributed representations of a query and each document field Query : automobile capital and the Detroit of Italy Document : http://dbpedia.org/page/Turin Taurinum Turin is an im- portant business and cultural center in northern Italy, capi- tal city of the Piedmont re- gion located mainly on the T aurinum Italy attributes attributes ... ... left bank of the Po River · · · city capital Susa Valley · · · · · · · · · Italy it is also dubbed la cap- itale Sabauda Savoyard capi- tal · · · · · · − → Space Station Teatro Carig- related space Milan nano Savoie List of political ... ... entity philosophers Haifa Parola, Parola Pertini related names Carlo Residences of the entity Royal House of Savoy Eco names · · · , Duchy of Mi- · · · lan Mezzo-soprano Genoa Ginzburg Alessandro Pertini 12/25
Pooling Layers (2) Step 2 : create document fields interaction matrix for each query phrase distributed representations compressed interaction matrices of query and document fields for unigram-based query phrases automobile capital Italy automobile T aurinum ... ... capital attributes capital city 0.28 0.30 space Milan Italy related ... ... entity Parola Pertini names 0.19 0.22 Italy Italy T aurinum Italy T aurinum ... ... attributes ... ... city capital capital attributes city 0.35 0.39 space Milan related Milan space related ... ... entity ... ... entity Parola Pertini Pertini names Parola names 0.34 0.36 13/25
Document Field Attention Layers Goal: compute the importance weights of document fields for aggregating the matching scores of query phrases Document : http://dbpedia.org/page/Turin importance weights Italy T aurinum attributes attributes ... ... 0.21 capital city softmax related related space Milan entity entity ... ... 0.18 names names Parola Pertini 14/25
Query Phrase Attention Layers Goal: compute the importance weights of query phrases for aggregating the matching scores of query phrases of the same type Query : automobile capital and the Detroit of Italy importance weights automobile query automobile 0.24 capital phrase capital softmax query Italy 0.19 Italy phrase 15/25
Matching Score Aggregation Layers Interaction Matrices query phrase: automobile capital matching score of 'automobile capital' in all document fields T aurinum Italy ... ... attributes capital city 0.28 0.30 matching score of unigram based query phrases related space Milan ... ... entity Parola Pertini 0.19 0.22 names aggregation of matching scores of all unigram- 0.30 and bigram-based query query phrase: phrases Italy T aurinum Italy ... ... attributes capital city 0.35 0.39 related space Milan ... matching score of 'Italy' ... entity Parola Pertini in all document fields 0.34 0.36 names aggregation of query aggregation of matching phrase matching scores of query phrases scores in document fields of the same type 16/25
Training ANSR is trained to minimize contrastive max-margin loss, given a collection of triplets < q , d n , d r > consisting of relevant d r and non-relevant d n documents for query q : � max(0 , ζ − s ( q , d r ) + s ( q , d n )) + γ � 2 ||W|| 2 � min 2 W < q , d n , d r > ∈T 17/25
Experiments Language modeling and probabilistic baselines: ◮ PRMS (Probabilistic Retrieval Model for Semistructured Data) [Kim, Xue and Croft, ECIR’09] ◮ MLM (Mixture of Language Models) [Ogilvie and Callan, SIGIR’03] ◮ BM25F [Robertson, Zaragoza and Taylor, CIKM’04] ◮ FSDM (Fielded Sequential Dependence Model) [Zhiltsov, Kotov and Nikolaev, SIGIR’15] Neural baselines: ◮ DRMM (Deep Relevance Matching Model) [Guo, Fan, Ai and Croft, CIKM’16] ◮ DESM (Dual Embedding Space Model ) [Nalisnick, Mitra, Craswell and Caruanan, WWW’16] ◮ NRM-F (Neural Ranking Model with Multiple Document Fields) [ Zamani, Mitra, Song, Craswell and Tiwary, WSDM’18] 18/25
Recommend
More recommend