� � ✁ � ✁ Two term-layers: An Alternative topology for representing term relationships in the Bayesian Network Retrieval Model Luis M. de Campos , Juan M. Fern´ andez-Luna , & Juan F. Huete Departamento de Ciencias de la Computaci´ on e Inteligencia Artificial. Universidad de Granada (Spain). Departamento de Inform´ atica. Universidad de Ja´ en (Spain). – p.1/31
� ✁ ✂ Layout 1. Introduction 2. Preliminaries 3. The Bayesian Network Retrieval Model. 4. An alternative representation for term relationships: A topology with two term layers. 5. Experiments and results. 6. Concluding remarks. Online World Conference on Soft Computing in Industrial Applications. – p.2/31
� ✁ ✂ Introduction (I) We present a modification of the Bayesian Network Retrieval Model (BNRM), which aims to improve its efficiency. This model is composed of two subnetworks: The document subnetwork: Stores the documents from the collection. The term subnetwork: The terms occurring in the documents and their relationships. Capturing term to term relationships within a collection implies a more accurate representation of the collection, improving the effectiveness of the IR system. Online World Conference on Soft Computing in Industrial Applications. – p.3/31
✁ � ✂ Introduction (II) In the original model, term relationships are represented by means of a polytree, automatically constructed. The topology proposed in this paper will contain a term subnetwork in which: The collection terms are duplicated and placed in a second layer. Arcs are established from terms from one layer to terms in the second. This bipartite graph allows to efficiently propagate using a probability function evaluation. Online World Conference on Soft Computing in Industrial Applications. – p.4/31
✁ ✄ ☛ ✏ ✞ � ✝ ✂ ✑ � ☎ ✄ ✂ ✂ ✞ ☎ ✆ � � ☎ ✄ ✂ ✒ � ✞ � ✁ ✁ ✍ ✂ ✄ ☎ � ☞ Preliminaries (I) - IR Representation of documents and queries in an IR system is usually based on term-weight vectors. The most common weighting schemes try to highlight the importance of each term, either within a given document it belongs to, or within the entire collection: Term frequency (within-document frequency), tf : the number of times that the term appears in the document. Inverse document frequency of the term in the ✌✎✍ collection: ✟✡✠ , number of documents, number of documents that contain the term. Online World Conference on Soft Computing in Industrial Applications. – p.5/31 The combination of both weights, tf idf , is
� ✁ ✂ Preliminaries (II) - IR Evaluation: Recall (R) , i.e., the proportion of relevant documents retrieved. Precision (P) , i.e., the proportion of retrieved documents that are relevant, for a given query. By computing the precision for a number of values of recall we obtain a recall-precision plot. The average precision for all the recall values considered may be used as a single measure. Online World Conference on Soft Computing in Industrial Applications. – p.6/31
✁ ✏ � ✁ � ✂ ☞ � ✏ ✞ � ☞ ✞ ✄ � ☎ ✏ � ☞ ✞ ✂ Preliminaries (III) - BN A Bayesian network is a Directed Acyclic Graph (DAG), where: Nodes in Variables from the problem. Arcs in Dependence relationships among the variables. The knowledge is represented in two ways: Qualitatively, showing the (in)dependencies between the variables. Quantitatively, by means of a set of conditional probability distributions, measuring the strength of the relationships ( ). Online World Conference on Soft Computing in Industrial Applications. – p.7/31
� ✄ ✏ ✏ � ☞ ✂ ✁ � ☞ � ☎ ✂ ✞ ✁ ✏ ✄ � ✂ ✂ ✂ � ✁ � � ☞ � Preliminaries (V) - BN The joint distribution can be recovered by: Online World Conference on Soft Computing in Industrial Applications. – p.8/31
✡ ✂ ✁ � ✝ ✁ ✠ � ☎ ✁ ✝ ✁ ☎ � � ✁ ✡ ✁ ✁ ✁ ✝ ✆ � � ✁ ✂ ✄ ☎ � ☎ ✆ � ☎ ✞ � ✠ ☎ ✁ The Bayesian Net. Ret. Model (I) Two sets of variables: ✝✟✞ and The topology of the networks is determined by following guidelines: There is a link joining each term node and each document node whenever belongs to . There are not links joining any document nodes and . Any document is conditionally independent of any other document when we know for sure the (ir)relevance values for all the terms indexing . Online World Conference on Soft Computing in Industrial Applications. – p.9/31
☎ ✂ � ✆ ✞ ✏ ✁ ☞ ☎ ✁ ✁ ✁ ✠ ✁ � ✁ ✂ � The Bayesian Net. Ret. Model (II) These three assumptions determine the network structure in part: The links joining term and document nodes have to be directed from terms to documents; moreover, the parent set of a document node is the set of term nodes that belong to , i.e., . Inclusion of dependences between terms: Application of an automatic learning algorithm that has the set of documents as input and generates as the output a polytree of terms. Reasons for using a polytree: Existence of a set of efficient learning, and exact and, also efficient, inference algorithms. Online World Conference on Soft Computing in Industrial Applications. – p.10/31
� ✁ ✂ The Bayesian Net. Ret. Model (III) Graphically, the retrieval model is represented by the following graph: Online World Conference on Soft Computing in Industrial Applications. – p.11/31
✍ ✂ ✄ ✞ ✌ ✝ ☞ ✂ ☛ ✡ ✄ ✆ � ✁ � ✆ ✂ ✄ ✌ ☞ ✝ ☞ ☞ � ✞ ✏ ✏ � ☞ ✂ � ✁ ✞ ✟ ☞ � ✌ � ✝ ☞ ✂ ✡☛ ✝ ✂ ✁ � ✁ ✑ ✞ ✏ � ✞ ✝ ☞ ✁ ☞ ✑ ✞ ✏ � ✞ ☞ � ✂ � ✞ ✡☛ ✂ � ✆ ✏ ✄ ✞ ✏ � ☞ � � ✁ � ✞ ✝ ☞ � ✏ � ✏ ✑ The Bayesian Net. Ret. Model (IV) Probability distributions: Term nodes without parents: and (1) Term nodes with parents: ✂☎✄ ✌☎✍ ✄ ✞✝✠✟ ✂☎✄ ✌✑✏ ✌✓✒ ✂☎✄ ✌☎✍ ✄ ✎✝ ✄ ✎✝ (2) Online World Conference on Soft Computing in Industrial Applications. – p.12/31
✛ ✄ ✤ ✥ ✠ ✡ ☛ ✚ ✄ ✄ ✚ ✥ ☛ ✤ ✡ ✠ ✟ ✛ ✪ ✩ ✪ ☛ ✫ ✠ ✗ ✬ ✦ ✩ ✄ ✗ ✦ ★ ✧ ✛ ✄ ✄ �✆ ✟ ✞ ✞ ✄ ✝ ✁ ☎ ✡ ✄ ✂ ✁ � ✁ ✂ ✠ ☛ ✗ ✯ ✖ ✚ ✙ ✝ ✄ ✄ ✕ � ✰ ✔ ✒✓ ☛ ✡ ✑ ✬ The Bayesian Net. Ret. Model (V) Document nodes: Due to efficiency problems, the model uses a probability function, that returns the required probability when called: (3) ✖✘✗ ☞✍✌ ☞✍✌✏✎ where , , and ✜✣✢ ✖✘✗ ☞✍✌ tf idf (4) ✖✘✗ tf idf ☞✍✌ being a normalizing constant (to assure that ). ✖✮✗ ☞✭✌ Online World Conference on Soft Computing in Industrial Applications. – p.13/31
� � ✁ ✁ ✄ ✁ ✏ � ✞ ☞ ✞ ☎ � � � ✁ ✂ ✝ The Bayesian Net. Ret. Model (VI) Given a query submitted to our system: Place the evidences in the term subnetwork: Each term (relevant). Run the inference process, obtaining, , . Sort in decreasing order of the posterior probability to carry out the evaluation process. Taking into account the topology of the model, general purpose inference algorithms cannot be applied due to efficiency considerations. A new specific inference method has been developed: propagation + evaluation . Online World Conference on Soft Computing in Industrial Applications. – p.14/31
✝ ✁ ☎ � � ✝ ☞ ✞ ✏ ✁ ✁ ☞ ☞ � ✞ � ✏ ✁ � ✁ ✏ � ✄ � ✏ ✁ � ✞ ☞ � ☞ ✂ � The Bayesian Net. Ret. Model (VII) An exact propagation in the term subnetwork. Results: . An evaluation of the probability function used to estimate the conditional probabilities in document nodes using the information obtained in the previous propagation, computing the probability of relevance of each document: (5) ✁✄✂ Online World Conference on Soft Computing in Industrial Applications. – p.15/31
✁ � ✂ An alternative: 2 term layers (I) If the graph contains a lot of terms and arcs, the propagation process could get slower problem. We have to look for an alternative topology that fulfills that: The accuracy of the term relationships represented in the graph. An efficient propagation scheme in the underlying graph to compute the posterior probabilities in each term node. Online World Conference on Soft Computing in Industrial Applications. – p.16/31
Recommend
More recommend