query by analogical examples relational search using web
play

Query by Analogical Examples: Relational Search Using Web Search - PowerPoint PPT Presentation

Query by Analogical Examples: Relational Search Using Web Search Engine Indices by: Lohit Jain (11390) Advisor: Prof Amitabha Mukerjee Motivation Relational search is an effective way to obtain information in an unknown field for users. For


  1. Query by Analogical Examples: Relational Search Using Web Search Engine Indices by: Lohit Jain (11390) Advisor: Prof Amitabha Mukerjee

  2. Motivation Relational search is an effective way to obtain information in an unknown field for users. For example, if an Apple user searches for Microsoft products, similar Apple products are important clues for the search. Even if the user does not know keywords to search for specific Microsoft products, the relational search returns a product name by querying simply an example of Apple products. More specifically, given a tuple containing three terms, such as (Apple, iPod, Microsoft), the term Zune can be extracted from the Web search results, where Apple is to iPod what Microsoft is to Zune.

  3. Recent Works WEB INFORMATION EXTRACTION: ● Brin [7] extracted many author-title relationships from the Web by the bootstrap method. ○ Given examples of author and title pairs, the method finds a phrase consisting of “prefix, author, middle, title, suffix” from Web documents, and extracts author-title pairs by matching the phrases, supposing that the same syntactic pattern means the same relation. ● Snowball [1], a system using a method similar to Brin’s, also extracts specific relationships from the Web. ○ It weights each pattern and improves the performance of the extraction. ● The concept of relational search has already been described by Cafarella et al. [8]. They crawled 90 million of Web pages and constructed an extraction graph that textually represented an entity- relationship graph, which was automatically extracted from the Web pages SIMILARITY OF RELATION: ● Turney et al. [22] have proposed methods that measure the similarity of relations. They aimed to solve verbal analogy questions, where a word pair, A and B, is provided, and the problem is to select the most feasible pair, C and D, from a set of five choices, ○ In this method, a vector whose elements correspond to the frequencies of documents containing prepared lexicon patterns, such as “X of Y ”, “Y to X”, and “X for Y ”. The similarity is calculated on the basis of the cosine similarity between vectors for pairs (A, B) and (C, D). Turney [19, 21] improved this method by expanding it with latent relational analysis and achieved 56.4% precision.

  4. Recent Works ● Bollegala et al. [4] also tackled the verbal analogy problem. They retrieved lexicon patterns between terms X and Y from the Web, trained a two-class support vector machine (SVM) to learn the contributions of various lexical patterns towards the relational similarity between the pair, and sped up the relation similarity calculation. ○ also proposed another method to measure relational similarity by clustering lexical patterns between two words, X and Y , and calculating the similarity based on a metric learning approach RETRIEVING TERMS IN SPECIFIC RELATION: ● Church et al. [9] measured relatedness between two words with mutual information. ● Turney [18] and Baroni et al. [2] proposed methods that calculate the level of synonyms for two words by using the number of Web documents returned by search engines. Their methods use the co-occurrences of words and mutual information. ● Bollegala et al. [3] computed semantic similarity by using automatically extracted lexical patterns from text snippets of Web search results and integrated these different similarity scores by using SVMs for a robust semantic similarity measure. ● Oyama et al. [16] retrieved from the Web pairs of words in which one word describes the other, which can also be taken as a part-of relationship. ● Hokama et al. [12] extracted from the Web mnemonic names of people.

  5. Methodology For input terms a, b, and c, two phases are conducted to extract a target term d, where Relation(a, b) is the closest relation to Relation(c, d). ● Finding Relation Extractor : Given a = Apple, b = iPod, and input c = Microsoft, only a query using the term Microsoft cannot identify an expected term d = Zune. Additional terms such as “music player” or lexico-syntactic patterns such as “is a music player sold by” should be given for identifying the term d. ● Extracting and Ranking Terms based on Relational Similarity: We input a query by combining input c =Microsoft and terms indicated by E(a, b) to a Web search engine, and we search for candidates D for the expected term d = Zune from results of the Web search. Finally, each candidate d i in D is ranked by the relational similarity between Relation(Apple, iPod) and Relation(Microsoft, d i ).

  6. Relation Extractor Finding a relation extractor E(a, b) for input terms a and b con- sists of three steps: ● Step 1 Gathering text contents that contain terms a and b using a conventional Web search engine. ● Step 2 Finding terms that frequently appear only in documents including both a and b. ● Step 3 Choosing a set of terms used in a relation extractor.

  7. Relation Extractor ● Search for Web documents that include a but not b and include b but not a, denoted by Doc(a ∧ ~b) and Doc(~a ∧ b), respectively. Web documents that include both a and b (Doc(a ∧ b)) are also retrieved ● For each term t supposed to be a noun, a χ2 test is conducted of the hypothesis that term t has the same probability of occurring in Doc(~a ∧ b) and Doc(a ∧ b). The same test for Doc(a ∧ ~b) is conducted. ● If both of these hypotheses for term t are rejected at significance level α, and the occurrence of t is higher in Doc(a ∧ b) than in both Doc(~a ∧ b) and Doc(a ∧ ~b), then t is taken to be an element of a term set T (a, b).

  8. Extracting and Ranking Terms based on Relational Similarity Ranking candidate terms for term d on the basis of relational similarity to input terms a, b, and c consists of the following four steps: ● Step 1 Gathering text contents that contain terms c and t in T (a, b) using a conventional Web search engine. ● Step 2 Finding candidates for term d that frequently appear only in documents including both terms c and t. ● Step 3 Scoring the candidates based on χ 2 tests for each t. ● Step 4 Aggregating the scores for each candidate. ○ Rank(q, d i ) = − ln Score(d i )

  9. Extracting and Ranking Terms based on Relational Similarity ● For each term t in a set of terms T (a, b), Web documents are searched for that include c but not t and that include t but not c, which are denoted by Doc(c ∧ ~t) and Doc(~c ∧ t), respectively. Documents that include both c and t (Doc(c ∧ t)) are also sought. Then, for each search result, we extract each noun d in the titles and summaries, each of which is a candidate for an expected term d. ● For each term d supposed to be a noun in Doc(c ∧ t), a χ 2 test for Doc(c ∧ ~t) and Doc(~c ∧ t) is conducted. In the test results, the probabilities of the null hypothesis are assigned to Pc(d) and Pt(d), respectively. ● If both of these hypotheses for term d i are rejected at significance level β and the occurrence of term d i is higher in Doc(c ∧ t) than in both Doc(~c ∧ t) and Doc(c ∧ ~t), then Pc,t (d) = Pc(d)*P t (d). Otherwise, Pc,t(d) = 1 ● The score Score(d) is the product of all P c,t (d) for all terms t in the term set T (a, b).

  10. Relation Extractor

  11. Relation Extractor

  12. Relation Extractor

  13. Relation Extractor

  14. Extracting and Ranking Terms

  15. Evaluation The returned terms ranked by each method were evaluated by the MRR, percentage of tests obtained in the top k, and the average number of Web accesses. the MRR as a metric to evaluate the ranked search results for each method. MRR is the mean of the inverse ranks for each task where a relevant result primarily appears. The results for top 5, 10, 20 calculated Experimented with each combination of the parameters, and represented them in the format TC(α, β). Although the significance level α affects the precision of results and the number of Web accesses, a primary trial had indicated that β is a small factor compared with α. This is why β was fixed and results were compared by changing α.

  16. References ● E. Agichtein and L. Gravano. Snowball: Extracting Relations from Large Plain-Text Collections. In Proc. of DL 2000, pages 85–94, 2000. ● D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring Semantic Similarity between Words Using Web Search Engines. In Proc. of WWW 2007, pages 757–766, 2007. ● P. D. Turney and M. L. Littman. Corpus-based Learning of Analogies and Semantic Relations. Machine Learning, 60(1-3):251–278, 2005. ● D. Bollegala, Y. Matsuo, and M. Ishizuka. WWW sits the SAT-Measuring Relational Similarity on the Web. In Proc. of ECAI 2008, pages 333–337, 2008. ● [5] D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring the similarity between implicit semantic relations from the web. In Proc. of WWW 2009, pages 651–660, 2009. ● [6] D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring the similarity between implicit semantic relations using web search engines. In Proc. of WSDM 2009, pages 104–113, 2009. ● [7] S. Brin. Extracting Patterns and Relations from the World Wide Web. In Proc. of WebDB 1998, pages 172–183, 1998.

Recommend


More recommend