integrating structured data and text
play

Integrating Structured Data and Text A Tagged Document < - PDF document

Integrating Structured Data and Text A Tagged Document < DOC > < DOCNO > WSJ870323-0180 < /DOCNO > < HL > Italys Commercial Vehicle Sales < /HL > < DD > 03/23/87 < /DD >


  1. ✬ ✩ Integrating Structured Data and Text ✫ ✪

  2. ✬ ✩ A Tagged Document < DOC > < DOCNO > WSJ870323-0180 < /DOCNO > < HL > Italy’s Commercial Vehicle Sales < /HL > < DD > 03/23/87 < /DD > < DATELINE > TURIN, Italy < /DATELINE > < TEXT > Commercial-vehicle sales in Italy rose 11.4% in February from a year ear- lier, to 8,848 units, according to provisional figures from the Italian Asso- ciation of Auto Makers. Sales for the Association are expected to rise an additional 2% in July. < /TEXT > < /DOC > ✫ ✪

  3. ✬ ✩ Another Tagged Document < DOC > < DOCNO > WSJ870323-0181 < /DOCNO > < HL > Ford Discontinues Taurus SHO Five-Speed Vehicle < /HL > < DD > 01/21/95 < /DD > < DATELINE > George, Atlanta < /DATELINE > < TEXT > Ford Motor Company announced that beginning in 1996, the Taurus SHO will no longer include a five-speed vehicle. < /TEXT > < /DOC > ✫ ✪

  4. ✬ ✩ Representing Unstructured Text as Relations DOC (doc id, doc name, date, dateline) DOC TERM (doc id, term, tf) DOC TERM PROX (doc id, term, offset) IDF (term, idf) STOP TERM (term) QUERY (term, tf) ✫ ✪

  5. ✬ ✩ DOC: doc id doc name date dateline 1 WSJ870323-0180 3/23/87 Turin, Italy 2 WSJ870323-0181 1/21/95 Georgia, Atlanta ✫ ✪

  6. ✬ ✩ DOC TERM: DOC TERM PROX IDF doc id term tf doc id term offset term idf 1 commercial 1 1 commercial 1 11.4% 2.9595 1 vehicle 1 1 vehicle 2 2% 1.5911 1 sales 2 1 sales 3 8848 4.3936 1 italy 1 1 italy 4 according 0.7782 1 rose 1 1 rose 5 additional 1.0792 1 11.4% 1 1 11.4% 6 association 1.0792 1 february 1 1 february 7 auto 1.2788 1 year 1 1 year 8 commercial 1.0000 1 earlier 1 1 earlier 9 earlier 0.6021 1 8,848 1 1 8,848 10 expected 0.6990 1 according 1 1 according 11 february 1.3222 1 provisional 1 1 provisional 12 figures 1.2553 1 figures 1 1 figures 13 italian 1.8451 1 italian 1 1 italian 14 italy 1.6721 1 association 2 1 association 15 July 1.0414 1 auto 1 1 auto 16 makers 1.3010 1 makers 1 1 makers 17 provisional 2.5172 1 expected 1 1 sales 18 rose 0.7782 1 additional 1 1 association 19 sales 0.6990 1 2% 1 ... ... ... vehicle 1.7709 1 July 1 1 July 24 year 0.0000 ... ... ... ... ... ... 2 vehicle 1 2 vehicle 14 ✫ ... ... ... ... ... ... ✪

  7. ✬ ✩ QUERY STOP TERM term tf term vehicle 1 a sales 1 about after all also an and ... the ... ✫ ✪

  8. ✬ ✩ Assumptions • DOC TERM does not contain duplicate terms for the same doc- ument. • QUERY does not contain duplicate terms The text preprocessor that creates the relations can remove dupli- cates that it finds in documents or queries. ✫ ✪

  9. ✬ ✩ Boolean OR SELECT d.doc id FROM DOC TERM d WHERE d.term = input term1 OR d.term = input term2 OR d.term = input term3 OR ... d.term = input termN ✫ ✪

  10. ✬ ✩ Boolean AND SELECT a.doc id FROM DOC TERM a,DOC TERM b, ... DOC TERM n WHERE a.term = input term1 AND b.term = input term2 AND c.term = input term3 AND d.term = input term4 AND ... n.term = input termN ✫ ✪

  11. ✬ ✩ Boolean AND • A better implementation of boolean AND. SELECT d.doc id FROM DOC TERM d,QUERY q WHERE d.term = q.term GROUP BY d.doc id HAVING COUNT(DISTINCT d.term) = (SELECT COUNT(*) FROM QUERY) Returns document 1 for our example. ✫ ✪

  12. ✬ ✩ Boolean OR • A better implementation of boolean OR. SELECT d.doc id FROM DOC TERM d,QUERY q WHERE d.term = q.term GROUP BY d.doc id ✫ ✪

  13. ✬ ✩ Threshold AND • Retrieve documents that conatin at least k of the specified terms in the query. SELECT d.doc id FROM DOC TERM d,QUERY q WHERE d.term = q.term GROUP BY d.doc id HAVING COUNT(DISTINCT d.term) ≥ k ✫ ✪

  14. ✬ ✩ Proximity Searches • Request for all documents that contain n terms within a term window of size width . • Use a relation DOC TERM PROX with attributes docid, term, offset . • The query insists that not only must all terms found in QUERY be contained in a given document, but at least one occurrence of each term in QUERY must fall within a term window of size width . ✫ ✪

  15. ✬ ✩ Proximity Search using SQL SELECT a.doc id FROM DOC TERM PROX a, DOC TERM PROX b WHERE a.term IN (SELECT q.term FROM QUERY q) AND b.term IN (SELECT q.term FROM QUERY q) AND a.doc id = b.doc id AND (b.offset - a.offset) BETWEEN 0 AND ( width − 1) GROUP BY a.doc id, a.term, a.offset HAVING COUNT(DISTINCT b.term) = (SELECT COUNT(*) FROM QUERY) Example. Find documents that have the terms vehicle and sales in a window of size 4. ✫ ✪

  16. ✬ ✩ After the first three conditions in the where clause. a.doc id a.term a.offset b.doc id b.term b.offset 1 vehicle 2 1 vehicle 2 1 vehicle 2 1 sales 3 1 vehicle 2 1 sales 18 1 sales 3 1 vehicle 2 1 sales 3 1 sales 3 1 sales 3 1 sales 18 1 sales 18 1 vehicle 2 1 sales 18 1 sales 3 1 sales 18 1 sales 18 2 vehicle 14 2 vehicle 14 ✫ ✪

  17. ✬ ✩ The following tuples satisfy all four conditions in the where clause. a.doc id a.term a.offset b.doc id b.term b.offset 1 vehicle 2 1 vehicle 2 1 vehicle 2 1 sales 3 1 sales 3 1 sales 3 1 sales 18 1 sales 18 2 vehicle 14 2 vehicle 14 ✫ ✪

  18. ✬ ✩ GROUP BY partitions the result set based on document id, term , and offset . a.doc id a.term a.offset b.doc id b.term b.offset 1 vehicle 2 1 vehicle 2 1 vehicle 2 1 sales 3 1 sales 3 1 sales 3 1 sales 18 1 sales 18 2 vehicle 14 2 vehicle 14 HAVING makes sure that all terms are found in a group. DISTINCT ensures that duplicate terms within the term window are not counted. Answer: Document 1. ✫ ✪

  19. ✬ ✩ Computing Relevance Using Unchanged SQL • If large number of documents match the query, we want to re- trieve the documents ranked by their relevance to the query. • How do we compute the relevance? – Model each document and query as a vector of terms. Then use cartesian distance between the query and a document to rank the documents. ✫ ✪

  20. ✬ ✩ d = total number of documents n = number of distinct terms in the document collection tf ij = number of occurrences of term t j in document D i f j = number of documents which contain t j d f j = log 10 ( d f j ) id d The calculation of the weighting factor ( w ) for a term in a docu- ment is defined as a combination of term frequency ( tf ) and inverse document frequency ( id f ). The value of the j th entry in the vector corresponding to document i is: D ij = tf ij × id f j , 1 ≤ j ≤ n Similarly, we compute the vector for the query. ✫ ✪ Q j = tf j × id f j , 1 ≤ j ≤ n

  21. ✬ ✩ Similarity Coefficients A similarity coefficient between a query Q and a document D i is just the dot-product of the corresponding vectors, which is euqivalent to the following sum: sim ( Q, D i ) = � t j =1 Q j × d ij ✫ ✪

  22. ✬ ✩ SQL for Relevance Ranking SELECT d.doc id, SUM(q.tf * i.idf * d.tf * i.idf) FROM QUERY q, DOC TERM d, IDF i WHERE q.term = i.term AND d.term = i.term GROUP BY d.doc id ORDER BY 2 DESC ✫ ✪

  23. ✬ ✩ Example for Relevance Ranking The result set for the query is computed as: Since vehicle contains an idf of 1.7709 and sales contains an idf .6990, the similarity coefficient for the documents is computed as: (vehicle) (sales) D 1 = (1 . 7709)(1)(1 . 7709)(1) + ( . 6990)(1)( . 6990)(2) D 2 = (1 . 7709)(1)(1 . 7709)(1) + ( . 6990)(1)( . 6990)(0) RESULT: doc id similarity coefficient 1 4.113 2 3.136 ✫ ✪

Recommend


More recommend