improving temporal language models for determining time
play

Improving Temporal Language Models for Determining Time of - PowerPoint PPT Presentation

Improving Temporal Language Models for Determining Time of Non-Timestamped Documents Nattiya Kanhabua and Kjetil N rv rv g g Nattiya Kanhabua and Kjetil N Dept. of Computer Science, Dept. of Computer Science, Norwegian University


  1. Improving Temporal Language Models for Determining Time of Non-Timestamped Documents Nattiya Kanhabua and Kjetil Nø ørv rvå åg g Nattiya Kanhabua and Kjetil N Dept. of Computer Science, Dept. of Computer Science, Norwegian University of Science and Technology, Norwegian University of Science and Technology, Trondheim, Norway Trondheim, Norway ECDL 2008 Conference, Å Århus rhus Denmark Denmark ECDL 2008 Conference,

  2. Agenda � Motivation and Challenge � Preliminaries � Our Approaches � Evaluation � Conclusion ECDL 2008 Norwegian University of Science and 2 Technology

  3. Motivation Answer Research Question Answer Research Question Extend keyword search with a “ How to improve search “ temporal information -- results in long-term archives of Temporal text-containment search digital documents? ” ” [Nørvåg’04] Temporal Information � Timestamp, e.g. the created or updated date � In local archives, timestamp can be found in document metadata which is trustable Q: Is document timestamp in WWW archive also trustable ? A: Not always, some problems: 1. A lack of metadata preservation 2. A time gap between crawling and indexing 3. Relocation of web documents ECDL 2008 Norwegian University of Science and 3 Technology

  4. Challenge “For a given document with uncertain timestamp, can the contents be “ used to determine the timestamp with a sufficiently high confidence?” ” I found a bible-like document. But I have Let’s me see… no idea when it was This document is probably created ? originated in 850 A.C. with 95% confidence. You should ask Guru! ECDL 2008 Norwegian University of Science and 4 Technology

  5. Preliminaries “A model for dating documents” Temporal Language Models presented in [de Jong et al. ’04] � Based on the statistic usage of words over time. � Compare a non-timestamped document with a reference corpus. � A reference time partition mostly overlaps in term usage -- the � tentative timestamp. Temporal Language Models A non-timestamped Partition Word document 1999 tsunami 1999 Japan tsunami tsunami tsunami Thailand Thailand Thailand 1999 tidal wave 2004 tsunami 2004 Thailand Partition score Partition score Partition score 2004 earthquake “1999”: 1 “1999”: 1 = 1 “1999”: 1 ECDL 2008 Norwegian University of Science and 5 “2004”: 1 “2004”: 1 + 1 “2004”: 1 + 1 = 2 � most likely timestamp Technology

  6. Proposed Approaches Three ways in improving: temporal language models 1) Data preprocessing 2) Word interpolation 3) Similarity score ECDL 2008 Norwegian University of Science and 6 Technology

  7. Data Preprocessing A direct comparison between extracted words in a document vs. temporal language models limits accuracy. . Semantic- Semantic -based Preprocessing based Preprocessing Description Description Part- -of of- -speech tagging speech tagging Most interesting classes of words are Part selected, e.g. nouns, verbs, and adjectives Co-occurrence of different words can alter Collocation extraction Collocation extraction the meaning, e.g. “United States” Word sense disambiguation Identifying the correct sense of word by Word sense disambiguation analyzing context in a sentence, e.g. “bank” Concept extraction Comparing 2 language models on concept Concept extraction level avoids a less frequency word problem Only the top-ranked N according to TF-IDF Word filtering Word filtering scores will be selected as index terms ECDL 2008 Norwegian University of Science and 7 Technology

  8. Word Interpolation “ A word is categorized into one of two classes depending on When a word has zero probability for a time partition according to “ characteristics occurring in time: recurring or non-recurring . ” a limited size of a corpus collection, it could have a non-zero ” frequency in that period in other documents outside a corpus. Recurring Non-recurring � Related to periodic events. � Words that are not recurring not recurring. � For example, “Summer Olympic”, � For example, “Terrorism”, “World Cup”, “French Open” “Tsunami” Identify recurring words by looking at overlap of words overlap of words distribution at the (flexible) endpoint of possible periods: every year or 4 years ECDL 2008 Norwegian University of Science and 8 Technology

  9. Word Interpolation (cont’) “ How to interpolate words depends on which category a word belongs to: recurring or non-recurring . ” Non-recurring Recurring 6000 6000 10000 10000 9000 9000 5000 5000 8000 8000 4000 7000 7000 4000 Frequency Frequency NR1 Frequency Frequency NR2 6000 6000 3000 3000 5000 5000 4000 4000 2000 NR3 2000 3000 3000 2000 2000 1000 1000 1000 1000 0 0 0 0 1996 2000 2004 2008 0 1996 1 2 2000 3 4 2004 5 6 2008 7 8 0 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Year Year Year Year (b) "Olympic games" after interpolating (a) "Olympic games" before interpolating (b) "Terrorism" after interpolating (a) "Terrorism" before interpolating ECDL 2008 Norwegian University of Science and 9 Technology

  10. Similarity Score “A term weighting concerns temporality, temporal entropy based on the term selection method presented in [Lochbaum,Steeter’89].” Temporal Entropy A measure of temporal information which a word conveys. Captures the importance of a term in a document collection whereas TF-IDF weights a term in a particular document. Tells how good a term is in separating a partition from others. N p is the total number of A probability of a partition partitions in a corpus p containing a term w i A term occurring in few partitions has higher temporal entropy compared to one appearing in many partitions . The higher temporal entropy a term has, the better representative of a partition . ECDL 2008 Norwegian University of Science and 10 Technology

  11. Similarity Score (cont’) “ By analyzing search statistics [Google Zeitgeist], we can increase the probability for a particular time partition. ” An inverse partition frequency, ipf = log N/n f(R) converts a ranked number into weight. The P(wi) is the probability that wi occurs: higher ranked query is P(wi) = 1.0 if a gaining query more important. P(wi) = 0.5 if a declining query (a) (b) A linear combination of a GZ score to an original similarity score [de Jong et al. ’04] ECDL 2008 Norwegian University of Science and 11 Technology

  12. Experimental Setting Partition Word Probability 1999 tsunami 0.015 1999 Japan 0.003 Build 1999 tidal wave 0.009 2004 tsunami 0.091 2004 Thailand 0.012 2004 earthquake 0.080 A reference corpus Temporal Language Models • A list of words and its probability in • Documents with known dates. each time partition. • Collected from the Internet Archive. • News history web pages, e.g. ABC • Intended to capture word usage News, CNN, NewYork Post, etc. within a certain time period. ECDL 2008 Norwegian University of Science and 12 Technology

  13. Experiments Constraints of a training set: 1. Cover the domain of a document to be dated. 2. Cover the time period of a document to be dated. A reference corpus A reference corpus Precision = the fraction of documents (15 sources) (15 sources) correctly dated Recall = the fraction of correctly dated documents processed Randomly select 1000 Select 10 news documents for testing sources from from 5 new sources various domains. (different from training sources) A training set A testing set ECDL 2008 Norwegian University of Science and 13 Technology

  14. Experiment (cont’) Experiment Experiment Evaluation Aspects Evaluation Aspects Description Description A Semantic-based Various combinations of semantics: preprocessing 1) POS – WSD – CON – FILT 2) POS – COLL – WSD – FILT 3) POS – COLL – WSD – CON – FILT B Temporal Entropy, Combination TE,GZ with semantic- Google Zeitgeist based preprocessing, or without. C Dating task and Similar to other classification tasks, confidence a system should be able to tell how much confidence it has in assigning a timestamp. Confidence is measured by the distance between scores of the 1 st and 2 nd ranked partitions. ECDL 2008 Norwegian University of Science and 14 Technology

  15. Results Baseline Baseline 80 80 A.1 TE A.2 GZ 70 70 A.3 S-TE S-GZ 60 60 Precision (%) Precision (%) 50 50 40 40 30 30 20 20 10 10 0 0 1-w 1-m 3-m 6-m 12-m 1-w 1-m 3-m 6-m 12-m Granularities Granularities (b) (a) Semantic-based preprocessing Temporal Entropy, Google Zeitgeist • Increase precision in almost all • By applying semantic-based first, TE and GZ granularities except 1-week obtain high improvement • In a small granularity, it is hard to gain • Semantic-based preprocessing generates high accuracy collocation and concepts • Weighted high by TE and GZ (most of search statistics are noun phrases) ECDL 2008 Norwegian University of Science and 15 Technology

  16. Results (cont’) The higher the confidence, the more reliable results. Precision 110 100 Recall 90 Percentage (%) 80 70 60 50 40 30 20 10 0 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Confidence level (c) Confidence levels and document dating accuracy ECDL 2008 Norwegian University of Science and 16 Technology

Recommend


More recommend