A Time Machine for Text Search Klaus Berberich, Srikanta Bedathur, - - PowerPoint PPT Presentation

a time machine for text search
SMART_READER_LITE
LIVE PREVIEW

A Time Machine for Text Search Klaus Berberich, Srikanta Bedathur, - - PowerPoint PPT Presentation

A Time Machine for Text Search Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum Max-Planck Institute for Informatics, Saarbrcken, Germany Motivation Historical information needs, e.g., Contemporary (~2001) articles


slide-1
SLIDE 1

A Time Machine for Text Search

Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum

Max-Planck Institute for Informatics, Saarbrücken, Germany

slide-2
SLIDE 2

Klaus Berberich – A Time Machine for Text Search

Motivation

Historical information needs, e.g.,

Contemporary (~2001) articles about the movie “Harry Potter and the Sorcerer’s Stone”

Relevant pages have disappeared but are preserved

by Web archives (e.g., archive.org)

Search over Web archives is limited and ignores the

time-axis

slide-3
SLIDE 3

Klaus Berberich – A Time Machine for Text Search

Motivation

Historical information needs, e.g.,

Contemporary (~2001) articles about the movie “Harry Potter and the Sorcerer’s Stone”

Relevant pages have disappeared but are preserved

by Web archives (e.g., archive.org)

Search over Web archives is limited and ignores the

time-axis

slide-4
SLIDE 4

Klaus Berberich – A Time Machine for Text Search

Motivation

Historical information needs, e.g.,

Contemporary (~2001) articles about the movie “Harry Potter and the Sorcerer’s Stone”

Relevant pages have disappeared but are preserved

by Web archives (e.g., archive.org)

Search over Web archives is limited and ignores the

time-axis

slide-5
SLIDE 5

Klaus Berberich – A Time Machine for Text Search

Motivation

Historical information needs, e.g.,

Contemporary (~2001) articles about the movie “Harry Potter and the Sorcerer’s Stone”

Relevant pages have disappeared but are preserved

by Web archives (e.g., archive.org)

Search over Web archives is limited and ignores the

time-axis

slide-6
SLIDE 6

Klaus Berberich – A Time Machine for Text Search

Motivation

Historical information needs, e.g.,

Contemporary (~2001) articles about the movie “Harry Potter and the Sorcerer’s Stone”

Relevant pages have disappeared but are preserved

by Web archives (e.g., archive.org)

Search over Web archives is limited and ignores the

time-axis

slide-7
SLIDE 7

Klaus Berberich – A Time Machine for Text Search

Motivation

Time-Travel Text Search extends keyword querying

by a time-point of interest t “harry potter” @ 2001/11/14

Other temporally versioned text collections

Wikis Repositories (e.g., controlled by CVS, Subversion) Your Desktop

slide-8
SLIDE 8

Klaus Berberich – A Time Machine for Text Search

Outline

Motivation Collection, Query, and Relevance Model Time-Travel Inverted File Index

Reducing Index Size Tuning Index Performance

Experimental Evaluation Conclusions

slide-9
SLIDE 9

Klaus Berberich – A Time Machine for Text Search

Collection Model

Document d is a sequence of time-stamped versions Version is a vector of searchable terms Document deletion results in tombstone version ⊥ Discrete time, timestamps are non-negative State of document collection as of time t

slide-10
SLIDE 10

Klaus Berberich – A Time Machine for Text Search

Query Model

Time-travel query q t consists of

keyword part q (i.e., a set of query terms) time-point of interest t

Time-travel query q t is evaluated over D t so that only

versions that existed at time t are considered

slide-11
SLIDE 11

Klaus Berberich – A Time Machine for Text Search

Relevance Model

We adapt Okapi BM25 as a relevance model Term-frequency score (TF) Inverse document-frequency score (IDF)

slide-12
SLIDE 12

Klaus Berberich – A Time Machine for Text Search

Relevance Model

We adapt Okapi BM25 as a relevance model Term-frequency score (TF) Inverse document-frequency score (IDF)

slide-13
SLIDE 13

Klaus Berberich – A Time Machine for Text Search

Relevance Model

We adapt Okapi BM25 as a relevance model Term-frequency score (TF) Inverse document-frequency score (IDF)

slide-14
SLIDE 14

Klaus Berberich – A Time Machine for Text Search

Relevance Model

We adapt Okapi BM25 as a relevance model Term-frequency score (TF) Inverse document-frequency score (IDF)

slide-15
SLIDE 15

Klaus Berberich – A Time Machine for Text Search

Relevance Model

We adapt Okapi BM25 as a relevance model Term-frequency score (TF) Inverse document-frequency score (IDF)

slide-16
SLIDE 16

Klaus Berberich – A Time Machine for Text Search

Outline

Motivation Collection, Query, and Relevance Model Time-Travel Inverted File Index

Reducing Index Size Tuning Index Performance

Experimental Evaluation Conclusions

slide-17
SLIDE 17

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Idea: Transparently extend “IR’s workhorse” so that

the existing wealth of extensions remains applicable

We extend postings by a validity time-interval

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

slide-18
SLIDE 18

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Time-travel query q t can be processed by scanning

index lists while ignoring non-relevant postings

Example: “harry”@t8

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

slide-19
SLIDE 19

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Time-travel query q t can be processed by scanning

index lists while ignoring non-relevant postings

Example: “harry”@t8

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

widf (“harry”, t8) = 3.08

slide-20
SLIDE 20

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Time-travel query q t can be processed by scanning

index lists while ignoring non-relevant postings

Example: “harry”@t8

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

widf (“harry”, t8) = 3.08 scan

slide-21
SLIDE 21

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Time-travel query q t can be processed by scanning

index lists while ignoring non-relevant postings

Example: “harry”@t8

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

widf (“harry”, t8) = 3.08 scan

slide-22
SLIDE 22

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Time-travel query q t can be processed by scanning

index lists while ignoring non-relevant postings

Example: “harry”@t8

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

widf (“harry”, t8) = 3.08 scan

slide-23
SLIDE 23

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Time-travel query q t can be processed by scanning

index lists while ignoring non-relevant postings

Example: “harry”@t8

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

widf (“harry”, t8) = 3.08 scan

slide-24
SLIDE 24

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Time-travel query q t can be processed by scanning

index lists while ignoring non-relevant postings

Example: “harry”@t8

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

widf (“harry”, t8) = 3.08 scan

slide-25
SLIDE 25

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Time-travel query q t can be processed by scanning

index lists while ignoring non-relevant postings

Example: “harry”@t8

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

widf (“harry”, t8) = 3.08 scan

slide-26
SLIDE 26

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Time-travel query q t can be processed by scanning

index lists while ignoring non-relevant postings

Example: “harry”@t8

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

widf (“harry”, t8) = 3.08 scan

slide-27
SLIDE 27

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Time-travel query q t can be processed by scanning

index lists while ignoring non-relevant postings

Example: “harry”@t8

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

widf (“harry”, t8) = 3.08 scan

slide-28
SLIDE 28

Klaus Berberich – A Time Machine for Text Search

Time-Travel Inverted File Index

Time-travel query q t can be processed by scanning

index lists while ignoring non-relevant postings

Example: “harry”@t8

IDF

“harry”

“potter”

B+-Tree

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

widf (“harry”, t8) = 3.08 scan

slide-29
SLIDE 29

Klaus Berberich – A Time Machine for Text Search

Outline

Motivation Collection, Query, and Relevance Model Time-Travel Inverted File Index

Reducing Index Size Tuning Index Performance

Experimental Evaluation Conclusions

slide-30
SLIDE 30

Klaus Berberich – A Time Machine for Text Search

Reducing Index Size

Shortcoming: Since we create one posting per

version per term, the resulting index is very large

TF ( d1, 11.2, [t1, t2) ) ( d1, 10.6, [t2, t5) )

“harry”

( d8, 10.9, [t7, t9) )

B+-Tree

HUGE!!!

(Wikipedia Revision History ~8.6B postings)

slide-31
SLIDE 31

Klaus Berberich – A Time Machine for Text Search

Reducing Index Size

Observation: Changes between document versions

minor (e.g., corrected typos) have no noticeable effect on the ranked result

(e.g., 500 x “harry” vs. 510 x “harry” )

Idea: Coalesce sequences of temporally adjacent

postings having similar scores

time score

non-coalesced coalesced

slide-32
SLIDE 32

Klaus Berberich – A Time Machine for Text Search

Reducing Index Size

Problem Statement: Given input sequence I find a

minimal length output sequence O with approximation errors bounded by a threshold

Approximate Temporal Coalescing (ATC)

finds an optimal output sequence using a greedy linear time algorithm

p1

score

p’ p3

Guarantee: |p’ - pi| / |pi|

p2

slide-33
SLIDE 33

Klaus Berberich – A Time Machine for Text Search

Tuning Index Performance

Shortcoming: During query processing many

postings are superfluously read

time

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 d1 d2 d3

document

1 2 3 4 5 6 7 8 9 10

slide-34
SLIDE 34

Klaus Berberich – A Time Machine for Text Search

Tuning Index Performance

Shortcoming: During query processing many

postings are superfluously read

time

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 d1 d2 d3

document

1 2 3 4 5 6 7 8 9 10

t

We read 10 postings, but only {1, 5, 8} are needed

slide-35
SLIDE 35

Klaus Berberich – A Time Machine for Text Search

Tuning Index Performance

Idea: Materialize smaller sublists containing only

postings that overlap with a smaller time-interval

time

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 d1 d2 d3

document

1 2 3 4 5 6 7 8 9 10

t

slide-36
SLIDE 36

Klaus Berberich – A Time Machine for Text Search

Tuning Index Performance

Idea: Materialize smaller sublists containing only

postings that overlap with a smaller time-interval

time

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 d1 d2 d3

document

1 2 3 4 5 6 7 8 9 10

t

By materializing a sublist for [t1, t2) we can achieve optimal performance for the query

slide-37
SLIDE 37

Klaus Berberich – A Time Machine for Text Search

Tuning Index Performance

Idea: Materialize smaller sublists containing only

postings that overlap with a smaller time-interval

time

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 d1 d2 d3

document

1 2 3 4 5 6 7 8 9 10

t

slide-38
SLIDE 38

Klaus Berberich – A Time Machine for Text Search

Tuning Index Performance

Idea: Materialize smaller sublists containing only

postings that overlap with a smaller time-interval

time

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 d1 d2 d3

document

1 2 3 4 5 6 7 8 9 10

By materializing a sublist for each elementary time interval we achieve optimal performance

t

slide-39
SLIDE 39

Klaus Berberich – A Time Machine for Text Search

Tuning Index Performance

So far, we have seen two extreme solutions

space-optimal: keep only a single list (SOPT) performance-optimal: keep one list per

elementary time-interval (POPT)

We propose two systematic techniques to trade-off

space and performance

performance-guarantee: consumes minimal space

while retaining a performance guarantee (PG)

space-bound: achieves best performance while

not exceeding a space limit (SB)

slide-40
SLIDE 40

Klaus Berberich – A Time Machine for Text Search

Tuning Index Performance

Performance Guarantee (PG)

consumes minimal space guarantees that for any t at most ! nt postings

are read where nt is the number of postings that exist at time t

Optimal solution computable by means of induction

in O(T2) time and O(T2) space (where T is the number of distinct timestamps in the list)

slide-41
SLIDE 41

Klaus Berberich – A Time Machine for Text Search

Tuning Index Performance

Space Bound (SB)

achieves minimal expected processing cost

(i.e., expected length of the list that is scanned)

consumes at most !n space where n is the

length of the original list

Optimal solution computable using dynamic

programming in O(n4) time and O(n3) space

Approximate solution computable in O(T2) time and

O(T) space using simulated annealing

slide-42
SLIDE 42

Klaus Berberich – A Time Machine for Text Search

Outline

Motivation Collection, Query, and Relevance Model Time-Travel Inverted File Index

Reducing Index Size Tuning Index Performance

Experimental Evaluation Conclusions

slide-43
SLIDE 43

Klaus Berberich – A Time Machine for Text Search

Experimental Evaluation – Setup

Implementation:

Java, Oracle 10g

Datasets:

WIKI: Revision history of English Wikipedia (2001-2005)

892K documents / 13,976K versions / 0.7 TBytes

UKGOV: Weekly crawls of 11 .gov.uk sites (2004-2005)

502K documents / 8,687K versions / 0.4 TBytes

Queries:

300 keyword queries from AOL query log that most frequently

produced a result click on en.wikipedia.org / .gov.uk

Each keyword query is assigned one time point per month in

the collection’s lifespan (18K / 7.2K time-travel queries in total)

slide-44
SLIDE 44

Klaus Berberich – A Time Machine for Text Search

Experimental Evaluation – Setup

Implementation:

Java, Oracle 10g

Datasets:

WIKI: Revision history of English Wikipedia (2001-2005)

892K documents / 13,976K versions / 0.7 TBytes

UKGOV: Weekly crawls of 11 .gov.uk sites (2004-2005)

502K documents / 8,687K versions / 0.4 TBytes

Queries:

300 keyword queries from AOL query log that most frequently

produced a result click on en.wikipedia.org / .gov.uk

Each keyword query is assigned one time point per month in

the collection’s lifespan (18K / 7.2K time-travel queries in total)

WIKI: ten commandments, abraham lincoln, da vinci code, harlem renaissance… UKGOV: 1901 uk census, british royal family, migrant worker statistics, witness intimidation…

slide-45
SLIDE 45

Klaus Berberich – A Time Machine for Text Search

Experimental Evaluation – Setup

Implementation:

Java, Oracle 10g

Datasets:

WIKI: Revision history of English Wikipedia (2001-2005)

892K documents / 13,976K versions / 0.7 TBytes

UKGOV: Weekly crawls of 11 .gov.uk sites (2004-2005)

502K documents / 8,687K versions / 0.4 TBytes

Queries:

300 keyword queries from AOL query log that most frequently

produced a result click on en.wikipedia.org / .gov.uk

Each keyword query is assigned one time point per month in

the collection’s lifespan (18K / 7.2K time-travel queries in total)

slide-46
SLIDE 46

Klaus Berberich – A Time Machine for Text Search

Indexes computed for different values of threshold

Approximate Temporal Coalescing

0.01 0.1 1 0.1 0.2 0.3 0.4 0.5 Size of Original Index

!

WIKI UKGOV

slide-47
SLIDE 47

Klaus Berberich – A Time Machine for Text Search

Indexes computed for different values of threshold

Approximate Temporal Coalescing

4.39% 2.38%

0.01 0.1 1 0.1 0.2 0.3 0.4 0.5 Size of Original Index

!

WIKI UKGOV

slide-48
SLIDE 48

Klaus Berberich – A Time Machine for Text Search

Approximate Temporal Coalescing

Impact on top-k query results assessed using

Relative Recall @ k in [0,1] Kendall’s @ k in [-1,1]

Computed per dataset for

all time-travel queries (18K / 7.2K) k varying as 10, 25, 50, 100 varying as 0.01, 0.05, 0.10, 0.25, 0.50

We report mean, 5%-percentile, and 95%-percentile

slide-49
SLIDE 49

Klaus Berberich – A Time Machine for Text Search

Approximate Temporal Coalescing

0.2 0.4 0.6 0.8 1

!

0.01 0.05 0.10 0.25 0.50 Relative Recall @ 100 (WIKI) Kendall’s " @ 100 (WIKI) Relative Recall @ 100 (UKGOV) Kendall’s " @ 100 (UKGOV)

slide-50
SLIDE 50

Klaus Berberich – A Time Machine for Text Search

Sublist Materialization

Threshold for ATC fixed as = 0.10 For terms in query workloads (422/522) we apply

SOPT and POPT PG for varying between 1.10 and 3.00 SB for varying between 1.10 and 3.00

We report

Space, i.e., total number of postings in

materialized sublists

Expected Processing Cost (EPC), i.e., expected

length of scanned list for random term and time

slide-51
SLIDE 51

Klaus Berberich – A Time Machine for Text Search

Performance Guarantee

WIKI UKGOV Space EPC Space EPC POPT 14,428% 100% 11,406% 100% SOPT 100% 963% 100% 147% WIKI UKGOV Space EPC Space EPC = 1.10 1,004% 106% 616% 103% = 1.50 295% 132% 233% 117% = 2.00 195% 160% 163% 125% = 3.00 145% 207% 132% 133%

Performance Guarantee

EPC = Expected Processing Cost

slide-52
SLIDE 52

Klaus Berberich – A Time Machine for Text Search

Performance Guarantee

WIKI UKGOV Space EPC Space EPC POPT 14,428% 100% 11,406% 100% SOPT 100% 963% 100% 147% WIKI UKGOV Space EPC Space EPC = 1.10 1,004% 106% 616% 103% = 1.50 295% 132% 233% 117% = 2.00 195% 160% 163% 125% = 3.00 145% 207% 132% 133%

Performance Guarantee

EPC = Expected Processing Cost

slide-53
SLIDE 53

Klaus Berberich – A Time Machine for Text Search

Performance Guarantee

WIKI UKGOV Space EPC Space EPC POPT 14,428% 100% 11,406% 100% SOPT 100% 963% 100% 147% WIKI UKGOV Space EPC Space EPC = 1.10 1,004% 106% 616% 103% = 1.50 295% 132% 233% 117% = 2.00 195% 160% 163% 125% = 3.00 145% 207% 132% 133%

Performance Guarantee

EPC = Expected Processing Cost

slide-54
SLIDE 54

Klaus Berberich – A Time Machine for Text Search

Performance Guarantee

WIKI UKGOV Space EPC Space EPC POPT 14,428% 100% 11,406% 100% SOPT 100% 963% 100% 147% WIKI UKGOV Space EPC Space EPC = 1.10 1,004% 106% 616% 103% = 1.50 295% 132% 233% 117% = 2.00 195% 160% 163% 125% = 3.00 145% 207% 132% 133%

Performance Guarantee

EPC = Expected Processing Cost

slide-55
SLIDE 55

Klaus Berberich – A Time Machine for Text Search

Performance Guarantee

WIKI UKGOV Space EPC Space EPC POPT 14,428% 100% 11,406% 100% SOPT 100% 963% 100% 147% WIKI UKGOV Space EPC Space EPC = 1.10 1,004% 106% 616% 103% = 1.50 295% 132% 233% 117% = 2.00 195% 160% 163% 125% = 3.00 145% 207% 132% 133%

Performance Guarantee

EPC = Expected Processing Cost

slide-56
SLIDE 56

Klaus Berberich – A Time Machine for Text Search

Space Bound

WIKI UKGOV Space EPC Space EPC POPT 14,428% 100% 11,406% 100% SOPT 100% 963% 100% 147% WIKI UKGOV Space EPC Space EPC = 3.00 288% 139% 273% 107% = 2.00 194% 171% 180% 119% = 1.50 146% 214% 131% 131% = 1.10 109% 406% 104% 145%

Space Bound

EPC = Expected Processing Cost

slide-57
SLIDE 57

Klaus Berberich – A Time Machine for Text Search

Space Bound

WIKI UKGOV Space EPC Space EPC POPT 14,428% 100% 11,406% 100% SOPT 100% 963% 100% 147% WIKI UKGOV Space EPC Space EPC = 3.00 288% 139% 273% 107% = 2.00 194% 171% 180% 119% = 1.50 146% 214% 131% 131% = 1.10 109% 406% 104% 145%

Space Bound

EPC = Expected Processing Cost

slide-58
SLIDE 58

Klaus Berberich – A Time Machine for Text Search

Space Bound

WIKI UKGOV Space EPC Space EPC POPT 14,428% 100% 11,406% 100% SOPT 100% 963% 100% 147% WIKI UKGOV Space EPC Space EPC = 3.00 288% 139% 273% 107% = 2.00 194% 171% 180% 119% = 1.50 146% 214% 131% 131% = 1.10 109% 406% 104% 145%

Space Bound

EPC = Expected Processing Cost

slide-59
SLIDE 59

Klaus Berberich – A Time Machine for Text Search

Space Bound

WIKI UKGOV Space EPC Space EPC POPT 14,428% 100% 11,406% 100% SOPT 100% 963% 100% 147% WIKI UKGOV Space EPC Space EPC = 3.00 288% 139% 273% 107% = 2.00 194% 171% 180% 119% = 1.50 146% 214% 131% 131% = 1.10 109% 406% 104% 145%

Space Bound

EPC = Expected Processing Cost

slide-60
SLIDE 60

Klaus Berberich – A Time Machine for Text Search

Space Bound

WIKI UKGOV Space EPC Space EPC POPT 14,428% 100% 11,406% 100% SOPT 100% 963% 100% 147% WIKI UKGOV Space EPC Space EPC = 3.00 288% 139% 273% 107% = 2.00 194% 171% 180% 119% = 1.50 146% 214% 131% 131% = 1.10 109% 406% 104% 145%

Space Bound

EPC = Expected Processing Cost

slide-61
SLIDE 61

Klaus Berberich – A Time Machine for Text Search

Outline

Motivation Collection, Query, and Relevance Model Time-Travel Inverted File Index

Reducing Index Size Tuning Index Performance

Experimental Evaluation Conclusions

slide-62
SLIDE 62

Klaus Berberich – A Time Machine for Text Search

Conclusions

Time-Travel Text Search

an interesting & important research problem!

Our Time Machine

building on inverted file index significant reduction of index size tunable index performance

Experimental Evaluation

  • ver two large-scale real-world datasets

demonstrating efficiency & effectiveness

slide-63
SLIDE 63

Klaus Berberich – A Time Machine for Text Search

Demo at VLDB ‘07

September 23-28, Vienna, Austria

Web-based GUI

FLUXCAPACITOR Server

Query Processor DB DB DB Time-Travel Text Index Metadata & Snippets IDF Score Time-Series

Temporally Versioned Text Collection

slide-64
SLIDE 64

Klaus Berberich – A Time Machine for Text Search

Thank you! Questions?

slide-65
SLIDE 65

Klaus Berberich – A Time Machine for Text Search

Experimental Evaluation – PG & SB

10000 100000 1e+09 1e+10 Space PG (UKGOV) SB (UKGOV) SOPT (UKGOV) POPT (UKGOV) 10000 100000 1e+09 1e+10 Expected Processing Cost Space PG (WIKI) SB (WIKI) SOPT (WIKI) POPT (WIKI)