SeFS: Unleashing the Power of Full-text Search on File Systems - PowerPoint PPT Presentation

SeFS: Unleashing the Power of Full-text Search on File Systems USENIX FAST ’07 (WiP) Stergios V. Anastasiadis (joint work with G. Margaritis) U. Ioannina, Greece

Motivation • Full-text search in modern systems often used for – Email – Application help files – Log files – Any file that contains text – ... • Maybe full-text search should – Receive the attention it deserves from system designers – Be made available as general system service to developers 02/14/2007 (c) S. V. Anastasiadis 2

File System Features • File size – Most files are small BUT – Most bytes are in large files • File lifetime – Is highly variable across different systems – Varies from minutes to years – Has median age = tens of days • User expectations – Perceive the file system as a reliable “storage medium” – Anticipate changes to be made visible almost immediately 02/14/2007 (c) S. V. Anastasiadis 3

Attempt #1: Information Retrieval • Upside – Online support of Boolean queries and dynamic updates – Mature technology (first ACM-SIGIR in 1978) • Downside – Technology initially developed for article archives – “Dynamic update” mainly means addition of new articles – Indexing structures biased from decade-old studies to serve the above assumptions 02/14/2007 (c) S. V. Anastasiadis 4

Index Maintenance in IR • Inverted files – Map terms to term positions in documents (posting lists) • Decades ago – Updated infrequently to include new articles – Contiguously stored on disk to minimize query time • Recently – Updated dynamically to include new articles BUT – Treating document changes as insertions/deletions – Use complex relocation techniques to preserve contiguity 02/14/2007 (c) S. V. Anastasiadis 5

Question • Why not allocate posting lists on fixed-size blocks? – Avoid data relocation during inserts/appends – Amortize disk seeks over large block sizes – Simplify system structure without major performance penalty • Several I/O demanding systems based on blocks – Database systems – The Google File System (chunks of 64MB) – Video streaming storage – … 02/14/2007 (c) S. V. Anastasiadis 6

Attempt #2: Web Search • Upside – Technology can handle large data sets – Search results quite close to user expectations • Downside – The web is perceived as unreliable; infrequent updates ok – Distributed nature make stats gathering difficult – Dedicated hardware devoted to indexing • Bottom line – Despite commonalities, file systems differ from the web – Exploit strengths without adopting weaknesses 02/14/2007 (c) S. V. Anastasiadis 7

Attempt #3: Relational Databases • First approach – Store all system metadata on a relational database system E.g. SRB/SDSC, SCFS/MIT, Amino/Stony Brook – Ok for ftp-like services – BUT maybe too heavyweight for fine-grain accesses • Why? – File systems custom-developed/optimized for handling their metadata 02/14/2007 (c) S. V. Anastasiadis 8

Relational Databases (cont’d) • Second approach – Keep system metadata on custom file-system structures – BUT maintain user metadata in a database – Maybe ok but still insufficient for full-text search • Why? – Full-text search more than a few attribute/value pairs per file – Inverted files most efficient structure for large text collections 02/14/2007 (c) S. V. Anastasiadis 9

Conclusion • File systems – More flexible in their functionality than article repositories – More reliable and amenable to stats gathering than the web – More efficient in fine-granularity operations than RDBs • Full-text search on file systems – Useful for different applications and system services – Should be designed from scratch, free from inherent drawbacks of solutions from other environments 02/14/2007 (c) S. V. Anastasiadis 10

SeFS: Unleashing the Power of Full-text Search on File Systems - PowerPoint PPT Presentation

SeFS: Unleashing the Power of Full-text Search on File Systems USENIX FAST 07 (WiP) Stergios V. Anastasiadis (joint work with G. Margaritis) U. Ioannina, Greece Motivation Full-text search in modern systems often used for Email

SIBA SEFs Mission SEF Portuguese Immigration and Borders Service To execute the

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

full year results full year results full year results full full year results full year results full

File Management What is a file? Elements of file management File organization

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

CIS 218 File Utilities and Filters Text / File Commands File Manipulation cat displays

Semantic Full-Text Search Semantic Full Text Search Talk @ SIGIR JIWES Talk @ SIGIR

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File IO 1 / 6 Text File IO File IO is done in Python with the built-in File object which is

File output Ch 6 Highlights - text file output - text file input Download vs stream Streams

FULL YEAR RESULTS FULL YEAR RESULTS. 2017 FULL YEAR RESULTS FULL YEAR RESULTS . 2017 . 2017 .

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Greenwood Genetic Center Founded in 1974 by the SC Department of Disabilities & Special Needs

Termination of retirement funds a legal minefield Deirdre Phillips INTRODUCTION S7A

A posteriori soundness for nondeterministic abstract interpretations Matthew Might (University

Stellar Content via Maximum a posteriori Ocvirk et al. (2006a) Ocvirk et al. (2006b) WHAT DOES

The Power of Local Search for Clustering in Separable Instances Vincent Cohen-Addad

Genera&ng a power set Given a set of elements, would

Moving Forward: A Non-Search Based Synthesis Method towards Efficient CNOT-Based Quantum Circuit

Object-Oriented Problem Solving Introduction Based on Chapter 1 of Introduction to Java

SeFS: Unleashing the Power of Full-text Search on File Systems - PowerPoint PPT Presentation

SeFS: Unleashing the Power of Full-text Search on File Systems USENIX FAST 07 (WiP) Stergios V. Anastasiadis (joint work with G. Margaritis) U. Ioannina, Greece Motivation Full-text search in modern systems often used for Email

SIBA SEFs Mission SEF Portuguese Immigration and Borders Service To execute the

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

full year results full year results full year results full full year results full year results full

File Management What is a file? Elements of file management File organization

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

CIS 218 File Utilities and Filters Text / File Commands File Manipulation cat displays

Semantic Full-Text Search Semantic Full Text Search Talk @ SIGIR JIWES Talk @ SIGIR

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File IO 1 / 6 Text File IO File IO is done in Python with the built-in File object which is

File output Ch 6 Highlights - text file output - text file input Download vs stream Streams

FULL YEAR RESULTS FULL YEAR RESULTS. 2017 FULL YEAR RESULTS FULL YEAR RESULTS . 2017 . 2017 .

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Greenwood Genetic Center Founded in 1974 by the SC Department of Disabilities &amp; Special Needs

Termination of retirement funds a legal minefield Deirdre Phillips INTRODUCTION S7A

A posteriori soundness for nondeterministic abstract interpretations Matthew Might (University

Stellar Content via Maximum a posteriori Ocvirk et al. (2006a) Ocvirk et al. (2006b) WHAT DOES

The Power of Local Search for Clustering in Separable Instances Vincent Cohen-Addad

Genera&amp;ng a power set Given a set of elements, would

Moving Forward: A Non-Search Based Synthesis Method towards Efficient CNOT-Based Quantum Circuit

Object-Oriented Problem Solving Introduction Based on Chapter 1 of Introduction to Java

Greenwood Genetic Center Founded in 1974 by the SC Department of Disabilities & Special Needs

Genera&ng a power set Given a set of elements, would