Scalable Full-Text Search for Petascale File Systems Andrew W. - PowerPoint PPT Presentation

Scalable Full-Text Search for Petascale File Systems Andrew W. Leung • Ethan L. Miller University of California, Santa Cruz 3 rd Petascale Data Storage Workshop (PDSW ’08) November 17th, 2008

Need scalable file management • Today’s file systems contain • Petabytes of data, billions of files, and thousands of users • File systems have focused on scaling • I/O and metadata throughput, latency, fault-tolerance, cost • Limited work on scaling organization and retrieval • File system organization largely unchanged for 30 years • File organization and retrieval has not kept pace with file systems 2

Problems with current approach • Files are organized into a single hierarchy • Possibly billions of files and directories • Slow and inaccurate • Users must carefully organize and name files and directories - Tedious and time consuming • Users must manually navigate huge hierarchies - Wastes time and is inaccurate • Files only have a single classification • Does not scale to petascale file systems 3

Scalable file retrieval with search • File system search has been researched for decades • Focused on full-text (aka keyword) search • Organizing and retrieving files with search • Files have many automatic classifications - Organization becomes much simpler • Files can be retrieved with any feature/keywords - No more slow namespace navigation - Reduces the chances of lost data 4

Petascale search challenges • Cost • Very expensive - often requires dedicated hardware • Performance • Tough to scale - often trade-off search and update performance • File system search should efficiently do both • Ranking • Limited file ranking algorithms • Security • Can significantly degrade search performance 5

A specialized petascale search design • Exploits file system properties • Can be integrated within the file system • Leverage namespace locality with hierarchical partitioning [Leung09] • Namespace influences • File access patterns [Leung08, Vogel99] • File properties [Agrawal07, Leung09] • Who accesses them [Agrawal07, Leung08] 6

Index partitioning / home usr proj john jim distmeta reliability include thesis scidac src experiments Keyword 1's Posting List Segments Hard Disk • Traditional file system search uses an inverted index • Consists of a dictionary that points posting lists • Our approach partitions the index based on the namespace • Posting lists are broken into segments 7

Benefits of our design • Flexible, fine-grained index control • Search and update can be controlled at sub-tree granularity • Critical for index with billions of files • Reducing the search space • Eliminate partitions that do not match search criteria • Allows users to control scope and performance of queries • Efficient index updates • Smaller posting lists are easier to update and keep sequential on-disk • Better resource utilization 8

The indirect index Indirect Index Keyword 1 Keyword 2 Keyword 3 Keyword 4 Dictionary Posting Lists ... Posting List Segments for Partition 1 ... Posting List Segments for Partition 2 • An inverted index that points to partition locations • Stores the dictionary • Posting lists store partition segment locations 9

Other possible extensions • Security • Eliminate restricted sub-trees from search space • No extra space required and reduces permission check • Ranking • Utilize namespace locality to improve search result ranking • Employ different ranking algorithms for different sub-trees • Cost efficiency • Exploit Zipf-like sub-tree query patterns • Compress or migrate rarely searched sub-tree segments to lower-tier 10

Current and future work • We are currently working on... • Collecting and analyzing keyword data sets • Crawl real-world large-scale file systems • No current file system search keyword collections exist • Completing the index and algorithm designs • Implementation and evaluation within the Ceph petascale file system • Allows realistic integration and benchmarking 11

Thank you! • Thanks to: • Minglong Shao, Timothy Bission, Shankar Pasupathy and NetApp’s ATG • SSRC faculty and students • Come see us at the poster session! • Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems • Questions? 12

Scalable Full-Text Search for Petascale File Systems Andrew W. - PowerPoint PPT Presentation

Scalable Full-Text Search for Petascale File Systems Andrew W. Leung Ethan L. Miller University of California, Santa Cruz 3 rd Petascale Data Storage Workshop (PDSW 08) November 17th, 2008 Need scalable file management Todays

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

full year results full year results full year results full full year results full year results full

File Management What is a file? Elements of file management File organization

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

CIS 218 File Utilities and Filters Text / File Commands File Manipulation cat displays

Semantic Full-Text Search Semantic Full Text Search Talk @ SIGIR JIWES Talk @ SIGIR

SeFS: Unleashing the Power of Full-text Search on File Systems USENIX FAST 07 (WiP) Stergios

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

File IO 1 / 6 Text File IO File IO is done in Python with the built-in File object which is

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

File output Ch 6 Highlights - text file output - text file input Download vs stream Streams

SASI, Cassandra on the full text search ride DuyHai DOAN Apache Cassandra Evangelist 1 5

Query Optimization 2 Instructor: Matei Zaharia cs245.stanford.edu Recap: Data Statistics

}w !"#$%&'()+,-./012345<yA| Illustraons by Ji Franek. Semanc Indexing

Advanced fulltext search with Sphinx Adrian Nuta // Sphinxsearch // 2014 Fulltext search in

Trees (Part 2) 1 / 59 Trees (Part 2) Recap Recap 2 / 59 Trees (Part 2) Recap B + Tree A B

Beyond full-text searches With Lucene and Solr Bertrand Delacrtaz ApacheCon EU 2007, Amsterdam

Search and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2016/17

Inverted Index Lecture 12 Inverted Index 1 December 2014 1 Wentworth Institute of Technology

Sambuz

Useful Links

Newsletter

Mail Us

Scalable Full-Text Search for Petascale File Systems Andrew W. - PowerPoint PPT Presentation

Scalable Full-Text Search for Petascale File Systems Andrew W. Leung Ethan L. Miller University of California, Santa Cruz 3 rd Petascale Data Storage Workshop (PDSW 08) November 17th, 2008 Need scalable file management Todays

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

full year results full year results full year results full full year results full year results full

File Management What is a file? Elements of file management File organization

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

CIS 218 File Utilities and Filters Text / File Commands File Manipulation cat displays

Semantic Full-Text Search Semantic Full Text Search Talk @ SIGIR JIWES Talk @ SIGIR

SeFS: Unleashing the Power of Full-text Search on File Systems USENIX FAST 07 (WiP) Stergios

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

File IO 1 / 6 Text File IO File IO is done in Python with the built-in File object which is

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

File output Ch 6 Highlights - text file output - text file input Download vs stream Streams

SASI, Cassandra on the full text search ride DuyHai DOAN Apache Cassandra Evangelist 1 5

Query Optimization 2 Instructor: Matei Zaharia cs245.stanford.edu Recap: Data Statistics

}w !&quot;#$%&amp;'()+,-./012345&lt;yA| Illustraons by Ji Franek. Semanc Indexing

Advanced fulltext search with Sphinx Adrian Nuta // Sphinxsearch // 2014 Fulltext search in

Trees (Part 2) 1 / 59 Trees (Part 2) Recap Recap 2 / 59 Trees (Part 2) Recap B + Tree A B

Beyond full-text searches With Lucene and Solr Bertrand Delacrtaz ApacheCon EU 2007, Amsterdam

Search and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2016/17

Inverted Index Lecture 12 Inverted Index 1 December 2014 1 Wentworth Institute of Technology

Sambuz

Useful Links

Newsletter

Mail Us

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

}w !"#$%&'()+,-./012345<yA| Illustraons by Ji Franek. Semanc Indexing