Berkeley-DB for Text/Multimedia Retrieval Chun Jin Language - PowerPoint PPT Presentation

Berkeley-DB for Text/Multimedia Retrieval Chun Jin Language Technologies Institute School of Computer Science Carnegie Mellon University

Motivation � Recent advance in text/multimedia retrieval: good algorithms � Scalability issue � Continuous data growth � Adding new search features � Try: separating the scalability problem from the retrieval algorithms?

Our Goal � Providing a library for application/system building based on Berkeley DB. � System prototyping.

Text Retrieval � Vector Space Model (VSM) for Text: � D = {t1, t2, …, tm} � Q = {t1, t2, …, tm} � Sim(D, Q) = cos(D, Q) � To Scale: Inverted Index: . . . t1 -> . DID TF POS1 POS2 … . . . . t2 -> .

Image Retrieval � Feature Space � M = {f1, f2, …, fm} � Q = {f1, f2, …, fm} � Distance(M, Q) = ||M - Q|| � To Scale: Quantization then Index: . . . f 1 -> . . . . . f l -> MID f lm . |f lQ –f lm | < δ

Retrieval Algorithm ? � Get feature entries � Compute feature-level similarities � Compute document-query similarities ? Q = {TF t1 , TF t2 , …, TF tm } . . . . . t1 -> . DID TF . . . . t2 -> .

Retrieval Algorithm � Get feature entries: Berkeley DB: � BTree/Hash indexing � Storage/buffer management � Compute feature-level similarities � Compute document-query similarities: Join: � AND/OR (Inner/Outer) � Join methods � Callback to compute Step 2 . . . t1 -> . . . . . t2 ->

System Architecture Data Query Preprocessor Results Search Similarity Inverted Engine Calculator Indexer Berkeley DB: Key Indexer Storage Manager Inverted Index

Development Layers Retrieval Application: Feature Extraction Similarity Measures Retrieval API: Join Methods Iterator Inverted Index Formatter Berkeley DB API: Key Indexer Lib (BTree/Hash, etc.) Storage Management

Berkeley DB vs. General DBMS: Task BDB DBMS √ √ √ � Indexing techniques √ √ √ � Storage management � Operation: Join √ √ � Developer’s API √ √ � SQL X √ � Transaction management X √ � Recovery management X √

Merge Join List MergeJoin( List left, List right, Feature Qrfeature) while ( not left.end() and not right.end()) lpair = left.current; rpair = right.current; Feature sim if lpair.key = rpair.key FeatureSim v = Qrfeature.Sim(rpair.data); Doc sim lpair.data = DocSim(lpair.data, v); left = left.next(); right = right.next(); else if lpair.key < rpair.key left = left.next(); else right = right.next() return left; Basic algorithm adopted from Wikipedia

Information Encapsulation � BDB: index key and entry boundary � Iterator: sub-entry boundary � Join: docID key and the rest of data � Similarity function: data structure t1 ->

Flexibility � Feature similarity: � Term positions for proximity search � Weighted link information � Meta data adjustment � Document-Query similarity: � Cosine � Euclidean � Probabilistic

Ongoing Work � Inverted Index Structure Design � Implementing join methods and iterators � Inverted Indexer � Similarity functions and feature extraction

Related Work � Commercial Systems: � Google � Endeca � Oracle Text DB/Multimedia DB � IBM Net Search Extender � Thunderstone Texis � YouTube � Research � CMU � Stanford [Su & Widom IDEAS05]

Conclusion � Problem: � Scalability issue on text/multimedia retrieval. � Idea: � Separating the problem from retrieval algorithms. � Layered architecture. � Goal: � Providing a library for application/system building. � Prototyping.

Acknowledgements � Thanks to Minglong Shao (CMU) and Zhu Liu (AT&T) for helpful discussions. � Thanks to Jaime Carbonell (CMU) for his continuous support and encouragement.

Berkeley-DB for Text/Multimedia Retrieval Chun Jin Language - PowerPoint PPT Presentation

Berkeley-DB for Text/Multimedia Retrieval Chun Jin Language Technologies Institute School of Computer Science Carnegie Mellon University Motivation Recent advance in text/multimedia retrieval: good algorithms Scalability issue

1 What is multimedia information retrieval? 1.1 Information retrieval 1.2 Multimedia 1.3

Multimedia Information Retrieval 1 What is multimedia information retrieval? 2 Basic Multimedia

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Multimedia Indexing and Retrieval Georges Qunot Multimedia Information Modeling and Retrieval

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Multimedia Systems Definition of Multimedia System A Multimedia System is a system capable of

Multimedia Applications Multimedia Applications Srinidhi Varadarajan Multimedia Applications

Chapter 1 Introduction to Multimedia 1.1 What is Multimedia? 1.2 Multimedia and Hypermedia 1.3

eyeShot Multimedia Search Engine Multimedia Search Engine eyeShot Extracting text patterns

Multimedia Indexing and Retrieval Georges Qunot Multimedia Information Modeling and Retrieval

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Distributed Multimedia Systems 8. Multimedia Applications Multimedia Applications - 1 Lszl

Summary User-centric Social Social Multimedia Multimedia Computing From Users: user-perceptive

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Active Learning for Multimedia Georges Qunot Multimedia Information Retrieval Group L

FreningsSparbanken Gert Engman Group EVP and Chief Information Officer 2 IT-costs and EMU

A Component Architecture for an Extensible, Highly Integrated Context-Aware Computing

Managing Failure Modes in Microservice Architectures Adrian Cockcroft @adrianco AWS VP Cloud

Verifying the CPA Networking Stack using SPIN/Promela Kevin Chalmers and Jon Kerridge Edinburgh

Optimizing the cost of vaccine deliveries a model- costed determination of key levers that

FIBER REINFORCEMENTS: CORRELATING PERMEABILITY AND LOCAL SPATIAL FIBROUS FEATURES S. Comas-Cardona

The Problem OpenSim is a relatively new development. As such, standards for how this

IP routing and QoS in satellite constellations Networking seminar, INRIA Sophia-Antipolis Salle