Fast Bag-Of-Words Candidate Selection in Content-Based Instance - PowerPoint PPT Presentation

Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems Michał Siedlaczek 1 Qi Wang 1 Yen-Yu Chen 2 Torsten Suel 1 1 Department of Computer Science and Engineering Tandon School of Engineering New York University 2 Blippar Inc. December 12, 2018

Introduction

Problem Statement ◮ Given a database of different types of images ◮ Point phone camera at an object ◮ Recognize it by finding its instance in the database ◮ Implemented as part of an Augmented Reality application ◮ General search in a broad domain

Content-Based Instance Retrieval ◮ Given a picture, return its matching instance from database ◮ Bag-of-words retrieval 1. Extract descriptors, robust against rotation, scaling, etc. ◮ Convolutional Neural Networks (CNN) [Zheng 2017] ◮ Scale-Invariant Feature Transform (SIFT) [Lowe 1999] 2. Translate feature set into visual words 3. Use standard text search techniques to find candidates 4. Rerank using a complex scoring method

Inverted Index

Document Retrieval 1. Lists for query terms used to find matching documents 2. Matching documents scored to find top N candidates 3. Candidates re-ranked by a complex ranker (e.g., DNN or ML model) [Liu 2009, Wang 2010] 4. Top k < N results returned to user

Document Retrieval Our work: ◮ Queries are pictures ◮ SIFT-generated descriptors translated to visual-word queries ◮ Partial scores stored in index and added up at query time

Scored Inverted Index

Text Retrieval Algorithms Exhaustive query processing ◮ Term at a time (TAAT) ◮ Document at a time (DAAT) ◮ Score at a time (SAAT)

Term at a Time

Document at a Time

Score at a Time

Safe Dynamic Pruning Non-exhaustive processing ◮ Threshold Algorithm [Fagin 1996] ◮ Well known algorithm used in databases ◮ MaxScore [Turtle 1995] ◮ Partitions terms/lists into essential and non-essential ◮ WAND [Broder 2003] (and variations) ◮ Find pivot – a document to which all lists can be skipped without missing any top- k document

Data Analysis

Data Analysis Objective Better understanding of how quantitative properties of bag-of-visual-words corpus and index may impact query efficiency. Data Set Comparison ◮ BoVW ◮ subset of Blippar’s production BoVW collection ◮ sampled production queries ◮ Clueweb09B ◮ standard IR text corpus ◮ TREC 06-09 Web Query Track topics

Data Analysis. 1: Query Lengths Average Query Lengths BoVW 272 Clueweb09B 2.7 Significance ◮ Large overhead of selecting a posting list during processing in BoVW ◮ DAAT methods slow down significantly

Data Analysis. 2: Posting List Lengths Clueweb09-B BoVW .6 .003 .4 674.1 172.72 .0015 .2 0 0 10 2 10 4 10 6 1 0 500 1000 1500 2000 Posting List Length Posting List Length

Data Analysis. 3: Posting List Max Scores Clueweb09-B BoVW 0.2 .01 14.5 142.72 0.1 .005 0.0 0 0 10 20 30 0 200 400 600 800 1000 Posting List Max Score Posting List Max Score

Data Analysis. 4: Length/Max Scores Correlation ◮ Clueweb09-B ◮ strong negative correlation (-0.66) ◮ Inverted Document Frequency: common words penalized by scoring functions ◮ BoVW ◮ almost no correlation (0.06) Significance Potentially less advantage for dynamic pruning methods such as Max-Score.

Data Analysis. 5: Query Term Footprint Query Term Footprint The fraction of the query terms actually contained in the average top-k result. Clueweb09B ◮ 60% – 95% depending on queries BoVW ◮ 1.1% for production queries ◮ Conjunctive queries impossible ◮ Negative impact on Max-Score algorithms — few non-essential lists to skip

Data Analysis. 6: Index Size Clueweb09-B ◮ 50 mln documents ◮ billions documents in real life BoVW ◮ 2.6 mln documents ◮ about an order of magnitude more in production ◮ far fewer documents than most large text collections

Data Analysis. 7: Accumulator Sparsity Clueweb09-B ◮ ~15% documents with non-zero scores BoVW ◮ ~8% documents with non-zero scores ◮ potential to improve accumulating and aggregating scores in TAAT processing

DAAT v TAAT

DAAT v TAAT Results on BoVW TAAT DAAT 0 10 20 30 40 Latency (ms) ◮ ~75% of DAAT instructions select next posting list

DAAT v TAAT: Query Lengths DAAT TAAT 10 Latency (ms) 8 6 4 2 0 1-10 41-50 91-100 141-150 191-200 Query Length Range

TAAT Optimizations

TAAT Optimizations: Aggregation (A) ◮ Keep max of each block while traversing ◮ Before aggregating a block, check if max is higher than the current threshold

TAAT Optimizations: Prefetch (P) ◮ ~50% accumulator access instructions miss L1 cache ◮ We hint CPU to prefetch accumulators ahead of time ◮ Additionally, we hint that it can be evicted right after the write instruction

TAAT Optimizations: Accumulator Initialization (I) ◮ A cyclic query counter q of size m ◮ At traversal, if q a < q , the accumulator is overwritten, and q a ← q ◮ Otherwise, we increase the accumulator ◮ At q = 0 , we erase the accumulator before traversal

TAAT Optimizations 6 Latency (ms) 4 2 0 TAAT A A+P A+P+I

Early Termination

Safe Early Termination ◮ We analyzed mechanics behind safe early termination techniques: ◮ Threshold Algorithm ◮ WAND ◮ MaxScore ◮ Data proves those techniques to be inefficient

Safe Early Termination Threshold Algorithm On average, the stopping condition occurs after processing 98% of postings. MaxScore Given the real final threshold, 97% of terms (98% of the postings) are essential on average. WAND Almost 80% of the postings have to be visited on average, and over 70% have to be evaluated.

Unsafe Score at a Time 2.4 2.2 N-S 2.0 1.8 1.6 0 20 40 60 80 100 Processed Postings (%)

Conclusions ◮ CBIR bag-of-words collection and queries are much different from textual ◮ This impacts the efficiency of known retrieval algorithms ◮ TAAT outperforms DAAT due to query length ◮ TAAT can be further optimized to neutralize its drawbacks ◮ Tested early termination techniques fail in our type of scenario

References [ Broder 2003 ] Broder, Carmel, Herscovici, Soffer, Zien. Efficient query evaluation using a two-level retrieval process [ Fagin 2001 ] Fagin, Lotem, Naor. Optimal aggregation algorithms for middleware [ Lowe 1999 ] Lowe. Object recognition from local scale-invariant features [ Turtle 1995 ] Turtle, Flood. Query evaluation: strategies and optimizations [ Zheng 2017 ] Zheng, Yang, Tian. SIFT meets CNN: A decade survey of instance retrieval

Fast Bag-Of-Words Candidate Selection in Content-Based Instance - PowerPoint PPT Presentation

Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems Micha Siedlaczek 1 Qi Wang 1 Yen-Yu Chen 2 Torsten Suel 1 1 Department of Computer Science and Engineering Tandon School of Engineering New York University 2

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering.

WINE BOTTLE AIRBAG SINGLE WINE BOTTLE AIRBAG SINGLE BOTTLE AIR BAG PROTECT ALL BOTTLED PRODUCT

Red-Bag Engineers Consultants Software User Day April 2017 Red-Bag 2017 1 Ves Online

Pathway Red Bag Scheme October 2018 The Red Bag concept The Red Bag scheme was first implemented

The Plastic Bag Free world in action Surfriders Ban the Bag Campaign Plastic bag free

Lecture: Visual Bag of Words Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning

Text Representation Bag-of-Words and Word Embeddings count vector unordered bag over

DC Bag Law Presented by Jeffrey Seltzer Associate Director Stormwater Management Division District

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Terminology: bag-of-words

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

2019 Candidate Filing Workshop Are you ready to file? Candidate Guide & Drive Filing

The Candidate Experience What is Candidate Experience? Candidate Experience How job seekers

SESSION CONTENT The Council Candidate and nomination information 1 Christchurch City Council -

Part 14: Content-Based Filtering and Hybrid Systems Francesco Ricci Content p Typologies of

Convolutional Networks for Text Graham Neubig Site https://phontron.com/class/nn4nlp2017/

Algebraic and Logical Query Languages Spring 2011 Instructor: Hassan Khosravi Relational

Basic Operations Algebra of Bags Mathematical system consisting of: Operands

Scalable Algorithm for Probabilistic Overlapping Community Detection Kento Nozawa, Kei

Bag of Pursuits and Neural Gas for Improved Sparse Coding Manifold Learning with Sparse Coding

Self Balancing Trees Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Warm Up

Lecture 7 Binary Search Trees and Red-Black Trees Announcements HW 3 released! (Due Friday)

TREES Lecture 11 CS2110 Summer 2019 Announcements 2 Confusion about submission due

Fast Bag-Of-Words Candidate Selection in Content-Based Instance - PowerPoint PPT Presentation

Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems Micha Siedlaczek 1 Qi Wang 1 Yen-Yu Chen 2 Torsten Suel 1 1 Department of Computer Science and Engineering Tandon School of Engineering New York University 2

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering.

WINE BOTTLE AIRBAG SINGLE WINE BOTTLE AIRBAG SINGLE BOTTLE AIR BAG PROTECT ALL BOTTLED PRODUCT

Red-Bag Engineers Consultants Software User Day April 2017 Red-Bag 2017 1 Ves Online

Pathway Red Bag Scheme October 2018 The Red Bag concept The Red Bag scheme was first implemented

The Plastic Bag Free world in action Surfriders Ban the Bag Campaign Plastic bag free

Lecture: Visual Bag of Words Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning

Text Representation Bag-of-Words and Word Embeddings count vector unordered bag over

DC Bag Law Presented by Jeffrey Seltzer Associate Director Stormwater Management Division District

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Terminology: bag-of-words

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

2019 Candidate Filing Workshop Are you ready to file? Candidate Guide &amp; Drive Filing

The Candidate Experience What is Candidate Experience? Candidate Experience How job seekers

SESSION CONTENT The Council Candidate and nomination information 1 Christchurch City Council -

Part 14: Content-Based Filtering and Hybrid Systems Francesco Ricci Content p Typologies of

Convolutional Networks for Text Graham Neubig Site https://phontron.com/class/nn4nlp2017/

Algebraic and Logical Query Languages Spring 2011 Instructor: Hassan Khosravi Relational

Basic Operations Algebra of Bags Mathematical system consisting of: Operands

Scalable Algorithm for Probabilistic Overlapping Community Detection Kento Nozawa, Kei

Bag of Pursuits and Neural Gas for Improved Sparse Coding Manifold Learning with Sparse Coding

Self Balancing Trees Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Warm Up

Lecture 7 Binary Search Trees and Red-Black Trees Announcements HW 3 released! (Due Friday)

TREES Lecture 11 CS2110 Summer 2019 Announcements 2 Confusion about submission due

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

2019 Candidate Filing Workshop Are you ready to file? Candidate Guide & Drive Filing