Joint Visual-Text Modeling for Multimedia Retrieval JHU CLSP - PowerPoint PPT Presentation

Introduction to Relevance Models � Originally introduced for text retrieval and cross-lingual retrieval Lavrenko and Croft 2001, Lavrenko, Choquette and Croft, � 2002 A formal approach to query expansion. � � A nice way of introducing context in images Without having to do this explicitly � Do this by computing the joint probability of � images and words 36

Cross Media Relevance Models (CMRM) Two parallel vocabularies: Words and Visterms � Analogous to Cross – lingual relevance models � Estimate the joint probabilities � of words and visterms from training images R Tiger | | d ∑ v ∏ = ( , ) ( ) ( | ) ( | ) P c d P J P c J P v J Tree v i ∈ = J T 1 i 37 J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation Grass and Relevance Using Cross-Media Relevance Models , In Proc. SIGIR’03.

Continuous Relevance Models (CRM) A continuous version of Cross Media Relevance � Model Estimate the P(v|J) using kernel density estimate � ⎛ ⎞ − | | J v v 1 ∑ ⎜ ⎟ Ji = ( | ) P v J K ⎜ ⎟ β n ⎝ ⎠ = 1 i : Gaussian Kernel K : Bandwidth β 38

Continuous Relevance Model � A generative model � Concept words w j generated by an i.i.d. sample from a multinomial � Visterms v i generated by a multi-variate (Gaussian) density 39

Normalized Continuous Relevance Models � Normalized CRM Pad annotations to fixed length. Then use the � CRM. Similar to using a Bernoulli model (rather than a � multinomial for words). Accounts for length (similar to length of � document in text retrieval). S. L. Feng, V. Lavrenko and R. Manmatha , Multiple Bernoulli Models for Image and Video Annotation , in CVPR’04 V. Lavrenko, S. L. Feng and R. Manmatha, Statistical Models for Automatic Video Annotation 40 and Retrieval , in ICASSP04

Annotation Performance � On Corel data Set: CMRM CRM Normalized- Models CRM 0.14 0.23 0.26 Mean average Precision � Normalized-CRM works best 41

Annotation Examples (Corel set) Sky train railroad Cat tiger bengal Snow fox arctic tails locomotive water tree forest water Tree plane zebra Mountain plane Birds leaf nest water herd water jet water sky sky 42

Results: Relevance Model on Trec Video Set � Model: Normalized continuous relevance model � Features: color and texture Our comparison experiments show adding edge � feature only get very slight improvement � Evaluate annotation on the development dataset for annotation evaluation mean average precision: 0.158 � 43

Annotation Performance on TREC 44

Proposal: Using Dynamic Information for Video Retrieval Presented by Shaolei Feng University of Massachusetts, Amherst 45

Motivation � Current models based on single frames in each shot. � But video is dynamic � Has motion information. � Use dynamic (motion) information � Better image representations (segmentations) � Model events/actions 46

Why Dynamic Information Model actions/events � Many Trecvid 2003 queries require motion � information. E.g. find shots of an airplane taking off. � find shots of a person diving into water. � Motion is an important cue for retrieving � actions/events. But using the optical flow over the entire image doesn’t � help. Use motion features from objects. � Better Image Representations � Much easier to segment moving objects from background � than to segment static images. 47

Problems with still images. � Current approach � Retrieve videos using static frames. � Feature representations Visterms from keyframes. � Rectangular partition or static segmentation � Poorly correlated with objects. � Features – color, texture, edges. � � Problem: visterms not correlated well with concepts. 48

Better Visterms – better results. Model performs well on related tasks. � Retrieval of handwritten manuscripts. � Visterms – word images. � Features computed over word images. � Annotations – ASCII word. � “you are to be particularly careful” Segmentation of words easier. � Visterms better correlated with concepts. � So can we extend the analogy to this domain… � 49

Segmentation Comparison a: Segmentation using only still image information b: Segmentation using only motion information 50 Pictures from Patrick Bouthemy’s Website, INRIA

Represent Shots not Keyframes � Shot boundary detection Use standard techniques. � � Segment moving objects E.g. By finding outliers from dominant (camera) � motion. � Visual features for object and background. � Motion features for object E.g Trajectory information, � � Motion features for background. Camera pan, zoom … � 51

Models � One approach - modify relevance model to include motion information. � Probabilistically annotate shots in the test set. | | d ∑ v ∏ = ( , ) ( ) ( | ) ( | ) P c d P J P c J P v J v i ∈ = 1 J T i | d | ∑ ∏ = ( , ( , )) ( ) ( | ) ( | ) ( | ) P c d d P S P c S P v S P m S v m i i ∈ = 1 S T i T: training set, S: shots in the training set � Other models e.g. HMM also possible 52

Estimation P(v i |S), P(m i |S) � If discrete visterms use smoothed maximum likelihood estimates. � If continuous use kernel density estimates. � Take advantage of repeated instances of the same object in shot. 53

Plan � Modify models to include dynamic information � Train on TrecVID03 development dataset � Test on TrecVID03 test dataset Annotate the test set � � Retrieve using TrecVID 2003 queries. � Evaluate retrieval performance using mean average precision 54

Score Normalization Experiments Presented by Desislava Petkova 55

Motivation for Score Normalization � Score probabilities are small � But there seems to be discriminating power � Try to use likelihood ratios 56

Bayes Optimal Decision Rule P w s r s P w s P s P w s = P s P w s P w P s w = P w P s w p w pdf w s w r s P w s = 1 r s p w pdf w s w 57

Estimating Class-Conditional PDFs � For each word: Divide training images into positive and negative � examples Create a model to describe the score distribution � of each set Gamma � Beta � Normal � Lognormal � � Revise word probabilities 58

Annotation Performance � Did not improve annotation performance on Corel or TREC 59

Proposal:Using Clustering to Improve Concept Annotation Desislava Petkova Mount Holyoke College 17 August 2004 60

Automatically annotating images Corel: � 5000 images � 4500 training � 500 testing � Word vocabulary � 374 words � Annotations � 1-5 words � Image vocabulary � 500 visterms � 61

Relevance models for annotation A generative language modeling approach � For a test image I = {v 1 , …, v m } compute the joint � distribution of each word w in the vocabulary with the visterms of I Compare I with training images J annotated with w � P w , I P J P w , I J J T m P w , I P J P w J P v i J 1 J T i 62

Estimating P(w|J) and P(v|J) � Use maximum-likelihood estimates � Smooth with the entire training set T a c w , J a c w ,T 1 P w J J T b c v , J b c v ,T 1 P v J J T 63

Motivation � Estimating the relevance model of a single image is a noisy process � P(v|J): visterm distributions are sparse � P(w|J): human annotations are incomplete � Use clustering to get better estimates 64

Potential benefits of clustering {cat, grass, tiger, water} {cat, grass, tiger} {grass, tiger, water} {water} {cat} {cat, grass, tiger, tree} Words in red are missing in the annotation 65

Relevance Models with Clustering � Cluster the training images using K- means � Use both visterms and annotations � Compute the joint distribution of visterms and words in each cluster � Use clusters instead of individual images m P w , I P C P w C P v i C 1 C T i 66

Preliminary results on annotation performance mAP 0.14 Standard relevance model (4500 training examples) Relevance model with clusters 0.128 (100 training examples) 67

Cluster-based smoothing � Smooth maximum likelihood estimates for the training images based on clusters they belong to c w ,C J c w , J c w ,T 1 P w J a 1 a 2 a 1 a 2 J C J T c v ,C J c v , J c v ,T 1 P v J b 1 b 2 b 1 b 2 J C J T 68

Experiments � Optimize smoothing parameters � Divide training set 4000 training images � 500 validation images � � Find the best set of clusters � Query-dependent clusters � Investigate soft clustering 69

Evaluation plan � Retrieval performance � Average precision and recall for one-word queries Comparison with the standard relevance model � 70

Hidden Markov Models for Image Annotations Pavel Ircing Sanjeev Khudanpur 71

Presentation Outline d Translation (MT) models (Paola), d Words Visterms Relevance Models (Shao Lei,Desislava), Visterms Words Graphical Models q (Pavel, Brock) Text classification models (Matt) Integration & Summary (Dietrich) 72

Model setup Training HMMs: • separate HMM for each training image – states given by manual annotations. • image blocks are “generated” water by annotation words • alignment between image ground grass blocks and annotation words is a hidden variable, models are trained using the EM tiger algorithm (HTK toolkit) Test HMM has |W| states, 2 scenarios: (a) p(w’|w) uniform (b) p(w’|w) from co-occurrence LM 73 Posterior probability from forward-backward pass used for p(w|Image)

Challenges in HMM training Inadequate annotations � There is no notion of order in the annotation words � Difficulties with automatic alignment between words � and image regions No linear order in image blocks (assume raster-scan) � Additional spatial dependence between block-labels � is missed Partially addressed via a more complex DBN (see � later) 74

Inadequacy of the annotations Corel database � beach Annotators often mark � palm only interesting objects people tree TRECVID database � Annotation concepts capture mostly semantics of the � image and they are not very suitable for describing visual properties car man-made object transportation vehicle outdoors non-studio setting nature-non-vegetation snow 75

Alignment problems There is no notion of order in the annotation words � Difficulties with automatic alignment between words and � image regions 76

Gradual Training Identify a set of “background” words (sky, grass, � water,...) In the initial stages of HMM training � Allow only “background” states to have their � individual emission probability distributions All other objects share a single “foreground” � distribution Run several EM iterations � Gradually untie the “foreground” distribution and run � more EM iterations 77

Gradual Training Results Results: Improved alignment of training images � Annotation performance on test images did not change � significantly 78

Another training scenarios models were forced to visit every state during � training huge models, marginal difference in performance � special states introduced to account for unlabelled � background and unlabelled foreground, with different strategies for parameter tying 79

Annotation performance - Corel Image features LM mAP No 0.120 Discrete Yes 0.150 No 0.140 Continuous (1 Gaussian per state) Yes 0.157 Continuous features are better than discrete � Co-ocurrence language model also gives moderate improvement � 80

Annotation performance - TRECVID � Continuous features only, no language model Model LM mAP No 0.094 1 Gaussian per state Yes X No 0.145 12 Gaussians per state Yes X 81

Annotation Performance on TREC 82

Summary: HMM-Based Annotation � Very encouraging preliminary results � Effort started this summer, validated on Corel, and yielded competitive annotation results on TREC � Initial findings � Proper normalization of the features is crucial for system performance: bug found and fixed on Friday! � Simple HMMs seem to work best � More complex training topology didn’t really help � More complex parameter tying was only marginally helpful � Glaring gaps � Need a good way to incorporate a language model 83

Graphical Models for Image Annotation + Joint Segmentation and Labeling for Content Based Image Retrieval Brock Pytlik Johns Hopkins University bep@cs.jhu.edu 84

Outline � Graphical Models for Image Annotation � Hidden Markov Models Preliminary Results � � Two-Dimensional HMM’s Work in Progress � � Joint Image Segmentation and Labeling � Tree Structure Models of Image Segmentation Proposed Research � 85

Graphical Model Notation C C C 3 1 2 ' ) ' ) p ( c | c p ( c | c water water water water ground grass ground grass ground grass ground grass tiger tiger tiger p ( o | c ) p ( o | c ) tiger O O O 3 2 1 86

Graphical Model Notation C C C 3 1 2 ' ) ' ) p ( c | c p ( c | c water water water water ground grass ground grass ground grass tiger tiger p ( o | c ) p ( o | c ) tiger O O O 3 2 1 87

Graphical Model Notation C C C 3 1 2 ' ) ' ) p ( c | c p ( c | c water water water water ground grass ground grass tiger p ( o | c ) p ( o | c ) tiger O O O 3 2 1 88

Graphical Model Notation C C C 3 1 2 ' ) ' ( | ) p ( c | c p c c water water water tiger ground grass p ( o | c ) p ( o | c ) tiger O O O 3 2 1 89

Graphical Model Notation Simplified An HMM for a 24-block Image 90

Graphical Model Notation Simplified An HMM for a 24-block Image 91

Modeling Spatial Structure An HMM for a 24-block Image 92

Modeling Spatial Structure An HMM for a 24-block Image Transition probabilities represent spatial extent of objects 93

Modeling Spatial Structure Transition probabilities represent A Two-Dimensional Model for a spatial extent of objects 24-block Image 94

Modeling Spatial Structure Transition probabilities represent A Two-Dimensional Model for a spatial extent of objects 24-block Image Model Training Time Per Training Time Per Image Iteration 1-D HMM .5 sec 37.5 min 2-D HMM 110 sec 8250 min = 137.5 hr 95

Bag-of-Annotations Training Unlike ASR Annotation Words are Unordered ⎧ { } 1 if c t ∈ tiger,grass,sky p ( M t = 1) = 1 ⎨ ⎩ 0 otherwise Mt 1 Constraint on C t C t Tiger, Sky, Grass 96

Bag-of-Annotations Training (II) Forcing Annotation Words to Contribute Only permit paths that visit every annotation word. (1) (2) (3) M t M t M t (1) = M t − 1 (1) ∨ ( C t = tiger) M t C t (2) = M t − 1 (2) ∨ ( C t = grass) M t (3) = M t − 1 (3) ∨ ( C t = sky) M t 97

Inference on Test Images � Forward Decoding p ( c | d v ) = p ( c , d v ) p ( d v ) 98

Inference on Test Images � Forward Decoding ⎡ ⎤ N ∑ ∏ ( | ) ( ) ⎢ p v s ⎥ p S i i ⎣ ⎦ p ( c | d v ) = p ( c , d v ) ∋ = 1 S c i p ( d v ) = 99

Inference on Test Images � Forward Decoding ⎡ ⎤ N ∑ ∏ ( | ) ( ) ⎢ p v s ⎥ p S i i ⎣ ⎦ p ( c | d v ) = p ( c , d v ) ∋ = 1 S c i p ( d v ) = ⎡ ⎤ N ∑ ∏ ( | ) ( ) ⎢ p v s ⎥ p S i i ⎣ ⎦ = 1 S i 100

Joint Visual-Text Modeling for Multimedia Retrieval JHU CLSP - PowerPoint PPT Presentation

Joint Visual-Text Modeling for Multimedia Retrieval JHU CLSP Workshop 2004 Final Presentation, August 17 2004 1 Team Undergraduate Students Desislava Petkova (Mt. Holyoke), Matthew Krause (Georgetown) Graduate Students

1 What is multimedia information retrieval? 1.1 Information retrieval 1.2 Multimedia 1.3

Multimedia Information Retrieval 1 What is multimedia information retrieval? 2 Basic Multimedia

Multimedia Indexing and Retrieval Georges Qunot Multimedia Information Modeling and Retrieval

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Multimedia Indexing and Retrieval Georges Qunot Multimedia Information Modeling and Retrieval

Multimedia Systems Definition of Multimedia System A Multimedia System is a system capable of

Multimedia Applications Multimedia Applications Srinidhi Varadarajan Multimedia Applications

Chapter 1 Introduction to Multimedia 1.1 What is Multimedia? 1.2 Multimedia and Hypermedia 1.3

eyeShot Multimedia Search Engine Multimedia Search Engine eyeShot Extracting text patterns

Berkeley-DB for Text/Multimedia Retrieval Chun Jin Language Technologies Institute School of

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Distributed Multimedia Systems 8. Multimedia Applications Multimedia Applications - 1 Lszl

Summary User-centric Social Social Multimedia Multimedia Computing From Users: user-perceptive

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Challenges for search engine retrieval effectiveness evaluations: Universal Search and user

Linear Algebraic Models in Information Retrieval Nathan Pruitt and Rami Awwad December 12th, 2016

Analysis of the Paragraph Vector Model for Information Retrieval Qingyao Ai 1 , Liu Yang 1 ,

Voice Based Information Retrieval System How far is it from text based retrieval system? PRAJNA

FLAT 3 : Feature Location & Textual Tracing Tool Trevor Savage, Meghan Revelle, Denys

Video Retrieval using Speech and Image Information Alexander G. Hauptmann, Rong Jin, and Tobun D.

Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering Stefanie Tellex,

Accelerating Document Retrieval and Ranking for Cognitive Applications Presenters: Tim Kaldewey

Joint Visual-Text Modeling for Multimedia Retrieval JHU CLSP - PowerPoint PPT Presentation

Joint Visual-Text Modeling for Multimedia Retrieval JHU CLSP Workshop 2004 Final Presentation, August 17 2004 1 Team Undergraduate Students Desislava Petkova (Mt. Holyoke), Matthew Krause (Georgetown) Graduate Students

1 What is multimedia information retrieval? 1.1 Information retrieval 1.2 Multimedia 1.3

Multimedia Information Retrieval 1 What is multimedia information retrieval? 2 Basic Multimedia

Multimedia Indexing and Retrieval Georges Qunot Multimedia Information Modeling and Retrieval

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Multimedia Indexing and Retrieval Georges Qunot Multimedia Information Modeling and Retrieval

Multimedia Systems Definition of Multimedia System A Multimedia System is a system capable of

Multimedia Applications Multimedia Applications Srinidhi Varadarajan Multimedia Applications

Chapter 1 Introduction to Multimedia 1.1 What is Multimedia? 1.2 Multimedia and Hypermedia 1.3

eyeShot Multimedia Search Engine Multimedia Search Engine eyeShot Extracting text patterns

Berkeley-DB for Text/Multimedia Retrieval Chun Jin Language Technologies Institute School of

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Distributed Multimedia Systems 8. Multimedia Applications Multimedia Applications - 1 Lszl

Summary User-centric Social Social Multimedia Multimedia Computing From Users: user-perceptive

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Challenges for search engine retrieval effectiveness evaluations: Universal Search and user

Linear Algebraic Models in Information Retrieval Nathan Pruitt and Rami Awwad December 12th, 2016

Analysis of the Paragraph Vector Model for Information Retrieval Qingyao Ai 1 , Liu Yang 1 ,

Voice Based Information Retrieval System How far is it from text based retrieval system? PRAJNA

FLAT 3 : Feature Location &amp; Textual Tracing Tool Trevor Savage, Meghan Revelle, Denys

Video Retrieval using Speech and Image Information Alexander G. Hauptmann, Rong Jin, and Tobun D.

Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering Stefanie Tellex,

Accelerating Document Retrieval and Ranking for Cognitive Applications Presenters: Tim Kaldewey

FLAT 3 : Feature Location & Textual Tracing Tool Trevor Savage, Meghan Revelle, Denys