contextual media retrieval using natural language queries
play

Contextual Media Retrieval Using Natural Language Queries IMPRS-CS - PowerPoint PPT Presentation

Contextual Media Retrieval Using Natural Language Queries IMPRS-CS PhD Application Talk Sreyasi Nag Chowdhury Masters Thesis Supervisors Dr. Mario Fritz Dr. Andreas Bulling Adviser M.Sc. Mateusz Malinowski 1 Outline Motivation and


  1. Question – Answering Dynamic world what is there in front of MPI-INF? Ambiguous query Subjective; multiple correct Our Q&A Model answers Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 9

  2. Question – Answering Dynamic world Media as answer what is there in front of MPI-INF? Ambiguous query Subjective; multiple correct Our Q&A Model answers Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 9

  3. Dynamic-Egocentric Extension π‘Ώπ’‘π’”π’Žπ’† (𝒙) Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 10

  4. Dynamic-Egocentric Extension π‘Ώπ’‘π’”π’Žπ’† (𝒙) Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 10

  5. Dynamic-Egocentric Extension π‘Ώπ’‘π’”π’Žπ’† (𝒙) 𝑻𝒖𝒃𝒖𝒋𝒅 π‘Ώπ’‘π’”π’Žπ’† (𝒙 𝒕 ) cafe(β€˜mensa’, 49.2560,7.0454). building(β€˜mpi_inf’, 49.2578,7.0460). Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 10

  6. Dynamic-Egocentric Extension π‘Ώπ’‘π’”π’Žπ’† (𝒙) 𝑻𝒖𝒃𝒖𝒋𝒅 π‘Ώπ’‘π’”π’Žπ’† (𝒙 𝒕 ) 𝑬𝒛𝒐𝒃𝒏𝒋𝒅 π‘Ώπ’‘π’”π’Žπ’† 𝒙 𝒆 cafe(β€˜mensa’, 49.2560,7.0454). building(β€˜mpi_inf’, 49.2578,7.0460). Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 10

  7. Dynamic-Egocentric Extension π‘Ώπ’‘π’”π’Žπ’† (𝒙) 𝑻𝒖𝒃𝒖𝒋𝒅 π‘Ώπ’‘π’”π’Žπ’† (𝒙 𝒕 ) 𝑬𝒛𝒐𝒃𝒏𝒋𝒅 π‘Ώπ’‘π’”π’Žπ’† 𝒙 𝒆 cafe(β€˜mensa’, 49.2560,7.0454). building(β€˜mpi_inf’, 49.2578,7.0460). Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 10

  8. Dynamic-Egocentric Extension π‘Ώπ’‘π’”π’Žπ’† (𝒙) 𝑻𝒖𝒃𝒖𝒋𝒅 π‘Ώπ’‘π’”π’Žπ’† (𝒙 𝒕 ) 𝑬𝒛𝒐𝒃𝒏𝒋𝒅 π‘Ώπ’‘π’”π’Žπ’† 𝒙 𝒆 cafe(β€˜mensa’, 49.2560,7.0454). building(β€˜mpi_inf’, 49.2578,7.0460). 𝑽𝒕𝒇𝒔 𝑡𝒇𝒖𝒃𝒆𝒃𝒖𝒃 (𝒙 𝒆𝒗 ) person(49.2578,7.0460 ,’n’) . day(20150220). Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 10

  9. Dynamic-Egocentric Extension π‘Ώπ’‘π’”π’Žπ’† (𝒙) 𝑻𝒖𝒃𝒖𝒋𝒅 π‘Ώπ’‘π’”π’Žπ’† (𝒙 𝒕 ) 𝑬𝒛𝒐𝒃𝒏𝒋𝒅 π‘Ώπ’‘π’”π’Žπ’† 𝒙 𝒆 cafe(β€˜mensa’, 49.2560,7.0454). building(β€˜mpi_inf’, 49.2578,7.0460). 𝑽𝒕𝒇𝒔 𝑡𝒇𝒖𝒃𝒆𝒃𝒖𝒃 (𝒙 𝒆𝒗 ) π‘«π’‘π’Žπ’Žπ’‡π’…π’–π’‹π’˜π’‡ 𝑡𝒇𝒏𝒑𝒔𝒛 (𝒙 𝒆𝒏 ) person(49.2578,7.0460 ,’n’) . image(`img_20141111_165828',2014111 day(20150220). 1,49.2566,7.0442,`november'). video(`vid_20141121_120149',20141121 ,49.2569, 7.0456,`november'). Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 10

  10. Qualitative Results Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 11

  11. Qualitative Results Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 11

  12. Qualitative Results Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 11

  13. Qualitative Results Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 11

  14. Qualitative Results Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 11

  15. Qualitative Results Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 11

  16. Qualitative Results Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 11

  17. Outline β€’ Motivation and Overview β€’ Contextual Media Retrieval System β€’ Results and Conclusion Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 12

  18. Evaluation Agreement and Disagreement between users * Model tested on 500 test queries Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 13

  19. Evaluation Agreement and Disagreement between users Total agreement * Model tested on 500 26.67% test queries Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 13

  20. Evaluation Agreement and Disagreement between users Majority agreement * Model tested on 500 ~ 40% test queries Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 13

  21. Evaluation Agreement and Disagreement between users * Model tested Disagreement on 500 test queries ~ 25% Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 13

  22. Evaluation Agreement and Disagreement between users * Model tested Future scope for on 500 personalization test queries ~ 25% Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 13

  23. Evaluation Study of human reference frame resolution Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 14

  24. Evaluation Study of human reference frame resolution Future scope for using other Knowledgebases Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 14

  25. Summary We have: β€’ Instantiated a β€š Collective Memory” of media content β€’ Developed a novel architecture for media retrieval with natural language voice queries in a dynamic setting - Xplore-M-Ego β€’ Integrated β€˜ egocentrism ’ to media retrieval Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 15

  26. Summary We have: β€’ Instantiated a β€š Collective Memory” of media content β€’ Developed a novel architecture for media retrieval with natural language voice queries in a dynamic setting - Xplore-M-Ego β€’ Integrated β€˜ egocentrism ’ to media retrieval Thank You Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 15

  27. References β€’ Photo Tourism: Exploring Photo Collections in 3D Noah Snavely, Steven M. Seitz, Richard Szeliski β€’ Video Collections in Panoramic Contexts J.Tompkin, F.Pece, R.Shah, S.Izadi, J.Kautz, C.Theobalt β€’ Videoscapes: Exploring Sparse, Unstructures Video Collections J.Tompkin, K. In Kim, J.Kautz, C.Theobalt β€’ PhotoScope:Visualizing Spatiotemporal Coverage of Photos for Construction Management F.Wu, M.Tory Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 16

  28. References β€’ Learning Dependency-Based Compositional Semantics Percy Liang, Michael I. Jordan, Dan Klein β€’ A multi-world approach to question answering about real-world scenes based on uncertain input M. Malinowski, M. Fritz β€’ Image Retrieval with Structured Object Queries Using Latent Ranking SVM T.Lan, W.Yang, Y.Wang, G.Mori β€’ Interpretation of Spatial Language in a Map Navigation Task M. Levit, D. Roy Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 16

  29. Extra Material 23-02-2015

  30. Contribution β€’ Instantiation of a β€š Collective Memory β€› of media files β€’ Extension of question-answering to a dynamic setting β€’ Extension of spatio-temporal exploration of media to a dynamic setting β€’ Incorporation of β€˜egocentrism’ to media retrieval β€’ Use of natural language voice queries for media retrieval 23-02-2015 Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries

  31. System Overview Modules of Xplore-M-Ego β€’ The Google Glass: User Interface β€’ Pre-processing : Modification of query, Mapping of a dynamic environment to a static environment β€’ Semantic Parser + Denotation : Semantic parsing and prediction of answer β€’ Collective Memory : Store of media files 23-02-2015 69

  32. Related Work β€’ Spatio-temporal Media Retrieval Paper Author(s) Overview Photo tourism: N. Snavely, S. M. Seitz, Exploration of popular exploring photo and R. Szeliski world sites by collections in 3D browsing through images Video collections in J. Tompkin, F. Pece, S. Spatio-temporal panoramic contexts Rajvi, I. Shahram, K. exploration of videos Jan, and C. Theobalt embedded on a panoramic context 23-02-2015 70

  33. Related Work β€’ Natural Language Question-Answering Paper Author(s) Overview Learning P. Liang, M. I. Jordan, Training of a semantic Dependency-based and D. Klein parser with question- compositional answer pairs; single Semantics static world approach A multi-world M. Malinowski and M. Question-answering approach to question Fritz task based on real answering world indoor images; about real-world static multi-world scenes based on approach uncertain input 23-02-2015 71

  34. Related Work β€’ Media Retrieval with Natural Language Queries Paper Author(s) Overview Towards surveillance S. Tellex and D. Roy Retrieval of video video search by frames from natural language surveillance videos query with spatial relations β€šacrossβ€› and β€šalongβ€› Image retrieval with T. Lan, W. Yang, Y. Retrieval of images structured Wang, and G. Mori based on scene object queries using contents using short latent ranking SVM structured phrases as queries 23-02-2015 72

  35. Data Collection 1. Map information : OpenStreetMap Contains –  Type of the entity  GPS coordinates  Name  Address 10-02-2015 73

  36. Data Collection 2. Collection of media files : Collective Memory ** Media files were captured with smart phones 10-02-2015 74

  37. Data Collection 3. Training and Test data  Synthetically-generated Data (β€š What is there in front of MPI- INF?β€›, answer(A, ( frontOf(A, β€˜ mpi inf ’)))) (β€šWhat is there behind MPI- INF?β€›, answer(A , (behind(A, β€˜ mpi inf ’)))) (β€šWhat is there on the right of MPI- INF?β€›, answer(A , (rightOf(A, β€˜ mpi inf ’)))) (β€šWhat is there on the left of MPI- INF?β€›, answer(A , (leftOf(A, β€˜ mpi inf ’))))  Real-world Data (β€šWhat is there on the left of MPI-INF ?β€›, β€˜img_20141102_123406’) (β€š What is on the left of MPI- INF?β€›, β€˜img_20141113_160930’) (β€š What is to the left of MPI-INF ?β€›, β€˜img_20141109_134914’) (β€š What is on the left side of MPI-INF ?β€›, β€˜img_20141115_100705’) 10-02-2015 75

  38. Data Collection 23-02-2015 76

  39. Semantic Parser Dependency-based Compositional Semantics (DCS) by Percy Liang β€’ DCS tree defines relations between predicates β€’ Denotation are solutions satisfying the relations β€’ city, major, loc, CA are predicates 23-02-2015 77

  40. Semantic Parser Example Questions World( w ): β€šWhat is the highest point in state('california','ca', 'sacramento', Florida ?β€› 23.67e+6, 158.0e+3,31, 'los angeles', 'san diego', 'san francisco', 'san jose'). β€šWhich State has the shortest river ?β€› city('alabama','al','birmingham',284413). β€šWhat is the capital of Maine ?β€› river('arkansas',2333,['colorado','kansas', 'oklahoma','arkansas']). β€šWhat are the populations of states through which the mountain('alaska','ak','mckinley',6194). Mississippi river run?β€› road('86',['massachusetts','connecticut']). β€šName all the lakes of US ?β€› country('usa',307890000,9826675). 23-02-2015 78

  41. **slide courtesy: Percy Liang Semantic Parser Learning in DCS 23-02-2015 79

  42. Semantic Parser β€’ Induction of logical forms β€’ Logical forms (DCS trees) induced as latent variables according to a probability distribution parametrized with ΞΈ β€’ Answer y evaluated with respect to world w 23-02-2015 80

  43. Semantic Parser β€’ Induction of logical forms Requirements – A set of rules/predicates: city(cityid(City,St)) :- state(State,St,_ ,_ ,_ ,_ ,City,_ ,_ ,_ ). loc(cityid(City,St),stateid(State)) :- state(State,St,_,_ ,_ ,_ ,City,_ ,_ ,_ ). river(riverid(R)) :- river(R,_ ,_ ). loc(cityid(City,St),stateid(State)) :- city(State,St,City, ). traverse(riverid(R),stateid(S)) :- river(R, ,States), member(S,States). area(stateid(X),squared mile(Area)) :- state(X,_ ,_ ,_ ,Area,_ ,_ ,_ ,_ ,_ ). population(countryid(X),Pop) :- country(X,Pop,_). major(X) :- city(X), population(X,moreThan(150000)). 23-02-2015 81

  44. Semantic Parser β€’ Induction of logical forms Requirements – A set of lexical triggers( L ): <(function words; predicate )> <([ POS tags ]; [predicates])> ( most, size). ( WRB ; loc) ( total , sum). ([ NN;NNS ]; [city,state,country,lake,mountain,river,place) ( called , nameObj). ([ NN;NNS ]; [person,capital,population]) ([ NN;NNS; JJ ]; [len,negLen,size,negSize,elevation) ([ NN;NNS; JJ ]; [negElevation,density,negDensity,area,negArea]) ( JJ ; major) Augmented Lexicon( L+ ): ( long , len). ( large , size). ( small , negSize). ( high , elevation). 23-02-2015 82

  45. Media Retrieval from Denotations Example Questions World( w ): β€šWhat is there on the right of MPI- image(`img_20141111_165828',201 INF?β€› 41111,49.2566,7.0442,`november'). β€šWhat is there in front of postbank ?β€› video(`vid_20141121_120149',2014 1121,49.2569, 7.0456,`november'). β€šWhat is there on the left of Mensa ?β€› cafe(β€˜mensa’, 49.2560,7.0454). β€šWhat is there near Science Park ?β€› building(β€˜mpi_inf’, 49.2578,7.0460). β€šWhat happened here one day ago?β€› bank(β€˜postbank’, 49.2556,7.0449). β€šWhat does this place look like in December?β€› 23-02-2015 83

  46. Dynamic-Egocentric Extension Lexical triggers: Basic lexicon L Augmented lexicon L+ ([ WP,WDT ], [image,video]). ( front , frontOf). ( NN , ( behind , behind). [atm,building,cafe,highway,parking,research_institution, ( right , rightOf). restaurant,shop,sport,tourism,university]). ( left , leftOf). ( JJS , [nearest]). ([ NN,NNS,VB ], [view]). ( VBD , [view]). Prediction accuracy: 17.9% Prediction accuracy: 47% 23-02-2015 84

  47. Dynamic-Egocentric Extension g g g g g g g g g g g g g 23-02-2015 85

  48. Dynamic-Egocentric Extension g g g g g g g g g g g g g 23-02-2015 86

  49. POS tags from Penn Treebank β€’ WRB : Wh-adverb β€’ NN : Noun, singular or mass β€’ NNS : Noun, plural β€’ JJ : Adjective β€’ WP : Wh-pronoun β€’ WDT : Wh-determiner β€’ NN : Noun, singular or mass β€’ JJS : Adjective, superlative β€’ NNS : Noun, plural β€’ VB : Verb β€’ VBD : Verb, past tense 23-02-2015 87

  50. Reason behind hard-coding spatial relations β€’ What is there left/VBN of MPI? β€’ What is there on the left/NN of MPI? β€’ What is there in front/NN of MPI? β€’ What is there behind/IN MPI? β€’ What is there right/RB of MPI? β€’ What is there on the right/NN of MPI? 23-02-2015 88

  51. Predicates used in Xplore-M-Ego 23-02-2015 89

  52. Results and Evaluation β€’ Synthetically generated question-answer pairs used for training and testing β€’ Maximum prediction accuracy – 47% 23-02-2015 90

  53. Results and Evaluation Performance Measures: π‘Ÿ 𝑛 = π‘œπ‘£π‘›π‘π‘“π‘  𝑝𝑔 π‘Ÿπ‘£π‘“π‘ π‘—π‘“π‘‘ π‘₯𝑗𝑒𝑖 𝑛𝑓𝑒𝑗𝑏 π‘ π‘“π‘’π‘ π‘—π‘“π‘€π‘π‘šπ‘‘ β€’ π‘Ÿ 𝑠 = π‘œπ‘£π‘›π‘π‘“π‘  𝑝𝑔 π‘Ÿπ‘£π‘“π‘ π‘—π‘“π‘‘ π‘₯𝑗𝑒𝑖 π‘ π‘“π‘šπ‘“π‘€π‘π‘œπ‘’ π‘ π‘“π‘’π‘ π‘—π‘“π‘€π‘π‘šπ‘‘ π‘π‘›π‘π‘œπ‘• π‘Ÿπ‘› β€’ π‘Ÿ 𝑒 = π‘œπ‘£π‘›π‘π‘“π‘  𝑝𝑔 π‘Ÿπ‘£π‘“π‘ π‘—π‘“π‘‘ π‘₯𝑗𝑒𝑖 π‘’π‘“π‘¦π‘’π‘£π‘π‘š π‘ π‘“π‘’π‘ π‘—π‘“π‘€π‘π‘šπ‘‘ π‘π‘œπ‘’ π‘œπ‘ π‘ π‘“π‘’π‘ π‘—π‘“π‘€π‘π‘šπ‘‘ β€’ π‘Ÿ 𝑠 𝑏𝑀𝑓𝑠𝑏𝑕𝑓 π‘žπ‘ π‘“π‘‘π‘—π‘‘π‘—π‘π‘œ = β€’ π‘Ÿ 𝑛 π‘Ÿ 𝑠 𝑏𝑀𝑓𝑠𝑏𝑕𝑓 π‘ π‘“π‘‘π‘π‘šπ‘š = β€’ π‘Ÿ 𝑛 + π‘Ÿ 𝑒 23-02-2015 91

  54. Results and Evaluation β€’ β€šhuman -in-the- loopβ€› training of the model o Five different models were trained o Training accuracies ranged from 42.6% to 48.8% o The best model based on training accuracy was used for further evaluations 23-02-2015 92

  55. Results and Evaluation β€’ β€šhuman -in-the- loopβ€› training of the model β€’ It is a method of training the semantic parser by human users through relevance feedback β€’ β€šCorrectβ€›/β€šWrongβ€› decisions are made solely based on the predicted answers β€’ The models are trained with real questions from human users 23-02-2015 93

  56. Results and Evaluation β€’ β€šhuman -in-the- loopβ€› training of the model o Automatic training of the semantic parser with the real data was not possible because – β€’ GPS coordinates of media files showing a particular entity does not match that of the map data β€’ Humans are inconsistent with regards to reference frames β€’ Question- answer pairs didn’t follow any pattern β€’ Denotations (often more than one answer) never matched with true answers, hence EM-like algorithm failed to learn 23-02-2015 94

  57. Results and Evaluation Human evaluation of model trained with real-world data β€’ RealModel -model trained with real- world data β€’ Relevance feedback collected from five users β€’ Overall percentage of relevant retrievals = 26.67% 23-02-2015 95

  58. Results and Evaluation β€’ Recall of SynthModel = 15.88% β€’ Recall of RealModel = 26.67% 23-02-2015 96

  59. Evaluation Human evaluation of temporal and contextual Q&A β€’ Five hypothetical locations and viewing directions provided to users β€’ Relevance feedback collected for retrievals following a canonical reference frame and a user-centric reference frame Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 97

  60. Evaluation Human evaluation of temporal and contextual Q&A β€’ Canonical and User-centric reference frame: β€šfront ofβ€› β€šright ofβ€› β€šleft ofβ€› User heading East β€šbehindβ€› Original: What is there in front of MPI-INF? Altered: What is there on the right of MPI-INF? Sreyasi Nag Chowdhury | Contextual Media Retrieval Using Natural Language Queries 23-02-2015 98

  61. Discussion Problem with matching GPS coordinates What is in front iCoffee of MPI-INF? retrieved media ground-truth media MPI-INF β€šFront of MPI -INF β€› 23-02-2015 99

  62. Discussion Challenges Limitations Converting a dynamic world to Spatial and temporal references a static world not identified Words tagged with incorrect Integrating β€˜egocentrism’ POS tags Arguments not identified from Handling temporal queries sentences Collection of data Scalability Increasing the coverage of the Reference resolution is not static database handled 23-02-2015 100

Recommend


More recommend