better action retrieval in images
play

better action retrieval in images Inkyu An Content 1. Background - PowerPoint PPT Presentation

Learning semantic relationships for better action retrieval in images Inkyu An Content 1. Background 2. Motivation 3. Related Work 4. Approach 5. Result 2 Background | Semantic ? What comes to mind when you see below picture ? There are


  1. Learning semantic relationships for better action retrieval in images Inkyu An

  2. Content 1. Background 2. Motivation 3. Related Work 4. Approach 5. Result 2

  3. Background | Semantic ? What comes to mind when you see below picture ? There are many parked vehicles on either side of the road. 3

  4. Background | Semantic labeling http://rodrigob.github.io/are_we_there_yet/build/semantic_l abeling_datasets_results.html#4d5352432d3231 4

  5. Background | Semantic labeling More complex - A wide variety of classes 5

  6. Background | Semantic labeling More complex - A wide variety of classes Collie Retriever Great Labrador Pomeranian Dane Retriever Vizsla Samoyed Bull Terrier Poodle Yorkshire Terrier 6

  7. Background | More and more complex She is stretching her right leg over listening a music 7

  8. Motivation | Action retrieval in images Query image Image Search ??? Person interacting with panda 8

  9. Motivation | Action retrieval in images Query image Result of Prior work Image Search False Positive Person interacting with panda 9

  10. Motivation | Action retrieval in images Query image Result images Person holding Person interacting Person feeding Person feeding animals with panda panda calf Implied-by Mutual-exclusive Type-of 10

  11. Motivation | Action retrieval in images Three kinds of relations 1. Implied-by 2. Type-of 3. Mutual-exclusive HEX-graph Large-scale object classification using label relation graphs [ECCV 2014] 11

  12. Motivation | Action retrieval in images β€œPerson interacting with panda” is represented by a weight vector 𝓧 𝑩 Skip-grams Distributed Representations of Words and Phrases and their Compositionality [NIPS 2013] 12

  13. Motivation | Action retrieval in images They needed to get a score of relationship of sentences pair. Neural Tensor Network Reasoning With Neural Tensor Networks for Knowledge Base Completion [NIPS 2013] 13

  14. Related Work | 1. HEX-graph - Three kinds of relations 2. Skip-grams - Weight vectors of actions(Sentence) 3. Neural Tensor Network - Scores of relationship of pairs of actions 14

  15. Related Work | HEX-graph _ Motivation Classifier Siberian Husky Poodle Bulldog Bengal cat Russian Blue Dog Cat 15

  16. Related Work | HEX-graph _ Motivation Classifier Siberian Husky Puppy Dog Cat Exclusion Subsumption HEX-graph 16

  17. Related Work | HEX-graph _ Problem Definition <HEX-graph> exclusion 𝑂𝑝𝑒𝑓𝑑 π‘Š ∢ Dog Cat Dog Cat Puppy Husky subsumption πΌπ‘—π‘“π‘ π‘π‘ π‘‘β„Žπ‘§ 𝑓𝑒𝑕𝑓 𝐹 β„Ž ∢ subsumption Husky Puppy πΉπ‘¦π‘‘π‘šπ‘£π‘‘π‘—π‘π‘œ 𝑓𝑒𝑕𝑓 𝐹 𝑓 ∢ exclusion Relations : Dog Puppy : subsumption Dog Cat : exclusion Husky Puppy : overlap 17

  18. Related Work | skip-grams - The training objective is to learn word vector representations that are good at predicting the nearby words The average log probability οƒ  Training Input sentence Nearby words 21

  19. Related Work | Neural Tensor Networks (NTN) - The model returns a high score if they are in that relationship and a low on otherwise 23

  20. Approach | Problem setup A set of actions 𝒝 Action : Person riding bike - Person riding bike - Person riding horse - Person preparing food Related - Chef cooking pasta images - Person walking with a horse Two SVO structure : 1. <subject, verb, object> 2. <subject, verb, prepositional object> 24

  21. Approach | Problem setup _ three kinds of relations Person preparing food 1. Implied-by : Chef cooking pasta Person doing football 2. Type-of : Man playing soccer 3. Mutually exclusive : Person riding horse Man riding camel 25

  22. Approach | Full model 2 𝐷 = 𝐷 𝑏𝑑 + 𝛽 𝑠 𝐷 𝑠𝑓𝑑 + 𝛽 π‘œ 𝐷 π‘œπ‘šπ‘ž + 𝛽 𝑑 𝐷 π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2 Full model : Basic action The weights in Language prior retrieval model the model [only Action] [Image + Action] Consistency Visual objective objective [Image + Action] [only Action] 26

  23. Approach | Full model 2 𝐷 = 𝐷 𝑏𝑑 + 𝛽 𝑠 𝐷 𝑠𝑓𝑑 + 𝛽 π‘œ 𝐷 π‘œπ‘šπ‘ž + 𝛽 𝑑 𝐷 π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2 Full model : Basic action The weights in Language prior retrieval model the model [only Action] [Image + Action] Consistency Visual objective objective [Image + Action] [only Action] 𝑋 = 𝑋 𝑗𝑛 , π‘₯ 𝐡 , 𝑋 𝑏𝑠𝑓 π‘š 2 π‘ π‘“π‘•π‘£π‘šπ‘π‘ π‘—π‘¨π‘“π‘’ π‘₯π‘—π‘’β„Ž 𝑏 π‘ π‘“π‘•π‘£π‘šπ‘π‘ π‘—π‘¨π‘π‘’π‘—π‘π‘œ π‘‘π‘π‘“π‘”π‘”π‘—π‘‘π‘—π‘“π‘œπ‘’ πœ‡ π‘ π‘“π‘š π΅βˆˆπ’ 27

  24. Approach | Full model 2 𝐷 = 𝐷 𝑏𝑑 + 𝛽 𝑠 𝐷 𝑠𝑓𝑑 + 𝛽 π‘œ 𝐷 π‘œπ‘šπ‘ž + 𝛽 𝑑 𝐷 π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2 Full model : Basic action The weights in Language prior retrieval model the model [only Action] [Image + Action] Consistency Visual objective objective [Image + Action] [only Action] 28

  25. Approach | Basic action retrieval model Person riding π΅π‘‘π‘’π‘—π‘π‘œ Skip-grams 𝒙 𝑩 Skip-grams bike 𝐡 𝑋 𝑐 𝑗𝑛 𝑗𝑛 𝐽 𝐡 + π’ˆ 𝑩 + 𝒙 𝑩 π’ˆ 𝑩 + CNN CNN π’ˆ βˆ’ 𝒙 𝑩 π’ˆ βˆ’ I βˆ’ 𝑐 𝑗𝑛 𝑋 𝑗𝑛 Action prediction loss 𝑔 𝐽 = 𝑋 𝑗𝑛 𝐷𝑂𝑂 𝐽 + 𝑐 𝑗𝑛 π‘ˆ (𝑔 𝐷 𝑏𝑑 = max 0,1 + π‘₯ 𝐡 π½βˆ’ βˆ’ 𝑔 𝐽+ ) 𝐽 + βˆˆπ’° 𝒰 𝐡 𝐡 : a set of positive images of A 𝐡 𝐽 βˆ’ βˆˆπ’° 𝒰 𝐡 : a set of negative images of A 𝐡 29

  26. Approach | Full model 2 𝐷 = 𝐷 𝑏𝑑 + 𝛽 𝑠 𝐷 𝑠𝑓𝑑 + 𝛽 π‘œ 𝐷 π‘œπ‘šπ‘ž + 𝛽 𝑑 𝐷 π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2 Full model : Basic action The weights in Language prior retrieval model the model [only Action] [Image + Action] Consistency Visual objective objective [Image + Action] [only Action] 30

  27. Approach | Relationship prediction Goal : Denote the relationship by a vector 𝑠 𝐡𝐢 𝑗 , 𝑠 𝑒 , 𝑠 𝑛 ∈ 0,1 3 = 𝑠 𝐡𝐢 𝐡𝐢 𝐡𝐢 Implied by, type-of and mutually exclusive 𝒔 𝑩π‘ͺ Person riding π΅π‘‘π‘’π‘—π‘π‘œ π‘₯ 𝐡 , π‘₯ 𝐢 Skip-grams Skip-grams bike 𝐡 Neural Tensor Person riding Neural Tensor π΅π‘‘π‘’π‘—π‘π‘œ 1:3 Softmax 𝑋 Network π‘ π‘“π‘š camel Network 𝐢 1:3 ⨂π‘₯ 𝐢 + 𝑐 π‘ π‘“π‘š 𝑠 𝐡𝐢 = 𝑑𝑝𝑔𝑒𝑛𝑏𝑦 𝛾 π‘₯ 𝐡 ⨂𝑋 π‘ π‘“π‘š 31

  28. Approach | Language prior for relationship - NLP prior Person preparing food 1. Implied-by : Chef cooking pasta Wrong 2. Type-of : Man eating fish Person feeding a fish 3. Mutually exclusive : Person riding horse Man riding camel 32

  29. Approach | Language prior for relationship The loss function of language-based relationship 𝑫 π’π’Žπ’’ : 𝐷 π‘œπ‘šπ‘ž = 𝑠 𝐡𝐢 βˆ’ 𝑠 𝐡𝐢 𝐡 πΆβˆˆβ„› 𝐡 𝒔 𝑩π‘ͺ : NLP prior 𝒔 𝑩π‘ͺ : Relationship prediction - NLP priors are not always accurate - They treated NLP priors as a noisy prior 33

  30. Approach | Full model 2 𝐷 = 𝐷 𝑏𝑑 + 𝛽 𝑠 𝐷 𝑠𝑓𝑑 + 𝛽 π‘œ 𝐷 π‘œπ‘šπ‘ž + 𝛽 𝑑 𝐷 π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2 Full model : Basic action The weights in Language prior retrieval model the model [only Action] [Image + Action] Consistency Visual objective objective [Image + Action] [only Action] 34

  31. Approach | Action retrieval with relationship - Visual objective A is implied-by B : Rank the positive images of B higher than the negatives of A 𝑗 = π‘ˆ 𝑔 𝐽 βˆ’ βˆ’ 𝑔 𝐽 𝑐 β†’ 𝐷 𝐡𝐢 max 0,1 + π‘₯ 𝐡 𝒰 𝐢 : a set of positive images of B 𝐽 𝑐 βˆˆπ’° 𝐢 𝐽 βˆ’ βˆˆπ’° 𝒰 𝐡 : a set of negative images of A 𝐡 A is Type-of B : Rank the positive images of A higher than negatives of B 𝑗 = π‘ˆ 𝑔 𝐽 βˆ’ βˆ’ 𝑔 𝐽 𝑏 β†’ 𝐷 𝐡𝐢 max 0,1 + π‘₯ 𝐢 𝐽 𝑏 βˆˆπ’° 𝒰 𝐡 : a set of positive images of A 𝐡 𝐽 βˆ’ βˆˆπ’° 𝒰 𝐢 : a set of negative images of B 𝐢 A is Mutually : Rank the positive images of A higher than the positives exclusive of B of B 𝑗 = π‘ˆ 𝑔 𝐽 𝑐 βˆ’ 𝑔 𝐽 𝑏 β†’ 𝐷 𝐡𝐢 max 0,1 + π‘₯ 𝐡 𝐽 𝑏 βˆˆπ’° 𝒰 𝐡 : a set of positive images of A 𝐡 𝐽 𝑐 βˆˆπ’° 𝒰 𝐢 : a set of positive images of B 𝐢 35

  32. Approach | Action retrieval with relationship - Visual objective 𝑗 β‹… 𝐷 𝑒 β‹… 𝐷 𝑒 + 𝑠 𝑛 β‹… 𝐷 𝐡𝐢 𝑗 𝑛 π‘ƒπ‘π‘˜π‘“π‘‘π‘’π‘—π‘€π‘“: 𝐷 𝑠𝑓𝑑 = 𝑠 + 𝑠 𝐡𝐢 𝐡𝐢 𝐡𝐢 𝐡𝐢 𝐡𝐢 π΅βˆˆπ’ πΆβˆˆβ„› 𝐡 Relationship prediction 𝑗 , 𝑠 𝑒 , 𝑠 𝑛 } 𝑠 𝐡𝐢 = {𝑠 𝐡𝐢 𝐡𝐢 𝐡𝐢 𝑗 , 𝐷 𝐡𝐢 𝑒 , 𝐷 𝑛 ) of each relations, when οƒ  Summarize costs( 𝐷 𝐡𝐢 𝐡𝐢 𝑗 , 𝑠 𝑒 , 𝑠 𝑛 } ) is β€˜1’. relationship prediction( {𝑠 𝐡𝐢 𝐡𝐢 𝐡𝐢 36

  33. Approach | Full model 2 𝐷 = 𝐷 𝑏𝑑 + 𝛽 𝑠 𝐷 𝑠𝑓𝑑 + 𝛽 π‘œ 𝐷 π‘œπ‘šπ‘ž + 𝛽 𝑑 𝐷 π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2 Full model : Basic action The weights in Language prior retrieval model the model [only Action] [Image + Action] Consistency Visual objective objective [Image + Action] [only Action] 37

Recommend


More recommend