sharing is caring in the land of the long tail
play

Sharing is Caring in the Land of The Long Tail Samy Bengio Real - PowerPoint PPT Presentation

Sharing is Caring in the Land of The Long Tail Samy Bengio Real life setting Real problems rarely come packaged as 1M images uniformly belonging to a set of 1000 classes 2 The long tail Well known phenomena where a small number


  1. Sharing is Caring in the Land of The Long Tail Samy Bengio

  2. Real life setting “Real problems rarely come packaged as 1M images uniformly belonging to a set of 1000 classes…” 2

  3. The long tail • Well known phenomena where a small number of generic objects/entities/words appear very often and most others appear more rarely. • Also knows as Zipf or Power law, or Pareto distribution. • The web is littered by this kind of distributions: • the frequency of each unique query on search engines, • the occurrences of each unique word in text documents, • etc. 3

  4. Example of a long tail Frequency of words in Wikipedia 1e+09 1e+08 1e+07 1e+06 Frequency 100000 10000 1000 100 10 the anyways trickiest h-plane Words 4

  5. Representation sharing • How do we design a classifier or a ranker when data follows a long tail distribution? • If we train one model per class, it is hard for poor classes to be well trained. • How come we humans are able to recognize objects we have seen only once or even never? • Most likely answer: representation sharing: all class models share/learn a joint representation. • Poor classes can then benefit from knowledge learned from semantically similar but richer classes. • Extreme case: zero-shot setting! 5

  6. Outline In this talk, I will cover the following ideas: • Wsabie: a joint embedding space of images and labels • The many facets of text embeddings • Zero-shot setting through embeddings • Incorporate Knowledge Graph constraints • Use of a language model I will NOT cover the following important issues: • Prediction time issues for extreme classification • Memory issues 6

  7. Wsabie Learn to embed images & labels to optimize top-ranked items. Labels Obama Eiffel Tower Shark Dolphin Lion ... 100-dimensional embedding space Wsabie: J. Weston et al, ECML 2010, IJCAI 2011 7

  8. Wsabie: summary sim(i,x) = <Wi,Vx> W V 000100 real values Label i Image x Triplet Loss: sim( , dolphin) > sim( , obama) + 1 Trained by stochastic gradient descent and smart sampling of negative examples 8

  9. Wsabie: experiments - results Method ImageNet 2010 Web prec@1 prec@10 prec@1 prec@10 approx 1.55% 0.41% 0.30% 0.34% kNN One-vs- 2.27% 1.02% 0.52% 0.29% Rest 4.03% 1.48% 1.03% 0.44% Wsabie Ensemble of 10 10.03% 3.02% Wsabies ImageNet 2010: 16000 labels and 4M images Web: 109000 labels and 16M images 9

  10. Wsabie: embeddings Label Nearest Neighbors barack obama barak obama, obama, barack, barrack obama, bow wow david beckham beckham, david beckam, alessandro del piero, del piero santa santa claus, papa noel, pere noel, santa clause, joyeux noel dolphin delphin, dauphin, whale, delfin, delfini, baleine, blue whale cows cattle, shire, dairy cows, kuh, horse, cow, shire horse, kone rose rosen, hibiscus, rose flower, rosa, roze, pink rose, red rose eiffel tower eiffel, tour eiffel, la tour eiffel, big ben, paris, blue mosque ipod i pod, ipod nano, apple ipod, ipod apple, new ipod f18 f 18, eurofighter, f14, fighter jet, tomcat, mig 21, f 16 10

  11. Wsabie: annotations delfini, orca, dolphin , mar, delfin, dauphin, whale, cancun, killer whale, sea world blue whale, whale shark, great white shark, underwater, white shark, shark, manta ray, dolphin , requin, blue shark, diving barrack obama, barak obama, barack hussein obama, barack obama , james marsden, jay z, obama, nelly, falco, barack eiffel, paris by night, la tour eiffel, tour eiffel, eiffel tower , las vegas strip, eifel, tokyo tower, eifel tower 11

  12. “Why not an embedding of text only?”

  13. Skip-Gram (Word2Vec) Learn dense embedding vectors from an unannotated text corpus, e.g. Wikipedia wing chair Obama W tuna E E bull shark “an exceptionally large male tiger shark can grow up to” tiger shark http://code.google.com/p/word2vec Tomas Mikolov, Kai Chen, Greg Corrado, Je ff Dean (ICLR 2013) 13

  14. Skip-Gram Wikipedia t-SNE visualization of ImageNet Skip-gram trained on Wikipedia, labels 155K terms tiger shark car bull shark cars blacktip shark muscle car shark sports car oceanic whitetip shark compact car sandbar shark autocar dusky shark automobile blue shark pickup truck requiem shark racing car great white shark passenger car reptiles lemon shark dealership birds insects food musical instruments clothing dogs aquatic life animals transportation 14

  15. Embeddings are powerful hot Berlin Germany big Rome hotter Italy bigger E( Rome ) - E( Italy ) + E( Germany ) ≈ E( Berlin ) E( hotter ) - E( hot ) + E( big ) ≈ E( bigger ) 15

  16. Let’s go back to images!

  17. Deep convolutional models for images But what about the long tail of classes? Layer 7 What about using our ... semantic embeddings Layer 1 for that? Input 17

  18. ConSE: Convex Combination of Semantic Embeddings [Norouzi et al, ICLR’2014] p (Lion | x ) p (Apple | x ) p (Orange | x ) p (Tiger | x ) p (Bear | x ) 18

  19. ConSE: Convex Combination of Semantic Embeddings from Skip-Gram for instance: s (y) = embedding position of y X f ( x ) = p ( y i | x ) s ( y i ) i f ( x ) = p (Lion | x ) s (Lion)+ p (Apple | x ) s (Apple)+ p (Orange | x ) s (Orange)+ p (Tiger | x ) s (Tiger)+ p (Bear | x ) s (Bear) Do a nearest neighbor search around f ( x ) to find the corresponding label 19

  20. ConSE(T): Convex Combination of Semantic Embeddings In practice, consider the average of only a few labels: top ( T ) = { i | p ( y i | x ) is among top T probabilities } f ( x ) = 1 X p ( y i | x ) s ( y i ) Z i ∈ top ( T ) 20

  21. ConSE(T): experiments on ImageNet • Model trained with 1.2M 3-hops ILSVRC 2012 images from 1,000 classes 2-hops • Evaluated on images from same classes. Training • Results are measured as hit@ k.

  22. ConSe(T) experiments 22

  23. Knowledge Graph 23

  24. Multiclass Classifiers Softmax GoogleLeNet model Logistic 24

  25. Object labels have rich relations Exclusion Hierarchical Dog Dog Cat Cat Corgi Puppy Corgi Puppy Overlap 25

  26. Visual Model + Knowledge Graph Dog 0.9 Visual Corgi 0.8 Knowledge Joint Model Graph Puppy 0.9 Inference Cat 0.1 Hierarchy and Exclusion (HEX) Graph Exclusion Hierarchical Dog Cat [Deng et al, ECCV 2014] Corgi Puppy 26

  27. HEX Classification Model x ∈ R n y ∈ {0,1} n Input scores Binary Label vector 1 ∏ ψ i , j ( y i , y j ) Pr( y | x ) = ∏ φ i ( x i , y i ) Z ( x ) i , j i If violates constraints if y i = 1 sigmoid ( x i ) 0 φ i ( x i , y i ) = ψ i , j ( y i , y j ) = if y i = 0 1 − sigmoid ( x i ) 1 Otherwise Unary: same as logistic regression Pairwise: set illegal configuration to zero � All illegal configurations have probability zero. 27

  28. Exp: Learning with weak labels • ILSVRC 2012: “relabel” or “weaken” a portion of fine-grained leaf labels to basic level labels. • Evaluate on fine-grained recognition Animal Animal Animal Relabel Dog Dog Dog … … … Corgi Husky Corgi Husky Corgi Husky Training Test Original ILSVRC 2012 
 (“weakened” labels) (leaf labels) 28

  29. Exp: Learning with weak labels • ILSVRC 2012: “relabel” or “weaken” a portion of fine-grained leaf labels to basic level labels. • Evaluate on fine-grained recognition. • Consistently outperforms baselines. Top 1 accuracy (top 5 accuracy) 29

  30. What about textual descriptions? • We have considered the long tail of objets. • What about more complex descriptions, involving multi-word descriptions, or captions? • We can use language models to help. 30

  31. Neural Image Caption Generator [Vinyals et al, CVPR 2015] Vision Language 1. Two pizzas sitting on top of Deep CNN Generating a stove top RNN oven. 2. A pizza sitting on top of a pan on top of a stove. A group of people Language ! Vision ! shopping at an Generating ! Deep CNN outdoor market. RNN ! There are many vegetables at the fruit stand. 31

  32. NIC: objective • Let I be an image (pixels). • Let S be the corresponding sentence (sequence of words). • Likelihood of producing the right sentence given the image: N X log p ( S | I ) = log p ( S t | I, S 0 , . . . , S t − 1 ) t =0 • We maximize the likelihood of producing the right sentence given the image: θ ? = arg max X log p ( S | I ; θ ) ✓ ( I,S ) 32

  33. NIC: model P(word 1) P(word 2) P(<end>) Embedding word 1 word N Recurrent Image Convolution Neural Net Neural Net 33

  34. Examples A person riding a A skateboarder does a trick A dog is jumping to catch a Two dogs play in the grass. motorcycle on a dirt road. on a ramp. frisbee. A refrigerator filled with lots of A group of young people Two hockey players are A little girl in a pink hat is food and drinks. playing a game of frisbee. fighting over the puck. blowing bubbles. A herd of elephants walking A close up of a cat laying A yellow school bus parked A red motorcycle parked on the across a dry grass field. on a couch. side of the road. in a parking lot. Describes without errors Describes with minor errors Somewhat related to the image Unrelated to the image 34

  35. It doesn’t always work… Human: A blue and black dress ... No! I see white and gold! Our model: A close up of a vase with flowers.

  36. Scheduled Sampling [NIPS 2015] 36

  37. Scheduled Sampling 37

Recommend


More recommend