visually grounded
play

Visually Grounded Neural Syntax Acquisition * * Haoyue Shi - PowerPoint PPT Presentation

Visually Grounded Neural Syntax Acquisition * * Haoyue Shi Jiayuan Mao Kevin Gimpel Karen Livescu July 29 th , 2019 @ACL When we were c hildren A cat is on the lawn. When we were c hildren A cat is on the lawn. A cat sleeps outside.


  1. Visually Grounded Neural Syntax Acquisition * * Haoyue Shi Jiayuan Mao Kevin Gimpel Karen Livescu July 29 th , 2019 @ACL

  2. When we were c hildren… A cat is on the lawn.

  3. When we were c hildren… A cat is on the lawn. A cat sleeps outside.

  4. When we were c hildren… A cat is on the lawn. A cat, as a whole, means something concrete. A cat sleeps outside.

  5. When we were c hildren… A cat is on the lawn. A cat is staring at you. A cat plays with a ball. A cat, as a whole, means something concrete. A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground.

  6. When we were c hildren… A cat is on the lawn. A cat is staring at you. A cat plays with a ball. A cat, as a whole, means something concrete. A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground. A cat, as a whole, functions as a single unit in sentences.

  7. When we were c hildren… A cat is on the lawn. A cat is staring at you. A cat plays with a ball. A cat was chasing a mouse. A dog was chasing a cat . A cat was chased by a dog. … A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground. A cat, as a whole, functions as a single unit in sentences.

  8. Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? Figure credit: Ding et al. (2018)

  9. Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? A cat is on the lawn

  10. Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? A cat is on the lawn

  11. Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? A cat is on the lawn

  12. Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Image Joint Embedding Space Parser Constituency Parse Tree Text 𝒅 3 Encoder 𝒅 1 𝒅 1 : a cat 𝒅 2 𝒅 2 : the lawn Image Encoder: 𝒅 3 : on the lawn ResNet 101 (He et al., 2015) … Estimated Concreteness as Scores

  13. Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Parser Constituency Parse Tree 𝒅 1 : a cat 𝒅 2 : the lawn 𝒅 3 : on the lawn …

  14. Greedy Bottom-Up Parser a cat is on the lawn

  15. Greedy Bottom-Up Parser Compute score 𝐰 𝑏 𝐺𝐺𝑂 = 4.5 4.5 𝐰 𝑑𝑏𝑢 a cat is on the lawn

  16. Greedy Bottom-Up Parser Compute score 𝐰 𝑑𝑏𝑢 𝐺𝐺𝑂 = 0.5 4.5 0.5 𝐰 𝑗𝑡 a cat is on the lawn

  17. Greedy Bottom-Up Parser Compute score 𝐰 𝑗𝑡 𝐺𝐺𝑂 = 1 4.5 0.5 1 𝐰 𝑝𝑜 a cat is on the lawn

  18. Greedy Bottom-Up Parser Compute score 𝐰 𝑝𝑜 𝐺𝐺𝑂 = 1 4.5 0.5 1 1 𝐰 𝑢ℎ𝑓 a cat is on the lawn

  19. Greedy Bottom-Up Parser Compute score 𝐰 𝑢ℎ𝑓 𝐺𝐺𝑂 = 3 4.5 0.5 1 1 3 𝐰 𝑚𝑏𝑥𝑜 a cat is on the lawn

  20. Greedy Bottom-Up Parser Normalized to a probability distribution 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn

  21. Greedy Bottom-Up Parser 0.45 0.05 0.1 0.1 0.3 Sample a pair to combine (training) Greedily combine (inference) a cat is on the lawn

  22. Greedy Bottom-Up Parser Textual representation: Normalized sum of children (a cat) 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 𝐰 𝑏 𝑑𝑏𝑢 = 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 0.45 0.05 0.1 0.1 0.3 2 a cat is on the lawn

  23. Greedy Bottom-Up Parser Textual representation: Normalized sum of children (a cat) is on the lawn 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 𝐰 𝑏 𝑑𝑏𝑢 = 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 0.45 0.05 0.1 0.1 0.3 2 a cat is on the lawn

  24. Greedy Bottom-Up Parser Compute probability 0.25 0.15 0.15 0.45 (a cat) is on the lawn 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn

  25. Greedy Bottom-Up Parser Combine (a cat) is on (the lawn) 0.25 0.15 0.15 0.45 (a cat) is on the lawn 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn

  26. Greedy Bottom-Up Parser Finished! ((a cat) (is (on (the lawn)))) … (a cat) is on (the lawn) 0.25 0.15 0.15 0.45 (a cat) is on the lawn 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn

  27. Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Parser Constituency Parse Tree 𝒅 1 : a cat 𝒅 2 : the lawn 𝒅 3 : on the lawn …

  28. Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Image Parser Constituency Parse Tree 𝒅 1 : a cat 𝒅 2 : the lawn 𝒅 3 : on the lawn …

  29. Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Image Joint Embedding Space Parser Constituency Parse Tree Text 𝒅 3 Encoder 𝒅 1 𝒅 1 : a cat 𝒅 2 𝒅 2 : the lawn Image Encoder: 𝒅 3 : on the lawn ResNet 101 (He et al., 2015) …

  30. The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  31. The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): √ A cat is on the lawn. 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  32. The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): √ A cat is on the lawn. 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) × ⋅ + = max ⋅, 0 A cat is on the lawn. 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  33. The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): × √ A cat is on the lawn. A dog is on the lawn. 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) × ⋅ + = max ⋅, 0 A cat is on the lawn. 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  34. Concreteness Estimation in the Joint Embedding Space Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  35. Concreteness Estimation in the Joint Embedding Space a cat √ Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  36. Concreteness Estimation in the Joint Embedding Space √ a cat Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: on the ? 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  37. Concreteness Estimation in the Joint Embedding Space √ a cat Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: ? on the 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) Abstractness: local hinge loss between constituents and images. 𝑏𝑐𝑡𝑢𝑠𝑏𝑑𝑢 𝑑; 𝑗 = ℒ(𝑗, 𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

Recommend


More recommend