Visually Grounded Neural Syntax Acquisition * * Haoyue Shi Jiayuan Mao Kevin Gimpel Karen Livescu July 29 th , 2019 @ACL
When we were c hildren… A cat is on the lawn.
When we were c hildren… A cat is on the lawn. A cat sleeps outside.
When we were c hildren… A cat is on the lawn. A cat, as a whole, means something concrete. A cat sleeps outside.
When we were c hildren… A cat is on the lawn. A cat is staring at you. A cat plays with a ball. A cat, as a whole, means something concrete. A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground.
When we were c hildren… A cat is on the lawn. A cat is staring at you. A cat plays with a ball. A cat, as a whole, means something concrete. A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground. A cat, as a whole, functions as a single unit in sentences.
When we were c hildren… A cat is on the lawn. A cat is staring at you. A cat plays with a ball. A cat was chasing a mouse. A dog was chasing a cat . A cat was chased by a dog. … A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground. A cat, as a whole, functions as a single unit in sentences.
Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? Figure credit: Ding et al. (2018)
Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? A cat is on the lawn
Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? A cat is on the lawn
Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? A cat is on the lawn
Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Image Joint Embedding Space Parser Constituency Parse Tree Text 𝒅 3 Encoder 𝒅 1 𝒅 1 : a cat 𝒅 2 𝒅 2 : the lawn Image Encoder: 𝒅 3 : on the lawn ResNet 101 (He et al., 2015) … Estimated Concreteness as Scores
Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Parser Constituency Parse Tree 𝒅 1 : a cat 𝒅 2 : the lawn 𝒅 3 : on the lawn …
Greedy Bottom-Up Parser a cat is on the lawn
Greedy Bottom-Up Parser Compute score 𝐰 𝑏 𝐺𝐺𝑂 = 4.5 4.5 𝐰 𝑑𝑏𝑢 a cat is on the lawn
Greedy Bottom-Up Parser Compute score 𝐰 𝑑𝑏𝑢 𝐺𝐺𝑂 = 0.5 4.5 0.5 𝐰 𝑗𝑡 a cat is on the lawn
Greedy Bottom-Up Parser Compute score 𝐰 𝑗𝑡 𝐺𝐺𝑂 = 1 4.5 0.5 1 𝐰 𝑝𝑜 a cat is on the lawn
Greedy Bottom-Up Parser Compute score 𝐰 𝑝𝑜 𝐺𝐺𝑂 = 1 4.5 0.5 1 1 𝐰 𝑢ℎ𝑓 a cat is on the lawn
Greedy Bottom-Up Parser Compute score 𝐰 𝑢ℎ𝑓 𝐺𝐺𝑂 = 3 4.5 0.5 1 1 3 𝐰 𝑚𝑏𝑥𝑜 a cat is on the lawn
Greedy Bottom-Up Parser Normalized to a probability distribution 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn
Greedy Bottom-Up Parser 0.45 0.05 0.1 0.1 0.3 Sample a pair to combine (training) Greedily combine (inference) a cat is on the lawn
Greedy Bottom-Up Parser Textual representation: Normalized sum of children (a cat) 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 𝐰 𝑏 𝑑𝑏𝑢 = 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 0.45 0.05 0.1 0.1 0.3 2 a cat is on the lawn
Greedy Bottom-Up Parser Textual representation: Normalized sum of children (a cat) is on the lawn 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 𝐰 𝑏 𝑑𝑏𝑢 = 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 0.45 0.05 0.1 0.1 0.3 2 a cat is on the lawn
Greedy Bottom-Up Parser Compute probability 0.25 0.15 0.15 0.45 (a cat) is on the lawn 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn
Greedy Bottom-Up Parser Combine (a cat) is on (the lawn) 0.25 0.15 0.15 0.45 (a cat) is on the lawn 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn
Greedy Bottom-Up Parser Finished! ((a cat) (is (on (the lawn)))) … (a cat) is on (the lawn) 0.25 0.15 0.15 0.45 (a cat) is on the lawn 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn
Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Parser Constituency Parse Tree 𝒅 1 : a cat 𝒅 2 : the lawn 𝒅 3 : on the lawn …
Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Image Parser Constituency Parse Tree 𝒅 1 : a cat 𝒅 2 : the lawn 𝒅 3 : on the lawn …
Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Image Joint Embedding Space Parser Constituency Parse Tree Text 𝒅 3 Encoder 𝒅 1 𝒅 1 : a cat 𝒅 2 𝒅 2 : the lawn Image Encoder: 𝒅 3 : on the lawn ResNet 101 (He et al., 2015) …
The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): √ A cat is on the lawn. 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): √ A cat is on the lawn. 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) × ⋅ + = max ⋅, 0 A cat is on the lawn. 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): × √ A cat is on the lawn. A dog is on the lawn. 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) × ⋅ + = max ⋅, 0 A cat is on the lawn. 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
Concreteness Estimation in the Joint Embedding Space Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
Concreteness Estimation in the Joint Embedding Space a cat √ Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
Concreteness Estimation in the Joint Embedding Space √ a cat Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: on the ? 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
Concreteness Estimation in the Joint Embedding Space √ a cat Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: ? on the 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) Abstractness: local hinge loss between constituents and images. 𝑏𝑐𝑡𝑢𝑠𝑏𝑑𝑢 𝑑; 𝑗 = ℒ(𝑗, 𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
Recommend
More recommend