Language modeling: a generative story of text p ( the cat chased the ) = p ( the ) · p ( cat | the ) · p ( chased | the cat ) · p ( the | the cat chased ) Lexicon / vocabulary Text generation with an RNN type spelling embedding σ ( w ) e ( w ) RNN RNN RNN RNN w cell cell cell cell � [ 0.2, ··· ,0.0 ] the 1 � [ 0.4, ··· ,0.5 ] the cat c a g e d the cat 2 UNK � [ − 0.1, ··· ,0.2 ] chased 3 � [ 0.3, ··· ,0.1 ] UNK 4 ...but what is the word? Pure character-level model as the solution? Ugh, spelling the again ... RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell t h e t h e c a t c h a s e d 2
Language modeling: a generative story of text p ( the cat chased the ) = p ( the ) · p ( cat | the ) · p ( chased | the cat ) · p ( the | the cat chased ) Lexicon / vocabulary Text generation with an RNN type spelling embedding σ ( w ) e ( w ) RNN RNN RNN RNN w cell cell cell cell � [ 0.2, ··· ,0.0 ] the 1 � [ 0.4, ··· ,0.5 ] the cat c a g e d the cat 2 UNK � [ − 0.1, ··· ,0.2 ] chased 3 � [ 0.3, ··· ,0.1 ] UNK 4 ...but what is the word? Pure character-level model as the solution? Ugh, spelling the again ... RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell ...can’t we memorize it? t h e t h e c a t c h a s e d 2
Our model: Spell once, summon anywhere 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, � ) e ( 1 � ) e ( 2 � ) e ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, � ) e ( 1 RNN cell t � ) e ( 2 � ) e ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, � ) e ( 1 RNN RNN cell cell t h � ) e ( 2 � ) e ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) e ( 2 � ) e ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 � ) e ( 2 � ) e ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t � ) σ ( 2 � ) e ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t � ) σ ( 2 � ) e ( 3 RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t � ) σ ( 2 � ) e ( 3 RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: � h 1 � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t � ) σ ( 2 � ) e ( 3 RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: � h 1 � ) e ( 1 RNN RNN RNN cell cell cell t h e look up � ) σ ( 1 w 1 embeddings � = 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t � ) σ ( 2 � ) e ( 3 RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: � h 1 � ) e ( 1 RNN RNN RNN cell cell cell t h e look up � ) σ ( 1 w 1 embeddings � = 1 � ) e ( 2 RNN RNN RNN look up cell cell cell c a t spellings � ) σ ( 2 the � ) e ( 3 σ ( w 1 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: RNN � � h 1 h 2 � ) e ( 1 RNN RNN RNN cell cell cell t h e look up � ) σ ( 1 w 1 embeddings � = 1 � ) e ( 2 RNN RNN RNN look up cell cell cell c a t spellings � ) σ ( 2 the � ) e ( 3 σ ( w 1 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: RNN � � h 1 h 2 � ) e ( 1 RNN RNN RNN cell cell cell t h e look up � ) σ ( 1 w 1 w 2 embeddings � � = 1 = 2 � ) e ( 2 RNN RNN RNN look up cell cell cell c a t spellings � ) σ ( 2 the � ) e ( 3 σ ( w 1 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: RNN � � h 1 h 2 � ) e ( 1 RNN RNN RNN cell cell cell t h e look up � ) σ ( 1 w 1 w 2 embeddings � � = 1 = 2 � ) e ( 2 RNN RNN RNN look up cell cell cell c a t spellings � ) σ ( 2 the cat � ) e ( 3 σ ( w 1 ) σ ( w 2 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: RNN RNN � � � h 1 h 2 h 3 � ) e ( 1 RNN RNN RNN cell cell cell t h e look up � ) σ ( 1 w 1 w 2 embeddings � � = 1 = 2 � ) e ( 2 RNN RNN RNN look up cell cell cell c a t spellings � ) σ ( 2 the cat � ) e ( 3 σ ( w 1 ) σ ( w 2 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: RNN RNN � � � h 1 h 2 h 3 � ) e ( 1 RNN RNN RNN cell cell cell t h e look up � ) σ ( 1 w 1 w 2 w 3 embeddings � � � = 1 = 2 = 3 � ) e ( 2 RNN RNN RNN look up cell cell cell c a t spellings � ) σ ( 2 the cat � ) e ( 3 σ ( w 1 ) σ ( w 2 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: RNN RNN � � � h 1 h 2 h 3 � ) e ( 1 RNN RNN RNN cell cell cell t h e look up � ) σ ( 1 w 1 w 2 w 3 embeddings � � � = 1 = 2 = 3 � ) e ( 2 RNN RNN RNN look up cell cell cell c a t spellings � ) σ ( 2 the cat chased � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 3 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: RNN RNN RNN � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e look up � ) σ ( 1 w 1 w 2 w 3 embeddings � � � = 1 = 2 = 3 � ) e ( 2 RNN RNN RNN look up cell cell cell c a t spellings � ) σ ( 2 the cat chased � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 3 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: RNN RNN RNN � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e look up � ) σ ( 1 w 1 w 2 w 3 w 4 embeddings � � � � = 1 = 2 = 3 = 1 � ) e ( 2 RNN RNN RNN look up cell cell cell c a t spellings � ) σ ( 2 the cat chased � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 3 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere Known words only have to be spelled out once, and can then be summoned anywhere: RNN RNN RNN � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e look up � ) σ ( 1 w 1 w 2 w 3 w 4 embeddings � � � � = 1 = 2 = 3 = 1 � ) e ( 2 RNN RNN RNN look up cell cell cell c a t spellings � ) σ ( 2 the cat chased the � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 3 ) σ ( w 4 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 3
Our model: Spell once, summon anywhere – the open-vocabulary case Known words only have to be spelled out once, and can then be summoned anywhere. � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 w 1 w 2 w 4 ? � � � = 1 = 2 = 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t � ) σ ( 2 the cat c a g e d the � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 4 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 4
Our model: Spell once, summon anywhere – the open-vocabulary case Known words only have to be spelled out once, and can then be summoned anywhere. e ( UNK ) � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 w 1 w 2 w 4 ? � � � = 1 = 2 = 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t � ) σ ( 2 the cat c a g e d the � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 4 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 4
Our model: Spell once, summon anywhere – the open-vocabulary case Known words only have to be spelled out once, and can then be summoned anywhere. e ( UNK ) � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 w 1 w 2 w 3 w 4 = UNK � � � = 1 = 2 = 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t � ) σ ( 2 the cat c a g e d the � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 4 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 4
Our model: Spell once, summon anywhere – the open-vocabulary case Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model. e ( UNK ) � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 w 1 w 2 w 3 w 4 = UNK � � � = 1 = 2 = 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t RNN cell � ) σ ( 2 the cat c a g e d the � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 4 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 4
Our model: Spell once, summon anywhere – the open-vocabulary case Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model. e ( UNK ) � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 w 1 w 2 w 3 w 4 = UNK � � � = 1 = 2 = 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t RNN RNN cell cell � ) σ ( 2 the cat c a g e d the � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 4 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 4
Our model: Spell once, summon anywhere – the open-vocabulary case Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model. e ( UNK ) � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 w 1 w 2 w 3 w 4 = UNK � � � = 1 = 2 = 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t RNN RNN RNN cell cell cell � ) σ ( 2 the cat c a g e d the � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 4 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 4
Our model: Spell once, summon anywhere – the open-vocabulary case Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model. e ( UNK ) � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 w 1 w 2 w 3 w 4 = UNK � � � = 1 = 2 = 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t RNN RNN RNN RNN cell cell cell cell � ) σ ( 2 the cat c a g e d the � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 4 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 4
Our model: Spell once, summon anywhere – the open-vocabulary case Known words only have to be spelled out once, and can then be summoned anywhere. Unknown words are spelled out “on-demand” using the same character-level model. e ( UNK ) � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 w 1 w 2 w 3 w 4 = UNK � � � = 1 = 2 = 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t RNN RNN RNN RNN RNN cell cell cell cell cell � ) σ ( 2 the cat c a g e d the � ) e ( 3 σ ( w 1 ) σ ( w 2 ) σ ( w 4 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 4
Samples from the model Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC of Fullett . 5
Samples from the model Sampled text from our model: Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC of Fullett . novel word with contextually appropriate spelling 5
Samples from the model novel spelling sampled known spelling Sampled text from our model: ❀ from its embedding Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC of Fullett . novel word with contextually appropriate spelling 5
Samples from the model novel spelling sampled known spelling Sampled text from our model: ❀ from its embedding grounded stipped ❀ Following the death of Edward Mc- Cartney in 1060 , the new defini- tion was transferred to the WDIC of Fullett . novel word with contextually appropriate spelling 5
Samples from the model novel spelling sampled known spelling Sampled text from our model: ❀ from its embedding grounded stipped ❀ Following the death of Edward Mc- Cartney in 1060 , the new defini- differ coronate ❀ tion was transferred to the WDIC of Fullett . novel word with contextually appropriate spelling 5
Samples from the model novel spelling sampled known spelling Sampled text from our model: ❀ from its embedding grounded stipped ❀ Following the death of Edward Mc- Cartney in 1060 , the new defini- differ coronate ❀ tion was transferred to the WDIC Clive Dickey ❀ of Fullett . novel word with contextually appropriate spelling 5
Samples from the model novel spelling sampled known spelling Sampled text from our model: ❀ from its embedding grounded stipped ❀ Following the death of Edward Mc- Cartney in 1060 , the new defini- differ coronate ❀ tion was transferred to the WDIC Clive Dickey ❀ of Fullett . Southport Strigger ❀ novel word with contextually appropriate spelling 5
Samples from the model novel spelling sampled known spelling Sampled text from our model: ❀ from its embedding grounded stipped ❀ Following the death of Edward Mc- Cartney in 1060 , the new defini- differ coronate ❀ tion was transferred to the WDIC Clive Dickey ❀ of Fullett . Southport Strigger ❀ novel word Carl Wuly ❀ with contextually appropriate spelling 5
Samples from the model novel spelling sampled known spelling Sampled text from our model: ❀ from its embedding grounded stipped ❀ Following the death of Edward Mc- Cartney in 1060 , the new defini- differ coronate ❀ tion was transferred to the WDIC Clive Dickey ❀ of Fullett . Southport Strigger ❀ novel word Carl Wuly ❀ with contextually Chants Tranquels ❀ appropriate spelling 5
Samples from the model novel spelling sampled known spelling Sampled text from our model: ❀ from its embedding grounded stipped ❀ Following the death of Edward Mc- Cartney in 1060 , the new defini- differ coronate ❀ tion was transferred to the WDIC Clive Dickey ❀ of Fullett . Southport Strigger ❀ novel word Carl Wuly ❀ with contextually Chants Tranquels ❀ appropriate spelling valuables migrations ❀ 5
Samples from the model novel spelling sampled known spelling Sampled text from our model: ❀ from its embedding grounded stipped ❀ Following the death of Edward Mc- Cartney in 1060 , the new defini- differ coronate ❀ tion was transferred to the WDIC Clive Dickey ❀ of Fullett . Southport Strigger ❀ novel word Carl Wuly ❀ with contextually Chants Tranquels ❀ appropriate spelling valuables migrations ❀ So why is this a good way of modeling language? 5
Linguistic notions: duality of patterning The meaningful elements in any language—"words" in everyday parlance [ ... ] — [ ... ] are represented by [ a ] small stock of distinguishable sounds which are in characters themselves wholly meaningless. – Hockett, 1960 6
Linguistic notions: duality of patterning The meaningful elements in any language—"words" in everyday parlance [ ... ] — [ ... ] are represented by [ a ] small stock of distinguishable sounds which are in characters themselves wholly meaningless. – Hockett, 1960 “ Meaningless ” character composition should be separate from “ meaningful ” word composition! 6
Linguistic notions: duality of patterning The meaningful elements in any language—"words" in everyday parlance [ ... ] — [ ... ] are represented by [ a ] small stock of distinguishable sounds which are in characters themselves wholly meaningless. – Hockett, 1960 “ Meaningless ” character composition should be separate from “ meaningful ” word composition! We should need a word’s spelling only to define it – not to later use it. 6
Duality of patterning �→ conditional independence! So? Why does this linguistics blurb matter? 7
Duality of patterning �→ conditional independence! So? Why does this linguistics blurb matter? • Irregular words have uncommon spellings children ...yet we use them like regular words! 7
Duality of patterning �→ conditional independence! So? Why does this linguistics blurb matter? • Irregular words have uncommon spellings children ...yet we use them like regular words! • Function words have uncommon spellings the , of ...yet we use them all the time without feeling weird! 7
Duality of patterning �→ conditional independence! So? Why does this linguistics blurb matter? • Irregular words have uncommon spellings children ...yet we use them like regular words! • Function words have uncommon spellings the , of ...yet we use them all the time without feeling weird! We should need a word’s spelling only to define it – not to later use it. Recall: 7
Duality of patterning �→ conditional independence! So? Why does this linguistics blurb matter? • Irregular words have uncommon spellings children ...yet we use them like regular words! • Function words have uncommon spellings the , of ...yet we use them all the time without feeling weird! We should need a word’s spelling only to define it – not to later use it. Recall: i.e. character-level models do it wrong! 7
Duality of patterning �→ conditional independence! So? Why does this linguistics blurb matter? • Irregular words have uncommon spellings children ...yet we use them like regular words! • Function words have uncommon spellings the , of ...yet we use them all the time without feeling weird! We should need a word’s spelling only to define it – not to later use it. Recall: i.e. character-level models do it wrong! ...and they’re slow as hell... 7
Duality of patterning �→ conditional independence! So? Why does this linguistics blurb matter? • Irregular words have uncommon spellings children ...yet we use them like regular words! • Function words have uncommon spellings the , of ...yet we use them all the time without feeling weird! We should need a word’s spelling only to define it – not to later use it. Recall: i.e. character-level models do it wrong! ...and they’re slow as hell... usage ⊥ spelling | embedding 7
The arbitrariness of the sign �→ allowing for idiosyncracy How should a word’s embedding and its spelling be connected? 8
The arbitrariness of the sign �→ allowing for idiosyncracy How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. – de Saussure, 1916, translated 8
The arbitrariness of the sign �→ allowing for idiosyncracy How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated 8
The arbitrariness of the sign �→ allowing for idiosyncracy How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated Meaning is not fully predictable from spellings. 8
The arbitrariness of the sign �→ allowing for idiosyncracy How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated Meaning is not fully predictable from spellings. Example: neither silly nor folly is an adverb, even though they both end in - ly ! 8
The arbitrariness of the sign �→ allowing for idiosyncracy How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated Meaning is not fully predictable from spellings. Example: neither silly nor folly is an adverb, even though they both end in - ly ! “construction” models like e ( caged ) : = CNN ( c a g e d ) ignore this! 8
The arbitrariness of the sign �→ allowing for idiosyncracy How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated Meaning is not fully predictable from spellings. Example: neither silly nor folly is an adverb, even though they both end in - ly ! “construction” models like e ( caged ) : = CNN ( c a g e d ) ignore this! ⇒ Allow any pairing a priori, but use spellings as prior / regularization! 8
The arbitrariness of the sign �→ allowing for idiosyncracy How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated Meaning is not fully predictable from spellings. Example: neither silly nor folly is an adverb, even though they both end in - ly ! “construction” models like e ( caged ) : = CNN ( c a g e d ) ignore this! ⇒ Allow any pairing a priori, but Outliers ( children , the , ...) use spellings as prior / regularization! may have idiosyncratic embeddings! 8
The arbitrariness of the sign �→ allowing for idiosyncracy How should a word’s embedding and its spelling be connected? The connection between the signifier and the signified is arbitrary. spelling meaning – de Saussure, 1916, translated Meaning is not fully predictable from spellings. Example: neither silly nor folly is an adverb, even though they both end in - ly ! “construction” models like e ( caged ) : = CNN ( c a g e d ) ignore this! ⇒ Allow any pairing a priori, but Outliers ( children , the , ...) use spellings as prior / regularization! may have idiosyncratic embeddings! regularize embeddings, don’t construct them 8
Recap: how does our model implement these ideas? Embeddings and spellings are connected on the type level, ensuring conditional independence of usage and spelling while assigning positive probability to any pairing! e ( UNK ) � � � � h 1 h 2 h 3 h 4 � ) e ( 1 RNN RNN RNN cell cell cell t h e � ) σ ( 1 w 1 w 2 w 3 w 4 = UNK � � � = 1 = 2 = 1 � ) e ( 2 RNN RNN RNN cell cell cell c a t RNN RNN RNN RNN RNN cell cell cell cell cell � ) σ ( 2 the cat c a g e d the � ) e ( 3 s ∼ p spell ( · | � σ ( w 1 ) σ ( w 2 ) σ ( w 4 ) h 3 ) RNN RNN RNN RNN RNN RNN cell cell cell cell cell cell c h a s e d � ) σ ( 3 9
How do we evaluate open-vocabulary language models? 1. Report likelihood p ( held-out text ) as perplexity? ( ↓ lower is better) 10
How do we evaluate open-vocabulary language models? bits per character 1. Report likelihood p ( held-out text ) as perplexity ( ↓ lower is better) 10
How do we evaluate open-vocabulary language models? bits per character 1. Report likelihood p ( held-out text ) as perplexity ( ↓ lower is better) 2. no UNKing allowed! 10
How do we evaluate open-vocabulary language models? bits per character 1. Report likelihood p ( held-out text ) as perplexity ( ↓ lower is better) no UNKing allowed! ∗ 2. 10
How do we evaluate open-vocabulary language models? bits per character 1. Report likelihood p ( held-out text ) as perplexity ( ↓ lower is better) no UNKing allowed! ∗ 2. ______________ ∗ Yes, we call some words “UNK” temporarily , but we still generate them fully ! 10
How do we evaluate open-vocabulary language models? bits per character 1. Report likelihood p ( held-out text ) as perplexity ( ↓ lower is better) no UNKing allowed! ∗ 2. → we must predict every character of the text, regardless of vocabulary size ______________ ∗ Yes, we call some words “UNK” temporarily , but we still generate them fully ! 10
How do we evaluate open-vocabulary language models? bits per character 1. Report likelihood p ( held-out text ) as perplexity ( ↓ lower is better) no UNKing allowed! ∗ 2. → we must predict every character of the text, regardless of vocabulary size ______________ ∗ Yes, we call some words “UNK” temporarily , but we still generate them fully ! ⇒ A tunable “vocabulary size” hyperparameter decides what is temporary-UNK. 10
Results test WikiText-2 (Merity et al., 2017) ← 1.8 1.4 � 2.5 million tokenized words from the English Wikipedia 11
Results test WikiText-2 (Merity et al., 2017) ← 1.8 1.4 � 2.5 million tokenized words from the English Wikipedia RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN � 1.775 cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell t h e c a t c h a s e d 11
Results test WikiText-2 (Merity et al., 2017) ← 1.8 1.4 � 2.5 million tokenized words from the English Wikipedia RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN � 1.775 cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell t h e c a t c h a s e d HCLM + cache � 1.500 previous SOTA (Kawakami et al., 2017) 11
Results test WikiText-2 (Merity et al., 2017) ← 1.8 1.4 � 2.5 million tokenized words from the English Wikipedia RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN � 1.775 cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell t h e c a t c h a s e d HCLM + cache � 1.500 previous SOTA (Kawakami et al., 2017) RNN RNN RNN RNN RNN � BPE: cell cell cell cell cell 1.468 the ca@ @ t cha@ @ sed 11
Results test WikiText-2 (Merity et al., 2017) ← 1.8 1.4 � 2.5 million tokenized words from the English Wikipedia RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN � 1.775 cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell t h e c a t c h a s e d HCLM + cache � 1.500 previous SOTA (Kawakami et al., 2017) RNN RNN RNN RNN RNN � BPE: cell cell cell cell cell 1.468 the ca@ @ t cha@ @ sed 11
Results test WikiText-2 (Merity et al., 2017) ← 1.8 1.4 � 2.5 million tokenized words from the English Wikipedia RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN � 1.775 cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell t h e c a t c h a s e d HCLM + cache � 1.500 previous SOTA (Kawakami et al., 2017) RNN RNN RNN RNN RNN � BPE: cell cell cell cell cell 1.468 the ca@ @ t cha@ @ sed � our full model: Spell Once, Summon Anywhere 1.455 11
Results on dev data test WikiText-2 (Merity et al., 2017) rare novel frequent 2.5 million tokenized words from the English Wikipedia all words words words RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN 3.89 2.08 1.38 1.775 cell cell cell cell cell cell cell cell cell cell cell cell cell cell cell t h e c a t c h a s e d HCLM + cache – – – 1.500 previous SOTA (Kawakami et al., 2017) RNN RNN RNN RNN RNN BPE: cell cell cell cell cell 4.01 1.70 1.08 1.468 the ca@ @ t cha@ @ sed our full model: Spell Once, Summon Anywhere 4.00 1.64 1.10 1.455 11
Recommend
More recommend