Subwords, Seriously? Ken Church KennethChurch@baidu.com CLSW-2020
Tokenization • Modern Deep Nets • BERT and ERNIE • Two modes: • Known Words (W): • directional à directional • Unknown Words (OOVs): • unidirectional à un ##idi ##re ##ction ##al • Subwords, byte pair encoding (BPE) • No word formation rules that derive • new words from other words • Proposal: • Add a 3 rd case between known and unknown: • almost known (𝑩𝑳) • Many OOVs are near known words (𝐿) • 𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 → 𝑣𝑜𝑗 − 𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 • Near: 𝑃𝑃𝑊 → 𝑞𝑠𝑓 𝐿 | 𝑃𝑃𝑊 → 𝐿 𝑡𝑣𝑔 • Where 𝐿 is a known word • And pre and suf are on a short list of prefixes and suffixes • CLSW-2020 2
Example from PubMed (Medical Abstracts) ERNIE/BERT (Baseline): 48 Tokens Proposed (Fewer Tokens): 35 Tokens un ##idi ##re ##ction ##al mixed l UNI- DIRECTIONAL MIXED ##ym ##ph ##oc ##yte cultures ( lymphocyte CULTURES ( M- LC ) ml ##c ) were set up using bo WERE SET UP USING BO- VINE ##vine peripheral blood l ##ym PERIPHERAL BLOOD lymphocytes ##ph ##ocytes ( p ##bl ) as ( pbl ) AS RESPOND -ER CELLS AND respond ##er cells and auto ##log autologous CELL LINES ##ous cell lines transformed in TRANSFORMED IN VITRO BY T . vitro by t . CLSW-2020 3
Observation: Many OOVs are near known words • Example • 𝑡 = 𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 (almost known) • 𝑥 = 𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 (known) • 𝑥 ∈ 𝑒𝑗𝑑𝑢 (unlike 𝑡 ) • When 𝑡 is near 𝑥 • There are opportunities to infer sound and meaning of 𝑡 from 𝑥 • Claim: • These inferences are safer than backing off to subwords (spelling) • Many applications: • Sound: g2p (grapheme to phoneme) for tts (text to speech) • Meaning: Translation CLSW-2020 4
Spoiler Alert: Morphology and Semantics as Vector Rotations WordNet Semantics Morphology • Some semantic relations: • Some morphological relations synonymy, antonymy, is-a 𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 → 𝑣𝑜𝑗 + 𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 • • 𝑣𝑜𝑨𝑗𝑞𝑞𝑓𝑒 → 𝑣𝑜 + 𝑨𝑗𝑞𝑞𝑓𝑒 • • Collect seeds for training • 𝑒𝑝𝑡 → 𝑒𝑝 + 𝑡 • is-a: < 𝑑𝑏𝑠, 𝑤𝑓ℎ𝑗𝑑𝑚𝑓 >, … 𝑐𝑏𝑠𝑙𝑗𝑜 → 𝑐𝑏𝑠𝑙 + 𝑗𝑜 • synonym: < 𝑝𝑝𝑒, ℎ𝑝𝑜𝑓𝑡𝑢 >, < 𝑝𝑝𝑒, 𝑞𝑠𝑝𝑔𝑗𝑑𝑗𝑓𝑜𝑢 >, … • • Collect seeds for training: antonym: < 𝑝𝑝𝑒, 𝑐𝑏𝑒 >, < 𝑝𝑝𝑒, 𝑓𝑤𝑗𝑚 >, … • < 𝑣𝑜𝑗 + 𝑦, 𝑦 > • • Learn Rotations: • < 𝑣𝑜 + 𝑦, 𝑦 > 𝑤𝑓𝑑(𝑑𝑏𝑠)𝑆 Y[_ ≈ 𝑤𝑓𝑑(𝑤𝑓ℎ𝑗𝑑𝑚𝑓) • < 𝑦 + 𝑡, 𝑦 > • • 𝑤𝑓𝑑 𝑝𝑝𝑒 𝑆 [`X ≈ 𝑤𝑓𝑑 ℎ𝑝𝑜𝑓𝑡𝑢 < 𝑦 + 𝑗𝑜, 𝑦 > • 𝑤𝑓𝑑(𝑝𝑝𝑒)𝑆 _Xa ≈ 𝑤𝑓𝑑(𝑐𝑏𝑒) • • Learn rotations 𝑆 • Thus, 𝑦𝑆𝑧 ⟹ 𝑤𝑓𝑑 𝑦 𝑆 ≈ 𝑤𝑓𝑑(𝑧) • 𝑤𝑓𝑑(𝑣𝑜𝑗 + 𝑦)𝑆 WXY ≈ 𝑤𝑓𝑑(𝑦) words ⟹ vectors • 𝑤𝑓𝑑(𝑣𝑜 + 𝑦)𝑆 WX ≈ 𝑤𝑓𝑑(𝑦) • functions on words (predictes, relations) ⟹ rotations • 𝑤𝑓𝑑(𝑦 + 𝑡)𝑆 [ ≈ 𝑤𝑓𝑑(𝑦) • • What is the meaning of not ? • 𝑤𝑓𝑑(𝑦 + 𝑗𝑜)𝑆 YX\ ≈ 𝑤𝑓𝑑(𝑦) ¬𝑦 ⟹ 𝑤𝑓𝑑 𝑦 𝑆 Xfa • • by analogy with: 𝑤𝑓𝑑(𝑣𝑜 + 𝑦) = 𝑤𝑓𝑑(𝑦)𝑆 WX CLSW-2020 5
Motivations: Black Boxes vs. Gray Boxes Modern Deep Nets (Black Boxes) Traditional Linguistics (Gray Boxes) un ##idi ##re ##ction ##al UNI- DIRECTIONAL Desiderata Intermediate representations: Unit test ≫ System test • • • End-to-End Performance: System test ≫ Unit test • Capture relevant linguistic generalizations • Intermediate representations considered harmful Example of an intermediate representation • • Small vocabularies ( 𝑊 ): Space & Time grow with 𝑊 Morphology • • Generalization to other tasks, domains, languages, etc. Relevant Linguistic Generalizations • • Morphology tends to be language specific • Optimization ≫ Annotation ≫ Creating lists by hand • Sound (S) • Linguistic resources considered harmful • Meaning (M) Non-Desiderata • 𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 ~ 𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 • • Linguistic Generalizations • Capture generalizations associated with stem: 𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 • https://en.wikipedia.org/wiki/Frederick_Jelinek • 𝑇(𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚) ~ 𝑇(𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚) • Every time I fire a linguist, • 𝑁(𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚) ~ 𝑁(𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚) • the performance of the speech recognizer goes up • Capture generalizations associated with affix: 𝑣𝑜𝑗 BPE (Byte Pair Encoding) • • 𝑇(𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚) ~ 𝑇(𝑣𝑜𝑗) (vowel) • Generous Definition: • 𝑁 𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 ~ 𝑁 𝑣𝑜𝑗 (one) • An optimization to find a small vocabulary of tokens with broad coverage Sound & Meaning ≫ Spelling • Not so generous definition: • • Deep representations are more insightful than superficial observations • BPE ≈ Spelling Spelling ≫ Sound & Meaning: • Spelling is observable • CLSW-2020 6
A Pendulum Swung Too Far (Church, 2011) CLSW-2020 7
A Pendulum Swung Too Far (Church, 2011) • 1950s: Empiricism • Shannon, Skinner, Firth, Harris • 1970s: Rationalism Grandparents Grandparents Grandparents • Chomsky and and and • Minsky Grandchildren Grandchildren Grandchildren • 1990s: Empiricism • IBM Speech Group • AT&T Bell Labs • 2010s: A Return to Rationalism? Fads come, and fads go • 2010s: Deep Nets • 2030s: DARPA AI Next • ``We don’t need more cat detectors” CLSW-2020 8
Jurafsky: Interspeech-2016, NAACL-2009 https://www.superlectures.com/interspeech2016/ • Jurafsky uses history of ketchup (& ice cream) • to shed light on currently popular methods in speech and language • He traces etymology of “ketchup” from an Asian fish sauce • Advances in (sailing) technology made it possible to replace anchovies with less expensive tomatoes and sugar from the West • The ice cream story combines fruit syrups (Sharbat) from Persia • with gun powder from China and advances in refrigeration technology Big Tent Better Together: Humanities + Engineering + Stats CLSW-2020 9
The Speech Invasion • At speech meetings (Interspeech-2016, as opposed to NAACL-2009), • Jurafsky credits speech researchers for transferring currently popular techniques from speech to language. CLSW-2020 10
What happened in1988? https://www.superlectures.com/interspeech2016/ CLSW-2020 11
• Jurafsky’s story is nice & simple, • But history is “complicated” • IMHO, speech did onto language, • what was done onto them • https://www.superlectures.com/interspeech2016/ What happened in 1975? The same thing that happened to language in 1988 (and to hedge funds in 1990s, and politics in 2016)? CLSW-2020 12
Robert Mercer ACL Lifetime Achievement http://techtalks.tv/talks/closing-session/60532/ 2014 End-to-end vs. Representation CLSW-2020 13
A Unified (Dystopian) Perspective: The World Would Be Better Off Without People • More on firing linguists… • Self-diving cars: • The most dangerous thing about a car is the driver. • Let's get rid of drivers. • Hedge funds: • The weak spot in an investment fund, is the fund manager. • Let's get rid of fund managers. • Speech, Machine Translation, CL, Deep Nets: • The most dangerous thing are the researchers. • Let's get rid of researchers (and especially the linguists) • Politics: • Government would would better without politicians. • See discussion of Brexit and 2016 US Election in https://en.wikipedia.org/wiki/Robert_Mercer • In these difficult times, • it would be good if the world was more tolerant of one another, • and willing to love one another through thick and thin. CLSW-2020 14
More recommend