C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January 23, 2017 C2NLU: An Overview Heike Adel 2017/01/23 1 / 34
Contents W e l c o m e _ t o _ m y _ t a l k NLU greeting ◮ Motivation ◮ Why do we want character-based models? ◮ Previous work ◮ Which character-based models/research exist? ◮ Conclusion ◮ Which challenges/open questions need to be considered? C2NLU: An Overview Contents Heike Adel 2017/01/23 2 / 34
Traditional NLP/NLU Typical NLP/NLU processing pipeline [Gillick et al. 2015]: document tokenization token segmentation sequence (language-specific) into sentences sentences POS tags, syntactic analysis semantic analysis syntactic (per sentence) dependencies NE tags, NLU semantic roles, ... ◮ Pipeline of different modules: prone to subsequent errors ◮ We usually cannot recover from errors (e.g., in tokenization) C2NLU: An Overview Introduction Heike Adel 2017/01/23 3 / 34
Idea: C2NLU ◮ Most extreme view: character end-to-end model that gets rid of traditional pipeline entirely Is this reasonable? document token tokenization segmentation sequence (language-specific) into sentences sentences syntactic analysis POS tags, semantic analysis syntactic (per sentence) dependencies NE tags, NLU semantic roles, ... C2NLU: An Overview Introduction Heike Adel 2017/01/23 4 / 34
C2NLU: Direct models of data ◮ Traditional machine learning: based on feature engineering ◮ Tokens = manually designed features for NLU models ◮ In contrast: deep learning: ◮ Models can directly access the data, e.g., pixels in vision, acoustic signals in speech recognition ◮ Models learn their own representation (“features”) of the data ⇒ Character-based models: raw-data approach for text C2NLU: An Overview Motivation Heike Adel 2017/01/23 5 / 34
C2NLU: Tokenization-free models ◮ Tokenization is difficult ◮ English: some difficult cases [Yahoo!, San Francisco-Los Angeles flights, #starwars] ◮ Chinese: tokens are not separated by spaces ◮ German: compounds [Donaudampfschifffahrtsgesellschaftskapit¨ ansm¨ utze → the hat of the captain of the association for shipping with steam powered vessels on the Danube] ◮ Turkish: agglutinative language [Bayramla¸ samadıklarımızdandır → He is among those with whom we haven’t been able to exchange Season’s greetings] ◮ Problem: difficult/inefficient to correct tokenization decisions C2NLU: An Overview Motivation Heike Adel 2017/01/23 6 / 34
C2NLU: More robust against noise ◮ Robust against small perturbations of input ◮ Examples: letter insertions, deletions, substitutions, transpositions [commputer, compter, compuzer, comptuer] ◮ Examples: space insertions, space deletions [guacamole → gua camole, ran fast → ranfast] C2NLU: An Overview Motivation Heike Adel 2017/01/23 7 / 34
C2NLU: Robust morphological processing ◮ If we model the sequence of characters of a token, we can in principle learn all morphological regularities: ◮ Inflections ◮ Derivations ◮ Wide range of morphological processes (vowel harmony, agglutination, reduplication, ...) ◮ Modeling words would, e.g., ignore that many words share common root, prefix or suffix ◮ ⇒ C2NLU: promising framework for incorporating lingustic knowledge about morphology into statistical models C2NLU: An Overview Motivation Heike Adel 2017/01/23 8 / 34
C2NLU: Orthographic productivity ◮ Character sequences are not arbitrary, but their predictability is limited ◮ Example: morphology ◮ Properties of names are predictable from character patterns [Delonda → female name, osinopril → medication] ◮ Modifications of existing words [staycation, dramedy, Obamacare] ◮ Non-morphological orthographic productivity [cooooool, Watergate, Dieselgate] ◮ Sound symbolism, phonesthemes [gl → glitter, gleam, glow] ◮ Onomatopoeia [oink, tick tock] C2NLU: An Overview Motivation Heike Adel 2017/01/23 9 / 34
C2NLU: Out-of-vocabulary (OOV) ◮ No OOVs in character input ◮ OOV generation possible, without the use of special mechanisms ◮ Possible application: names/transliterations in end-to-end machine translation ◮ Open question: How can character-based systems accurately generate OOVs? C2NLU: An Overview Motivation Heike Adel 2017/01/23 10 / 34
Early work: Application specific character-based features ◮ The history of character-based features for ML models is long ◮ Information retrieval with character n-grams [McNamee et al. 2004, Chen et al. 1997, Damashek 1995, Cavnar 1994, de Heer 1974] ◮ Grapheme-to-phoneme conversion [Bisani et al. 2008, Kaplan et al. 1994, Sejnowski et al. 1987] ◮ Char align: bilingual character-level alignments [Church 1993] ◮ Prefix and suffix features for tagging rare words [M¨ uller et al. 2013, Ratnaparkhi 1996] ◮ Transliteration [Sajjad et al. 2016, Li et al. 2004, Knight et al. 1998] ◮ Diacritics restauration [Mihalcea et al. 2002] ◮ POS induction (unsupervised, multilingual) [Clark 2003] ◮ Characters and character n-grams as features for NER [Klein et al. 2003] ◮ Language identification [Alex 2005] C2NLU: An Overview Early Work Heike Adel 2017/01/23 11 / 34
Early work (2): Language modeling and machine translation ◮ Character-based language modeling (non-neural) “How well can the next letter of a text be predicted when the preceding N letters are known?” [Shannon 1951] ◮ Morpheme-level features for language models; application: speech recognition [Shaik et al. 2013, Kirchhoff et al. 2006, Vergyri et al. 2004, Ircing et al. 2001] ◮ Language-independent character n-gram language models for authorship attribution [Peng et al. 2003] ◮ Hybrid word/subword n-gram language model for OOV words in speech recognition [Parada et al. 2011, Shaik et al. 2011, Kombrink et al. 2010, Hirsim¨ aki et al. 2006] ◮ Characters and character n-grams as input to Restricted Boltzmann Machine-based language models; application: machine translation [Sperr et al. 2013] ◮ Character-based machine translation (non-neural) ◮ Machine translation based on characters/character n-grams [Tiedemann et al. 2013, Vilar et al. 2007, Lepage et al. 2005] C2NLU: An Overview Early Work Heike Adel 2017/01/23 12 / 34
Categorization of previous work ◮ Three clusters [Sch¨ utze 2017] ◮ Tokenization-based models ◮ Bag-of-n-gram models ◮ End-to-end models ◮ But: also mixtures possible, ◮ e.g., tokenization-based bag-of-n-gram models ◮ e.g., bag-of-n-gram or tokenization-based models trained end-to-end C2NLU: An Overview Categorization Heike Adel 2017/01/23 13 / 34
Tokenization-based models ◮ Character-level models based on tokenization (tokenization: necessary pre-processing step) ◮ Model input: tokenized text or individual tokens ◮ Example: word representations based on characters (e.g., for rare words or OOVs) C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 14 / 34
Tokenization-based models: Examples Example: word representations based on characters ◮ (1) Average of character embeddings Σ t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 15 / 34
Tokenization-based models: Examples (2) Example: word representations based on characters ◮ (2) Bidirectional RNN/LSTM over character embeddings t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 16 / 34
Tokenization-based models: Examples (3) Example: word representations based on characters ◮ (3) CNN over character embeddings max t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 17 / 34
Tokenization-based models: Examples (4) How to integrate such character-based embeddings into a larger system? ⇒ Example: Hierarchical RNNs [Ling et al. 2016, Luong et al. 2016, Plank et al. 2016, Vylomova et al. 2016, Wang et al. 2016, Yang et al. 2016, Ballesteros et al. 2015] DET JJ NN V IN DET NN word em- bedding char based embedding the red table is in the kitchen t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 18 / 34
Tokenization-based models: Examples (5) Hierarchical CNN + FF network [dosSantos et al. 2014ab/2015] NN conc word em- bedding char based embedding the red table is in the kitchen max t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 19 / 34
Tokenization-based models: Examples (6) Hierarchical CNN+RNN [Chiu et al. 2016, Costa-Juss` a et al. 2016, Jaech et al. 2016, Kim et al. 2016, V et al. 2016, Vylomova et al. 2016] DET JJ NN V IN DET NN word em- bedding char based embedding the red table is in the kitchen max t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 20 / 34
Tokenization-based models: Examples (7) Character-Aware Neural Language Models [Kim et al.2016] cross-entropy loss between next word and prediction softmax output to obtain distribution over next word long-short term memory network ◮ Combining the character highway network model output with word [Srivastava et al. 2015] embeddings did not help in max-over-time this study pooling layer convolution layer with multiple filters of different widths concatenation of character embeddings C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 21 / 34
Recommend
More recommend