C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January - PowerPoint PPT Presentation

C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January 23, 2017 C2NLU: An Overview Heike Adel 2017/01/23 1 / 34

Contents W e l c o m e _ t o _ m y _ t a l k NLU greeting ◮ Motivation ◮ Why do we want character-based models? ◮ Previous work ◮ Which character-based models/research exist? ◮ Conclusion ◮ Which challenges/open questions need to be considered? C2NLU: An Overview Contents Heike Adel 2017/01/23 2 / 34

Traditional NLP/NLU Typical NLP/NLU processing pipeline [Gillick et al. 2015]: document tokenization token segmentation sequence (language-specific) into sentences sentences POS tags, syntactic analysis semantic analysis syntactic (per sentence) dependencies NE tags, NLU semantic roles, ... ◮ Pipeline of different modules: prone to subsequent errors ◮ We usually cannot recover from errors (e.g., in tokenization) C2NLU: An Overview Introduction Heike Adel 2017/01/23 3 / 34

Idea: C2NLU ◮ Most extreme view: character end-to-end model that gets rid of traditional pipeline entirely Is this reasonable? document token tokenization segmentation sequence (language-specific) into sentences sentences syntactic analysis POS tags, semantic analysis syntactic (per sentence) dependencies NE tags, NLU semantic roles, ... C2NLU: An Overview Introduction Heike Adel 2017/01/23 4 / 34

C2NLU: Direct models of data ◮ Traditional machine learning: based on feature engineering ◮ Tokens = manually designed features for NLU models ◮ In contrast: deep learning: ◮ Models can directly access the data, e.g., pixels in vision, acoustic signals in speech recognition ◮ Models learn their own representation (“features”) of the data ⇒ Character-based models: raw-data approach for text C2NLU: An Overview Motivation Heike Adel 2017/01/23 5 / 34

C2NLU: Tokenization-free models ◮ Tokenization is difficult ◮ English: some difficult cases [Yahoo!, San Francisco-Los Angeles flights, #starwars] ◮ Chinese: tokens are not separated by spaces ◮ German: compounds [Donaudampfschifffahrtsgesellschaftskapit¨ ansm¨ utze → the hat of the captain of the association for shipping with steam powered vessels on the Danube] ◮ Turkish: agglutinative language [Bayramla¸ samadıklarımızdandır → He is among those with whom we haven’t been able to exchange Season’s greetings] ◮ Problem: difficult/inefficient to correct tokenization decisions C2NLU: An Overview Motivation Heike Adel 2017/01/23 6 / 34

C2NLU: More robust against noise ◮ Robust against small perturbations of input ◮ Examples: letter insertions, deletions, substitutions, transpositions [commputer, compter, compuzer, comptuer] ◮ Examples: space insertions, space deletions [guacamole → gua camole, ran fast → ranfast] C2NLU: An Overview Motivation Heike Adel 2017/01/23 7 / 34

C2NLU: Robust morphological processing ◮ If we model the sequence of characters of a token, we can in principle learn all morphological regularities: ◮ Inflections ◮ Derivations ◮ Wide range of morphological processes (vowel harmony, agglutination, reduplication, ...) ◮ Modeling words would, e.g., ignore that many words share common root, prefix or suffix ◮ ⇒ C2NLU: promising framework for incorporating lingustic knowledge about morphology into statistical models C2NLU: An Overview Motivation Heike Adel 2017/01/23 8 / 34

C2NLU: Orthographic productivity ◮ Character sequences are not arbitrary, but their predictability is limited ◮ Example: morphology ◮ Properties of names are predictable from character patterns [Delonda → female name, osinopril → medication] ◮ Modifications of existing words [staycation, dramedy, Obamacare] ◮ Non-morphological orthographic productivity [cooooool, Watergate, Dieselgate] ◮ Sound symbolism, phonesthemes [gl → glitter, gleam, glow] ◮ Onomatopoeia [oink, tick tock] C2NLU: An Overview Motivation Heike Adel 2017/01/23 9 / 34

C2NLU: Out-of-vocabulary (OOV) ◮ No OOVs in character input ◮ OOV generation possible, without the use of special mechanisms ◮ Possible application: names/transliterations in end-to-end machine translation ◮ Open question: How can character-based systems accurately generate OOVs? C2NLU: An Overview Motivation Heike Adel 2017/01/23 10 / 34

Early work: Application specific character-based features ◮ The history of character-based features for ML models is long ◮ Information retrieval with character n-grams [McNamee et al. 2004, Chen et al. 1997, Damashek 1995, Cavnar 1994, de Heer 1974] ◮ Grapheme-to-phoneme conversion [Bisani et al. 2008, Kaplan et al. 1994, Sejnowski et al. 1987] ◮ Char align: bilingual character-level alignments [Church 1993] ◮ Prefix and suffix features for tagging rare words [M¨ uller et al. 2013, Ratnaparkhi 1996] ◮ Transliteration [Sajjad et al. 2016, Li et al. 2004, Knight et al. 1998] ◮ Diacritics restauration [Mihalcea et al. 2002] ◮ POS induction (unsupervised, multilingual) [Clark 2003] ◮ Characters and character n-grams as features for NER [Klein et al. 2003] ◮ Language identification [Alex 2005] C2NLU: An Overview Early Work Heike Adel 2017/01/23 11 / 34

Early work (2): Language modeling and machine translation ◮ Character-based language modeling (non-neural) “How well can the next letter of a text be predicted when the preceding N letters are known?” [Shannon 1951] ◮ Morpheme-level features for language models; application: speech recognition [Shaik et al. 2013, Kirchhoff et al. 2006, Vergyri et al. 2004, Ircing et al. 2001] ◮ Language-independent character n-gram language models for authorship attribution [Peng et al. 2003] ◮ Hybrid word/subword n-gram language model for OOV words in speech recognition [Parada et al. 2011, Shaik et al. 2011, Kombrink et al. 2010, Hirsim¨ aki et al. 2006] ◮ Characters and character n-grams as input to Restricted Boltzmann Machine-based language models; application: machine translation [Sperr et al. 2013] ◮ Character-based machine translation (non-neural) ◮ Machine translation based on characters/character n-grams [Tiedemann et al. 2013, Vilar et al. 2007, Lepage et al. 2005] C2NLU: An Overview Early Work Heike Adel 2017/01/23 12 / 34

Categorization of previous work ◮ Three clusters [Sch¨ utze 2017] ◮ Tokenization-based models ◮ Bag-of-n-gram models ◮ End-to-end models ◮ But: also mixtures possible, ◮ e.g., tokenization-based bag-of-n-gram models ◮ e.g., bag-of-n-gram or tokenization-based models trained end-to-end C2NLU: An Overview Categorization Heike Adel 2017/01/23 13 / 34

Tokenization-based models ◮ Character-level models based on tokenization (tokenization: necessary pre-processing step) ◮ Model input: tokenized text or individual tokens ◮ Example: word representations based on characters (e.g., for rare words or OOVs) C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 14 / 34

Tokenization-based models: Examples Example: word representations based on characters ◮ (1) Average of character embeddings Σ t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 15 / 34

Tokenization-based models: Examples (2) Example: word representations based on characters ◮ (2) Bidirectional RNN/LSTM over character embeddings t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 16 / 34

Tokenization-based models: Examples (3) Example: word representations based on characters ◮ (3) CNN over character embeddings max t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 17 / 34

Tokenization-based models: Examples (4) How to integrate such character-based embeddings into a larger system? ⇒ Example: Hierarchical RNNs [Ling et al. 2016, Luong et al. 2016, Plank et al. 2016, Vylomova et al. 2016, Wang et al. 2016, Yang et al. 2016, Ballesteros et al. 2015] DET JJ NN V IN DET NN word embedding char based embedding the red table is in the kitchen t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 18 / 34

Tokenization-based models: Examples (5) Hierarchical CNN + FF network [dosSantos et al. 2014ab/2015] NN conc word embedding char based embedding the red table is in the kitchen max t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 19 / 34

Tokenization-based models: Examples (6) Hierarchical CNN+RNN [Chiu et al. 2016, Costa-Juss` a et al. 2016, Jaech et al. 2016, Kim et al. 2016, V et al. 2016, Vylomova et al. 2016] DET JJ NN V IN DET NN word embedding char based embedding the red table is in the kitchen max t a b l e C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 20 / 34

Tokenization-based models: Examples (7) Character-Aware Neural Language Models [Kim et al.2016] cross-entropy loss between next word and prediction softmax output to obtain distribution over next word long-short term memory network ◮ Combining the character highway network model output with word [Srivastava et al. 2015] embeddings did not help in max-over-time this study pooling layer convolution layer with multiple filters of different widths concatenation of character embeddings C2NLU: An Overview Tokenization-based Models Heike Adel 2017/01/23 21 / 34

C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January - PowerPoint PPT Presentation

C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January 23, 2017 C2NLU: An Overview Heike Adel 2017/01/23 1 / 34 Contents W e l c o m e _ t o _ m y _ t a l k NLU greeting Motivation Why do we want

Dagstuhl C2NLU: Working Groups Mo/Tu Hinrich Sch utze January 23, 2017 1 Working Group

Seminar C2NLU, Schlo Dagstuhl, Wadern, Germany 24-January-2017 From Bayes Decision Rule to

Modelling Multiple Sequences: Explorations, Consequences and Challenges Orhan Firat Dagstuhl

01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 |

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 SF park overview OVERVIEW PRESENTATION / 2

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 Acknowledgements OVERVIEW PRESENTATION / 2 SF

INVESTOR PRESENTATION FEBRUARY 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

INVESTOR PRESENTATION MAY 2019 Index Executive Summary Company Overview Business Overview

INVESTOR PRESENTATION MARCH 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

1 Overview Overview Regional demographic overview Regional demographic overview Workforce

Covid-19 and Business Interruption: Maximizing Insurance Coverage and Federal Grants Counsel

OVERVIEW OVERVIEW OVERVIEW OVERVIEW The qualifications are aimed at primary school

An overview to Maltese An overview to Maltese An overview to Maltese An overview to Maltese

GSM System Overview GSM System Overview GSM System Overview GSM System Overview Phone Lin

Butterball Employees Butterball Employees Butterball Employees Benefits Overview Ruan Benefits

Program-for-Results Financing Overview Overview Overview of World Bank Instruments

Machine Translation 3: Linguistics in SMT and NMT Ond rej Bojar bojar@ufal.mff.cuni.cz

Hebrew Dependency Parsing: Initial Results Yoav Goldberg Michael Elhadad IWPT 2009, Paris

Effects of oligoribonucleotides-D-mannitol complexes on the hemagglutinin-glycan interactions

Historical linguistics : the study of how language changes over time sound change: phonemic and

New NP dependency marking in the second generation IE languages Artemij Keidan, Sapienza

Pirah Pirah Numbers & Stuff ISO 639-2 <myp> Spoken by Hi'aiti'ihi (

STATISTICAL MACHINE TRANSLATION 14.05.19 Statistical Natural Language Processing 1 Rule-based

MORPHOLOGY A Study of the internal structure of words and the relationships among words

C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January - PowerPoint PPT Presentation

C2NLU: An Overview Heike Adel CIS, LMU Munich Dagstuhl January 23, 2017 C2NLU: An Overview Heike Adel 2017/01/23 1 / 34 Contents W e l c o m e _ t o _ m y _ t a l k NLU greeting Motivation Why do we want

Dagstuhl C2NLU: Working Groups Mo/Tu Hinrich Sch utze January 23, 2017 1 Working Group

Seminar C2NLU, Schlo Dagstuhl, Wadern, Germany 24-January-2017 From Bayes Decision Rule to

Modelling Multiple Sequences: Explorations, Consequences and Challenges Orhan Firat Dagstuhl

01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 |

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 SF park overview OVERVIEW PRESENTATION / 2

OVERVIEW PRESENTATION / 1 OVERVIEW PRESENTATION / 1 Acknowledgements OVERVIEW PRESENTATION / 2 SF

INVESTOR PRESENTATION FEBRUARY 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

INVESTOR PRESENTATION MAY 2019 Index Executive Summary Company Overview Business Overview

INVESTOR PRESENTATION MARCH 2016 INDEX EXECUTIVE SUMMARY COMPANY OVERVIEW BUSINESS OVERVIEW

1 Overview Overview Regional demographic overview Regional demographic overview Workforce

Covid-19 and Business Interruption: Maximizing Insurance Coverage and Federal Grants Counsel

OVERVIEW OVERVIEW OVERVIEW OVERVIEW The qualifications are aimed at primary school

An overview to Maltese An overview to Maltese An overview to Maltese An overview to Maltese

GSM System Overview GSM System Overview GSM System Overview GSM System Overview Phone Lin

Butterball Employees Butterball Employees Butterball Employees Benefits Overview Ruan Benefits

Program-for-Results Financing Overview Overview Overview of World Bank Instruments

Machine Translation 3: Linguistics in SMT and NMT Ond rej Bojar bojar@ufal.mff.cuni.cz

Hebrew Dependency Parsing: Initial Results Yoav Goldberg Michael Elhadad IWPT 2009, Paris

Effects of oligoribonucleotides-D-mannitol complexes on the hemagglutinin-glycan interactions

Historical linguistics : the study of how language changes over time sound change: phonemic and

New NP dependency marking in the second generation IE languages Artemij Keidan, Sapienza

Pirah Pirah Numbers &amp; Stuff ISO 639-2 &lt;myp&gt; Spoken by Hi'aiti'ihi (

STATISTICAL MACHINE TRANSLATION 14.05.19 Statistical Natural Language Processing 1 Rule-based

MORPHOLOGY A Study of the internal structure of words and the relationships among words

Pirah Pirah Numbers & Stuff ISO 639-2 <myp> Spoken by Hi'aiti'ihi (