dagstuhl c2nlu working groups mo tu
play

Dagstuhl C2NLU: Working Groups Mo/Tu Hinrich Sch utze January 23, - PDF document

Dagstuhl C2NLU: Working Groups Mo/Tu Hinrich Sch utze January 23, 2017 1 Working Group MORPH: Morphology This WG is concerned with morphology, one of the core areas of computational linguistics and theoretical linguistics, especially once


  1. Dagstuhl C2NLU: Working Groups Mo/Tu Hinrich Sch¨ utze January 23, 2017 1

  2. Working Group MORPH: Morphology This WG is concerned with morphology, one of the core areas of computational linguistics and theoretical linguistics, especially once we’ve overcome English-centric myopia. A lot (everything?) changes when morphology is modeled on the character level, in End2End systems and in the framework of deep learning. • character-level models for morphological analysis • character-level models for morphological generation • in character-level models: What happens to prefixes, suffixes, stems, roots? • subword units: morphologically motivated vs non-morphological – properties, strengths, weaknesses etc. • morphological induction / paradigm completion / discovery of morphological rules: supervised, semisupervised, unsupervised • Do certain types of morphology lend themselves better to character-level models? • inflectional vs. derivational morphology • non-concatenative morphologies • segmentation • language modeling – how to incorporate morphology: input, output, at which level? • insights into human morphology from analyzing neural models? • use character-level representations as a research methodology for morphology: e.g., compositional (“dearly”) vs noncompositional (“early”) forms • efficiency (character-based worse than word-based?) • inspection, interpretation, analysis, beyond black-box models • evaluation • applications • come up with 1-5 new research directions

  3. Working Group MT: Machine Translation Machine translation is perhaps the biggest success of deep learning in NLP. This WG will be concerned with research questions and challenges for character-level MT. • character NMT • linear NMT • dealing with OOVs and cross-token dependencies (e.g., hierarchy) • localization • beyond LSTM dependencies • transliteration • character-level alignment • multilingual NMT • multi-task NMT for multiple modalities • document level NMT • What are the units: characters, BEPs, subwords, words, phrases? • Are there still units? • What happens with syntax? • efficiency (character-based worse than word-based?) • inspection, interpretation, analysis, beyond black-box models • evaluation • applications (e.g., specific (low/high) resource settings / text types / language pairs?) • come up with 1-5 new research directions

  4. Working Group RepLearn: Character-Level Representation Learning “Unsupervised representation learning techniques capitalize on unlabeled data . . . The goal . . . is to learn a representation that reveals intrinsic low-dimensional structure in data, disentangles underlying factors of variation by incorporating universal AI priors such as smoothness and sparsity, and is useful across multiple tasks and domains.” (Raman Arora) Embeddings and representation learning in general have been critical to the success of deep learning. Can we learn embeddings / representations without feature engineering (e.g., tokenization) and, if so, how? • OOV representations • beyond word embeddings • RNN/GRU/LSTM based embeddings • CNN based embeddings • multilingual embeddings, universal embeddings • noise • noncanonical language • characters vs bytes vs radicals vs bits • learning algorithms, segmentation • linking it back up to traditional linguistic units (e.g., words) • how is ambiguity represented? • numbers, named entities, multiwords and other nontypical units • form-function regularities: which form regularities (e.g., “add s at the end”) correspond to function regularities • cross-token modeling • char2vec (FastText?) • non-morphological character-level producitivity • typoglycemia • efficiency (character-based worse than word-based?) • inspection, interpretation, analysis, beyond black-box models • evaluation • applications • come up with 1-5 new research directions

  5. Working Group End2End: End2End Architectures This WG will be concerned with the challenges that character-level models pose for ma- chine learning. How can dependencies over long distances be learned? How can such models be made efficient in training and application? In an approach without feature- engineered preprocessing, how can domain knowledge and priors be incorporated into machine learning architectures? • CNNs vs RNNs: tradeoff speed/accuracy, parallel/sequential • hierarchical, multi-speed, multi-scale architectures – fixed small depth (2?) vs unbounded hierarchy (paragraph, document, book) • context: attention, memory, convolution etc. • which point in input to focus on • interface between character-level and higher-level (traditional?) processing layers (syntax, semantics) • multimodal / crossmodal End2End architectures • End2End learning of long-distance relationships: corresponding phrases in sentence pairs (or document pairs) • generation of OOVs • End2End segmentation learning (i.e., learn the right way to segment for an application) • how to put in domain / linguistic knowledge? • Bayesian models • in our big machine learning toolbox T , what are interesting t ∈ T to explore in combinations of the form “character-level + t ” • (add hot deep learning architecture of the day here) • efficiency (character-based worse than word-based?) • inspection, interpretation, analysis, beyond black-box models • evaluation • applications • come up with 1-5 new research directions

Recommend


More recommend