inf5820 language technological applications course summary
play

INF5820: Language technological applications Course summary Andrey - PowerPoint PPT Presentation

INF5820: Language technological applications Course summary Andrey Kutuzov, Lilja vrelid, Stephan Oepen, Taraka Rama & Erik Velldal University of Oslo 20 November 2018 Today Exam preparations Collectively summing up Results


  1. INF5820: Language technological applications Course summary Andrey Kutuzov, Lilja Øvrelid, Stephan Oepen, Taraka Rama & Erik Velldal University of Oslo 20 November 2018

  2. Today ◮ Exam preparations ◮ Collectively summing up ◮ Results of obligatory assignment(s) ◮ Current trends, beyond INF5820. ◮ Cutting edge in word embedding pre-training ◮ Transfer and multitask learning ◮ Adversarial learning ◮ Transformers ◮ And more. . . 2

  3. Exam ◮ When: Monday November 26, 09:00 AM (4 hours). ◮ Where: Store fysiske lesesal, Fysikkbygningen ◮ How: ◮ No aids (no textbooks, etc.) ◮ Pen and paper (not Inspera) ◮ Not a programming exam ◮ Focus on conceptual understanding ◮ Could still involve equations, but no complicated calculations by hand ◮ Details of use cases we’ve considered (in lectures or assignments) are also relevant 3

  4. Neural Network Methods for NLP (The Great Wave off Kanagawa by Katsushika Hokusai) 4

  5. What has changed? ◮ We’re still within the realm of supervised machine learning. But: ◮ A shift from linear models with discrete representations of manually specified features, ◮ to non-linear models with distributed and learned representations. 5

  6. What has changed? ◮ We’re still within the realm of supervised machine learning. But: ◮ A shift from linear models with discrete representations of manually specified features, ◮ to non-linear models with distributed and learned representations. ◮ We’ll consider two main themes running through the semester: architectures and representations. 5

  7. Architectures and model design 6

  8. Architectures and model design ◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. 6

  9. Architectures and model design ◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs: 6

  10. Architectures and model design ◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs: ◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . . 6

  11. Architectures and model design ◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs: ◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . . ◮ Variations beyond simple RNNs: 6

  12. Architectures and model design ◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs: ◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . . ◮ Variations beyond simple RNNs: ◮ (Bi)LSTM + GRU (gating), attention and stacking. ◮ Variations of how RNNs can be used: Acceptors, transducers, conditioned generation (encoder-decoder / seq.-to-seq.) ◮ Various ways of performing sequence labeling with RNNs 6

  13. Architectures and model design ◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs: ◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . . ◮ Variations beyond simple RNNs: ◮ (Bi)LSTM + GRU (gating), attention and stacking. ◮ Variations of how RNNs can be used: Acceptors, transducers, conditioned generation (encoder-decoder / seq.-to-seq.) ◮ Various ways of performing sequence labeling with RNNs Various aspects of modeling common to all the neural architetctures: 6

  14. Architectures and model design ◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs: ◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . . ◮ Variations beyond simple RNNs: ◮ (Bi)LSTM + GRU (gating), attention and stacking. ◮ Variations of how RNNs can be used: Acceptors, transducers, conditioned generation (encoder-decoder / seq.-to-seq.) ◮ Various ways of performing sequence labeling with RNNs Various aspects of modeling common to all the neural architetctures: ◮ dimensionalities, regularization, initialization, handling OOVs, activation functions, batches, loss-functions, learning rate, optimizer,. . . ◮ Embedding pre-training and text pre-processing ◮ Backpropagation, vanishing / exploding gradients 6

  15. Representations ◮ An important part of the neural ‘revolution’ in NLP: the input representations provided to the learner. ◮ Traditional feature vectors: ◮ Word embeddings: ◮ Main benefit of using embeddings rather than one-hot encodings: 7

  16. Representations ◮ An important part of the neural ‘revolution’ in NLP: the input representations provided to the learner. ◮ Traditional feature vectors: High-dimensional, sparse, categorical and discrete. Based on manualy specified feature templates. ◮ Word embeddings: Low-dimensional, dense, continous and distributed. Often learned automatically, eg. as a language model. ◮ Main benefit of using embeddings rather than one-hot encodings: ◮ Information-sharing between features, counteracts data-sparsness. ◮ Can be computed from unlabelled data. 7

  17. Representations ◮ An important part of the neural ‘revolution’ in NLP: the input representations provided to the learner. ◮ Traditional feature vectors: High-dimensional, sparse, categorical and discrete. Based on manualy specified feature templates. ◮ Word embeddings: Low-dimensional, dense, continous and distributed. Often learned automatically, eg. as a language model. ◮ Main benefit of using embeddings rather than one-hot encodings: ◮ Information-sharing between features, counteracts data-sparsness. ◮ Can be computed from unlabelled data. ◮ We’ve also considered various tasks for intrinsic evaluation of distributional word vectors. 7

  18. Representation learning ◮ With neural network models, our main interest is not always in the final classification outcome itself. ◮ Rather, we might be interested in the learned internal representations. ◮ Examples? 8

  19. Representation learning ◮ With neural network models, our main interest is not always in the final classification outcome itself. ◮ Rather, we might be interested in the learned internal representations. ◮ Examples? ◮ Embeddings in neural models ◮ Pre-trained or learned from scratch (with one-hot input) ◮ Static (frozen) or dynamic. ◮ The pooling layer of a CNN or the final hidden state of an RNN provides a fixed-length representation of an arbitray-length sequence. 8

  20. Specialized NN architectures ◮ Focus of manual engineering shifted from features to architechture decisions and hyper-parameters. ◮ The elimination of feature-engineering is only partially true: ◮ Need for specialized NN architectures that extract higher-level features: ◮ CNNs and RNNs. ◮ Pitch: layers and architectures are like Lego bricks – mix and match. 9

  21. Specialized NN architectures ◮ Focus of manual engineering shifted from features to architechture decisions and hyper-parameters. ◮ The elimination of feature-engineering is only partially true: ◮ Need for specialized NN architectures that extract higher-level features: ◮ CNNs and RNNs. ◮ Pitch: layers and architectures are like Lego bricks – mix and match. ◮ Examples of things you could be asked to reflect on: ◮ When would you use each architecture? ◮ What are some of the ways we’ve combined the various bricks? ◮ When choosing to apply a non-hierarchical CNN, what assumptions are you implicitly making about the nature of your task or data? ◮ Why could it make sense run a CNN over the word-by-word vector outputs of an RNN (e.g. a BiLSTM)? 9

  22. INF5820: Experiment Design 10

  23. INF5820: Experiment Design Methodology ◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions. 10

  24. INF5820: Experiment Design Methodology ◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions. Main Results ◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: 10

  25. INF5820: Experiment Design Methodology ◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions. Main Results ◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 10

  26. INF5820: Experiment Design Methodology ◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions. Main Results ◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9 / 9; 10

Recommend


More recommend