Libraries and Tools π€ Transformers, AllenNLP LING575 Analyzing Neural Language Models Shane Steinert-Threlkeld February 6 2020 1
Outline β Very helpful tools β π€ Transformers β AllenNLP β Walk-through of a classifier and a tagger β Second half: tips/tricks for experiment running and paper writing 2
π€ Transformers https://huggingface.co/transformers 3
Where to get LMs to analyze? β RNNs: see week 3 slides β Josefewicz et al βExploring the limitsβ¦β β Gulordava et al βColorless green ideasβ¦β β ELMo via AllenNLP (about which more later) β Effectively a unique API for each model β All (essentially) Transformer-based models: HuggingFace! 4
Overview of the Library β Access to many variants of many very large LMs (BERT, RoBERTa, XLNET, ALBERT, T5, language-specific models, β¦) with fairly consistent API β Build tokenizer + model from string for name or config β Then use just like any PyTorch nn.Module β Emphasis on ease-of-use β E.g. low barrier-to-entry to using the models, including for analysis β Interoperable with PyTorch or TensorFlow 2.0 5
Example: Tokenization See http://juditacs.github.io/2019/02/19/bert-tokenization-stats.html (h/t Naomi Shapiro) 6
Example: Forward Pass 7
Outputs from the forward pass β Outputs are always tuples of Tensors β BERT, by default, gives two things: β Top layer embeddings for each token. β¨ Shape: (batch_size, max_length, embedding_dimension) β Pooled representation: embedding of β[CLS]β token, passed through one tanh layer β¨ Shape: (batch_size, embedding_dimension) 8
Getting more out of a model from transformers import BertConfig, BertModel config = BertConfig( βbert-base-uncasedβ, output_attentions=True, output_hidden_states=True) model = BertModel(config) β Now, itβs a 4-tuple as output, additionally containing: β Hidden states. A tuple of tensors, one for each layer. Length: # layers β¨ Shape of each: (batch_size, max_length, embedding_dimension) β Attention heads: tuple of tensors, one for each layer. Length: # layers β¨ Shape of each: (batch_size, num_heads, max_length, max_length) 9
What the library does well β Very easy tokenization β Forward pass of models β Exposing as many internals as possible β All layers, attention heads, etc β As unified an interface as possible β But: different models have different properties, controlled by Configs β Read the docs carefully! 10
What the library does not do β Anything related to training β Padding β Batching β Optimizing probe models, etc. Use PyTorch (or TF) for that 11
AllenNLP https://allennlp.org/ 12
Overview of AllenNLP β Built on top of PyTorch β Flexible data API β Abstractions for common use cases in NLP β e.g. take a sequence of representations and give me a single one β Modular: β Because of that, can swap in and out different options, for good experiments β Declarative model-building / training via config files β See https://github.com/allenai/writing-code-for-nlp-research-emnlp2018 β https://allennlp.org/tutorials β https://github.com/jbarrow/allennlp_tutorial 13
β¨ β¨ β¨ Some Advantages β Focus on modeling / experimenting, not writing boilerplate, e.g.: β Training loop: β¨ for each epoch: for each batch: get model outputs on batch compute loss compute gradients update parameters β Not that complicated, but: β Early stopping β Check-pointing (saving best model(s)) β Generating and padding the batches β Logging results β β¦. β¨ allennlp train myexperiment.jsonnet 14
Example Abstractions β TextFieldEmbedder β Seq2SeqEncoder β Seq2VecEncoder β Attention β β¦ β Allows for easy swapping of different choices at every level in your model. 15
Overall Structure (Classification) DatasetReader Model Iterator Trainer 16
Basic Components: Dataset Reader β Datasets are collections of Instances , which are collections of Fields β For text classification, e.g.: one TextField, one LabelField β Many more: https://allenai.github.io/allennlp-docs/api/data/fields/field/ β DatasetReadersβ¦.. read data sets. Two primary methods: β _read(file): reads data from disk, yields Instances. By calling: β text_to_instance (variable signature) β Processing of the βrawβ data from disk into final form β Produces one Instance at a time 17
DatasetReader: Stanford Sentiment Treebank β One line from train.txt: β¨ ( 3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .))) β Core of _read: β Core of text_to_instance: β¦ 18
Model Fine tune or not 19
Model NB: frozen embeddings can be pre-computed for efficiency 20
Where was BERT? β In the PretrainedTransformerEmbedder β AllenNLP has wrappers around HuggingFace β But note: to extract more from a model, youβll probably need to write your own class, using the existing ones as inspiration 21
Config file (classifying_experiment.jsonnet) @DatasetReader.register(βsst_readerβ) Arguments to SSTReader! 22
Config file (classifying_experiment.jsonnet) allennlp train classifying_experiment.jsonnet \ --serialization-dir test \ --include-package classifying 23
TensorBoard tensorboard --logdir /serialization_dir/log Use SSH port forwarding to view server-side results locally 24
Tagging β The repository also has an example of training a semantic tagger β Like POS tagging, but with a richer set of βsemanticβ tags β Issue: the data comes with its own tokenization: β BERT: ['the', 'ya', '##zuka', 'are', 'the', 'japanese', 'mafia', β.β] β Need to get word-level representations out of BERTβs subword representations 25
Tagging: Modeling β My example: keep track of which spans of BERT tokens the original words correspond to β Some complication in the DatasetReader because of this β And then combine those representations with an arbitrary Seq2VecEncoder β Since then (a few months ago), theyβve added a PretrainedMismatchedTransformerEmbedder that has essentially the same functionality β (Spans are pooled by summing, not by an arbitrary Seq2Vec) β Might be safest to use that (and corresponding MismatchedIndexer) 26
On These Libraries β If youβre using transformer-based LMs, I strongly recommend HuggingFace β But itβs possible that learning AllenNLPβs abstractions may cost you more time than it saves in the short term β As always, try and use the best tool for the job at hand 27
Other tools for experiment management β Disclaimer: Iβve never used them! β Might be over-kill in the short term β Guild (entirely local): https://guild.ai/ β CodaLab: https://codalab.org/ β Weights and Biases: https://www.wandb.com/ β Neptune: https://neptune.ai/ 28
Using GPUs on Patas 29
Setting up local environment β Two GPU nodes (getting a third one soon): β 2xTesla P40 β 8xTesla M10 β For info on setting up your local environment to use these nodes in a fairly painless way: β https://www.shane.st/teaching/575/win20/patas-gpu.pdf β Pay attention to cudatoolkit version!! 30
Condor job file for patas executable = run_exp_gpu.sh getenv = True error = exp.error log = exp.log notification = always transfer_executable = false request_memory = 8*1024 request_GPUs = 1 +Research = True Queue 31
Example executable #!/bin/sh conda activate my-project allennlp train tagging_experiment.jsonnet --serialization-dir test \ --include-package tagging \ --overrides "{'trainer': {'cuda_device': 1}}" 32
Recommend
More recommend