machine learning 2
play

Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. Wallace Material in this lecture derived rom materials created by Jay Alammar (http:// jalammar.github.io/illustrated-transformer/) Some housekeeping First, lets talk


  1. Machine Learning 2 DS 4420 - Spring 2020 Transformers Byron C. Wallace Material in this lecture derived rom materials created by Jay Alammar (http:// jalammar.github.io/illustrated-transformer/)

  2. Some housekeeping • First, let’s talk midterm… • Mean: 70 (from 30s to high 90s) • I miscalibrated Q2 (average: 56%) ★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the same content as Q2 — you can use this to make up up to half (12.5) points on said question

  3. Some housekeeping • First, let’s talk midterm… • Mean: 70 (from 30s to high 90s) • I miscalibrated Q2 (average: 56%) ★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the same content as Q2 — you can use this to make up up to half (12.5) points on said question

  4. Some housekeeping • First, let’s talk midterm… • Mean: 70 (from 30s to high 90s) • I miscalibrated Q2 (average: 56%) ★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the same content as Q2 — you can use this to make up up to half (12.5) points on said question

  5. Some housekeeping • First, let’s talk midterm… • Mean: 70 (from 30s to high 90s) • I miscalibrated Q2 (average: 56%) ★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the same content as Q2 — you can use this to make up up to half (12.5) points on said question

  6. Some housekeeping • First, let’s talk midterm… • Mean: 70 (from 30s to high 90s) • I miscalibrated Q2 (average: 56%) ★ I gave back to 5 points to everyone (mean now 75) ★ We are releasing an optional bonus assignment that covers the same content as Q2 — you can use this to make up up to half (12.5) points on said question. This will be released tonight; due date is flexible.

  7. HW 4 • HW 4 will be released soon; due 3/24 (Tuesday)

  8. Projects! • THURSDAY 3/13 Project proposal is due! • TUESDAY 3/17 Project pitches in class!

  9. A remote possibility • There is a (increasingly) non-zero chance that Northeastern will move to holding all classes remotely in the coming days/weeks • In this case: Remote / recorded lectures; on-demand office hours, remotely; project presentations (+ pitches) will also have to be remote or recorded (will figure out!) • Keep an eye on Piazza for more updates

  10. Today • Will introduce transformer networks, which are a type of neural networks that have come to dominate in NLP • To get there, will first review RNNs briefly

  11. Today • Will introduce transformer networks, which are a type of neural networks that have come to dominate in NLP • To get there, will first review RNNs briefly

  12. RNNs • Review [on board]

  13. Transformers • Hey, maybe we can get rid of recurrence!

  14. Attention mechanisms

  15. ˆ y output layer T X c = = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word … embeddings This movie terrible … so

  16. ˆ y output layer T X c = = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word word … … embeddings embeddings This movie so terrible …

  17. ˆ y ˆ y output layer output layer T X c = = T α i h i X c = = α i h i i =1 i =1 … α 1 α 2 α T-1 α T … α 1 α 2 α T-1 α T Attention Attention h 1 h 2 h T-1 h T … h 1 h 2 h T-1 h T … … … BiLSTM BiLSTM … … word word word … … … embeddings embeddings embeddings This movie so terrible … This movie terrible … so

  18. ˆ y output layer T X c = = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word … embeddings This movie so terrible …

  19. ˆ y output layer T c = X = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word … embeddings This movie so terrible …

  20. ˆ y output layer T X c = = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word … embeddings This movie so terrible …

  21. ˆ y output layer T X c = = α i h i i =1 … α 1 α 2 α T-1 α T Attention h 1 h 2 h T-1 h T … … BiLSTM … word … embeddings This movie … so terrible

  22. Transformer block source: http://jalammar.github.io/illustrated-transformer/

  23. First, embed source: http://jalammar.github.io/illustrated-transformer/

  24. Then transform source: http://jalammar.github.io/illustrated-transformer/

  25. What is “self-attention”? source: http://jalammar.github.io/illustrated-transformer/

  26. source: http://jalammar.github.io/illustrated-transformer/

  27. source: http://jalammar.github.io/illustrated-transformer/

  28. This one weird trick source: http://jalammar.github.io/illustrated-transformer/

  29. In matrices Learned source: http://jalammar.github.io/illustrated-transformer/

  30. In matrices source: http://jalammar.github.io/illustrated-transformer/

  31. Let’s implement… [notebook TODOs 1 & 2]

  32. OK, but what is it used for?

  33. Translation source: http://jalammar.github.io/illustrated-transformer/

  34. Translation source: http://jalammar.github.io/illustrated-transformer/

  35. Language modeling https://talktotransformer.com/

  36. BERT

  37. BERT BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language { jacobdevlin,mingweichang,kentonl,kristout } @google.com

  38. Pre-train (self-supervise) then fine-tune : A winning combo

  39. This is a thing now A Primer in BERTology: What we know about how BERT works Anna Rogers, Olga Kovaleva, Anna Rumshisky Department of Computer Science, University of Massachusetts Lowell Lowell, MA 01854 { arogers, okovalev, arum } @cs.uml.edu

  40. MNLI NER SQuAD NSP Mask LM Mask LM Start/End Span C T 1 ... T N T [SEP] T 1 ’ ... T M ’ C T 1 ... T N T [SEP] T 1 ’ ... T M ’ BERT BERT BERT E 1 ... E N E [SEP] E 1 ’ ... E M ’ E 1 ... E N E [SEP] E 1 ’ ... E M ’ E [CLS] E [CLS] ... ... ... ... [CLS] Tok 1 Tok N [SEP] Tok 1 TokM [CLS] Tok 1 Tok N [SEP] Tok 1 TokM Masked Sentence A Question Masked Sentence B Paragraph Question Answer Pair Unlabeled Sentence A and B Pair Pre-training Fine-Tuning BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language { jacobdevlin,mingweichang,kentonl,kristout } @google.com

  41. Self-Supervise an Encoder BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language { jacobdevlin,mingweichang,kentonl,kristout } @google.com

  42. Self-Supervise an Encoder The cat is very cute

  43. Self-Supervise an Encoder The cat is very cute X The [MASK] is very cute y cat

  44. Let’s implement … [notebook TODO 3]

  45. BERT details we did not consider • BERT actually uses word-pieces rather than entire words • Also uses “positional” embeddings in the inputs to give a sense of “location” in the sequence • Multiple self-attention “heads” • Deeper (12+ layers)

  46. BERT details we did not consider • BERT actually uses word-pieces rather than entire words • Also uses “positional” embeddings in the inputs to give a sense of “location” in the sequence • Multiple self-attention “heads” • Deeper (12+ layers)

  47. BERT details we did not consider • BERT actually uses word-pieces rather than entire words • Also uses “positional” embeddings in the inputs to give a sense of “location” in the sequence • Multiple self-attention “heads” • Deeper (12+ layers)

  48. BERT details we did not consider • BERT actually uses word-pieces rather than entire words • Also uses “positional” embeddings in the inputs to give a sense of “location” in the sequence • Multiple self-attention “heads” • Deeper (12+ layers) • Residual + layer norms (prevents explosions/NaNs)

  49. For a more detailed implementation … • See Sasha Rush’s excellent “annotated transformer”: http://nlp.seas.harvard.edu/2018/04/03/attention.html

Recommend


More recommend