convolutional networks for text
play

Convolutional Networks for Text Pengfei Liu Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Convolutional Networks for Text Pengfei Liu Site https://phontron.com/class/nn4nlp2020/ With some slides by Graham Neubig Outline 1. Feature Combinations 2. CNNs and Key Concepts 3. Case Study on Sentiment


  1. CS11-747 Neural Networks for NLP Convolutional Networks for Text Pengfei Liu Site https://phontron.com/class/nn4nlp2020/ With some slides by Graham Neubig

  2. Outline 1. Feature Combinations 2. CNNs and Key Concepts 3. Case Study on Sentiment Classification 4. CNN Variants and Applications 5. Structured CNNs 6. Summary

  3. An Example Prediction Problem: Sentiment Classification very good good ? I hate this movie neutral bad very bad very good good ? I love this movie neutral bad very bad

  4. An Example Prediction Problem: Sentiment Classification very good good neutral I hate this movie bad very bad very good good I love this movie neutral bad very bad

  5. An Example Prediction Problem: Sentiment Classification very good good I hate this movie neutral bad very bad very good good neutral I love this movie bad very bad how does our machine to do this task?

  6. Continuous Bag of Words (CBOW) I hate this movie • One of the simplest methods lookup lookup lookup lookup • Discrete symbols to continuous vectors + + + = + = W bias scores

  7. Continuous Bag of Words (CBOW) I hate this movie • One of the simplest methods lookup lookup lookup lookup • Discrete symbols to continuous vectors + + + • Average all vectors = + = W bias scores

  8. Deep CBOW I hate this movie • More linear transformations followed by activation functions + + + (Multilayer Perceptron, = MLP) tanh( tanh( W 1 *h + b 1 ) W 2 *h + b 2 ) + = W bias scores

  9. What’s the Use of the “Deep” • Multiple MLP layers allow us easily to learn feature combinations (a node in the second layer might be “feature 1 AND feature 5 are active”) • e.g. capture things such as “not” AND “hate” • BUT! Cannot handle “not hate”

  10. Handling Combinations

  11. Bag of n-grams I hate this movie bias scores sum( ) = probs softmax • A contiguous sequence of words • Concatenate word vectors

  12. Why Bag of n-grams? Allow us to capture • combination features in a simple way “don’t love”, “not the best” Decent baseline and • works pretty well

  13. What Problems w/ Bag of n-grams? • Same as before: parameter explosion • No sharing between similar words/n-grams • Lose the global sequence order

  14. What Problems w/ Bag of n-grams? • Same as before: parameter explosion • No sharing between similar words/n-grams • Lose the global sequence order Other solutions?

  15. Neural Sequence Models

  16. Neural Sequence Models Most of NLP tasks  Sequence representation learning problem

  17. Neural Sequence Models char : i-m-p-o-s-s-i-b-l-e word : I-love-this-movie

  18. Neural Sequence Models CBOW Bag of n-grams CNNs RNNs Transformer GraphNNs

  19. Neural Sequence Models CBOW Bag of n-grams CNNs RNNs Transformer GraphNNs

  20. Convolutional Neural Networks

  21. Definition of Convolution Convolution -- > mathematical operation • Continuous • Discrete

  22. Definition of Convolution Convolution -- > mathematical operation • Continuous • Discrete

  23. Intuitive Understanding Input : feature vector Filter : learnable param. Output : hidden vector

  24. Priori Entailed by CNNs

  25. Priori Entailed by CNNs Local bias: Different words could interact with their neighbors

  26. Priori Entailed by CNNs Local bias: Different words could interact with their neighbors

  27. Priori Entailed by CNNs Parameter sharing: The parameters of composition function are the same.

  28. Basics of CNNs

  29. Concept: 2d Convolution • Deal with 2-dimension signal, i.e., image

  30. Concept: 2d Convolution

  31. Concept: 2d Convolution

  32. Concept: Stride Stride: the number of units shifts over the input matrix.

  33. Concept: Stride Stride: the number of units shifts over the input matrix.

  34. Concept: Stride Stride: the number of units shifts over the input matrix.

  35. Concept: Padding Padding: dealing with the units at the boundary of input vector.

  36. Concept: Padding Padding: dealing with the units at the boundary of input vector.

  37. Three Types of Convolutions m=7 Narrow n=3 m-n+1=5

  38. Three Types of Convolutions m=7 Narrow n=3 m-n+1=5 m=7 Equal n=3 m-n+1=5

  39. Three Types of Convolutions m=7 Narrow n=3 m-n+1=5

  40. Three Types of Convolutions m=7 Narrow n=3 m-n+1=5 m=7 Equal n=3 m

  41. Three Types of Convolutions m=7 Narrow n=3 m-n+1=5 m=7 Equal n=3 m m=7 Wide n=3 m+n-1=9

  42. Concept: Multiple Filters Motivation: each filter represents a unique feature of the convolution window.

  43. Concept: Pooling • Pooling is an aggregation operation, aiming to select informative features

  44. Concept: Pooling • Pooling is an aggregation operation, aiming to select informative features • Max pooling: “Did you see this feature anywhere in the range?” (most common) • Average pooling: “How prevalent is this feature over the entire range” • k-Max pooling: “Did you see this feature up to k times?” • Dynamic pooling: “Did you see this feature in the beginning? In the middle? In the end?”

  45. Concept: Pooling Max pooling:

  46. Concept: Pooling Max pooling: Mean pooling:

  47. Concept: Pooling Max pooling: Mean pooling: K-max pooling

  48. Concept: Pooling Max pooling: Mean pooling: K-max pooling Dynamic pooling:

  49. Case Study: Convolutional Networks for Text Classification (Kim 2015)

  50. CNNs for Text Classification (Kim 2015) • Task: sentiment classification • Input: a sentence • Output: a class label (positive/negative)

  51. CNNs for Text Classification (Kim 2015) • Task: sentiment classification • Input: a sentence • Output: a class label (positive/negative) • Model: • Embedding layer • Multi-Channel CNN layer • Pooling layer/Output layer

  52. Overview of the Architecture Input Filter Pooling CNN Output Dict

  53. Embedding Layer Input • Build a look-up table (pre- trained? Fine-tuned?) • Discrete  distributed Look-up Table

  54. Conv. Layer

  55. Conv. Layer • Stride size?

  56. Conv. Layer • Stride size? • 1

  57. Conv. Layer • Wide, equal, narrow?

  58. Conv. Layer • Wide, equal, narrow? • narrow

  59. Conv. Layer • How many filters?

  60. Conv. Layer • How many filters? • 4

  61. Pooling Layer • Max-pooling • Concatenate

  62. Output Layer • MLP layer • Dropout • Softmax

  63. CNN Variants

  64. Priori Entailed by CNNs • Local bias • Parameter sharing

  65. Priori Entailed by CNNs How to handle long-term • Local bias dependencies? • Parameter sharing How to handle different types of compositionality?

  66. Priori Entailed by CNNs Advantage Priori Limitation

  67. CNN Variants Locality Bias • Long-term dependency • increase receptive fields (dilated) • Complicated Interaction • dynamic filters Sharing Parameters

  68. Dilated Convolution (e.g. Kalchbrenner et al. 2016) • Long-term dependency with less layers sentence class (classification) next char (language modeling) word class (tagging) i _ h a t e _ t h i s _ f i l m

  69. Dynamic Filter CNN (e.g. Brabandere et al. 2016) • Parameters of filters are static, failing to capture rich interaction patterns. • Filters are generated dynamically conditioned on an input.

  70. Common Applications

  71. CNN Applications • Word-level CNNs • Basic unit: word • Learn the representation of a sentence • Phrasal patterns • Char-level CNNs • Basic unit: character • Learn the representation of a word • Extract morphological patters

  72. CNN Applications • Word-level CNN • Sentence representation

  73. NLP (Almost) from Scratch (Collobert et al.2011) • One of the most important papers in NLP • Proposed as early as 2008

  74. CNN Applications • Word-level CNN • Sentence representation • Char-level CNN • Text Classification

  75. CNN-RNN-CRF for Tagging (Ma et al. 2016) • A classic framework and de-facto standard for tagging • Char-CNN is used to learn word representations (extract morphological information). • Complementarity

  76. Structured Convolution

  77. Why Structured Convolution? The man ate the egg.

  78. Why Structured Convolution? The man ate the egg. vanilla CNNs

  79. Why Structured Convolution? The man ate the egg. vanilla • Some convolutional CNNs operations are not necessary • e.g. noun-verb pairs very informative, but not captured by normal CNNs

  80. Why Structured Convolution? • Some convolutional operations are not The man ate the egg. necessary • e.g. noun-verb pairs very informative, but not captured by normal CNNs • Language has structure, would like it to localize features

  81. Why Structured Convolution? • Some convolutional operations are not The man ate the egg. necessary • e.g. noun-verb pairs very informative, but not captured by normal CNNs • Language has structure, would like it to localize features The “Structure” provides stronger prior!

Recommend


More recommend