in5550 neural methods in natural language processing
play

IN5550: Neural Methods in Natural Language Processing Lecture 2 - PowerPoint PPT Presentation

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: from Linear Models to Neural Networks Andrey Kutuzov, Vinit Ravishankar, Jeremy Barnes, Lilja vrelid, Stephan Oepen, & Erik Velldal University


  1. Linear classifiers Simple linear function f ( x ; W , b ) = x · W + b (1) ◮ Function input: ◮ feature vector x ∈ R d in ; ◮ each training instance is represented with d in features; ◮ for example, some properties of the documents. 11

  2. Linear classifiers Simple linear function f ( x ; W , b ) = x · W + b (1) ◮ Function input: ◮ feature vector x ∈ R d in ; ◮ each training instance is represented with d in features; ◮ for example, some properties of the documents. ◮ Function parameters θ : ◮ matrix W ∈ R d in × d out ◮ d out is the dimensionality of the desired prediction (number of classes) ◮ bias vector b ∈ R d out ◮ bias ‘shifts’ the function output to some direction. 11

  3. Linear classifiers Training of a linear classifier f ( x ; W , b ) = x · W + b θ = W , b ◮ Training is finding the optimal θ . 12

  4. Linear classifiers Training of a linear classifier f ( x ; W , b ) = x · W + b θ = W , b ◮ Training is finding the optimal θ . ◮ ‘Optimal’ means ‘ producing predictions ˆ y closest to the gold labels y on our n training instances ’. 12

  5. Linear classifiers Training of a linear classifier f ( x ; W , b ) = x · W + b θ = W , b ◮ Training is finding the optimal θ . ◮ ‘Optimal’ means ‘ producing predictions ˆ y closest to the gold labels y on our n training instances ’. ◮ Ideally, ˆ y = y 12

  6. Linear classifiers Here, training instances are represented with 2 features each ( x = [ x 0 , x 1 ]) and labeled with 2 class labels ( y = { black , red } ): 13

  7. Linear classifiers Here, training instances are represented with 2 features each ( x = [ x 0 , x 1 ]) and labeled with 2 class labels ( y = { black , red } ): ◮ Parameters of f ( x ; W , b ) = x · W + b define the line (or hyperplane) separating the instances. 13

  8. Linear classifiers Here, training instances are represented with 2 features each ( x = [ x 0 , x 1 ]) and labeled with 2 class labels ( y = { black , red } ): ◮ Parameters of f ( x ; W , b ) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier. 13

  9. Linear classifiers Here, training instances are represented with 2 features each ( x = [ x 0 , x 1 ]) and labeled with 2 class labels ( y = { black , red } ): ◮ Parameters of f ( x ; W , b ) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier. ◮ NB: the dataset on the plot is linearly separable. 13

  10. Linear classifiers Here, training instances are represented with 2 features each ( x = [ x 0 , x 1 ]) and labeled with 2 class labels ( y = { black , red } ): ◮ Parameters of f ( x ; W , b ) = x · W + b define the line (or hyperplane) separating the instances. ◮ This decision boundary is actually our learned classifier. ◮ NB: the dataset on the plot is linearly separable. ◮ Question: lines with 3 values of b are shown. Which is the best? 13

  11. Linear classifiers How can we represent our data ( X )? 14

  12. Linear classifiers How can we represent our data ( X )? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. 14

  13. Linear classifiers How can we represent our data ( X )? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? 14

  14. Linear classifiers How can we represent our data ( X )? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc. 14

  15. Linear classifiers How can we represent our data ( X )? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc. ◮ Maybe actors’ names (Meryl Streep, Steven Segal) 14

  16. Linear classifiers How can we represent our data ( X )? ◮ Imagine you have a review of a film and you want to know if the reviewer likes the film or hates it. ◮ What are the simplest features that would help you decide this? ◮ Good, bad, great, terrible, etc. ◮ Maybe actors’ names (Meryl Streep, Steven Segal) ◮ The simplest way to represent these words as features is a Bag-of-Words representation 14

  17. Linear classifiers Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. 15

  18. Linear classifiers Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i ? ◮ or a binary flag { 1 , 0 } of whether a appeared in i at all or not. 15

  19. Linear classifiers Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i ? ◮ or a binary flag { 1 , 0 } of whether a appeared in i at all or not. ◮ This schema is called ‘bag of words’ (BoW). ◮ for example, if we have 1000 words in the vocabulary: ◮ x i ∈ R 1000 15

  20. Linear classifiers Bag of words ◮ Each word from a pre-defined vocabulary D can be a separate feature. ◮ How many times the word a appears in the document i ? ◮ or a binary flag { 1 , 0 } of whether a appeared in i at all or not. ◮ This schema is called ‘bag of words’ (BoW). ◮ for example, if we have 1000 words in the vocabulary: ◮ x i ∈ R 1000 ◮ x i = [20 , 16 , 0 , 10 , 0 , . . . , 3] 15

  21. Linear classifiers ◮ Bag-of-Words feature vector of x can be interpreted as a sum of one-hot vectors ( o ) for each token in it: 16

  22. Linear classifiers ◮ Bag-of-Words feature vector of x can be interpreted as a sum of one-hot vectors ( o ) for each token in it: ◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’} . 16

  23. Linear classifiers ◮ Bag-of-Words feature vector of x can be interpreted as a sum of one-hot vectors ( o ) for each token in it: ◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’} . ◮ o 0 = [0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0] 16

  24. Linear classifiers ◮ Bag-of-Words feature vector of x can be interpreted as a sum of one-hot vectors ( o ) for each token in it: ◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’} . ◮ o 0 = [0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0] ◮ o 1 = [0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0] ◮ etc... 16

  25. Linear classifiers ◮ Bag-of-Words feature vector of x can be interpreted as a sum of one-hot vectors ( o ) for each token in it: ◮ D extracted from the text above contains 10 words (lowercased): {‘-’, ‘by’, ‘in’, ‘most’, ‘norway’, ‘road’, ‘the’, ‘tourists’, ‘troll’, ‘visited’} . ◮ o 0 = [0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0] ◮ o 1 = [0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0] ◮ etc... ◮ i = [1 , 1 , 1 , 1 , 1 , 2 , 2 , 1 , 1 , 1] (‘ the ’ and ‘ road ’ mentioned 2 times) 16

  26. Linear classifiers Can we interpret the different parts of a learned model as representations of the data? 17

  27. Linear classifiers Can we interpret the different parts of a learned model as representations of the data? ◮ Each of n instances (documents) is represented by a vector of features ( x ∈ R d in ). 17

  28. Linear classifiers Can we interpret the different parts of a learned model as representations of the data? ◮ Each of n instances (documents) is represented by a vector of features ( x ∈ R d in ). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in ( feature ∈ R n ). 17

  29. Linear classifiers Can we interpret the different parts of a learned model as representations of the data? ◮ Each of n instances (documents) is represented by a vector of features ( x ∈ R d in ). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in ( feature ∈ R n ). ◮ Together these learned representations form a W matrix, part of θ . ◮ Thus, it contains data both about the instances and their features (more about this later). 17

  30. Linear classifiers Can we interpret the different parts of a learned model as representations of the data? ◮ Each of n instances (documents) is represented by a vector of features ( x ∈ R d in ). ◮ Inversely, each feature can be represented by a vector of instances (documents) it appears in ( feature ∈ R n ). ◮ Together these learned representations form a W matrix, part of θ . ◮ Thus, it contains data both about the instances and their features (more about this later). ◮ Feature engineering is deciding what features of the instances we will use during the training. 17

  31. Linear classifiers negative positive neutral great best terrible worst Segal the road 18

  32. Linear classifiers negative positive neutral great best terrible worst Segal the road 18

  33. Linear classifiers negative positive neutral great best terrible worst Segal the road 18

  34. Linear classifiers negative positive neutral great best terrible worst Segal the road 18

  35. Linear classifiers Overview of Linear Models 19

  36. Linear classifiers f ( x ; W , b ) = x · W + b Output of binary classification Binary decision ( d out = 1): 20

  37. Linear classifiers f ( x ; W , b ) = x · W + b Output of binary classification Binary decision ( d out = 1): ◮ ‘ Is this message spam or not? ’ 20

  38. Linear classifiers f ( x ; W , b ) = x · W + b Output of binary classification Binary decision ( d out = 1): ◮ ‘ Is this message spam or not? ’ ◮ W is a vector, b is a scalar. 20

  39. Linear classifiers f ( x ; W , b ) = x · W + b Output of binary classification Binary decision ( d out = 1): ◮ ‘ Is this message spam or not? ’ ◮ W is a vector, b is a scalar. ◮ The prediction ˆ y is also a scalar: either 1 (‘yes’) or − 1 (‘no’). 20

  40. Linear classifiers f ( x ; W , b ) = x · W + b Output of binary classification Binary decision ( d out = 1): ◮ ‘ Is this message spam or not? ’ ◮ W is a vector, b is a scalar. ◮ The prediction ˆ y is also a scalar: either 1 (‘yes’) or − 1 (‘no’). ◮ NB: the model can output any number, but we convert all negatives to − 1 and all positives to 1 ( sign function). θ = ( W ∈ R d in , b ∈ R 1 ) 20

  41. Linear classifiers 0 1 0 0 1 1 1 0.5 sign(1.5) = 1 0 1 0 21

  42. Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification 22

  43. Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification Multi-class decision ( d out = k ) 22

  44. Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification Multi-class decision ( d out = k ) ◮ ‘ Which of k candidates authored this text? ’ 22

  45. Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification Multi-class decision ( d out = k ) ◮ ‘ Which of k candidates authored this text? ’ ◮ W is a matrix, b is a vector of k components. 22

  46. Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification Multi-class decision ( d out = k ) ◮ ‘ Which of k candidates authored this text? ’ ◮ W is a matrix, b is a vector of k components. ◮ The prediction ˆ y is also a one-hot vector of k components. 22

  47. Linear classifiers f ( x ; W , b ) = x · W + b Output of multi-class classification Multi-class decision ( d out = k ) ◮ ‘ Which of k candidates authored this text? ’ ◮ W is a matrix, b is a vector of k components. ◮ The prediction ˆ y is also a one-hot vector of k components. ◮ The component corresponding to the correct author has the value of 1, others are zeros, for example: y = [0 , 0 , 1 , 0] (for k = 4) ˆ θ = ( W ∈ R d in × d out , b ∈ R d out ) 22

  48. Linear classifiers 0 1 0 0 1 1 1 argmax( ) 0 1 1 1 0 0 1 2 2 4 0 1 1 1 0 1 1 1 3 1 23

  49. Linear classifiers Log-linear classification If we care about how confident is the classifier about each decision: 24

  50. Linear classifiers Log-linear classification If we care about how confident is the classifier about each decision: ◮ Map the predictions to the range of [0 , 1]... 24

  51. Linear classifiers Log-linear classification If we care about how confident is the classifier about each decision: ◮ Map the predictions to the range of [0 , 1]... ◮ ...by a squashing function, for example, sigmoid: 1 y = σ ( f ( x )) = ˆ (2) 1 + e − ( f ( x )) ◮ The result is the probability of the prediction! σ ( x ) 24

  52. Linear classifiers ◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: y = [0 . 4 , 0 . 1 , 0 . 9 , 0 . 5] (for k = 4) ˆ 25

  53. Linear classifiers ◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: y = [0 . 4 , 0 . 1 , 0 . 9 , 0 . 5] (for k = 4) ˆ ◮ We choose the one with the highest score: ˆ y = arg max y [ i ] = ˆ ˆ (3) y [2] i 25

  54. Linear classifiers ◮ For multi-class cases, log-linear models produce probabilities for all classes, for example: y = [0 . 4 , 0 . 1 , 0 . 9 , 0 . 5] (for k = 4) ˆ ◮ We choose the one with the highest score: ˆ y = arg max y [ i ] = ˆ ˆ (3) y [2] i ◮ But often it is more convenient to transform scores into a probability distribution, using the softmax function: y = softmax ( xW + b ) ˆ (4) e ( xW + b ) [ i ] y [ i ] = ˆ (5) j e ( xW + b ) [ j ] � 25

Recommend


More recommend