Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014
Introduction 1 Multi-task Gaussian Process Regression 2 Experiments and Discussion 3 Conclusions and Future Work 4 2 / 23
Introduction 1 Multi-task Gaussian Process Regression 2 Experiments and Discussion 3 Conclusions and Future Work 4 3 / 23
Emotion Analysis Goal Automatically detect emotions in a text [Strapparava and Mihalcea, 2008]; 4 / 23
Emotion Analysis Goal Automatically detect emotions in a text [Strapparava and Mihalcea, 2008]; Headline Fear Joy Sadness Storms kill, knock out power, cancel flights 82 0 60 Panda cub makes her debut 0 59 0 4 / 23
Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. 5 / 23
Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small → Multi-task models are able to learn from all emotions jointly; 5 / 23
Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small → Multi-task models are able to learn from all emotions jointly; Annotation scheme is subjective and fine-grained → Prone to bias and noise; 5 / 23
Why Multi-task? Learn a model that shows sound and interpretable correlations between emotions. Datasets are scarce and small → Multi-task models are able to learn from all emotions jointly; Annotation scheme is subjective and fine-grained → Prone to bias and noise; Disclaimer: this work is not about features (at the moment...) 5 / 23
Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: 6 / 23
Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a “general” domain-independent knowledge in the data. 6 / 23
Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a “general” domain-independent knowledge in the data. Annotation Noise Modelling: assumes that annotations are noisy deviations from a “ground truth”. 6 / 23
Multi-task learning and Anti-correlations Most multi-task models used in NLP assume some degree of correlation between tasks: Domain Adaptation: assumes the existence of a “general” domain-independent knowledge in the data. Annotation Noise Modelling: assumes that annotations are noisy deviations from a “ground truth”. For Emotion Analysis, we need a multi-task model that is able to take into account possible anti-correlations, avoiding negative transfer. Headline Fear Joy Sadness Storms kill, knock out power, cancel flights 82 0 60 Panda cub makes her debut 0 59 0 6 / 23
Introduction 1 Multi-task Gaussian Process Regression 2 Experiments and Discussion 3 Conclusions and Future Work 4 7 / 23
Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: 8 / 23
Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) 8 / 23
Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) Mean function 8 / 23
Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) Kernel function 8 / 23
Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) p ( f | X , y ) = p ( y | X , f ) p ( f ) Prior p ( y | X ) 8 / 23
Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) Likelihood p ( f | X , y ) = p ( y | X , f ) p ( f ) p ( y | X ) 8 / 23
Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) p ( f | X , y ) = p ( y | X , f ) p ( f ) p ( y | X ) Marginal likelihood 8 / 23
Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) p ( f | X , y ) = p ( y | X , f ) p ( f ) Posterior p ( y | X ) � p ( y ∗ | x ∗ , X , y ) = p ( y ∗ | x ∗ , f , X , y ) p ( f | X , y ) df f 8 / 23
Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) p ( f | X , y ) = p ( y | X , f ) p ( f ) p ( y | X ) � p ( y ∗ | x ∗ , X , y ) = p ( y ∗ | x ∗ , f , X , y ) p ( f | X , y ) df f Likelihood (test) 8 / 23
Gaussian Processes Let ( X , y ) be the training data and f ( x ) the latent function that models that data: f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) p ( f | X , y ) = p ( y | X , f ) p ( f ) p ( y | X ) � p ( y ∗ | x ∗ , X , y ) = p ( y ∗ | x ∗ , f , X , y ) p ( f | X , y ) df f Predictive distribution 8 / 23
GP Regression Likelihood: In a regression setting, we usually consider a Gaussian likelihood, which allow us to obtain a closed form solution for the test posterior; 1 AKA Squared Exponential, Gaussian or Exponential Quadratic kernel. 9 / 23
GP Regression Likelihood: In a regression setting, we usually consider a Gaussian likelihood, which allow us to obtain a closed form solution for the test posterior; Kernel: Many options available. In this work we use the Radial Basis Function (RBF) kernel 1 : F � � i ) 2 − 1 ( x i − x ′ k ( x , x ′ ) = α 2 � f × exp 2 l i i =1 1 AKA Squared Exponential, Gaussian or Exponential Quadratic kernel. 9 / 23
The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k (( x , d ) , ( x ′ , d ′ )) = k data ( x , x ′ ) × B d , d ′ 10 / 23
The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k (( x , d ) , ( x ′ , d ′ )) = k data ( x , x ′ ) × B d , d ′ Kernel on data points (like RBF, for instance) 10 / 23
The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k (( x , d ) , ( x ′ , d ′ )) = k data ( x , x ′ ) × B d , d ′ Coregionalisation matrix: encodes task covariances 10 / 23
The Intrinsic Coregionalisation Model Coregionalisation models extend GPs to vector-valued outputs [´ Alvarez et al., 2012]. Here we use the Intrinsic Coregionalisation Model (ICM): k (( x , d ) , ( x ′ , d ′ )) = k data ( x , x ′ ) × B d , d ′ B can be parameterised and learned by optimizing the model marginal likelihood. 10 / 23
PPCA model [Bonilla et al., 2008] decomposes B using PPCA: B = UΛU T + diag( α ) , 11 / 23
PPCA model [Bonilla et al., 2008] decomposes B using PPCA: B = UΛU T + diag( α ) , To ensure numerical stability, we employ the incomplete-Cholesky decomposition over UΛU T : L T + diag( α ) , B = ˜ L˜ 11 / 23
PPCA model L 11 L 21 L 31 L 41 L 51 L 61 ˜ L 12 / 23
PPCA model L 11 L 11 L 21 L 31 L 41 L 51 L 61 L 21 L 31 × L 41 L 51 L 61 ˜ × ˜ L T L 12 / 23
PPCA model L 11 L 11 L 21 L 31 L 41 L 51 L 61 α 1 L 21 α 2 L 31 α 3 × + diag( ) = L 41 α 4 L 51 α 5 L 61 α 6 ˜ × ˜ + diag( α ) = L T L 12 / 23
PPCA model L 11 L 11 L 21 L 31 L 41 L 51 L 61 α 1 L 21 α 2 L 31 α 3 × + diag( ) = L 41 α 4 L 51 α 5 L 61 α 6 ˜ × ˜ + diag( α ) = L T B L 12 / 23
PPCA model 12 hyperparameters L 11 L 11 L 21 L 31 L 41 L 51 L 61 α 1 L 21 α 2 L 31 α 3 × + diag( ) = L 41 α 4 L 51 α 5 L 61 α 6 ˜ × ˜ + diag( α ) = L T B L 12 / 23
PPCA model 18 hyperparameters L 11 L 12 L 11 L 21 L 31 L 41 L 51 L 61 α 1 L 21 L 22 L 12 L 22 L 32 L 42 L 52 L 62 α 2 L 31 L 32 α 3 × + diag( ) = L 41 L 42 α 4 L 51 L 52 α 5 L 61 L 62 α 6 ˜ × ˜ + diag( α ) = L T B L 13 / 23
PPCA model 24 hyperparameters L 11 L 12 L 13 L 11 L 21 L 31 L 41 L 51 L 61 α 1 L 21 L 22 L 23 L 12 L 22 L 32 L 42 L 52 L 62 α 2 L 31 L 32 L 33 L 13 L 23 L 33 L 43 L 53 L 63 α 3 × + diag( ) = L 41 L 42 L 43 α 4 L 51 L 52 L 53 α 5 L 61 L 62 L 63 α 6 ˜ × ˜ + diag( α ) = L T B L 14 / 23
Introduction 1 Multi-task Gaussian Process Regression 2 Experiments and Discussion 3 Conclusions and Future Work 4 15 / 23
Experimental Setup Dataset: SEMEval2007 “Affective Text” [Strapparava and Mihalcea, 2007]; 16 / 23
Experimental Setup Dataset: SEMEval2007 “Affective Text” [Strapparava and Mihalcea, 2007]; 1000 News headlines, each one annotated with 6 scores [0-100], one for emotion; 16 / 23
Recommend
More recommend