natural language processing with deep learning sentiment
play

Natural Language Processing with Deep Learning Sentiment Analysis - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Sentiment Analysis with Machine Learning Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction to Machine Learning Sentiment Analysis


  1. Natural Language Processing with Deep Learning Sentiment Analysis with Machine Learning Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception

  2. Agenda • Introduction to Machine Learning • Sentiment Analysis • Feature Extraction • Breaking the curse of dimensionality!

  3. Agenda • Introduction to Machine Learning • Sentiment Analysis • Feature Extraction • Breaking the curse of dimensionality!

  4. Notation § 𝑏 → a value or a scalar § 𝒄 → an array or a vector - 𝑗 !" element of 𝒄 is the scalar 𝑐 # § 𝑫 → a set of arrays or a matrix - 𝑗 !" vector of 𝑫 is 𝒅 # - 𝑘 !" element of the 𝑗 !" vector of 𝑫 is the scalar 𝑑 #,% 4

  5. Linear Algebra – Recap § Transpose - 𝒃 is in 1 × d dimensions → 𝒃 𝐔 is in d × 1 dimensions - 𝑩 is in e × d dimensions → 𝑩 𝐔 is in d × e dimensions § Inverse of the square matrix 𝑻 is 𝑻 &𝟐 § Dot product - 𝒃 + 𝒄 ( = 𝑑 dimensions: 1 × d $ d × 1 = 1 𝒅 - 𝒃 + 𝑪 = dimensions: 1 × d $ d × e = 1 × e 𝑫 - 𝑩 + 𝑪 = dimensions: l × m $ m × n = l × n 5

  6. Statistical Learning § Given 𝑂 observed data points: 𝒀 = [𝒚 & , 𝒚 ' , … , 𝒚 ( ] accompanied with output (label) values: 𝒛 = [𝑧 & , 𝑧 ' , … , 𝑧 ( ] and each data point is defined as a vector with 𝑚 dimensions (features): ) , 𝑦 ' ) , … , 𝑦 * ) ] 𝒚 ) = [𝑦 & 6

  7. Statistical Learning § Statistical learning assumes that there exists a TRUE function ( 𝑔 ()*+ ) that has generated these data: 𝒛 = 𝑔 +,-. 𝒀 + 𝜗 § 𝑔 ()*+ - The true but unknown function that produces the data - A fixed function § 𝜗 > 0 - Called irreducible error - Rooted in the constrains in gathering data, and measuring and quantifying features 7

  8. Example 𝑔 ()*+ 𝑔 "#$% à blue surface 𝒀 à Red points with two features: Seniority , Years of Education 𝒛 à Income 𝜗 à the differences between the data points and the surface 8

  9. Machine Learning Model § A machine learning (ML) model tries to estimate 𝑔 ()*+ by defining function 𝑔 : 1 𝒛 = 𝑔 𝒀 such that 7 𝒛 (predicted outputs) be close to 𝒛 (real outputs). § The differences between the values of 7 𝒛 and 𝒛 is reducible error - Can be reduced by better models, better estimations of 𝑔 "#$% 9

  10. Generalization § The aim of machine learning is to create a model using observed experiences (training data) that generalizes to the problem domain, namely performs well on unobserved instances (test data) link 10

  11. Learning the model – Splitting dataset § Data points are splitted into: - Training set : for training the model - Validation set : for tuning model’s hyper-parameters - Test set : for evaluating model’s performance § Common train – validation – test splitting sizes - 60%, 20%, 20% - 70%, 15%, 15% - 80%, 10%, 10% Observed data points Training set Test set Validation Training set Test set set 11

  12. Learning the model Features / Variables (𝑌) Labels / Output Variable (𝑍) sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 Pstatus: parent's cohabitation status ('T' - F 15 T no 3 living together 'A' - apart) F 15 T yes 1 Romantic : with a romantic relationship F 16 T no 2 Walc : weekend alcohol consumption (from M 16 T no 2 1 - very low to 5 - very high) M 16 T no 1 Dataset F 17 A no 1 http://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOH M 15 A no 1 OL+CONSUMPTION# M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 M 15 T no 2 M 15 A yes 1 F 16 T no 2 F 16 T no 2 F 16 T no 1 M 17 T no 4 12

  13. Learning the model sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 Train Set M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 M 15 T no 2 M 15 A yes 1 F 16 T no 2 Test Set F 16 T no 2 F 16 T no 1 M 17 T no 4 13

  14. Learning the model sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 Train Set M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 M 15 T no ? 2 M 15 A yes ? 1 F 16 T no ? 2 Test Set F 16 T no ? 2 F 16 T no ? 1 M 17 T no ? 4 𝑧 14

  15. Learning the model sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 Train Set ML Model Train M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 M 15 T no ? 2 M 15 A yes ? 1 F 16 T no ? 2 Test Set F 16 T no ? 2 F 16 T no ? 1 M 17 T no ? 4 𝑧 15

  16. Learning the model sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 Train Set ML Model Train M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 Predict F 15 T no 1 M 15 T no 3 2 M 15 T no 1 1 M 15 A yes 1 2 F 16 T no 2 Test Set 2 F 16 T no 2 F 16 T no 3 1 4 M 17 T no 4 𝑧 2 𝑧 16

  17. Learning the model sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 Train Set ML Model Train M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 Predict F 15 T no 1 M 15 T no 3 2 M 15 T no 1 1 M 15 A yes 1 2 F 16 T no 2 Test Set 2 F 16 T no 2 F 16 T no 3 1 4 M 17 T no 4 𝑧 2 𝑧 Evaluation – Generalization error 17

  18. Tuning hyper parameters – Model selection § Decide on the exploration of several sets of the model’s hyper-parameters § Train a separate model per each set using training set § Among the trained models, select the best performing one based on the evaluation result on validation set § Take the selected model and evaluate it on test set → final model performance 18

  19. ML models § Parametric models - The model is defined as a function (or a family of functions) consisting of a set of parameters - Functions such as linear regression, logistic regression, naïve Bayes, and neural networks - The problem of finding the ML model is reduced to finding the optimum values for the parameters § Non-parametric models - There is no assumption about the form of the function - The model is directly learned from data - ML models such as SVM, k-NN, smoothing spline, gaussian processes Term of the day! Inductive bias: all assumptions we consider in defining and creating an ML model. Our prior knowledge about what 𝑔 %&'( should be. 19

  20. A sample ML model: Linear Regression § 𝑔 is defined as a Linear Regression function: 𝑧 = 𝑔 𝒚; 𝒙 = 𝑥 , + 𝑥 - 𝑦 - + 𝑥 . 𝑦 . +…+ 𝑥 / 𝑦 / where 𝒙 = [𝑥 & , 𝑥 ' , … , 𝑥 ( ] is the set of model parameters § In the “income” example: 𝑗𝑜𝑑𝑝𝑛𝑓 = 𝑔 𝒚; 𝒙 = 𝑥 , + 𝑥 - ×𝑓𝑒𝑣𝑑𝑏𝑢𝑗𝑝𝑜 +𝑥 . ×𝑡𝑓𝑜𝑗𝑝𝑠𝑗𝑢𝑧 20

  21. A trained Linear Regression model 21

  22. Loss Function § Optimization of parameters is done by first defining a loss function § A loss function measures the discrepancies between the predicted outputs 7 𝒛 and real ones 𝒛 § E.g. Mean Square Error (MSE) – a common regression loss function: 1 𝑧 # ; 𝒙) = 1 𝑧 # . ℒ(𝑧 # , M 𝑂 P 𝑧 # − M #0- Loss functions for classification: Next lectures Good to know! What is Mean Absolute Error and how is it different from MSE? 22

  23. Optimization § Next, training data is used to find an optimum set of parameters 𝒙 ∗ by optimizing the loss function: 𝒙 ∗ = argmin ℒ (𝑧 # , M 𝑧 # ; 𝒙) 𝒙 . MSE: 𝒙 ∗ = argmin - 1 1 ∑ #0- 𝑧 # − 𝑔 𝑦 # ; 𝒙 𝒙 § How to optimize: - Stochastically , e.g. using Stochastic Gradient Descent (SGD) → next lecture - Analytically , e.g. in linear regression → Deep Learning book 5.1.4 23

  24. ML models… cont. Model Capacity high low less flexible more flexible less parameters more parameters lower variance higher variance higher bias lower bias prune to underfitting prune to overfitting Terms of the day! (Statistical) Bias indicates the amount of assumptions, taken to define a model. Higher bias means more assumptions and less flexibility, as in linear regression. Variance: in what extent the estimated parameters of a model vary when the values of data points change (are resampled). Overfitting: When the model exactly fits to training data, namely when it also captures the noise in data. 24

  25. Learning Curve underfit overfit test set sweet spot! error train set capacity Models: black → 𝑔 !"#$ orange → linear regression blue and green → two smoothing spline models 25

  26. Regularization § A regularization method introduces additional information (assumptions) to avoid overfitting by decreasing variance § E.g. adding the squared L2 norm of parameters to loss function: 1 𝑧 # ; 𝒙 = 1 𝑧 # . + 𝒙 . . ℒ 𝑧 # , M 𝑂 P 𝑧 # − M #0- . 𝒙 . = P 𝑥 # # 26

Recommend


More recommend