feature extraction for sentiment analysis on twitter data
play

Feature extraction for sentiment analysis on twitter data with - PowerPoint PPT Presentation

Feature extraction for sentiment analysis on twitter data with spanish language Victor Mu niz Research Center in Mathematics. Monterrey, Mexico. Victor Mu niz (CIMAT Mty) Sentiment Analysis Junio 2015 1 / 33 Introduction Sentiment


  1. Feature extraction for sentiment analysis on twitter data with spanish language Victor Mu˜ niz Research Center in Mathematics. Monterrey, Mexico. Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 1 / 33

  2. Introduction Sentiment Analysis focuses on automatically identifying whether a text expresses a positive, negative or neutral opinion about some topic. Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 2 / 33

  3. Introduction Among all virtual opinion plataforms, Twitter has become the most popular for sentiment analysis due to several reasons: Availability of information Large amount of data Constant update Worldwide available Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 3 / 33

  4. Introduction Among all virtual opinion plataforms, Twitter has become the most popular for sentiment analysis due to several reasons: Lot of applications Opinion based marketing Online ranking Government and politics Official statistics Among many others... Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 4 / 33

  5. Introduction One of the most popular techniques for text classification is the Bag of Words (Joachims, 1998), which constructs a Term Document Matrix based on term frequencies. Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 5 / 33

  6. Introduction However, on twitter data, the application of this (or any) technique is not straightforward: Andas bien loco @Telcel con la zona horaria d tu RED, a cada rato m mueves la Hr.?? #chidotucotorreo @ServicioTelcel Short text http://t.co/QoOX3OCYxt Misspellings Abbreviations and @Profeco @Tiendas_OXXO no cumple con algunos requerimientos como tipos de non-standard bebida falsos asi como la falta del contractions precio :( Emoticons, hashtags Unbalanced classes No nos deja pasar el cadenero del oxxo gooey! k pedo 100pre me pasaaaa!!! #Queoso Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 6 / 33

  7. Introduction Standard preprocessing techniques on twitter data are not enough, because generally we have variations of words with the same meaning: pseudo-estudiantes = pseudoestudiantes = seudoestudiantes = seudestudiantes separados = separa2 siempre = sienpre = 100pre This problem causes sparse Term Document Matrix Bag of words it’s not enough. We need to incorporate contextual (apriori) information The challenge is to extract the main features of the tweet, which give us insights of the sentiment (polarity) of the text Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 7 / 33

  8. Introduction There is a lot of work on both feature extraction and classification for tweets, however, the vast majority are focused on english text Some previous work on lexical normalization of spanish text has been done (Mosqueda & Moreda, 2012), however, there are important differences between countries and regions, even in the same language. This must be taken into account The objective of our work, is to implement a normalization method for spanish text by using kernel-based methods, in order to obtain important features which can be used as input for a classification method Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 8 / 33

  9. Preprocessing and normalization Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 9 / 33

  10. Normalization Data : We obtained and manually classify tweets from the API ( https://dev.twitter.com/ ) according to some specific topics (i.e, convenience stores, cellphone services, etc). Standard text preprocessing: Convert to lowercase Remove stopwords in spanish according to the list given by Martin Porter’s Snowball stemming project http://snowball.tartarus.org/ . We add some words relative to the topic. Remove special characters: URL’s, @, RT, , -, :, among others Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 10 / 33

  11. Normalization Remove repeated characters and excess of white spaces Emoticon sustitution according to the list: en.wikipedia.org/wiki/List_of_emoticons . For instance: :-) emoticon-positivo > :[ emoticon-negativo :) emoticon-positivo =( emoticon-negativo :o) emoticon-positivo :-[ emoticon-negativo :c) emoticon-positivo :- || emoticon-muy-negativo :-D emoticon-muy-positivo > :( emoticon-muy-negativo X-D emoticon-muy-positivo : | emoticon-neutral Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 11 / 33

  12. Normalization The normalisation process consists on 1 Detection of non-conventional words 2 Substitution with similar words, (hopefully the correct ones in terms of the linguistic meaning) Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 12 / 33

  13. Normalization Detection of non-conventinal words We used Aspell ( http://aspell.net/ ) with a spanish dictionary, and we added extra terms, such as cities and localities from Mexico and other ones relative to the topic. For each word in the preprocessed tweet, we did a search with the Aspell API, and if it does not appear, we consider the options given by Aspell. Very often, the top ranked suggestion by Aspell is not the best choice Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 13 / 33

  14. Normalization Detection of non-conventinal words Consider pseudoestudiantes [1] "pseudo" "estudiantes" "pseudo-estudiantes" [4] "predestinares" "predestines" "predestinases" [7] "predestinareis" "predestinase" "predestinar" [10] "predestinas" "predestinasteis" "predestinaste" [13] "predestinis" "sudestada" "predestinaras" [16] "predestinars" "predestinaseis" "sudestadas" [19] "predestinis" "predestinadas" "predestinados" [22] "predestinabas" "predestinamos" We need to choose the appropriate word from the suggestions Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 14 / 33

  15. Normalization Kernel methods and “string kernels”. Let x , z ∈ X (input space). Consider the kernel function: k ( x , z ) = � φ ( x ) , φ ( z ) � where φ is a map: φ : x ∈ X �→ φ ( x ) ∈ H (feature space) Kernel trick (Scholkopf and Smola, 2002) f ( x ) = � α i k ( x i , x ) k ( x, x ′ ) X K A Datos Kernel Matriz de Gram Algoritmo Funcion decision Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 15 / 33

  16. Normalization String Kernels (Lodhi 2002, Shawe-Taylor y Cristianini 2004, Watkins 2000, Herbrich 2002) provides a similarity measure between two documents x y y . Let s to be a substring. The mapping to the feature space is given by � λ L ( s x ) , φ s ( x ) = s ∈ x where λ ∈ (0 , 1) es a weight and L ( s x ) is the length of the substring s into the document x . Example: Consider s = car : if x =“cara”, then L ( s x ) = 3 ( car a). φ s ( x ) = λ 3 , if x =“cuarto”, then L ( s x ) = 4 ( c u ar to) φ s ( x ) = λ 4 . Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 16 / 33

  17. Normalization String Kernels (Lodhi 2002, Shawe-Taylor y Cristianini 2004, Watkins 2000, Herbrich 2002) provides a similarity measure between two documents x y y . Let s to be a substring. The mapping to the feature space is given by � λ L ( s x ) , φ s ( x ) = s ∈ x where λ ∈ (0 , 1) es a weight and L ( s x ) is the length of the substring s into the document x . Example: Consider s = car : if x =“cara”, then L ( s x ) = 3 ( car a). φ s ( x ) = λ 3 , if x =“cuarto”, then L ( s x ) = 4 ( c u ar to) φ s ( x ) = λ 4 . Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 16 / 33

  18. Normalization String Kernels (Lodhi 2002, Shawe-Taylor y Cristianini 2004, Watkins 2000, Herbrich 2002) provides a similarity measure between two documents x y y . Let s to be a substring. The mapping to the feature space is given by � λ L ( s x ) , φ s ( x ) = s ∈ x where λ ∈ (0 , 1) es a weight and L ( s x ) is the length of the substring s into the document x . Example: Consider s = car : if x =“cara”, then L ( s x ) = 3 ( car a). φ s ( x ) = λ 3 , if x =“cuarto”, then L ( s x ) = 4 ( c u ar to) φ s ( x ) = λ 4 . Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 16 / 33

  19. Normalization The kernel (dot product) between documents x and y is given by � � � λ L ( s x )+ L ( s y ) , k n ( x , y ) = s ∈ Σ n s ⊂ x s ⊂ y where Σ n is the set of all substrings of size n from a finite alphabet Σ. Example: Consider the words cat, car, bat and bar con | s | = 2: c-a c-t a-t b-a b-t c-r a-r b-r λ 2 λ 3 λ 2 φ (cat) 0 0 0 0 0 λ 2 λ 3 λ 2 φ (car) 0 0 0 0 0 λ 2 λ 2 λ 3 φ (bat) 0 0 0 0 0 λ 2 λ 2 λ 3 φ (bar) 0 0 0 0 0 k ( car , cat ) = λ 4 , k ( car , car ) = k ( cat , cat ) = 2 λ 4 + λ 6 . Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 17 / 33

  20. Normalization The kernel (dot product) between documents x and y is given by � � � λ L ( s x )+ L ( s y ) , k n ( x , y ) = s ∈ Σ n s ⊂ x s ⊂ y where Σ n is the set of all substrings of size n from a finite alphabet Σ. Example: Consider the words cat, car, bat and bar con | s | = 2: c-a c-t a-t b-a b-t c-r a-r b-r λ 2 λ 3 λ 2 φ (cat) 0 0 0 0 0 λ 2 λ 3 λ 2 φ (car) 0 0 0 0 0 λ 2 λ 2 λ 3 φ (bat) 0 0 0 0 0 λ 2 λ 2 λ 3 φ (bar) 0 0 0 0 0 k ( car , cat ) = λ 4 , k ( car , car ) = k ( cat , cat ) = 2 λ 4 + λ 6 . Victor Mu˜ niz (CIMAT Mty) Sentiment Analysis Junio 2015 17 / 33

Recommend


More recommend