Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Text Mining
Overview Featurization Traditional approaches Word embeddings and representational learning A look ahead Tooling 2
Text is everywhere Medical records Product reviews Repair notes Facebook posts Book recommendations Tweets Declarations Legislation, court decisions Emails Websites … 3
But is is unstructured Or "semi-structured", rather, though still: No direct "feature vector" representation Linguistic structure Language Relationship between words Importance of words Negations, etc. Text is dirty Grammatical, spelling, abbreviations, homographs Text is intended for communication between people 4
The trick for unstructered data It all boils down to featurization Approaches to convert the unstructured data to a structured feature vector "Feature engineering" Just as for computer vision, we can split up approaches in a "before and after deep learning" era 5
The basics 6
So, how to featurize? Let's start simple The goal is to take a collection of documents – each of which is a relatively free-form sequence of words – and turn it into the familiar feature-vector representation A collection of documents: corpus (plural: corpora) A document is composed of individual tokens or terms (words, but can be broader as well, i.e. punctuation tokens, abbreviations, smileys, …) Mostly: each document is one instance, but sentences can also form instances Which features will be used is still to be determined 7
Bag of words The "keep it simple approach" Simply treat every document (instance) as a collection of individual words Ignore grammar, word order, sentence structure, and (usually) punctuation Treat every word in a document as a potentially important keyword of the document Each document is represented by sequence of ones (if the token is present in the document) or zeros (the token is not present in the document) I.e. one feature per word Inexpensive to generate though leads to an explosion of features Can work in some settings Alternatively: frequencies instead of binary features can be used this is an a example second “This is an example” => 1 1 1 0 1 0 “This is a second example” => 1 1 0 1 1 1 8
Normalization, stop-word removal, stemming The case should be normalized E.g. every term is in lowercase Words should be stemmed (stemming) Suffixes are removed (only root is kept) E.g., noun plurals are transformed to singular forms Porter’s stemming algorithm: basically suffix- stripping More complex: lemmatization (e.g. better → good, flies/flight &arr; fly) Context is required: part of speech (PoS) tagging (see later) Stop-words should be removed (stop word removal) A stop-word is a very common word in English (or whatever language is being parsed) Typical words such as the words the, and, of, and on are removed 9
(Normalized) term frequency Recall that we can use bag of words with counts (word frequencies) Nice as this differentiates between how many times a word is used “Document-term matrix” However: Documents of various lengths Words of different frequencies Words should not be too common or too rare Both upper and lower limit on the number (or fraction) of documents in which a word may occur So the raw term frequencies are best normalized in some way Such as by dividing each by the total number of words in the document Or the frequency of the specific term in the corpus Also think about the production context: what do we do when we encounter a previously unseen term? 10
TF-IDF Term frequency (TF) is calculated per term t per document d ∣{ w ∈ d : w = t }∣ TF( t , d ) = ∣ d ∣ (number of times term appears divided by document length) However, frequency in the corpus also plays a role: terms should not be too rare, but also not be too common. So we need a measure of sparseness Inverse document frequency (IDF) is calculated per term t over the corpus c ∣ c ∣ ( ∣{ d ∈ c : t ∈ d }∣ ) IDF( t ) = 1 + log (one plus logarithm of total number of documents divided by documents having term) 11
TF-IDF 12
TF-IDF TFIDF( t , d ) = TF( t , d ) × IDF( t ) Gives you a weight for each term in each document Perfect for our feature matrix Rewards terms that occur frequently in the document But penalizes if they occur frequently in the whole collection A vector of weights per document is obtained 13
TF-IDF: example 15 prominent jazz musicians and excerpts of their biographies from Wikipedia Nearly 2,000 features after stemming and stop-word removal! Consider the sample phrase “Famous jazz saxophonist born in Kansas who played bebop and latin” 14
Dealing with high dimensionality Feature selection will often need to be applied Fast and scalable classification or clustering techniques will need to be used E.g. linear Naive Bayes and Support Vector Machines have proven to work well in this setting Using clustering techniques based on non-negative matrix factorization Also recall from pre-processing: "the hashing trick": collapse the high amount of features to n hashed features Use dimensionality reduction techniques like t-SNE or UMAP (Uniform Manifold Approximation and Projection, McInnes, 2018, https://github.com/lmcinnes/umap) 15
N-gram sequences What if word order is important? A next step from the previous techniques is to use sequences of adjacent words as terms Adjacent pairs are commonly called bi-grams Example: “The quick brown fox jumps” Would be transformed into {quick_brown, brown_fox, fox_jumps} Can be combined together with 1-grams: {quick, brown, fox, jumps} But: N-grams greatly increase the size of the feature set 16
Natural language processing (NLP) Complete field of research Key idea: use machine learning and statistical approaches to learn a language from data Better suited to deal with Contextual information (“This camera sucks” vs. “This vacuum cleaner really sucks”) Negations Sarcasm Best known tasks: PoS (Part of Speech) tagging: noun, verb, subject, … Named entity recognition 17
Part of speech 18
Named entity recognition 19
Named entity recognition 20
Word embeddings and representational learning 21
Vector space models Represent an item (a document, a word) as a vector of numbers: banana 0 1 0 1 0 0 2 0 1 0 1 0 Such a vector could for instance correspond to documents in which the word occurs: banana 0 1 0 1 0 0 2 0 1 0 1 0 ↓ ↓ ↓ ↓ ↓ Doc2 Doc4 Doc7 Doc9 Doc11 22
Vector space models The vector can also correspond to neighboring word context banana 0 1 0 1 0 0 2 0 1 0 1 0 ↓ ↓ ↓ ↓ ↓ (yellow,-1) (on,+2) (grows,+1) (tree,+3) (africa,+5) "yellow banana grows on trees in africa" 23
Word embeddings A dense vector of real values. The vector dimension is typically much smaller than the number of items or the vocabulary size E.g. typical dimension for some learning tasks: 128 You can imagine the vectors as coordinates for times in the embedding space. Some distance metric defines a notion of relatedness between items in this space 24
Word embeddings Man is to woman as king is to ____ ? Good is to best as smart is to ____ ? China is to Beijing as Russia is to ____ ? Turns out the word-context based vector model we just looked at is good for such analogy tasks [king] – [man] + [woman] ≈ [queen] 25
How to construct word embeddings? Matrix factorization based Non-negative matrix factorization GloVe (Word-NeighboringWord) See https://nlp.stanford.edu/projects/glove/ Neural network based: word2vec word2vec released by Google in 2013 See https://code.google.com/archive/p/word2vec/ Neural network-based implementation that learns vector representation per word Background information https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/ https://iksinc.wordpress.com/tag/continuous-bag-of-words-cbow/ https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/ https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e 26
word2vec Word2vec: convert each term to a vector representation Works at term level, not at document level (by default) Such a vector comes to represent in some abstract way the “meaning” of a word Possible to learn word vectors that are able to capture the relationships between words in a surprisingly expressive way “Man is to woman as uncle is to (?)” 27
word2vec Here, the word vectors have a dimension of 1000 Second-step PCA or t-SNE possible common to project to two-dimensions 28
word2vec The general idea is the assumption that a word is correlated with and defined by its context Was also the basic idea behind N-grams More so, we're going to predict a word based on its context The methods of learning: Continuous Bag-of-Words (CBOW) Continuous Skip-Gram (CSG) 29
Recommend
More recommend