Building Blocks Edit Distance Most common approach: Levenshtein Distance Measures the number of operations to transform a token into another token Operations: add, remove, replace (optional) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 27 2020-03-19
Building Blocks Machine Learning Components As alternative to manually-created rule-based components ... use of statistical models ... use of machine learning models Expert knowledge is then ofen used ... to create a annotated dataset (supervised scenario) ... typically called ground truth (or gold standard) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 28 2020-03-19
Building Blocks Machine Learning Components + Pros Ofen provide a “probabilistic” output For example, decision with a confidence value ... or a ranked list of candidates − Cons Data hungry Complex machine learning algorithm require large amounts (of clean, unambiguous) data Typically associated with higher computational complexity Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 29 2020-03-19
Basic Tasks Oven used and basic processing components Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 30 2020-03-19
Basic Tasks Language Detection Task : Detect the language of the input text Ofen the first step is to identify the language of a text Typical assumption: the language does not change within the text Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 31 2020-03-19
Basic Tasks Language Detection Approaches (1/2) Use white lists Have a list of typical words for each language Count the frequency of their occurrence Typically these words will be function words e.g., German: ④❞❡r✱ ❞✐❡✱ ❞❛s✱ ❡✐♥❡r✱ ❡✐♥❡✱ ✳✳✳⑥ Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 32 2020-03-19
Basic Tasks Language Detection Approaches (2/2) Learn the character distribution Ofen on sub-word level, e.g., character 3-grams Based on corpora of specific languages Compare the language-specific distributions with the text Further literature: Jauhiainen, T. S., Lui, M., Zampieri, M., Baldwin, T., & Lindén, K. (2019). Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, 65, 675-782. Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 33 2020-03-19
Basic Tasks Sentence Segmentation Task : split a sequence of characters into sentences Also called sentence spliting , sentence detection Typically solved via a white list of sentence boundary characters ... in combination with exception rules Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 34 2020-03-19
Basic Tasks Word Segmentation Task : split a sequence of characters into words (tokens) Ofen also called word tokenisation Hard for languages without (white-)space between words Problem: Ambiguous what a token should be e.g., clitic contractions: don’t → <don’t> | <don, t> | <do, n’t> | <do, not> Typically approached using rules (regex) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 35 2020-03-19
Basic Tasks Compound-Word Spliting Task : split a compound word into its sub-words e.g., Frühlingserwachen → <Frühling, s 2 , erwachen> (spring awakening) Approach 1. Split the word into its syllables 2. Check every combination of consecutive syllables against a dictionary 2 Called the Fugen-s Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 36 2020-03-19
Basic Tasks Subword Spliting Task : split a single words (tokens) into smaller parts Motivation: Subwords may still carry some semantics ... deal with out-of-vocabulary words Example, using Wordpiece 3 segmenter <Jim, Hen, ##son, was, a, puppet, ##eer> 3 Yonghui Wu, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprintarXiv:1609.08144 Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 37 2020-03-19
Basic Tasks Word n-Grams Task : capture the sequence information of consecutive words Approach: Combine pairs of adjacent tokens into a single token Example (for word 2-gram) Das ist ein kleiner Test → { Das_ist, ist_ein, ein_kleiner, kleiner_Test } Many variations exists, for example skip-grams Useful resource: ❤tt♣s✿✴✴❜♦♦❦s✳❣♦♦❣❧❡✳❝♦♠✴♥❣r❛♠s Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 38 2020-03-19
Basic Tasks Character n-Grams Task : capture the sequence information of consecutive characters Approach: Combine pairs of adjacent characters into a single token Example (for character 3-gram) Das ist ein kleiner Test → { Das, ist, ein, kle, lei, ein, ine, ner, Tes, est } Ofen use to find similar (writen) words ... to generate candidates for spelling correction Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 39 2020-03-19
Basic Tasks Text Normalisation Task : Transform the text into a normalised form In search engines the document and the query at normalised the same way Ofen transform all characters to lower case Also called case folding Unix command line tool ✩ tr ❆✲❩ ❛✲③ Note: This tasks is language-dependent Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 40 2020-03-19
Basic Tasks Text Normalisation Additionally, wide variety of normalisation strategies For example, removal of diacritics e.g., for German: Größe → groesze Conflate character repetitions (whitespace) Removal of special characters e.g., : word_press → wordpress Many domain-dependent approaches (e.g, mathematics formulas) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 41 2020-03-19
Basic Tasks Text Normalisation + Pros Reduces the number of unique tokens Increased performance when using machine learning Cleaner text − Cons May remove too much information for certain tasks e.g., sentiment detection: “TELL ME MORE‼1!” Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 42 2020-03-19
Basic Tasks Stop word list Manually assembled list of non-content words e.g. the, a, with, to, ... Remove words without semantics Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 43 2020-03-19
Basic Tasks Stemming Task : reduce words (tokens) to its root form (stem) Typically cuting of the suffix, and optionally replacing it e.g., hopping in snowy conditions → hop in snowi condit Ofen rule based, for example Porter Stemmer List of rewrite rules Problems: under- and overgeneralising Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 44 2020-03-19
Basic Tasks Lemmatisation Task : reduce words (tokens) to its root form (lemma) e.g., going, went, gone → go, go, go Ofen based on dictionaries Based on corpora 4 4 For example: ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❲❩❇❙♦❝✐❛❧❙❝✐❡♥❝❡❈❡♥t❡r✴❣❡r♠❛❧❡♠♠❛ Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 45 2020-03-19
Basic Tasks Part-of-Speech (PoS) Tagger Task : add the word group to each token Nouns, verbs, adverbs, determiner, ... Ofen based on machine learning Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 46 2020-03-19
Basic Tasks Chunker Task : combine multiple, consecutive words to form phrases Noun phrases, verb phrases, prepositional phrases, ... Typically based on the output of the PoS tagger e.g., <adjectivce> <noun> → NP Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 47 2020-03-19
Basic Tasks Syntactic Parsing Transform a sentence into a tree representation ... which reflects the grammatical structure of the sentence Example sentence ❚❤❡ ❝♦♣ s❛✇ t❤❡ ♠❛♥ ✇✐t❤ t❤❡ ❜✐♥♦❝✉❧❛rs✳ Taken from: Bergmann, A., Hall, K. C., & Ross, S. M. (2007). Language files: Materials for an introduction to language and linguistics. Ohio State University Press. Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 48 2020-03-19
Basic Tasks Syntactic Parsing - Example Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 49 2020-03-19
Basic Tasks Syntactic Parsing - Example Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 50 2020-03-19
Basic Tasks Dependency Parsing Transform a sentence into a graph representation ... where each vertex is a word ... and each edge represents a grammatical relationship Example sentence ❆❢t❡r✇❛r❞ ✱ ■ ✇❛t❝❤❡❞ ❛s ❛ ❜✉tt✲t♦♥ ♦❢ ❣♦♦❞ ✱ ❜✉t ♠✐s❣✉✐❞❡❞ ♣❡♦♣❧❡ ❢✐❧❡❞ ♦✉t ♦❢ t❤❡ t❤❡❛t❡r ✱ ❛♥❞ ✐♠♠❡❞✐❛t❡❧② ❧✐t ✉♣ ❛ s♠♦❦❡ ✳ Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 51 2020-03-19
Basic Tasks Dependency Parsing - Example Sentence Tree Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 52 2020-03-19
Basic Tasks Dependency Parsing - Example Dependency Output Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 53 2020-03-19
Document Representation Feature extraction and engineering for text Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 54 2020-03-19
Document Representation Bag of Words (BoW) Afer the processing pipeline has been executed (on a document/sentence) ... the output is collected and put into a bag Thus, each word is treated independently The sequence information is lost Semantic similarity (between two documents/bags) Compare the overlap between the two bags Assumption : Many tokens in common equates similar content For example, the Jaccard distance, Dice coefficient, ... Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 55 2020-03-19
Document Representation Bag of Words (BoW) Example “The green house is next to the blue building” → { blue , building , green , house , is , next , the , to } Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 56 2020-03-19
Document Representation Vector Space Model Each unique word is assigned its own dimension Each document is then represented by a single vector Assumption : The vector represents the semantics of the document In a simple seting If the word is contained in the document The value for the dimension in to the vector ... will be set to a non-zero value The process of assigning a dimension to each word is also called vectorisation Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 57 2020-03-19
Document Representation Vector Space Model Document-Term Matrix Documents are rows, and terms are columns The resulting matrix is very sparse Typically approx. 2 % non-zero entries Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 58 2020-03-19
Document Representation Vector Space Model One-Hot Encoding Representation of a set of words (document, sentence, sequence, single word, ...) Words contained in the set are represented by a 1 Can be seen as a single row of the document-term matrix Note: Ofen also used to encode nominal features as multiple binary features Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 59 2020-03-19
Document Representation Vector Space Model Weighting strategies ( term weighting ) Simple case: use 1 as “non-zero value” More sophisticated strategies Count how ofen a token occurs → term frequency Down-weight common token → inverse document frequency Take into consideration the length of a document → length normalisation Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 60 2020-03-19
Document Representation Latent Semantic Analysis (LSI/LSA) [1] Idea : Apply thin SVD on the document-term matrix Where the SVD is limited to the k most important singular values Requires as input: Document/term matrix Fixed number of topics Provides: Mapping of document to a (dense) lower-dimensional representation Probabilistic version: pLSA [2] [1] Landauer, et al. (1997). A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. [2] Hofmann, T. (1999). Probabilistic latent semantic indexing. Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 61 2020-03-19
Document Representation Latent Dirichlet Allocation (LDA) Requires as input: Document/term matrix Fixed number of topics Provides: Mapping of document to topics (as vector of probabilities) Mapping of terms to topics (as vector of probabilities) Can be seen as fuzzy co-clustering Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 62 2020-03-19
Document Representation Latent Dirichlet Allocation - Example Figure: Example of LDA build using the TASA corpus Steyvers, M., & Griffiths, T. (2007). Probabilistic Topic Models. Handbook of Latent Semantic Analysis. Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 63 2020-03-19
Language Models Probabilities of Words Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 64 2020-03-19
Language Models Introduction Estimate the probabilities of words e.g., occurring in a document Estimate the probability of a span of text e.g., sequence of words Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 65 2020-03-19
Language Models Unigram Language Model Estimate the probabilities of a single words P ( w i ) Can be estimated from a corpus count ( w i ) P ( w i ) ≈ � wj ∈ W count ( w j ) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 66 2020-03-19
Language Models n-Gram Language Model Estimate the probabilities of a sequence of words P ( w 1 , . . . , w m ) Can be used to predict the next (unseen) word P ( w 1 , . . . , w m ) = � m i = 1 P ( w i | w 1 , . . . , w i − 1 ) Estimated via a corpus P ( w i | w i − ( n − 1 ) , . . . , w i − 1 ) = count ( w i − ( n − 1 ) ,..., w i − 1 , w i ) count ( w i − ( n − 1 ) ,..., w i − 1 ) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 67 2020-03-19
PoS Tagging Assigning word groups to individual words Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 68 2020-03-19
PoS Tagging What is PoS tagging? Part-of-speech tagging is the process of assigning a part-of-speech or other lexical class marker to each word in a corpus [Jurafsky & Martin] Input: a string of words and a specified tagset Output: a single best match for each word Figure: Assing words to tags out of a tagset [Jon Atle Gulla] Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 69 2020-03-19
PoS Tagging POS Examples ❇♦♦❦ t❤❛t ❢❧✐❣❤t✳ ❱❇ ❉❚ ◆◆ ❉♦❡s t❤❛t ❢❧✐❣❤t s❡r✈❡ ❞✐♥♥❡r❄ ❱❇❩ ❉❚ ◆◆ ❱❇ ◆◆ This task is not trivial For example: “book” is ambiguous (noun or verb) Challenge for POS tagging: resolve these ambiguities! Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 70 2020-03-19
PoS Tagging Tagset The tagset is the vocabulary of possible POS tags. Choosing a tagset Striking a balance between Expressiveness (number of different word classes) “Classifiability” (ability to automatically classify words into the classes) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 71 2020-03-19
PoS Tagging Examples for existing tagsets Brown corpus, 87-tag tagset (1979) Penn Treebank, 45-tag tagset, selected from Brown tagset (1993) C5, 61-tag tagset C7, 146-tag tagset STTS, German tagset (1995/1999) ❤tt♣✿✴✴✇✇✇✳✐♠s✳✉♥✐✲st✉tt❣❛rt✳❞❡✴❢♦rs❝❤✉♥❣✴r❡ss♦✉r❝❡♥✴❧❡①✐❦❛✴❚❛❣❙❡ts✴stts✲t❛❜❧❡✳❤t♠❧ Today, universal tagsets are common for course grain classification. Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 72 2020-03-19
PoS Tagging Penn Treebank Over 4.5 mio words Presumed to be the first large syntactically annotated corpus Annotated with POS information And with skeletal syntactic structure Two-stage tagging process: 1. Assigning POS tags automatically (stochastic approach, 3-5% error) 2. Correcting tags by human annotators Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 73 2020-03-19
PoS Tagging How hard is the tagging problem? Figure: The number of word classes in the the Brown corpus by degree of ambiguity Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 74 2020-03-19
PoS Tagging Some approaches for POS tagging Rule based ENGTWOL tagger Transformation based Brill tagger Stochastic (machine learning) HMM tagger Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 75 2020-03-19
PoS Tagging Rule based POS tagging A two stage process 1. Assign a list of potential parts-of-speech to each word, e.g. BRIDGE → V N 2. Using rules, eliminate parts-of-speech tags from that list until a single tag remains ENGTWOL uses about 1.100 rules to rule out incorrect parts-of-speech Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 76 2020-03-19
PoS Tagging Rule based POS tagging Input Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 77 2020-03-19
PoS Tagging Rule based POS tagging Rules Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 78 2020-03-19
PoS Tagging Pros and Cons of Rule-Base System + Interpretable model + Make use of expert/domain knowledge − Lot of work to create rules − Number of rules may explode (and contradict each other) ○ Good starting point Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 79 2020-03-19
PoS Tagging Transformation based POS tagging Brill Tagger, combination of rule-based tagger with supervised learning [Brill 1995] Rules Initially assign each word a tag (without taking the context into account) Known words → assign the most frequent tag Unknown word → e.g. noun (guesser rules) Apply rules iteratively (taking the surrounding context into account → context rules) e.g. ■❢ ❚r✐❣❣❡r✱ t❤❡♥ ❝❤❛♥❣❡ t❤❡ t❛❣ ❢r♦♠ ❳ t♦ ❨ , ■❢ ❚r✐❣❣❡r✱ t❤❡♥ ❝❤❛♥❣❡ t❤❡ t❛❣ t♦ ❨ Typically 50 guessing rules and 300 context rules Rules have been induced from tagged corpora by means of Transformation-Based Learning (TBL) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 80 ❤tt♣✿✴✴✇✇✇✳❧✐♥❣✳❣✉✳s❡✴⑦✴❧❛❣❡r✴♠♦❣✉❧✴❜r✐❧❧✲t❛❣❣❡r✴✐♥❞❡①✳❤t♠❧ 2020-03-19
PoS Tagging Transformation-Based Learning 1. Generate all rules that correct at least one error 2. For each rule: a. Apply a copy of the most recent state of the training set b. Score the result using the objective function (e.g. number of wrong tags) 3. Select the rules with the best score 4. Update the training set by applying the selected rules 5. Stop if the the score is smaller than some pre-set threshold T ; otherwise repeat from step 1 Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 81 2020-03-19
PoS Tagging Pros and Cons of Hybrid System + Interpretable model + Make use of expert/domain knowledge + Less work to create rules − Additional work to annotate dataset − Risk of many rules ○ Works in special cases Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 82 2020-03-19
PoS Tagging Stochastic part-of-speech tagging Based on probability of a certain tag given a certain context Requires a training corpus No probabilities available for words not in training corpus Smoothing Simple Method : Choose the most frequent tag in the training text for each word Result: 90% accuracy Baseline method Lot of non-trivial methods, e.g. Hidden Markov Models (HMM) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 83 2020-03-19
PoS Tagging Generative Stochastic Part-of-Speech Tagging Intuition: Pick the most likely tag for each word Choose the best tag sequence for an entire sentence, seeking to maximize the formula P ( word | tag ) × P ( tag | previous n tags ) Let T = t 1 , .., t n be a sequence of tags Let W = w 1 , ..., w n be a sequence of words Find the PoS tags that generate a sequence of words, i.e., look for the most probable sequence of tags T underlying the observed words W Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 84 2020-03-19
PoS Tagging Markov models & Markov chains Markov chains can be seen as a weighted finite-state machines They have the following Markov properties, where X i is a state in the Markov chain, and s is a value that the state takes: Limited horizon : P ( X t + 1 = s | X 1 , ..., X t ) = P ( X t + 1 = s | X t ) (first order Markov models) ... the value at state t + 1 just depends on the previous state Time invariant : P ( X t + 1 = s | X t ) is always the same, regardless of t ... there are no side effects Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 85 2020-03-19
PoS Tagging Hidden Markov Models Now, that we are given a sequence of words (observation) and want to find the POS tags? Each state in the Markov model will be a POS tag (hidden state), but we don’t know the correct state sequence The underlying sequence of events (= the POS tags) can be seen as generating a sequence of words ... thus, we have a Hidden Markov Model Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 86 2020-03-19
PoS Tagging Hidden Markov Model Needs three matrices as input: A (transmission, POS �→ POS), B (emission, POS �→ Word), π (initial probabilities, POS) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 87 2020-03-19
PoS Tagging Three Fundamental Problems 1. Probability estimation : How do we efficiently compute probabilities, i.e. P ( O | θ ) - the probability of an observation sequence O given a model θ θ = ( A , B , π ) , A ... transition matrix, B ... emission matrix, π initial probability matrix 2. Best path estimation : How do we choose the best sequence of states X , given our observation O and the model θ How do we maximise P ( X | O ) ? 3. Parameter estimation : From a space of models, how do we find the best parameters ( A , B , and π ) to explain the observation Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 88 How do we (re)estimate θ in order to maximise P ( O | µ ) ? 2020-03-19
PoS Tagging Three Fundamental Problems - Algorithmic Approaches 1. Probability estimation Dynamic programming (summing forward probabilities) 2. Best path estimation Viterbi algorithm 3. Parameter estimation Baum-Welch algorithm (Forward-Backward algorithm) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 89 2020-03-19
PoS Tagging Simplifying the Probabilities argmax t 1 , n P ( t 1 , n | w 1 , n ) = argmax t 1 , n P ( w 1 , n | t 1 , n ) P ( t 1 , n ) → refers to the whole sentence ... estimating probabilities for an entire sentence is a bad idea Markov models have the property of limited horizon: one state refers only back the previous (n, typically 1) steps - is has no memory Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 90 2020-03-19
PoS Tagging Simplifying the probabilities Independence assumption: words/tags are independent of each other For a bi-gram model: P ( t 1 , n ) ≈ P ( t n | t n − 1 ) P ( t n − 1 | t n − 2 ) . . . P ( t 2 | t 1 ) = � n i = 1 P ( t i | t i − 1 ) A word’s identity only depends on its tag P ( w 1 , n | t 1 , n ) ≈ � n i = 1 P ( w i | t i ) The final equation is: t 1 , n = � n ˆ i = 1 P ( w i | t i ) P ( t i | t i − 1 ) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 91 2020-03-19
PoS Tagging Probability Estimation for Tagging How do we get such probabilities? → With supervised tagging we can simply use Maximum Likelihood Estimation (MLE) and use counts (C) from a reference corpus P ( t i | t i − 1 ) ≈ C ( t i − 1 , t i ) C ( t i − 1 ) P ( w i | t i ) ≈ C ( w i , t i ) C ( t i ) Given these probabilities we can finally assign a probability to a sequence of states (tags) To find the best sequence (of tags) we can apply the Viterbi algorithm Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 92 2020-03-19
PoS Tagging Probability Estimation - Justification Given an observation, estimate the underlying probability e.g. recall PMF for binomial: p ( k ) = � n � ( 1 − p ) n − k p k k We want to estimate the best p � n � ( 1 − p ) n − k p k argmax p P ( observed data ) = argmax p k ( 1 − p ) n − k p k ) → derivative to find the maxima (0 = ∂ � n � ∂ p k For large np one can approximate p to be k � k ( n − k ) n (and standard deviation of for n 3 independent and an unbiased estimate) Note: There are alternative versions on how to estimate the probabilities Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 93 2020-03-19
PoS Tagging Pros and Cons of Stochastic Systems + Less involvement of expert/domain knowledge (annotate dataset) + Finds an optimal working point − High complexity − Black-box, cannot easily be interpreted − Requires machine learning expert (e.g., check for preconditions) ○ Best results (if preconditions met) Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 94 2020-03-19
PoS Tagging POS tagging - Stochastic part-of-speech tagging Does work for cases, where there is evidence in the corpus But what to do, if there are rare events, which just did not make it into the corpus? Simple non-solution: always assume their probability to be 0 Alternative solution: smoothing Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 95 2020-03-19
PoS Tagging POS tagging - Stochastic part-of-speech tagging Will the sun rise tomorrow? Laplace’s Rule of Succession We start with the assumption that rise/non-rise are equally probable On day n + 1, we’ve observed that the sun has risen s times before p Lap ( S n + 1 = 1 | S 1 + ... + S n = s ) = s + 1 n + 2 What is the probability on day 0, 1, ...? Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 96 2020-03-19
PoS Tagging POS tagging - Stochastic part-of-speech tagging Laplace Smoothing Simply add one: C ( t i − 1 , t i ) C ( t i − 1 , t i )+ 1 C ( t i − 1 ) ⇒ C ( t i − 1 )+ V ( t i − 1 , t ) ... where V ( t i − 1 , t ) = | { t i | C ( t i − 1 , t i ) > 0 } | (vocabulary size) Can be further generalised by introducing a smoothing parameter λ C ( t i − 1 , t i )+ λ C ( t i − 1 )+ λ V ( t i − 1 , t ) Note: Also called Lidstone smoothing, additive smoothing Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 97 2020-03-19
PoS Tagging POS tagging - Stochastic part-of-speech tagging Estimate the smoothing parameter C ( t i − 1 , t i )+ λ C ( t i − 1 )+ λ V ( t i − 1 , t ) ... typically λ is set between 0 and 1 How to choose the correct λ ? Separate a small part of the training set (held out data) ... development set Apply the maximum likelihood estimate Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 98 2020-03-19
PoS Tagging State-of-the-Art in POS Tagging Selected Approaches System name Short description All tokens Unknown words TnT Hidden markov model 96.46% 85.86% MElt MEMM 96.96% 91.29% GENiA Tagger Maximum entropy 97.05% Not available Averaged Perceptron Averaged Perception 97.11% Not available Maxent easiest-first Maximum entropy 97.15% Not available SVMTool SVM-based 97.16% 89.01% LAPOS Perceptron based 97.22% Not available Morče/COMPOST Averaged Perceptron 97.23% Not available Stanford Tagger 2.0 Maximum entropy 97.32% 90.79% LTAG-spinal Bidirectional perceptron 97.33% Not available SCCN Condensed nearest neighbor 97.50% Not available Taken from: ❤tt♣✿✴✴❛❝❧✇❡❜✳♦r❣✴❛❝❧✇✐❦✐✴✐♥❞❡①✳♣❤♣❄t✐t❧❡❂P❖❙❴❚❛❣❣✐♥❣❴✪✷✽❙t❛t❡❴♦❢❴t❤❡❴❛rt✪✷✾ Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 99 2020-03-19
Information Extraction Extract semantic content from text Roman Kern <rkern@tugraz.at>, Institute for Interactive Systems and Data Science 100 2020-03-19
Recommend
More recommend