Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha 26/03/2020
Statistical Significance Tests in NLP Agenda NLP Tasks Significance Tests Presentation of tasks Types of testing ● ● Evaluation metrics Metric to test types ● ● Advantages and drawback Decision tree for applying ● ● of the metrics statistical tests 2 26/03/2020
Statistical Significance Tests in NLP NLP Tasks & Metrics 3 26/03/2020
Statistical Significance Tests in NLP Text Classification Binary classification: ● Given a sentence classify it as positive or negative (i.e. cats and dogs) ○ Results presented via a confusion matrix: ○ Why do we need a metric? ● To compare different algorithms ○ Accuracy = (TP+TN)/(TP+TN+FP+FN) ○ What is the problem with accuracy? ○ 4 26/03/2020 Word split ■ Sentence split ■ Paragraphs, etc ■ Is there any other information that we can collect? ○
Statistical Significance Tests in NLP Text Classification (2) If data contains 99 dogs and 1 cat: ● Classify everything as dogs ○ Receive 99% of accuracy ○ Not classifying as we want ○ Precision, Recall and F-measure: ● ○ 5 26/03/2020
Statistical Significance Tests in NLP Correlation Tasks Sentiment analysis: ● Annotators evaluate a piece of text with a sentiment from 1-5 ○ Sentence semantic similarity: ● Given two sentences, annotators decide to give a 1-10 semantic similarity ○ between the two sentences How to make the correlation between human annotation and the algorithm? ● Correlations: ○ It is a statistical technique that shows how related two random ■ variables are 6 26/03/2020 ○
Statistical Significance Tests in NLP Correlation Metrics Two most used correlations: ● Pearson correlation: ○ Spearmann correlation: ○ It is the Pearson correlation between ranked variables: ■ 7 26/03/2020
Statistical Significance Tests in NLP Language model Generating the next token of a sequence ● Usually based on the collection of co-occurrence of words in a window: ● Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence: ○ We need to identify a metric that assesses how well the language model works: ● How well a probability model predicts a sample! ○ 8 26/03/2020
Statistical Significance Tests in NLP Language model (2) Metric: Perplexity ● We want to model a probability distribution P ○ How close is our distribution Q with the real one? ■ We can approximate by testing against drawn samples in P ● The larger is the perplexity, the worse ○ The N-th root is a normalizing factor ○ 9 26/03/2020
Statistical Significance Tests in NLP Machine Translation Translates one piece of text to the other: ● Usually sentence/phrase based ○ Evaluation metric: ● BLEU : BiLingual Evaluation Understudy ○ Accuracy cannot be used because we cannot have the “Exact” same translation ■ Usually sentences are compared with very good quality translations ■ Average of present words in any of the reference translations ■ Returns [0, 1] with 1 reflecting very good translation ■ 10 26/03/2020
Statistical Significance Tests in NLP BLEU Only Precision on word level doesn’t work: ● Precision here is 1 ○ Naive solution: ○ Number of words occurring in both / by the max number of words ■ What if we have the reversed order words in the translation? ■ To overcome this problem, n -grams match: ■ Average the precision on each n from 1 to 4 ● Penalize if the translation is shorter than one of the references ● 11 26/03/2020
Statistical Significance Tests in NLP Deep Parsing Deep Parsing (Dependency Parsing): ● Parses the sentence in its grammatical structure ○ “Head” - “Dependent” form ○ It is an acyclic directed graph (mostly implemented as a tree) ○ Example: ○ “There are different examples that we might use!” ■ 12 26/03/2020
Statistical Significance Tests in NLP Deep Parsing (2) Attachment score: ● The percentage of the words that have the correct head ○ Two measures for the dependency parsing: ● LAS: Labeled attachment score ○ Each word, it is considered correct if the head and the label is correct ■ UAS: Unlabeled Attachment Score: ○ Considered to be correct for a word if the identified head of the word is ■ correct (without necessarily having the correct label) 13 26/03/2020
Statistical Significance Tests in NLP Summarization Given a long text, the goal is to find a smaller textual representation: ● Usually a list of phrases or of sentences ○ Evaluation: ● ROUGE: Recall-Oriented Understudy for Gisting Evaluation ○ Tries to evaluate the summary (or translation) based on a list of references ○ ROUGE-N: Overlap of N-grams between the system and reference summaries. ○ 14 26/03/2020
Statistical Significance Tests in NLP Rouge ROUGE-1 refers to the overlap of unigram (each word) between the system and ● reference summaries ROUGE-2 refers to the overlap of bigrams between the system and references ● Other Rouge variants: ● ROUGE-L — measures longest matching sequence of words using LCS ○ ROUGE-S — measuring matching ordered words by allowing for arbitrary ○ gaps 15 26/03/2020
Statistical Significance Tests in NLP METEOR METEOR (Metric for Evaluation of Translation with Explicit ORdering): ● Using sentences as basic units ○ Based on Precision and Recall with more emphasis to Recall ○ ■ Adds some other features: stemming + synonyms ○ Create an alignment between the two sentences ○ How good is the alignment? ■ 16 26/03/2020
Statistical Significance Tests in NLP METEOR (2) Calculating the precision: ● Unigram Precision and Recall: ○ Where is the number of common unigrams ■ is the number of unigrams in the candidate translation ■ is the number of unigrams in the reference translation ■ Longer n -gram matches are used to compute a penalty p : ○ Where is the minimum number of matching chunks ■ is the number of unigram mapping ■ ■ 17 26/03/2020
Statistical Significance Tests in NLP Metrics on Papers Papers ACL + TACL 18 26/03/2020
Statistical Significance Tests in NLP NLP Tasks & Metrics End of Module 19 26/03/2020
Statistical Significance Tests in NLP Statistical Significance Tests 20 26/03/2020
Statistical Significance Tests in NLP Current Status How it is done: ● Find an dependent variable ○ Try the state of the art on a benchmark and the current algorithm ○ Find out that the current algorithm performs better on the benchmark ○ What if we have an improvement (i.e. accuracy) of 1%? ● Is it statistically significant? ○ We should apply statistical significance to the results ○ 21 26/03/2020
Statistical Significance Tests in NLP Significance Tests Given two algorithms A and B , a dataset X and a measure M : ● M(Alg, X) is the value of the evaluation measure on dataset X ○ So the difference between the two algorithms is: ○ ■ The statistical hypothesis: ○ ■ 22 26/03/2020
Statistical Significance Tests in NLP Significance Tests (2) The p-value for rejecting/accepting the null hypothesis is: ● ○ Types of error (as learned in the class) ● Type 1 error: ○ The Null hypothesis is rejected when it is actually true ■ Type 2 error: ○ The Null hypothesis is not rejected, although it should be ■ 23 26/03/2020
Statistical Significance Tests in NLP Significance Tests (3) Parametric vs. Non-Parametric ● If distribution is known, a parametric test is better: ○ Lower probability of making type 2 errors ■ Non-Parametric don’t make any assumption on the distribution: ○ Less powerful ■ More sound if the distribution isn’t known ■ 24 26/03/2020
Statistical Significance Tests in NLP Normality Test Null-hypothesis: ● Sample comes from a normal distribution ○ Shapiro-Wilk: ● Checks if the variance of QQ plots and variance samples are similar ○ ● Kolmogorov-Smirnov: ○ Tests between the shapes of two distributions (not only normal) ○ Absolute difference between two cumulative distribution function ● Anderson-Darling: ○ Tests whether a given which also uses a cumulative distribution function 25 26/03/2020
Statistical Significance Tests in NLP Parametric Test Parametric tests assume that the data are drawn from a known distribution: ● With defined parameters ○ Usually from normally distributed ○ Paired Student’s t-test: ● Typically applied to Accuracy, UAS and LAS: ■ Compute the mean number of predictions per sample ● Based on the Central Limit Theorem: ● Sum of the independently drawn variables follow a Normal Distribution ○ Also Pearson’s correlation (with n-2 degree of freedom) ■ 26 26/03/2020
Statistical Significance Tests in NLP Non Parametric Tests Two dimensions: ● Statistical power ○ Computational complexity ○ Two families: ● Sample free: ○ Doesn’t consider the values of the evaluation measure ■ Sample based: ○ It considers the values of the evaluation measures and sample from them ■ 27 26/03/2020
Recommend
More recommend