The Hitchhikers Guide to Testing Statistical Significance in NLP - PowerPoint PPT Presentation

The Hitchhiker’s Guide to Testing Statistical Significance in NLP Rotem Dror , Gili Baumer, Segev Shlomov, and Roi Reichart ACL 2018 https://github.com/rtmdrr/testSignificanceNLP

I want to be…   state of the art Ingredients Directions • – my new algorithm • Apply algorithm on • – current SOTA • Apply algorithm on algorithm • Test if   • Data - • Evaluation measure

This is not enough! • The difference between the performance of algorithm and could be coincidental! • We need to make sure that the probability of making a false claim is very small. • We can do so by… Testing Statistical Significance!

NLP & Hypothesis Testing – Survey ACL 2017 • 180 experimental long papers • 63 checked statistical significance • Only 42 mentioned the name of the statistical test • Only 36 used the correct statistical test - of all papers! OK! Checked significan 180 ce experiment al papers

Simple Guide

Statistical Significance Hypothesis Testing • Let: .

Statistical Significance Hypothesis Testing • The smaller the p-value is, the higher the indication that the null hypothesis, , does not hold. • We reject the null hypothesis if

Statistical Significance Hypothesis Testing • Type I error – rejecting the null hypothesis when it is true • Type II error –not rejecting the null hypothesis when the alternative is true • Significance level – probability of making type I error () • Significance Power – probability of not making type II error

So… Let’s all test for statistical significance! Why not? OK ☹ ☹ ☹ ☹

NLP & Hypothesis Testing - Problems Both algorithms are applied on the same data . What is the distribution of ? Data samples are not independent.

Paired Statistical Tests • Both algorithms are applied on the same data – dependent • Paired sample: sample selected from the first population is related to the corresponding sample from the second population • Solution: apply paired-version of statistical test • Paired t-test, Wilcoxon signed-rank test, paired bootstrap…

Parametric Tests • First case: the distribution of is Normal • Parametric tests make assumptions about the test statistic distribution, particularly - normal distribution. • When the parametric test meets assumptions it has high statistical power • Linear regression analyses • T-tests and analyses of variance on the difference of means • Normal curve Z-tests of the differences of means and proportions

Parametric Tests – Check for Normality • Shapiro-Wilk: tests if a sample comes from a normally distributed population scipy.stats.shapiro([a-b for a, b in zip(res_A, res_B)]) • Anderson-Darling: tests if a sample is drawn from a given distribution scipy.stats.anderson([a-b for a, b in zip(res_A, res_B)], 'norm' ) • Kolmogorov-Smirnov: goodness of fit test. Samples are standardized and compared with a standard normal distribution. scipy.stats.kstest([a-b for a, b in zip(res_A, res_B)], 'norm' )

Non-Parametric Tests • Second case: the distribution of is unknown\not normal • Non parametric tests do not assume anything about the test statistic distribution • Two types – sampling-free and sampling-based tests

Sampling-Free Non-Parametric Tests Binomial\ Not Normal Multinomial McNemar Sign Wilcoxon Cochren’s Q signed-rank

Sampling-Based Non-Parametric Tests • Permutation tests: resamples drawn at random from the original data. Without replacements . • Paired design – consider all possible choices   of signs to attach to each difference. • Bootstrap: resamples drawn at random from the   original data. With replacements . • Paired design – sample with repetitions from   the set of all differences.

NLP Data and I.I.D Assumption • Many NLP datasets have dependent samples • All statistical test assume independency => all tests are invalid, impact hard to quantify • Solution: come up with statistical tests   that allow dependencies

NLP & Hypothesis Testing Both algorithms are applied on the same data . What is the distribution of ? Data samples are not independent.

Simple Guide

  Thank You for Listening   Questions?   https://github.com/rtmdrr/testSignificanceNLP

The Hitchhikers Guide to Testing Statistical Significance in NLP - PowerPoint PPT Presentation

The Hitchhikers Guide to Testing Statistical Significance in NLP Rotem Dror , Gili Baumer, Segev Shlomov, and Roi Reichart ACL 2018 https://github.com/rtmdrr/testSignificanceNLP I want to be state of the art Ingredients Directions

The Hitchhiker's The Hitchhiker's Guide to CLIs in Guide to CLIs in Python Python Vinayak

Dont Panic: The Hitchhikers Guide to Don t Panic: The Hitchhiker s Guide to Missing Import

A Hitchhiker's Guide to Azure Mobile Apps David Giard Microso. Technical Evangelist

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

Software testing Software Testing Introduction Testing levels Automated testing Principles and

1. Test page This page is for testing. This page is for testing. This page is for testing.

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Overview Objective Types of testing ECE 553: TESTING AND Verification testing

Object Oriented Testing Chapter 23 1 OO Testing Class Testing: Equivalent to unit testing

Software Testing Software testing 1 V model Software testing 2 Program testing goals To

YOUR COMPLETE YOUR COMPLETE PRESENTATION GUIDE PRESENTATION GUIDE PRESENTATION GUIDE Im so

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

For Financial Professional Use Only. Not For Distribution to the Public. For Financial

Asset manager funds Joseph Gerakos University of Chicago May 20, 2016 Asset manager funds

Pat Patodia Beta Analytic Inc. November 21-23, 2011 2011 Pacific Rim Summit 1 Overview

Creating a Financial Markets Infrastructure Leader of the Future All share acquisition of

Human Interface/ Human Error 18-849b Dependable Embedded Systems Charles P. Shelton February

Director Bimestefer Conversation with County Leadership Medicaid Eligibility Error Rates June

Hypothesis Testing Lecture 18 Biostatistics 602 - Statistical Inference . Summary . .

Programming with Subclasses Announcements for This Lecture Assignments Prelim 2 Thursday,