Sta$s$cal Significance Tes$ng In Theory and In Prac$ce - PowerPoint PPT Presentation

Sta$s$cal ¡Significance ¡Tes$ng ¡ In ¡Theory ¡and ¡In ¡Prac$ce ¡ Ben ¡Cartere8e ¡ University ¡of ¡Delaware ¡ ¡ h8p://ir.cis.udel.edu/ICTIR15tutorial ¡ ¡

Hypotheses ¡and ¡Experiments ¡ • Hypothesis: ¡ – Using ¡an ¡SVM ¡for ¡classifica$on ¡will ¡give ¡be8er ¡accuracy ¡ than ¡using ¡Naïve ¡Bayes ¡ – A ¡“Symbol-‑Refined ¡Tree ¡Subs$tu$on ¡Grammar” ¡will ¡give ¡ be8er ¡parsing ¡results ¡than ¡a ¡simple ¡TSG ¡ – Expanding ¡a ¡short ¡keyword ¡query ¡with ¡synonyms ¡will ¡ improve ¡search ¡engine ¡effec$veness ¡ • Experiment: ¡ – Build ¡a ¡baseline ¡system ¡ – Modify ¡it ¡based ¡on ¡your ¡hypothesis ¡ – Test ¡both ¡systems ¡on ¡one ¡or ¡more ¡datasets ¡

Experimental ¡Results ¡ from ¡Shindo ¡et ¡al., ¡ Bayesian ¡Symbol-‑Refined ¡Tree ¡Subs5tu5on ¡Grammars ¡for ¡Syntac5c ¡Parsing, ¡ ACL ¡2012 ¡

So ¡What? ¡ • “Do ¡these ¡results ¡support ¡my ¡hypothesis? ¡ • “Are ¡these ¡results ¡meaningful?” ¡ • “Is ¡it ¡possible ¡that ¡my ¡results ¡are ¡just ¡ random?” ¡ à ¡sta$s$cal ¡significance ¡tes$ng! ¡

Overview ¡of ¡This ¡Tutorial ¡ Part ¡1 : ¡ ¡Tes$ng ¡Sta$s$cal ¡Significance ¡ – May ¡be ¡a ¡review ¡for ¡some ¡of ¡you ¡ Part ¡2 : ¡ ¡Fundamentals ¡of ¡Significance ¡Tes$ng ¡ Part ¡3 : ¡ ¡Applica$ons, ¡or, ¡Why ¡Bother ¡With ¡ ¡ ¡Fundamentals? ¡ Part ¡4 : ¡ ¡Myths ¡and ¡Misconcep$ons ¡ Part ¡5 : ¡ ¡Significance ¡Tes$ng ¡in ¡IR ¡Research ¡

Using ¡R ¡ • R ¡is ¡a ¡soaware ¡environment ¡for ¡sta$s$cal ¡ compu$ng ¡ ¡ • Includes ¡built-‑in ¡implementa$ons ¡of ¡many ¡ common ¡tests ¡ – Also ¡has ¡its ¡own ¡programming ¡language ¡for ¡ implemen$ng ¡your ¡own ¡ • Download ¡from ¡h8p://r-‑project.org ¡ – Download ¡TREC-‑7 ¡evalua$on ¡data ¡from ¡ h8p://ir.cis.udel.edu/ICTIR15tutorial/trec7.RData ¡

Background: ¡ ¡Experimenta$on ¡in ¡IR ¡ • The ¡standard ¡experimental ¡secng ¡in ¡IR ¡is ¡called ¡ the ¡ Cranfield ¡paradigm ¡ • Two ¡components: ¡ ¡test ¡collec$ons ¡and ¡ effec$veness ¡measures ¡ – A ¡test ¡collec$on ¡comprises: ¡ • A ¡corpus ¡of ¡documents ¡ • A ¡set ¡of ¡informa$on ¡needs/tasks/topics/queries ¡ • Relevance ¡judgments ¡ – Effec$veness ¡measures ¡such ¡as: ¡ • Precision@10, ¡average ¡precision, ¡nDCG@10, ¡alpha-‑ nDCG@10, ¡etc ¡

Background: ¡ ¡Cranfield ¡ 0.3 0.4 0.1 0.5 0.3 A ¡ query 1 query 2 0.2 0.3 0.1 0.2 0.3 B ¡ query 3 query 4 0.4 0.4 0.3 0.1 0.2 C ¡ query 5 0.1 0.5 0.4 0.3 0.1 D ¡

Background: ¡ ¡Cranfield ¡ A ¡ B ¡ C ¡ D ¡ query 1 0.3 0.2 0.4 0.1 0.250 query 2 0.4 0.3 0.4 0.5 0.400 query 3 0.1 0.1 0.3 0.4 0.225 query 4 0.5 0.2 0.1 0.3 0.275 query 5 0.3 0.3 0.2 0.1 0.225 0.32 0.22 0.28 0.28

Part ¡1 ¡ TESTING ¡STATISTICAL ¡SIGNIFICANCE ¡

Commonly-‑Used ¡Tests ¡ • Non-‑parametric: ¡ – Sign ¡test/binomial ¡test ¡ – Wilcoxon ¡signed ¡rank ¡test ¡ • Parametric: ¡ – Student’s ¡t-‑test ¡ – ANOVA ¡ • Distribu$on-‑free: ¡ – Randomiza$on ¡test ¡ – Bootstrap ¡test ¡

Sign ¡Test ¡ Query ¡ A ¡ B ¡ B-‑A ¡ sign(B-‑A) ¡ 1 ¡ .25 ¡ .35 ¡ +.10 ¡ +1 ¡ 2 ¡ .43 ¡ .84 ¡ +.41 ¡ +1 ¡ 7 ¡“successes” ¡in ¡9 ¡complete ¡trials ¡ 3 ¡ .39 ¡ .15 ¡ -‑.24 ¡ -‑1 ¡ 4 ¡ .75 ¡ .75 ¡ 0 ¡ 0 ¡ 5 ¡ .43 ¡ .68 ¡ +.25 ¡ +1 ¡ What ¡if ¡each ¡+1/-‑1 ¡was ¡just ¡the ¡ ¡ ¡result ¡of ¡flipping ¡a ¡fair ¡coin? ¡ 6 ¡ .15 ¡ .85 ¡ +.70 ¡ +1 ¡ 7 ¡ .20 ¡ .80 ¡ +.60 ¡ +1 ¡ 8 ¡ .52 ¡ .50 ¡ -‑.02 ¡ -‑1 ¡ 9 ¡ .49 ¡ .58 ¡ +.09 ¡ +1 ¡ 10 ¡ .50 ¡ .75 ¡ +.25 ¡ +1 ¡ What ¡is ¡the ¡probability ¡we ¡would ¡see ¡7 ¡or ¡more ¡heads ¡if ¡the ¡coin ¡is ¡fair? ¡

Binomial ¡Distribu$on ¡ What ¡is ¡the ¡probability ¡we ¡would ¡see ¡7 ¡or ¡more ¡heads ¡if ¡the ¡coin ¡is ¡fair? ¡ P(7 ¡heads ¡| ¡9 ¡trials, ¡½ ¡probability) ¡ + ¡P(8 ¡heads ¡| ¡9 ¡trials, ¡½ ¡probability) ¡ + ¡P(9 ¡heads ¡| ¡9 ¡trials, ¡½ ¡probability) ¡ = ¡0.09 ¡ p-‑value ¡= ¡0.09 ¡

Wilcoxon ¡Signed-‑Rank ¡Test ¡ Query ¡ A ¡ B ¡ B-‑A ¡ Rank ¡ B-‑A ¡ W = 2 + 3 + 5.5 + 5.5 + 7 + 8 + 9 1 ¡ .25 ¡ .35 ¡ +.10 ¡ 1 ¡ -‑.02 ¡ W = 40 2 ¡ .43 ¡ .84 ¡ +.41 ¡ 2 ¡ +.09 ¡ 3 ¡ .39 ¡ .15 ¡ -‑.24 ¡ 3 ¡ +.10 ¡ 4 ¡ .75 ¡ .75 ¡ 0 ¡ 4 ¡ -‑.24 ¡ 5 ¡ .43 ¡ .68 ¡ +.25 ¡ 5.5 ¡ +.25 ¡ 6 ¡ .15 ¡ .85 ¡ +.70 ¡ 5.5 ¡ +.25 ¡ 7 ¡ .20 ¡ .80 ¡ +.60 ¡ 7 ¡ +.41 ¡ 8 ¡ .52 ¡ .50 ¡ -‑.02 ¡ 8 ¡ +.60 ¡ 9 ¡ .49 ¡ .58 ¡ +.09 ¡ 9 ¡ +.70 ¡ 10 ¡ .50 ¡ .75 ¡ +.25 ¡

Wilcoxon ¡Signed-‑Rank ¡Test ¡ 0.015 W = 40 0.010 Density 0.005 p − value = 0.02 0.000 -60 -40 -20 0 20 40 60 W

Student’s ¡t-‑test ¡ ˆ µ = B − A = 0.214 Query ¡ A ¡ B ¡ B-‑A ¡ 1 ¡ .25 ¡ .35 ¡ +.10 ¡ ˆ B − A = 0.291 σ 2 ¡ .43 ¡ .84 ¡ +.41 ¡ 3 ¡ .39 ¡ .15 ¡ -‑.24 ¡ 4 ¡ .75 ¡ .75 ¡ 0 ¡ ˆ µ 5 ¡ .43 ¡ .68 ¡ +.25 ¡ t = n = 2.33 ˆ 6 ¡ .15 ¡ .85 ¡ +.70 ¡ σ B − A 7 ¡ .20 ¡ .80 ¡ +.60 ¡ 8 ¡ .52 ¡ .50 ¡ -‑.02 ¡ 9 ¡ .49 ¡ .58 ¡ +.09 ¡ 10 ¡ .50 ¡ .75 ¡ +.25 ¡ 16 ¡

Student’s ¡t-‑test ¡ ˆ µ = B − A = 0.214 ˆ σ B − A = 0.291 B − A = 0.291 σ ˆ µ t = n = 2.33 ˆ σ B − A p − value = 0.02 17 ¡

Randomiza$on ¡Test ¡ Query ¡ Query ¡ Query ¡ A ¡ A ¡ A ¡ B ¡ B ¡ B ¡ B-‑A ¡ B-‑A ¡ B-‑A ¡ ˆ 0 = B − A = 0.214 1 ¡ 1 ¡ 1 ¡ .25 ¡ .25 ¡ .35 ¡ .35 ¡ .35 ¡ .25 ¡ +.10 ¡ +.10 ¡ -‑.10 ¡ µ 2 ¡ 2 ¡ 2 ¡ .84 ¡ .43 ¡ .43 ¡ .84 ¡ .43 ¡ .84 ¡ +.41 ¡ +.41 ¡ -‑.41 ¡ ˆ 1 = − 0.008 µ 3 ¡ 3 ¡ 3 ¡ .39 ¡ .39 ¡ .39 ¡ .15 ¡ .15 ¡ .15 ¡ -‑.24 ¡ -‑.24 ¡ -‑.24 ¡ 4 ¡ 4 ¡ 4 ¡ .75 ¡ .75 ¡ .75 ¡ .75 ¡ .75 ¡ .75 ¡ 0 ¡ 0 ¡ 0 ¡ ˆ 2 = − 0.093 µ 5 ¡ 5 ¡ 5 ¡ .43 ¡ .68 ¡ .68 ¡ .68 ¡ .43 ¡ .43 ¡ +.25 ¡ -‑.25 ¡ -‑.25 ¡ 6 ¡ 6 ¡ 6 ¡ .85 ¡ .15 ¡ .15 ¡ .85 ¡ .15 ¡ .85 ¡ +.70 ¡ +.70 ¡ -‑.70 ¡ 7 ¡ 7 ¡ 7 ¡ .20 ¡ .20 ¡ .80 ¡ .80 ¡ .80 ¡ .20 ¡ +.60 ¡ +.60 ¡ -‑.60 ¡ 8 ¡ 8 ¡ 8 ¡ .52 ¡ .50 ¡ .50 ¡ .52 ¡ .52 ¡ .50 ¡ +.02 ¡ +.02 ¡ -‑.02 ¡ 9 ¡ 9 ¡ 9 ¡ .49 ¡ .58 ¡ .49 ¡ .58 ¡ .58 ¡ .49 ¡ +.09 ¡ +.09 ¡ 0.09 ¡ 10 ¡ 10 ¡ 10 ¡ .75 ¡ .50 ¡ .50 ¡ .50 ¡ .75 ¡ .75 ¡ +.25 ¡ +.25 ¡ -‑.25 ¡

Randomiza$on ¡Test ¡ ˆ 0 = B − A = 0.214 µ p − value = 0.02 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 mean

Bootstrap ¡Test ¡ Query ¡ A ¡ B ¡ B-‑A ¡ s1 ¡ s2 ¡ s3 ¡ 1 ¡ .25 ¡ .35 ¡ +.10 ¡ -‑.24 ¡ +.25 ¡ -‑.24 ¡ 2 ¡ .43 ¡ .84 ¡ +.41 ¡ +.41 ¡ +.10 ¡ +.60 ¡ 3 ¡ .39 ¡ .15 ¡ -‑.24 ¡ -‑.02 ¡ +.25 ¡ -‑.70 ¡ 4 ¡ .75 ¡ .75 ¡ 0 ¡ 0 ¡ +.60 ¡ +.25 ¡ 5 ¡ .43 ¡ .68 ¡ +.25 ¡ +.25 ¡ +.70 ¡ +.70 ¡ 6 ¡ .15 ¡ .85 ¡ +.70 ¡ +.10 ¡ -‑.02 ¡ +.41 ¡ 7 ¡ .20 ¡ .80 ¡ +.60 ¡ +.25 ¡ +.10 ¡ -‑.02 ¡ 8 ¡ .52 ¡ .50 ¡ -‑.02 ¡ +.10 ¡ +.25 ¡ -‑.24 ¡ 9 ¡ .49 ¡ .58 ¡ +.09 ¡ +.25 ¡ 0 ¡ +.70 ¡ 10 ¡ .50 ¡ .75 ¡ +.25 ¡ +.10 ¡ -‑.02 ¡ +.25 ¡

Bootstrap ¡Distribu$on ¡ p − value = 0.005 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 mean

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce - PowerPoint PPT Presentation

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben Cartere8e University of Delaware h8p://ir.cis.udel.edu/ICTIR15tutorial Hypotheses and Experiments

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben

Sta$s$cs Sta$s$cs Fourth Dimension of a Sta$s$cal Programmer

Be#er Tes(ng with Less Work: QuickCheck Tes(ng in Prac(ce

Advanced fMRI Prac/cal Nonparametric Inference, Power & Meta-Analysis Thomas E. Nichols

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

TES Communications Strategy Not SELEP TES Communications Strategy Why is it important? You

Towards prac+cal incremental recomputa+on for scien+sts Philip J. Guo and Dawson Engler Workshop

Sta$s$cal Hypothesis Tes$ng Ghostbusters Ghostbusters How many

F orwa rd L ooking Sta te me nt Ce rta in o f the sta te me nts ma de in this Pre se nta tio

Medi-Cal Healthier California for All Drug Medi-Cal Organized Delivery System Program Renewal and

CAL IF ORNIA HIGH- - SPE SPE E D RAIL CAL IF ORNIA HIGH E D RAIL CAL IF ORNIA HIGH-

PRAC feedback to working parties Presented by: V. Hivert, R.Anderson (PRAC) 25 September 2019

Objec(ves Review Lab 1 Linux prac(ce Programming prac(ce Print statements

Significance How important is it? Thoughts on historical significance A property must have

5.1 Linear mass-spring models a lesson for MATH F302 Differential Equations Ed Bueler, Dept. of

A RATE OF CONVERGENCE FOR LAGRANGIAN AVERAGED NAVIER-STOKES EQUATIONS ED WAYMIRE BASED ON JOINT

Engineered quantum systems. G J Milburn Centre for Engineered Quantum Systems, The University of

5. High Time Resolution Astrophysics (HTRA) PhD Course, University of Padua Page 1 High Energy

Monitoring and data filtering II. Correlated data Advanced Herd Management Anders Ringgaard

SPECTRUM IN THE SRF CAVITIES WITH MECHANICAL IMPERFECTIONS A. Lunin, T. Khabiboulline, N. Solyak,

Transition Radiation Transition radiation is emitted whenever a charged particle cross the

Dispersion analysis of a strain-rate dependent ductile-to-brittle transition model Harm Askes 1 ,

Sambuz

Useful Links

Newsletter

Mail Us

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce - PowerPoint PPT Presentation

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben Cartere8e University of Delaware h8p://ir.cis.udel.edu/ICTIR15tutorial Hypotheses and Experiments

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben

Sta$s$cs Sta$s$cs Fourth Dimension of a Sta$s$cal Programmer

Be#er Tes(ng with Less Work: QuickCheck Tes(ng in Prac(ce

Advanced fMRI Prac/cal Nonparametric Inference, Power &amp; Meta-Analysis Thomas E. Nichols

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

TES Communications Strategy Not SELEP TES Communications Strategy Why is it important? You

Towards prac+cal incremental recomputa+on for scien+sts Philip J. Guo and Dawson Engler Workshop

Sta$s$cal Hypothesis Tes$ng Ghostbusters Ghostbusters How many

F orwa rd L ooking Sta te me nt Ce rta in o f the sta te me nts ma de in this Pre se nta tio

Medi-Cal Healthier California for All Drug Medi-Cal Organized Delivery System Program Renewal and

CAL IF ORNIA HIGH- - SPE SPE E D RAIL CAL IF ORNIA HIGH E D RAIL CAL IF ORNIA HIGH-

PRAC feedback to working parties Presented by: V. Hivert, R.Anderson (PRAC) 25 September 2019

Objec(ves Review Lab 1 Linux prac(ce Programming prac(ce Print statements

Significance How important is it? Thoughts on historical significance A property must have

5.1 Linear mass-spring models a lesson for MATH F302 Differential Equations Ed Bueler, Dept. of

A RATE OF CONVERGENCE FOR LAGRANGIAN AVERAGED NAVIER-STOKES EQUATIONS ED WAYMIRE BASED ON JOINT

Engineered quantum systems. G J Milburn Centre for Engineered Quantum Systems, The University of

5. High Time Resolution Astrophysics (HTRA) PhD Course, University of Padua Page 1 High Energy

Monitoring and data filtering II. Correlated data Advanced Herd Management Anders Ringgaard

SPECTRUM IN THE SRF CAVITIES WITH MECHANICAL IMPERFECTIONS A. Lunin, T. Khabiboulline, N. Solyak,

Transition Radiation Transition radiation is emitted whenever a charged particle cross the

Dispersion analysis of a strain-rate dependent ductile-to-brittle transition model Harm Askes 1 ,

Sambuz

Useful Links

Newsletter

Mail Us

Advanced fMRI Prac/cal Nonparametric Inference, Power & Meta-Analysis Thomas E. Nichols