Chi square LING572 Advanced Statistical Methods for NLP January 23, - PowerPoint PPT Presentation

Chi square LING572 Advanced Statistical Methods for NLP January 23, 2020 1

Chi square ● An example: is having a masters degree a good feature for predicting footwear preference? ● A: MS (binary) ● B: footwear preference ● Bivariate tabular analysis: ● Is there a relationship between two random variables A and B in the data? ● How strong is the relationship? ● What is the direction of the relationship? 2

Raw frequencies Sneaker Leather Sandal Boots Others shoe MS 6 17 13 9 5 no-MS 13 5 7 16 9 Feature: has a masters degree/not Classes: {Sandal, Sneaker, ….} 3

Two distributions Sandal Sneaker Leather Boot Others Total MS 6 17 13 9 5 50 Observed distribution (O): no-MS 13 5 7 16 9 50 Total 19 22 20 25 14 100 Sandal Sneaker Leather Boot Others Total MS 50 Expected distribution (E): no-MS 50 Total 19 22 20 25 14 100 4

Two distributions Sandal Sneaker Leather Boot Others Total MS 6 17 13 9 5 50 Observed distribution (O): no-MS 13 5 7 16 9 50 Total 19 22 20 25 14 100 Sandal Sneaker Leather Boot Others Total MS 9.5 11 10 12.5 7 50 Expected distribution (E): no-MS 9.5 11 10 12.5 7 50 Total 19 22 20 25 14 100 5

Chi square ● Expected value = row total * column total / table total = P(row value) * P(column value) * table total ( O ij − E ij ) 2 χ 2 = ∑ ● E ij ij ● χ 2 = (6-9.5) 2 /9.5 + (17-11) 2 /11+ …. = 14.026 6

Calculating χ 2 ● Fill out a contingency table of the observed values ➔ O ● Compute the row totals and column totals ● Calculate expected value for each cell assuming no association ➔ E ● Compute chi square: ( O − E ) 2 / E 7

When r=2 and c=2 O = E = ( O ij − E ij ) 2 ( ad − bc ) 2 N χ 2 = ∑ = E ij ( a + b )( a + c )( b + d )( c + d ) ij 8

χ 2 test 9

Basic idea ● Null hypothesis (the tested hypothesis): no relation exists between two random variables. ● Calculate the probability of having the observation with that χ 2 value, assuming the hypothesis is true. ● If the probability is too small, reject the hypothesis. 10

Requirements ● The events are assumed to be independent and have the same distribution. ● The outcomes of each event must be mutually exclusive. ● At least 5 observations per cell. ● Collect raw frequencies, not percentages 11

Degree of freedom ● Degree of freedom df = (r – 1) (c – 1) r: # of rows c: # of columns ● In this ex: df=(2-1)(5-1)=4 12

χ 2 distribution table 0.10 0.05 0.025 0.01 0.001 1 2.706 3.841 5.024 6.635 10.828 2 4.605 5.991 7.378 9.210 13.816 3 6.251 7.815 9.348 11.345 16.266 4 7.779 9.488 11.143 13.277 18.467 5 9.236 11.070 12.833 15.086 20.515 6 10.645 12.592 14.449 16.812 22.458 … df=4 and 14.026 > 13.277 p<0.01 ➔ ➔ there is a significant relation 13

χ 2 distribution source 14

χ 2 to P Calculator http://vassarstats.net/newcs.html scipy.stats.chi2_contingency 15

Steps of χ 2 test ● Select significance level p 0 ● Calculate χ 2 ● Compute the degrees of freedom df = (r-1)(c-1) ● Calculate p given χ 2 value (or get the χ 20 for p 0 ) ● if p < p 0 (or if χ 2 > χ 20 ) then reject the null hypothesis. 16

Summary of χ 2 test ● A very common method for determining whether two random variables are independent ● Many good tutorials online ● Ex: http://en.wikipedia.org/wiki/Chi-square_distribution ● https://www.khanacademy.org/math/ap-statistics/chi-square-tests/chi-square- tests-two-way-tables/v/chi-square-test-homogeneity 17

Applying to Text Classification ● Exercise: is ‘bad’ a good feature for predicting sentiment? ● Is sentiment independent from ‘bad’ or not? ● What are counts in this table? ● Number of documents bad=1 bad=0 Total positive 13 185 negative 212 28 Total 18

Additional slides 19

χ 2 example ● Shared Task Evaluation: ● Topic Detection and Tracking (aka TDT) ● Sub-task: Topic Tracking Task ● Given a small number of exemplar documents (1-4) ● Define a topic ● Create a model that allows tracking of the topic ● I.e. find all subsequent documents on this topic ● Exemplars: 1-4 newswire articles ● 300-600 words each 20

Challenges ● Many news articles look alike ● Create a profile (feature representation) ● Find terms that are strongly associated with current topic ● Not all documents are labeled ● Only a small subset belong to topics of interest ● Differentiate from other topics AND ‘background’ 21

Approach ● X 2 feature selection: ● Assume terms have binary representation ● Positive class term occurrences from exemplar docs ● Negative class term occurrences from ● other class exemplars, ‘earlier’ uncategorized docs ● Compute X 2 for terms ● Retain terms with highest X 2 scores ● Keep top N terms ● Create one feature set per topic to be tracked 22

Tracking Approach ● Build vector space model ● Feature weighting: tf*idf ● Distance measure: Cosine similarity ● Select documents scoring above threshold ● Result: Improved retrieval 23

Chi square LING572 Advanced Statistical Methods for NLP January 23, - PowerPoint PPT Presentation

Chi square LING572 Advanced Statistical Methods for NLP January 23, 2020 1 Chi square An example: is having a masters degree a good feature for predicting footwear preference? A: MS (binary) B: footwear preference Bivariate

1 Outline Chi-square test Logistic regression 2 Chi-square test 3 Chi-Square Test -

Unit 5: Inference for categorical variables Lecture 3: Chi-square tests Statistics 101 Thomas

+ Quantitative Statistics: Chi-Square ScWk 242 Session 7 Slides + Chi-Square Test of

Square Root of Not: Square Root of Not: . . . A Major Difference Between Square Root of

On the Chi square and higher-order Chi distances for approximating f -divergences Frank Nielsen 1

Chi-Square Test How do you know if your data is the result of random chance or environmental

Unit 5: Inference for categorical variables Lecture 3: Chi-square tests Statistics 101 Thomas

Ho Chi Minh City Moving Towards the Sea with Climate Change Adaptation In cooperation between

On the Algorithmic Power of Spiking Neural Networks Chi-Ning Chou Kai-Min Chung Chi-Jen Lu

Homeowners Association 2018 Annual Meeting Baldwin Square November 7, 2018 BALDWIN SQUARE

Jackson Square Site III, Phase 3 Redevelopment August 11, 2016 Jackson Square Partners, LLC

Harrisburg, Pennsylvania Harrisburg, Pennsylvania Market Square Plaza Nicole Renno Structural

CALVERT SQUARE Presented by Ms. Phyllis Armistead, Chief of Housing Calvert Square Board of

I-SQU SQUARE ARE Proudly udly making ing the e dream am of f a b better tter Ir

REDESIGN OF COPPENS SQUARE DORCHESTER, BOSTON, MA Meeting #3 February 22, 2017 COPPENS SQUARE

REDESIGN OF COPPENS SQUARE DORCHESTER, BOSTON, MA Meeting #4 March 2, 2017 COPPENS SQUARE

Q3 2018 Earnings Slides October 25, 2018 FORWARD-LOOKING STATEMENTS & NON-GAAP FINANCIAL

Collaboration, Interoperability, and Secure Systems Mr. Richard Lee ADUSD (Information

Anima IETF 93 Charter Discussion Design Team Update bootstrap design team

Network Security: Continued CS 236 On-Line MS Program Networks and Systems Security Peter

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) The Final Part April 4,

Third Quarter FY 2019/20 Financial Results 28 April 2020 Singapore Australia Malaysia

DNA REPL KASYONU VE REKOMB NASYONU (Kaynak: Genetik Kavramlar, Klug, Prof. Dr. Bekta TEPE

Dialog Typing paves way for dialog to negotiate communication properties NE Exe Checker - (All)

Chi square LING572 Advanced Statistical Methods for NLP January 23, - PowerPoint PPT Presentation

Chi square LING572 Advanced Statistical Methods for NLP January 23, 2020 1 Chi square An example: is having a masters degree a good feature for predicting footwear preference? A: MS (binary) B: footwear preference Bivariate

1 Outline Chi-square test Logistic regression 2 Chi-square test 3 Chi-Square Test -

Unit 5: Inference for categorical variables Lecture 3: Chi-square tests Statistics 101 Thomas

+ Quantitative Statistics: Chi-Square ScWk 242 Session 7 Slides + Chi-Square Test of

Square Root of Not: Square Root of Not: . . . A Major Difference Between Square Root of

On the Chi square and higher-order Chi distances for approximating f -divergences Frank Nielsen 1

Chi-Square Test How do you know if your data is the result of random chance or environmental

Unit 5: Inference for categorical variables Lecture 3: Chi-square tests Statistics 101 Thomas

Ho Chi Minh City Moving Towards the Sea with Climate Change Adaptation In cooperation between

On the Algorithmic Power of Spiking Neural Networks Chi-Ning Chou Kai-Min Chung Chi-Jen Lu

Homeowners Association 2018 Annual Meeting Baldwin Square November 7, 2018 BALDWIN SQUARE

Jackson Square Site III, Phase 3 Redevelopment August 11, 2016 Jackson Square Partners, LLC

Harrisburg, Pennsylvania Harrisburg, Pennsylvania Market Square Plaza Nicole Renno Structural

CALVERT SQUARE Presented by Ms. Phyllis Armistead, Chief of Housing Calvert Square Board of

I-SQU SQUARE ARE Proudly udly making ing the e dream am of f a b better tter Ir

REDESIGN OF COPPENS SQUARE DORCHESTER, BOSTON, MA Meeting #3 February 22, 2017 COPPENS SQUARE

REDESIGN OF COPPENS SQUARE DORCHESTER, BOSTON, MA Meeting #4 March 2, 2017 COPPENS SQUARE

Q3 2018 Earnings Slides October 25, 2018 FORWARD-LOOKING STATEMENTS &amp; NON-GAAP FINANCIAL

Collaboration, Interoperability, and Secure Systems Mr. Richard Lee ADUSD (Information

Anima IETF 93 Charter Discussion Design Team Update bootstrap design team

Network Security: Continued CS 236 On-Line MS Program Networks and Systems Security Peter

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) The Final Part April 4,

Third Quarter FY 2019/20 Financial Results 28 April 2020 Singapore Australia Malaysia

DNA REPL KASYONU VE REKOMB NASYONU (Kaynak: Genetik Kavramlar, Klug, Prof. Dr. Bekta TEPE

Dialog Typing paves way for dialog to negotiate communication properties NE Exe Checker - (All)

Q3 2018 Earnings Slides October 25, 2018 FORWARD-LOOKING STATEMENTS & NON-GAAP FINANCIAL