Part III Unstructured Data Data Retrieval: III.1 Unstructured data - PowerPoint PPT Presentation

Inf1-DA 2010–2011 III: 68 / 91 Part III — Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval Statistical Analysis of Data: III.2 Data scales and summary statistics III.3 Hypothesis testing and correlation III.4 χ 2 and collocations III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 69 / 91 The χ 2 test While the correlation coefficient, introduced in the previous lecture, is a useful statistical test for correlation, it is applicable only to numerical data (both interval and ratio scales). The χ 2 (chi-squared) test is a general tool for investigating correlations between categorical data . We shall illustrate the χ 2 test with the following example. Is there any correlation, in a class of students enrolled on a course, between submitting the coursework for the course and attending the course exam? III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 70 / 91 General approach The investigation will conform to the usual pattern of a statistical test. The null hypothesis is that there is no relationship between coursework submission and exam attendance. The χ 2 test will allow us to compute the probability p that the data we see might occur were the null hypothesis true. Once again, if p is significantly low, we reject the null hypothesis, and we conclude that there is a relationship between coursework submission and exam attendance. To begin, we use the data to compile a contingency table of frequency observations O ij . III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 71 / 91 Contingency table O ij sub ¬ sub O 11 O 12 att O 21 O 22 ¬ att O 11 is number of students who submitted coursework and attended the exam. O 12 is number of students who did not submit coursework, but attended the exam. O 21 is number of students who submitted coursework, but did not attend the exam. O 22 is number of students who neither submitted coursework nor attended exam. III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 72 / 91 Worked example O ij sub ¬ sub O 11 = 94 O 12 = 20 att O 21 = 2 O 22 = 15 ¬ att O 11 is number of students who submitted coursework and attended the exam. O 12 is number of students who did not submit coursework, but attended the exam. O 21 is number of students who submitted coursework, but did not attend the exam. O 22 is number of students who neither submitted coursework nor attended exam. III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 73 / 91 Idea of χ 2 test The observations O ij are the actual data frequencies We use these to calculate expected frequencies E ij , i.e., the frequencies we would have expected to see were the null hypothesis true. The χ 2 test is calculated by comparing the actual frequency to the expected frequency. The larger the discrepancy between these two values, the more improbable it is that the data could have arisen were the null hypothesis true. Thus a large discrepancy allows us to reject the null hypothesis and conclude that there is likely to be a correlation. III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 74 / 91 Marginals To compute the expected frequencies, we first compute the marginals R 1 , R 2 , B 1 , B 2 of the observation table. O ij sub ¬ sub O 11 O 12 R 1 = O 11 + O 12 att O 21 O 22 R 2 = O 21 + O 22 ¬ att B 1 = O 11 + O 21 B 2 = O 12 + O 22 N Here N = R 1 + R 2 = B 1 + B 2 III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 75 / 91 Marginals explained The marginals and N are very simple. • B 1 is the number of students who submitted coursework. • B 2 is the number of students who did not submit coursework. • R 1 is the number of students who attended the exam. • R 2 is the number of students who did not attend the exam. • N is the total number of students registered for the course. Given these figures, if there were no relationship between submitting coursework and attending the exam, we would expect the number of students doing both to be B 1 R 1 N III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 76 / 91 Expected frequencies The expected frequencies E ij are now calculated as follows. E ij sub ¬ sub E 11 = B 1 R 1 /N E 12 = B 2 R 1 /N R 1 = E 11 + E 12 att E 21 = B 1 R 2 /N E 22 = B 2 R 2 /N R 2 = E 21 + E 22 ¬ att B 1 = E 11 + E 21 B 2 = E 12 + E 22 N Notice that this table has the same marginals as the original. III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 77 / 91 The χ 2 value We can now define the χ 2 value by: ( O ij − E ij ) 2 χ 2 = � E ij i,j = ( O 11 − E 11 ) 2 + ( O 12 − E 12 ) 2 + ( O 21 − E 21 ) 2 + ( O 22 − E 22 ) 2 E 11 E 12 E 21 E 22 N.B. It is always the case that: ( O 11 − E 11 ) 2 = ( O 12 − E 12 ) 2 = ( O 21 − E 21 ) 2 = ( O 22 − E 22 ) 2 This fact is helpful in simplifying χ 2 calculations. Mathematical Exercise. Why are these 4 values always equal? III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 78 / 91 Worked example (continued) Marginals: O ij sub ¬ sub 94 20 114 att 2 15 17 ¬ att 96 35 131 Expected values: E ij sub ¬ sub 83 . 542 30 . 458 114 att 12 . 458 4 . 542 17 ¬ att 96 35 131 III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 79 / 91 Worked example (continued) χ 2 = 10 . 458 2 83 . 542 + 10 . 458 2 30 . 458 + 10 . 458 2 12 . 458 + 10 . 458 2 4 . 542 = 109 . 370 83 . 542 + 109 . 370 30 . 458 + 109 . 370 12 . 458 + 109 . 370 4 . 542 = 1 . 309 + 3 . 591 + 8 . 779 + 24 . 081 = 37 . 76 III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 80 / 91 Critical values for χ 2 test For a χ 2 test based on a 2 × 2 contingency table, the critical values are: p 0 . 1 0 . 05 0 . 01 0 . 001 χ 2 2 . 706 3 . 841 6 . 635 10 . 828 Interpretation of table: If the null hypothesis were true then: • The probability of the χ 2 value exceeding 2 . 706 would be p = 0 . 1 . • The probability of the χ 2 value exceeding 3 . 841 would be p = 0 . 05 . • The probability of the χ 2 value exceeding 6 . 635 would be p = 0 . 01 . • The probability of the χ 2 value exceeding 10 . 828 would be p = 0 . 001 . III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 81 / 91 Worked example (concluded) In our worked example, we have χ 2 = 37 . 76 > 10 . 828 , In this case, we can reject the null hypothesis with very high confidence ( p < 0 . 001 ). In fact since χ 2 = 37 . 76 > > 10 . 828 we have confidence p < < 0 . 001 We conclude that our data provides strong support for a correlation between coursework submission and exam attendance. III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 82 / 91 χ 2 test — subtle points In critical value tables for the χ 2 test, the entries are usually classified by degrees of freedom . For an m × n contingency table, there are ( m − 1) × ( n − 1) degrees of freedom. (This can be understood as follows. Given fixed marginals, once ( m − 1) × ( n − 1) entries in the table are completed, the remaining m + n − 1 entries are completely determined.) The values in the table on slide III.80 are those for 1 degree of freedom, and are thus the correct values for a 2 × 2 table. The χ 2 test for a 2 × 2 table is considered unreliable when N is small (e.g. less than 40 ) and at least one of the four expected values is less than 5 . In such situations, a modification Yates correction , is sometimes applied. (The details are beyond the scope of this course.) III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 83 / 91 Application 2: finding collocations Recall from Part II that a collocation is a sequence of words that occurs atypically often in language usage. Examples were: strong tea ; run amok ; make up ; bitter sweet , etc. Using the χ 2 test we can use corpus data to investigate whether a given n -gram is a collocation. For simplicity, we focus on bigrams. (N.B. All the examples above are bigrams.) Given a bigram w 1 w 2 , we use a corpus to investigate whether the words w 1 w 2 appear together atypically often. Again we shall apply the χ 2 -test. So first we need to construct the relevant contingency table. III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 84 / 91 Contingency table for bigrams O ij w 1 ¬ w 1 w 2 O 11 = f ( w 1 w 2 ) O 12 = f ( ¬ w 1 w 2 ) ¬ w 2 O 21 = f ( w 1 ¬ w 2 ) O 22 = f ( ¬ w 1 ¬ w 2 ) f ( w 1 w 2 ) is frequency of w 1 w 2 in the corpus. f ( ¬ w 1 w 2 ) is number of bigram occurrences in corpus in which the second word is w 2 but the first word is not w 1 . (N.B. If the same bigram appears n times in the corpus then this counts as n different occurrences.) f ( w 1 ¬ w 2 ) is number of bigram occurrences in corpus in which the first word is w 1 but the second word is not w 2 . f ( ¬ w 1 ¬ w 2 ) is number of bigram occurrences in corpus in which the first word is not w 1 and the second is not w 2 . III.4: χ 2 and collocations Part III: Unstructured Data

Inf1-DA 2010–2011 III: 85 / 91 Worked example 2 Recall from note II.5 that the bigram strong desire occurred 10 times in the CQP Dickens corpus. We shall investigate whether strong desire is a collocation. The full contingency table is: O ij strong ¬ strong 10 214 desire 655 3407085 ¬ desire III.4: χ 2 and collocations Part III: Unstructured Data

Part III Unstructured Data Data Retrieval: III.1 Unstructured data - PowerPoint PPT Presentation

Inf1-DA 20102011 III: 68 / 91 Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval Statistical Analysis of Data: III.2 Data scales and summary statistics III.3 Hypothesis testing and correlation III.4 2

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Data and Analysis Part III Unstructured Data Ian Stark February 2011 Part III: Unstructured

Unstructured Data Typically refers to free text I Allows I G Keyword queries including

Nature Inspired Visualization of Unstructured Big Data Aaditya Prakash prakash@aaditya.info

Unstructured Data Miner 315 Madison Avenue Suite 901 New York, NY 10017 (646) 701-0055

The Statistics of Dirty Data Sanjay Krishnan coax treasure out of messy, unstructured data 204

Open. Scalable. Intelligent? Free Mind Unstructured Open Too Source Ended For Business

Next Generation Data Discovery Fusing Structured and Unstructured Content from Multiple

Assembling Claim Adjuster Notes and Other Unstructured Data for Data Analytics Applications

A Reference-Set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources

HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. Garcia-Molina, ! J. Widom, A.

Discovery of Challenging Sources: SharePoint, Unstructured Data, the Cloud and Beyond Panelist:

Outline Part I. Introduction Part II. ML for DI Part III. DI for ML Training data

TLD Registry Data an unstructured wander through the zoo Joe Abley Public Interest Registry

A Reference-Set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources

CFD General Notation System (CGNS) Usage for unstructured grids Edwin van der Weide Stanford

Integrated Semantic Search on Structured and Unstructured Data in the ADOnIS System Friederike

Computing graphs on an HPC cluster: working with distributed unstructured data in Chapel A LEX R

Storage Formats Storage Formats 1 1 Overview We covered storage of unstructured files in HDFS

1 2 This demonstration is aimed at anyone with lots of text, unstructured or multi- format data

CS 5412/LECTURE 18 Ken Birman ACCESSING COLLECTIONS Spring, 2020

Storage and Indexing 1 Overview We covered storage of unstructured files in HDFS