univariate categorical data
play

Univariate Categorical Data MATH 185 Introduction to Computational - PowerPoint PPT Presentation

Univariate Categorical Data MATH 185 Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ eariasca/math185.html MATH 185 University of California San Diego


  1. Univariate Categorical Data MATH 185 – Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ ∼ eariasca/math185.html MATH 185 – University of California San Diego – Ery Arias-Castro 1 / 10

  2. The first 2000 digits of π We use the pi2000 data in the package UsingR – call ?pi2000. > library(UsingR) > str(pi2000) num [1:2000] 3 1 4 1 5 9 2 6 5 3 ... Q: Though this is not the role of a statistician per se , what kind of questions would we ask of such data? MATH 185 – University of California San Diego – Ery Arias-Castro 2 / 10

  3. Counts/Frequencies Say we are insterested in the number of times certain digits appear. We therefore summarize the data as counts in the different categories > table(pi2000) pi2000 0 1 2 3 4 5 6 7 8 9 181 213 207 189 195 205 200 197 202 211 Alternatively, we can compute frequencies > table(pi2000)/length(pi2000) pi2000 0 1 2 3 4 5 6 7 8 9 0.0905 0.1065 0.1035 0.0945 0.0975 0.1025 0.1000 0.0985 0.1010 0.1055 MATH 185 – University of California San Diego – Ery Arias-Castro 3 / 10

  4. Barplot For categorical data with a few categories, a barplot is often useful. > barplot(table(pi2000), col = "#ffffcc") 200 150 100 50 0 0 1 2 3 4 5 6 7 8 9 MATH 185 – University of California San Diego – Ery Arias-Castro 4 / 10

  5. Pie Chart We can also use a pie chart. > pie(table(pi2000)) 2 3 1 4 0 5 9 6 8 7 MATH 185 – University of California San Diego – Ery Arias-Castro 5 / 10

  6. Testing for equal proportions The Pearson χ 2 -goodness-of-fit test: We observe an i.i.d. sample ξ 1 , . . . , ξ n with P ( ξ i = r s ) = p s We want to test � H 0 : p s = p 0 s for all s = 1 , . . . , t � H 1 : there is s = 1 , . . . , t such that p s � = p 0 s The Pearson χ 2 -goodness-of-fit test rejects when D below is large t ( X s − np 0 s ) 2 � D = np 0 s s =1 How large? Under the null, D has approximately the χ 2 distribution with t − 1 degrees of freedom. MATH 185 – University of California San Diego – Ery Arias-Castro 6 / 10

  7. Testing for equal proportions 1 Here, n = 2000, t = 10 (with r s = s ) and p 0 s = 10 . > chisq.test(table(pi2000)) Chi-squared test for given probabilities data: table(pi2000) X-squared = 4.42, df = 9, p-value = 0.8817 The p -value is fairly large and so there is not enough evidence to reject the null. MATH 185 – University of California San Diego – Ery Arias-Castro 7 / 10

  8. Testing for Dependencies Many possible dependency structures. Here is an example. Compute the differences of successive digits and group them into {− 9 , . . . , 9 } > table(diff(pi2000)) -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 18 33 66 93 103 119 145 170 156 190 181 162 131 114 116 83 46 45 28 If the sequence behaved like an i.i.d. sample from the uniform on { 0 , . . . , 9 } , the differences would have the following distribution on {− 9 , . . . , 9 } > p0 = c(1:9, 10, 9:1)/100 [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.09 0.08 0.07 0.06 MATH 185 – University of California San Diego – Ery Arias-Castro 8 / 10

  9. Testing for Dependencies We therefore perform a χ 2 -goodness-of-fit test to verify that > chisq.test(table(diff(pi2000)), p = p0) Chi-squared test for given probabilities data: table(diff(pi2000)) X-squared = 19.4219, df = 18, p-value = 0.3663 Again, there is not enough evidence to reject the null. MATH 185 – University of California San Diego – Ery Arias-Castro 9 / 10

  10. Testing for Dependencies � A more detailed-oriented method computes the number of transistions from digit s to digit t . � If the sequence behaved like an i.i.d. sample from the uniform on { 0 , . . . , 9 } , all transitions would be equally likely. � However, there are many (100) such transitions. � Many other approaches, under the name of Tests of Randomness – for example tests based on runs. MATH 185 – University of California San Diego – Ery Arias-Castro 10 / 10

Recommend


More recommend