Statistical Issues Associated With Multi-way Contingency Tables & Links to Algebraic Geometry Stephen E. Fienberg Cylab, Department of Statistics, & Machine Learning Department Carnegie Mellon University & IMA (Joint work with A. Dobra, A. Rinaldo, & Y. Zhou) 1
Preliminaries • I am an “A” at IMA for Applications of Algebraic Geometry. • This talk: – Continuation from last week’s seminar by Serkan Hosten. • I won’t provide a notational translation table but I will overlap and give links. – Introduction to a number of statistical problems for the analysis of categorical data. 2
Overview Three data examples and two statistical problems: 1. Bounds for cell counts in contingency tables given marginals. 2. Maximum likelihood estimation for log-linear models and large sparse contingency tables. How are they interrelated? Where do algebraic and other geometry tools fit in? Scaling up computations to deal with large sparse tables. 3
Ex. 1: Risk Factors for Coronary Heart Disease Syst. BP • 1841 Czech auto workers d Edwards and Havanek (1985) Biometrika Phys. work Lipo ratio • Selection of 6 binary c e variables • 2 6 table – “0” cell – population unique, “1” b f Anamnesis – 2 cells with “2” Mental work a 4 Smoke (Y/N)
Ex. 1: The Data B no yes F E D C A no yes no yes ne < 3 < 140 no 44 40 112 67 g yes 129 145 12 23 � 140 no 35 12 80 33 yes 109 67 7 9 � 3 < 140 no 23 32 70 66 yes 50 80 7 13 � 140 no 24 25 73 57 yes 51 63 7 16 pos < 3 < 140 no 5 7 21 9 yes 9 17 1 4 � 140 no 4 3 11 8 yes 14 17 5 2 � 3 < 140 no 7 3 14 14 yes 9 16 2 3 � 140 no 4 0 13 11 yes 5 14 4 4 5
R-U Confidentiality Map Original Data Maximum Disclosure Risk Tolerable Risk Released Data No Data Data Utility (Duncan, et al. 2004)
Disclosure Limitation for Sparse Count Data • Uniqueness in population table ⇔ cell count of “1”: – Uniqueness allows intruder to match characteristics in table with other data bases that include same variables to learn confidential information. • Utility typically tied to usefulness of marginal totals for statistical inference. • Risk concerned with small cell counts. – Assess using bounds for cell counts given marginal totals. 7
Marginals as Data Releases • Simple summaries corresponding to subsets of variables. • Traditional mode of reporting for statistical agencies and others. • Useful in statistical modeling: Role of log-linear models. • National Institute of Statistical Sciences Project and some of my former students have dealt with other models and other types of releases. 8
Ex. 2: Genetics Linkage • Data come from a barley milkdew experiment. – Edwards (1992). Comp. Stat. Data Anal. – 37 binary variables (genes) and 81 cases (5% missing data). • Subset of 6 genes that appear closely linked on basis of marginal distributions? • On same chromosome? 9
Ex. 2: The Data 10
Ex. 3: Australian Census Data • 10-dimensional highly sparse contingency table extracted from 1981 Australian population census (based on10 million people): Variable BPL SEX AGE REL MST DUR QAL INC FIN TIS # Categ. 102 2 11 27 5 62 11 15 16 18 • 892,533,945,600 cells! 11
Collapsed Tables • Collapsed 5-way table with 105,600 cells of which 65% are zero Variable BPL MST QAL INC FIN # Categ. 8 5 11 15 16 • Collapsed 6-way table with 48,000 cells of which 41% are zero Variable BPL SEX AGE REL MST QAL # Categ. 8 2 11 5 5 11 12
Two Faces of Algebraic Statistics & Contingency Tables 1. Representation of statistical models for cell probabilities: Description of parameter space. A. Characterizing joint distributions. B. Log-linear models including those with “graphical representation” via conditional independencies. 2. Statistical inference: Studying and characterizing portions of sample space: A. Minimal sufficient statistics (sufficient data summaries) for models—marginal totals. B. Maximum likelihood estimation. C. Distribution over all possible having given marginals (“exact distribution”)—related bounds. 13
Its All About Geometry • Polyhedral Geometry : virtually all data-related quantities can be described by polyhedra. Polyhedral Polytope Cone • Algebraic Geometry : a statistical model is specified by a polynomial map. The set of probability distributions is a hyper-surface of points satisfying polynomial equations. Algebraic (Toric) Variety 14
2 × 2 Table: The Model • We are interested in the distribution p 11 p 12 p 1+ of the 4 cells in the table specified p 21 p 22 p 2+ by the vector of log probabilities: p +1 p +2 1 • Model of independence: p ij = p i+ p +j log( p 11 , p 12 , p 21 , p 22 ) = A � = ( p 1 + , p 2 + , p + 1 , p + 2 ) � • The set of all probability distributions for model of independence need to satisfy one polynomial equation: p 11 p 22 - p 12 p 21 = 0, Segre Variety and belong to surface of independence: 15
2 × 2 Table: The Data Design Matrix p ij = p i+ p +j Model of independence: n 11 n 12 n 21 n 22 Observed Counts MSS t 1 = n 1+ 1 1 0 0 n 11 n 12 Margins 0 0 1 1 t 2 = n 2+ t = An t = An n 21 n 22 1 0 1 0 t 3 = n +1 t 4 = n +2 0 1 0 1 • Set of all tables having margins t are integer points inside a polytope and form the fiber : 4 , Ax = t 4 , An = t } { } x � R � 0 {n � R � 0 16
Design Matrix A MLE Sample Space Parameter Space A identifies the fiber: the A specifies the set of set of all tables having the polynomial equations that same margins: encode the dependence { x � 0, Ax = t } among the variables. { } x � 0, Ax = t Leads to the generalized All probability vectors hypergeometric probability satisfy binomial equations: distribution. p u + � p u � = 0 p u + � p u � = 0 [Set of all tables are lattice all integer u ∈ kernel ( A ). u � kernel(A) points in the simplex.] 17
Maximum Likelihood Estimation • Distribution for n given p : n 11 p 12 n 12 p 21 n 21 p 22 n 22 f (n | p) � p 11 • For model of independence: minimal p ij = p i+ p +j sufficient statistics for parameters are: – t = An = ( n 1+ , n 2+ , n +1 , n +2 ) • Maximum likelihood equations: – p i+ =n i+ /n i = 1, 2; p +j =n +j /n j = 1, 2. • Solution (MLEs): ij = n i + n + j /n 2 . ˆ p • Rescale by total n to count scale n p ij =m ij : ˆ m ij = n i + n + j /n. 18
Two-Way Fréchet Bounds • For 2 × 2 tables of counts{ n ij } given the marginal totals { n 1 + ,n 2 + } and { n + 1 ,n + 2 }: n 11 n 12 n 1 + n 21 n 22 n 2 + n + 1 n + 2 n min(n i + ,n + j ) � n ij � max(n i + + n + j � n, 0 ) ˆ m ij = n i + n + j /n. • Link to independence: • Interested in multi-way generalizations involving higher-order, overlapping margins. 19
Log-linear Models for 2 3 Tables • In 3-way table of counts, { n ijk }, we model logarithms of expectations, E( n ijk )= m ijk > 0: log( m ) u u u u u u u = + + + + + + ijk 1 ( i ) 2 ( j ) 3 ( k ) 12 ( ij ) 13 ( ik ) 23 ( jk ) • MSSs are margins corresponding to highest order u -terms: { n ij+ } , { n i+k } , { n +jk } . – MSSs describe simplicial complex : [12][13][23]. • Alternative ways to write model: m ijk = � ij � ik � jk m 111 m 221 = m 112 m 222 m 121 m 211 m 122 m 212 m 111 m 221 m 122 m 212 � m 121 m 211 m 112 m 222 = 0 20
Log-linear Models (cont.) • Maximum likelihood estimates (MLEs) found by setting MSSs equal to their expectations: ˆ m ij + = n ij + for i = 1 , 2 , , j = 1 , 2 , ˆ m + jk = n + jk for j = 1 , 2 , ,k = 1 , 2 , ˆ m i + k = n i + k for i = 1 , 2 ,k = 1 , 2 . • Set: m ijk = n ijk ± � • Solve cubic equation for δ : m 111 m 221 m 122 m 212 � m 121 m 211 m 112 m 222 = 0 • When do we get +ve solutions for { m ijk }? 21
Existence of MLEs for 2 × 2 × 2 Table 0 n n n n n + � � � � � + � 121 1 1 112 122 1 2 + + n n n n 0 n � � + � + � � � 211 221 2 1 212 2 2 + + n n n n n n 11 21 1 12 22 2 + + + + + + + + n n 11 12 + + n n 21 22 + + Delta must be zero and MLE doesn’t exist. 22
Two Other 3-Way Examples With [12][13][23] • 3 3 table where MLE exists • 4 3 table where MLE does not exist 23
MLEs for Log-Linear Models for k -Way Tables • Log-linear models and algebraic geometry representations generalize. • Sampling distributions for f ( n | p ) are key! – ML equations then have similar form. • Existence of MLEs linked to pattern of zeros: – Discoverable by defining basis for models and using algebraic and polyhedral geometry. – Examples discovered using Polymake . • General theorem in Haberman (1974) and “constructive” version in Rinaldo (2005). 24
Graphical & Decomposable Log-linear Models • Graphical log-linear models: defined by simultaneous conditional independence relationships: – Absence of edges in graph. Syst. BP • Decomposable models correspond d Lipo ratio to triangulated graphs. a e Smoke (Y/N) Ex. 1: Czech autoworkers • Graph has 3 cliques: Phys. work [ADE][ABCE][BF] c b Mental work f • “Interesting” decomposable log-linear model for Anamnesis data! 25
Recommend
More recommend