Preserving Confidentiality Overview AND Providing Adequate Data for Statistical Modeling • Background and some fundamental abstractions for disclosure limitation. Stephen E. Fienberg – Statistical users want more than to retrieve a few numbers. Department of Statistics • Results on bounds for table entries. Center for Automated Learning and Discovery • Uses of Markov bases for exact Center for Computer and Communications distributions and perturbation of tables. Security • Links to log-linear models, and related Carnegie Mellon University statistical theory and methods. Pittsburgh, PA, U.S.A. 1 2 R-U Confidentiality Map NISS Prototype Query System • For k -way table of counts. ������������� ��������������� • Queries: Requests for marginal tables. �������� • Responses: Yes--release; No; (and perhaps ���������� ���� ������������������������������ “Simulate” and then release). • As released margins cumulate we have increased information about table entries. ������� • Margins need to be consistent ==> possible simulated releases get highly constrained. ������������ (Duncan, et al. 2001) 3 4 Page 1
Fundamental Abstractions Confidentiality Concern • Query space, Q, with partial ordering: • Uniqueness in population table ⇔ ⇔ ⇔ ⇔ cell – Elements can be marginal tables, conditionals, k - count of “1”. groupings, regressions, or other data summaries. – Released set: R( t ), and implied Unreleasable set: U( t ). • Uniqueness allows intruder to match – Releasable frontier: maximal elements of R( t ). characteristics in table with other data – Unreleasable frontier: minimal elements of U( t ). bases that include the same variables plus • Risk and Utility defined on subsets of Q . others to learn confidential information. – Risk Measure : identifiability of small cell counts. – Assuming data are reported without error! – Utility : reconstructing table using log-linear models. • Identity versus attribute disclosure. – Release rules must balance risk and utility: • R-U Confidentiality map. 5 6 • General Bayesian decision-theoretic approach. Example 1: 2000 Census Why Marginals? • U.S. decennial census “long form” • Simple summaries corresponding to – 1 in 6 sample of households nationwide. subsets of variables. – 53 questions, many with multiple categories. • Traditional mode of reporting for – Data measured with substantial error! statistical agencies and others. – Data reported after application of data swapping! • Useful in statistical modeling: Role of • Geography log-linear models. – 50 states; 3,000 counties; 4 million “blocks”. • Collapsing categories of categorical – Release of detailed geography yields uniqueness in sample and at some level in population. variables uses similar DL methods and • American Factfinder releases various 3-way statistical theory. tables at different levels of geography. 7 8 Page 2
Example 2: Risk Factors for Coronary Heart Disease Syst. BP • 1841 Czech auto workers d Edwards and Havanek (1985) • 2 6 table P hys. work Lipo ra tio c e • population data – “0” cell – population unique, “1” – 2 cells with “2” b f Ana mne sis Me nta l work a 9 10 Smoke (Y/ N) Example 2: The Data Example 3: NLTCS B no yes F E D C A no yes no yes • National Long Term Care Survey ne < 3 < 140 no 44 40 112 67 g – 20-40 demographic/background items. yes 129 145 12 23 – 30-50 items on disability status, ADLs and IADLs, ≥ ≥ 140 ≥ ≥ no 35 12 80 33 yes 109 67 7 9 most binary but some polytomous. ≥ ≥ 3 ≥ ≥ < 140 no 23 32 70 66 – Linked Medicare files. yes 50 80 7 13 ≥ ≥ 140 ≥ ≥ – 5 waves: 1982, 1984, 1989, 1994, 1999. no 24 25 73 57 yes 51 63 7 16 • We’ve been working with 2 16 table, pos < 3 < 140 no 5 7 21 9 yes 9 17 1 4 collapsed across several waves of survey, ≥ 140 ≥ ≥ ≥ no 4 3 11 8 with n =21,574. yes 14 17 5 2 ≥ ≥ ≥ ≥ 3 < 140 no 7 3 14 14 Erosheva (2002) yes 9 16 2 3 ≥ ≥ ≥ 140 ≥ no 4 0 13 11 Dobra, Erosheva, & Fienberg(2003) yes 5 14 4 4 11 12 Page 3
Two-Way Fréchet Bounds Bounds for Multi-Way Tables • For 2 × × × × 2 tables of counts{ n ij } given the • k -way table of non-negative counts, k ≥ ≥ 3. ≥ ≥ marginal totals { n 1 + ,n 2 + } and { n + 1 ,n + 2 }: – Release set of marginal totals, possibly overlapping. n n n – Goal : Compute bounds for cell entries. + + + + 11 12 1 n n n – LP and IP approaches are NP-hard. + + + + 21 22 2 • Our strategy has been to: n n n + + + + + + + + 1 2 – Develop efficient methods for several special cases. ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ + + + + − − − − n n n n n n min( , ) max( , 0 ) – Exploit linkage to statistical theory where possible. i + + + + + + + + j ij i + + + + + + + + j – Use general, less efficient methods for residual cases. • Interested in multi-way generalizations • Direct generalizations to tables with non- involving higher-order, overlapping integer, non-negative entries. margins. 13 14 Graphical & Decomposable Role of Log-linear Models? Log-linear Models • For 2 × × × 2 case, lower bound is evocative of MLE × • Graphical models: defined by simultaneous for estimated expected value under independence: conditional independence relationships m = = = = n n n S yst. BP ˆ / . ij i + + + + + + + + j d – Absence of edges in graph. – Bounds correspond to log-linearized version. Lipo ratio Example 2: a e – Margins are minimal sufficient statistics (MSS) . Smoke (Y/ N) • In 3 -way table of counts, { n ijk }, we model logs Czech autoworkers of expectations {E( n ijk )= m ijk } : Graph has 3 cliques: = = + + + + + + + + + + + + m = = u + + u + + u + + u + + u + + u + + u log( ) ijk i j k ij ik jk 1 ( ) 2 ( ) 3 ( ) 12 ( ) 13 ( ) 23 ( ) P hys. work [ADE][ABCE][BF] • MSS are margins corresponding to highest order c b • Decomposable models correspond to Mental work terms: { n ij+ } , { n i+k } , { n +jk } . f Ana triangulated graphs. 15 16 Page 4
MLEs for Decomposable Multi-way Bounds Log-linear Models • For decomposable log-linear models: ∏ ∏ ∏ ∏ • For decomposable models, expected cell MSSs = = = = values are explicit function of margins, Expected Value ∏ ∏ ∏ ∏ Separators corresponding to MSSs ( cliques in graph): • Theorem: When released margins – For conditional independence in 3-way table: correspond to those of a decomposable log m ijk = = u + = = + + u 1( i ) + + + + + u 2( j ) + + u 3( k ) + + + + u 12( ij ) + + + + u 13( ik ) + + model: – Upper bound: minimum of relevant margins. mij + + mi + + + + k + + m ijk = = = = – Lower bound: maximum of zero, or sum of mi + ++ + + + + + relevant margins minus separators. • Substitute observed margins for expected – Bounds are sharp. in explicit formula to get MLEs. Fienberg and Dobra (2000) 17 18 Ex. 2: Czech Autoworkers Multi-Way Bounds (cont.) Syst. BP • Example : Given margins in k -way table d • Suppose released margins are Lipo ra tio a e that correspond to ( k -1)-fold conditional S moke (Y/ N) [ADE][ABCE][BF] : independence given variable 1: P hys. work c b – Correspond to decomposable graph. Menta l work n n n f { } { } .... { } Anamne sis – Cell containing population unique has bounds [0, 25]. i i + + + + + + + + i + + + + i + + + + i + + + + + + + + i ... ... ... k 1 2 1 3 1 • Then bounds are – Cells with entry of “2” have bounds: [0,20] and [0,38]. n n n ≥ ≥ ≥ ≥ n min{ , ,..., } – Lower bounds are all “0”. i i + + + + + + + + + + + + i + + + + i + + + + + + + + i + + + + + + + + i i i i i ... ... ... ... k k 1 2 1 3 1 1 2 3 ≥ ≥ ≥ ≥ n + + + + n + + + + + + + + n − − − − n k − − − − • “Safe” to release these margins; low risk max{ ... ( 2 ), 0 } i i + + + + + + i + + i + + + + i + + + + + + + + + + + + + + + + + + + + + + + + + + + + ... ... ... + + + + i + + + + i k 1 2 1 3 ... 3 1 of disclosure. 19 20 Page 5
Recommend
More recommend