preserving confidentiality overview and providing
play

Preserving Confidentiality Overview AND Providing Adequate Data - PowerPoint PPT Presentation

Preserving Confidentiality Overview AND Providing Adequate Data for Statistical Modeling Background and some fundamental abstractions for disclosure limitation. Stephen E. Fienberg Statistical users want more than to retrieve a few


  1. Preserving Confidentiality Overview AND Providing Adequate Data for Statistical Modeling • Background and some fundamental abstractions for disclosure limitation. Stephen E. Fienberg – Statistical users want more than to retrieve a few numbers. Department of Statistics • Results on bounds for table entries. Center for Automated Learning and Discovery • Uses of Markov bases for exact Center for Computer and Communications distributions and perturbation of tables. Security • Links to log-linear models, and related Carnegie Mellon University statistical theory and methods. Pittsburgh, PA, U.S.A. 1 2 R-U Confidentiality Map NISS Prototype Query System • For k -way table of counts. ������������� ��������������� • Queries: Requests for marginal tables. �������� • Responses: Yes--release; No; (and perhaps ���������� ���� ������������������������������ “Simulate” and then release). • As released margins cumulate we have increased information about table entries. ������� • Margins need to be consistent ==> possible simulated releases get highly constrained. ������������ (Duncan, et al. 2001) 3 4 Page 1

  2. Fundamental Abstractions Confidentiality Concern • Query space, Q, with partial ordering: • Uniqueness in population table ⇔ ⇔ ⇔ ⇔ cell – Elements can be marginal tables, conditionals, k - count of “1”. groupings, regressions, or other data summaries. – Released set: R( t ), and implied Unreleasable set: U( t ). • Uniqueness allows intruder to match – Releasable frontier: maximal elements of R( t ). characteristics in table with other data – Unreleasable frontier: minimal elements of U( t ). bases that include the same variables plus • Risk and Utility defined on subsets of Q . others to learn confidential information. – Risk Measure : identifiability of small cell counts. – Assuming data are reported without error! – Utility : reconstructing table using log-linear models. • Identity versus attribute disclosure. – Release rules must balance risk and utility: • R-U Confidentiality map. 5 6 • General Bayesian decision-theoretic approach. Example 1: 2000 Census Why Marginals? • U.S. decennial census “long form” • Simple summaries corresponding to – 1 in 6 sample of households nationwide. subsets of variables. – 53 questions, many with multiple categories. • Traditional mode of reporting for – Data measured with substantial error! statistical agencies and others. – Data reported after application of data swapping! • Useful in statistical modeling: Role of • Geography log-linear models. – 50 states; 3,000 counties; 4 million “blocks”. • Collapsing categories of categorical – Release of detailed geography yields uniqueness in sample and at some level in population. variables uses similar DL methods and • American Factfinder releases various 3-way statistical theory. tables at different levels of geography. 7 8 Page 2

  3. Example 2: Risk Factors for Coronary Heart Disease Syst. BP • 1841 Czech auto workers d Edwards and Havanek (1985) • 2 6 table P hys. work Lipo ra tio c e • population data – “0” cell – population unique, “1” – 2 cells with “2” b f Ana mne sis Me nta l work a 9 10 Smoke (Y/ N) Example 2: The Data Example 3: NLTCS B no yes F E D C A no yes no yes • National Long Term Care Survey ne < 3 < 140 no 44 40 112 67 g – 20-40 demographic/background items. yes 129 145 12 23 – 30-50 items on disability status, ADLs and IADLs, ≥ ≥ 140 ≥ ≥ no 35 12 80 33 yes 109 67 7 9 most binary but some polytomous. ≥ ≥ 3 ≥ ≥ < 140 no 23 32 70 66 – Linked Medicare files. yes 50 80 7 13 ≥ ≥ 140 ≥ ≥ – 5 waves: 1982, 1984, 1989, 1994, 1999. no 24 25 73 57 yes 51 63 7 16 • We’ve been working with 2 16 table, pos < 3 < 140 no 5 7 21 9 yes 9 17 1 4 collapsed across several waves of survey, ≥ 140 ≥ ≥ ≥ no 4 3 11 8 with n =21,574. yes 14 17 5 2 ≥ ≥ ≥ ≥ 3 < 140 no 7 3 14 14 Erosheva (2002) yes 9 16 2 3 ≥ ≥ ≥ 140 ≥ no 4 0 13 11 Dobra, Erosheva, & Fienberg(2003) yes 5 14 4 4 11 12 Page 3

  4. Two-Way Fréchet Bounds Bounds for Multi-Way Tables • For 2 × × × × 2 tables of counts{ n ij } given the • k -way table of non-negative counts, k ≥ ≥ 3. ≥ ≥ marginal totals { n 1 + ,n 2 + } and { n + 1 ,n + 2 }: – Release set of marginal totals, possibly overlapping. n n n – Goal : Compute bounds for cell entries. + + + + 11 12 1 n n n – LP and IP approaches are NP-hard. + + + + 21 22 2 • Our strategy has been to: n n n + + + + + + + + 1 2 – Develop efficient methods for several special cases. ≥ ≥ ≥ ≥ ≥ ≥ ≥ ≥ + + + + − − − − n n n n n n min( , ) max( , 0 ) – Exploit linkage to statistical theory where possible. i + + + + + + + + j ij i + + + + + + + + j – Use general, less efficient methods for residual cases. • Interested in multi-way generalizations • Direct generalizations to tables with non- involving higher-order, overlapping integer, non-negative entries. margins. 13 14 Graphical & Decomposable Role of Log-linear Models? Log-linear Models • For 2 × × × 2 case, lower bound is evocative of MLE × • Graphical models: defined by simultaneous for estimated expected value under independence: conditional independence relationships m = = = = n n n S yst. BP ˆ / . ij i + + + + + + + + j d – Absence of edges in graph. – Bounds correspond to log-linearized version. Lipo ratio Example 2: a e – Margins are minimal sufficient statistics (MSS) . Smoke (Y/ N) • In 3 -way table of counts, { n ijk }, we model logs Czech autoworkers of expectations {E( n ijk )= m ijk } : Graph has 3 cliques: = = + + + + + + + + + + + + m = = u + + u + + u + + u + + u + + u + + u log( ) ijk i j k ij ik jk 1 ( ) 2 ( ) 3 ( ) 12 ( ) 13 ( ) 23 ( ) P hys. work [ADE][ABCE][BF] • MSS are margins corresponding to highest order c b • Decomposable models correspond to Mental work terms: { n ij+ } , { n i+k } , { n +jk } . f Ana triangulated graphs. 15 16 Page 4

  5. MLEs for Decomposable Multi-way Bounds Log-linear Models • For decomposable log-linear models: ∏ ∏ ∏ ∏ • For decomposable models, expected cell MSSs = = = = values are explicit function of margins, Expected Value ∏ ∏ ∏ ∏ Separators corresponding to MSSs ( cliques in graph): • Theorem: When released margins – For conditional independence in 3-way table: correspond to those of a decomposable log m ijk = = u + = = + + u 1( i ) + + + + + u 2( j ) + + u 3( k ) + + + + u 12( ij ) + + + + u 13( ik ) + + model: – Upper bound: minimum of relevant margins. mij + + mi + + + + k + + m ijk = = = = – Lower bound: maximum of zero, or sum of mi + ++ + + + + + relevant margins minus separators. • Substitute observed margins for expected – Bounds are sharp. in explicit formula to get MLEs. Fienberg and Dobra (2000) 17 18 Ex. 2: Czech Autoworkers Multi-Way Bounds (cont.) Syst. BP • Example : Given margins in k -way table d • Suppose released margins are Lipo ra tio a e that correspond to ( k -1)-fold conditional S moke (Y/ N) [ADE][ABCE][BF] : independence given variable 1: P hys. work c b – Correspond to decomposable graph. Menta l work n n n f { } { } .... { } Anamne sis – Cell containing population unique has bounds [0, 25]. i i + + + + + + + + i + + + + i + + + + i + + + + + + + + i ... ... ... k 1 2 1 3 1 • Then bounds are – Cells with entry of “2” have bounds: [0,20] and [0,38]. n n n ≥ ≥ ≥ ≥ n min{ , ,..., } – Lower bounds are all “0”. i i + + + + + + + + + + + + i + + + + i + + + + + + + + i + + + + + + + + i i i i i ... ... ... ... k k 1 2 1 3 1 1 2 3 ≥ ≥ ≥ ≥ n + + + + n + + + + + + + + n − − − − n k − − − − • “Safe” to release these margins; low risk max{ ... ( 2 ), 0 } i i + + + + + + i + + i + + + + i + + + + + + + + + + + + + + + + + + + + + + + + + + + + ... ... ... + + + + i + + + + i k 1 2 1 3 ... 3 1 of disclosure. 19 20 Page 5

Recommend


More recommend