TOWARDS A QUALITY ASSESSMENT OF DISCLOSURE-LIMITED STATISTICAL DATA Lawrence H. Cox, Ph.D. National Center for Health Statistics LCOX@CDC.GOV
QUALITY-CONFIDENTIALITY TRADEOFF To reduce risk of statistical disclosure to an acceptable level , statistical disclosure limitation (SDL) methods - abbreviate - eliminate - modify original data Lowering disclosure risk typically forces reduction of data quality in terms of - accuracy - completeness - usability Over the past 4 decades, SDL methods have been - studied/developed - improved/refined/implemented with considerable success At the same time, efforts to assess/control/assure quality were virtually absent
This presentation - examines quality effects of three SDL methods for tabular data - explores quality-preserving methods The three methods - rounding - complementary cell suppression - controlled tabular adjustment
HIGHLIGHTS In view of time limitations, the take-home messages are Rounding - rounding keeps the data release intact - methods for quality-preserving rounding � preserving mean, variance � preserving distribution - available to NSOs - rounding can limit disclosure effectively Complementary cell suppression - has very negative effects on data quality, especially as the data release is not intact - in the absence of a mathematical model, in some cases suppression can be undone - the security of suppression hinges on a single quantity that often can be estimated - p-percent rules can be vulnerable - p/q-ambiguity rules are vulnerable Controlled tabular adjustment - keeps the data release intact - can preserve key values and statistics - can preserve original distribution - effectively limits disclosure
ROUNDING Rounding (base B) : replace original data values x = qB + r, 0 < r < B by integer multiples R(x) = mB of an integer rounding base B Adjacent rounding (typical): |x – R(x)| < B Zero-restricted rounding (typical): R(mB) = mB Controlled rounding preserves additivity We are concerned with - effects of base B rounding on statistical properties of original data ( data quality ) • mean • variance/TMSE • distribution - effects on disclosure risk : P[x | R(x)]
Principal issues in evaluating an SDL method (1) Is the method effective for limiting disclosure? (2) Are its effects on data quality acceptable? Examined these questions for four rounding rules - conventional rounding - modified conventional rounding - zero-restricted 50/50 rounding - unbiased rounding We only on report zero-restricted 50/50 rounding We evaluate rounding rule/base (1) in terms of the posterior probability of an original data value given its rounded value (2) in terms expected increase in total mean squared error and expected difference between pre- and post-rounding distributions as measured by a conditional Chi-square statistic
We assume - r - and q -distributions independent B − ∼ {0, 1} - r Uniform (can be relaxed) Focus on adjacent rounding - R(x) = qB or (q + 1)B - R(x) = qB + R(r) with R(r) = 0 or B Zero-restricted 50/50 rounding - r = 0: round down - r ≠ 0: round down or up each with probability ½ Assumptions imply - E[x] = BE[q] + E[r] - P[r] = P[r|q] = 1/B - V(x) = B 2 V(q) + V(r)
EFFECTS OF ROUNDING ON MEAN, VARIANCE For zero-restricted 50/50 rounding + 1 -1 B B - [ ] and [ ] = = = = ( ) 0 ( ) , thus P R r P R r B 2 2 B B − -1 1 B B 2 - [ ] = = ( ) [ ( )] and E R r V R r 2 4 Expected value of x and R(x) Unrounded 50/50 -1 -1 B B qB + qB + 2 2 Variance of x and R(x) Unrounded 50/50 B − B − 2 2 1 1 B 2 V[q] + B 2 V[q] + 4 12
EFFECTS OF ROUNDING ON x-DISTRIBUTION Use the conditional Chi-square statistic χ = 2 - ∑ x U x − − [ ( ) ] [ ( ) ] R x x 2 R r r 2 = = - (x = 0, U x = 0) U x x x x x Degrees of freedom df determined by the tabular structure − ⎡ ⎤ ⎡ ⎤ ( ) 2 2 r B r = + = = ∑ ( ) [ | ] [ ( ) 0] [ ( ) ] P R r P R r B E U x x x ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦ x x x x x r x - d = # {x} = the number of x-observations - e = # {x < B}, viz., zeroes and confidential values Can derive − − 2 ( 1) 1 1 B B B ≤ + − ≥ [ ] ( ) [ | 1] - E U e d e E q 2 6 q − B B − ( ( ) ) R r r 2 2 1 ( 1) B − − 1 B [ ] ∑ E = ( ) 2 2 r = s s 1 − − − ( ( ) ) R r r 2 = ( 1)(2 1) B B [ ] E 6 B 1 ≥ [ | 1] NSO can estimate E q q So, NSO can select B so that the expected conditional Chi-square value is not statistically significant
EFFECTIVENESS FOR DISCLOSURE LIMITATION Evaluate effectiveness of rounding for SDL in terms of posterior predictive probabilities P[x=r|R(r) = 0] P[x|R(x)=0] x=r 50/50 {r = 0} 1 B + R(x)=0 1 2 { B + } 1 Confidentiality analysis - prior r-probabilities uniform on {0, 1, …, B-1} - ideally, posterior probabilities uniform on same set - or, if x=0 is not a confidential value, then uniform over its B-1 nonzero values - if x=r=0 is not confidential, under 50/50 rounding posterior probabilities are uniform over the confidential values Reference: Cox and Kim (2006)
COMPLEMENTARY CELL SUPPRESSION p-PERCENT RULE For magnitude data , each respondent ( contributor ) to the value of cell X contributes an individual amount, e.g., - monthly sales for a clothing store - weekly payroll for a factory - number of patient visits for an emergency room Cell value of X is x = sum of all contributions x i to X ∑ = ≥ ≥ ≥ ; .... .... x x x x x 1 2 i i i The p-percent rule is designed to prevent narrow estimation of any contribution to a cell value by a second contributor or third party. It says: A tabulation cell X is a disclosure ( sensitive ) cell if, after subtracting the second largest contribution from the cell value, the remainder is within p-percent of the largest contribution Express p as a decimal (not a percent); e.g., 20% = 0.20 ∑ = − > Sensitivity expressed via ( ) (1/ ) 0 S X x p x p 1 i ≥ 3 i NB: Protecting largest from second largest protects all
p/q-AMBIGUITY RULE In addition to p-percent protection, data releaser assumes intruder can estimate any contribution within q-percent Express q as decimal : q < 1 and, of course, q >> p ∑ = − > Sensitivity expressed via ( ) ( / ) 0 S X x q p x p q / 1 i ≥ 3 i Thus, p/q-ambiguity rule is stricter than p-percent rule, viz., all p-percent sensitive cells are p/q-sensitive When q = 1: p/q-ambiguity rule = p-percent rule Disclosure limitation method must take into account the ability of the intruder to estimate within q-percent
CCS - suppress from publication all sensitive cells - the disclosure rule enables releaser to compute for each sensitive cell the minimum uncertainty in estimation required to protect the cell - that quantity is dependent on the distribution of contributions within the cell and differs from cell to cell and cell value to cell value - it is called X ’s protection limit r(X) = r - select other, nonsensitive cells whose suppression will render the tabulations safe according to the disclosure rule--the complementary suppressions - safe means that no interval for x finer than [x-r, x+r] is derivable from released tabulations - select the complementary suppressions optimally with respect to some information loss criterion, e.g., # total value suppressed # total number of suppressions # Berg entropy - very complex mathematically/computationally - for the p/q-rule, the mathematical suppression must take into account the ability of the intruder to estimate values to within q-percent
Mathematical models for CCS Tabular structure is represented as Ay = 0 Entries of A = -1, 0, +1 Original data: a = (a 1 ,…., a n ); Aa = 0 Sensitive cell values: a d(i) , i = 1, …, s Protection: r d(i) , 0 < r d(i) < a d(i) , and r k = 0 otherwise CCS Models ∑ min c z k k k = = = 1,...., ; 1,2; 1,...., : i s j k n = 0 Ay , i j − ≤ ≤ − (1 ) a z y a r z ,1, k k i k k k k + ≥ ≥ + (1 ) a z y a r z ,2, k k i k k k k = = 0,1; 1 z z j d i ( ) Minimize number of cells suppressed: c k = 1 Minimize total value suppressed c k = a k Minimize Berg entropy c k = log (1+ a k )
Suppression done “by hand” can be vulnerable 3x3x3 contingency table, all internal entries suppressed ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ * * * 11 * * * 5 * * * 5 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ * * * 5 * * * 11 * * * 5 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ * * * ⎠ ⎝ 5 ⎠ ⎝ * * * ⎠ ⎝ 5 ⎠ ⎝ * * * ⎠ ⎝ 11 ⎠ ( ) ( ) ( ) 11 5 5 (21) 5 11 5 (21) 5 5 11 (21) ⎛ ⎞ 1 10 10 ⎜ ⎟ 10 1 10 ⎜ ⎟ ⎜ ⎟ 10 10 1 ⎝ ⎠ Unique solution: ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 1 5 5 0 5 0 0 0 5 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 5 0 0 5 1 5 0 0 5 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ 5 0 0 ⎠ ⎝ 0 5 0 ⎠ ⎝ 5 5 1 ⎠ contains three 1’s-- DISCLOSURE
Recommend
More recommend