hide privacy preserving medical data publishing
play

HIDE: Privacy Preserving Medical Data Publishing James Gardner - PowerPoint PPT Presentation

HIDE: Privacy Preserving Medical Data Publishing James Gardner Department of Mathematics and Computer Science Emory University jgardn3@emory.edu Motivation De-identification is critical in any health informatics system Research


  1. HIDE: Privacy Preserving Medical Data Publishing James Gardner Department of Mathematics and Computer Science Emory University jgardn3@emory.edu

  2. Motivation • De-identification is critical in any health informatics system • Research • Sharing • Need an easy-to-use interface and framework for data custodians and publishers • Understanding data is necessary to de-identify data

  3. HIPAA  1. Names;  2. All geographical subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code, if according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.  3. All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;  4. Phone numbers;  5. Fax numbers;  6. Electronic mail addresses;  7. Social Security numbers;  8. Medical record numbers;  9. Health plan beneficiary numbers; 10. Account numbers;   11. Certificate/license numbers;  12. Vehicle identifiers and serial numbers, including license plate numbers;  13. Device identifiers and serial numbers;  14. Web Universal Resource Locators (URLs);  15. Internet Protocol (IP) address numbers;  16. Biometric identifiers, including finger and voice prints;  17. Full face photographic images and any comparable images; and  18. Any other unique identifying number, characteristic, or code (note this does not mean the unique code assigned by the investigator to code the data)

  4. PHI Summary • Protected Health Information (PHI) is defined by HIPAA as individually identifiable health information • Direct identifiers include name, SSN, etc. • Indirect identifiers include gender, age, address information, etc.

  5. Research Challenges • Detect PHI in heterogeneous medical data • Apply structured anonymization principles on heterogeneous medical data (micro-privacy) • Release differentially private aggregated statistics (macro-privacy)

  6. HIDE • Health Information DE-identification • Uses techniques from • Information Extraction • Data linking • Structured Anonymization • Differential Privacy • Data Mining

  7. HIDE

  8. Outline • Background and related work • Existing de-identification approaches • Named entity recognition • Privacy preserving data publishing • Proposed Work • HIDE framework • Identifying and sensitive information extraction • Micro-data publishing • Macro-data publishing • Software

  9. Alternative Systems • Scrub System - rules and dictionaries are used to detect PHI • Semantic Lexicon System - rules and dictionaries are used to detect PHI • DE-ID - rules and dictionaries, developed at Pittsburgh and approved by IRB • Concept-Match Scrubber - removes every word not in an approved list of non-identifying terms • Carafe - uses a CRF to detect PHI

  10. Limitations of Most Systems • Lack portability • Don ʼ t give formal privacy guarantees • Don ʼ t utilize the latest work from structured data anonymization • Focus only on removing PHI

  11. Named Entity Recognition • Locate and classify atomic elements in text into predefined categories such as person, organization, location, expressions of time, quantities, etc. • NER systems can be classified into either: • Rule-based • Machine Learning-based

  12. NER Examples • Part-of-speech (POS) Tagging • I/PRP think/VBP it/PRP ‘s/BES a/DT pretty/ RB good/JJ idea/NN ./. • Personal Health Identifier Detection • <age>77</age> year old <gender>female</ gender> with history of <disease>B-cell lymphoma</disease> (Marginal zone, <mrn>SH-04-4444</mrn>)

  13. NER Metrics • Precision • TP / (TP + FP) • Recall • TP / (TP + FN)

  14. Rule-based • Rely on hand-coded rules and dictionaries • Dictionaries can be used for terms in a closed class with an exhaustive list, e.g. geographic locations • Regular expressions are used to detect terms that follow certain syntactic patterns, e.g. phone numbers

  15. Machine learning-based • Model the NER as a sequence labeling task where each token is assigned a label • Train classifiers to label each token • Classifiers use a list of features (or attributes) for training and classification of the sequence • Frequently applied classifiers are HMM, MEMM, SVM, and CRF

  16. Conditional Random Field • A Conditional Random Field (CRF) provides a probabilistic framework for labeling and segmenting sequential data • A CRF defines a conditional probability of a label sequence given an observation sequence

  17. Comparison • Rule-based • Accurate • Require experts to modify • Not portable • Machine learning-based • Accurate • Modification of models is done through training rather than “coding” • Portable

  18. Privacy Preserving Data Publishing • Weak privacy (Micro) • release a modified version of each record according to a given anonymization principle • assumes level of background knowledge • Differential privacy (Macro) • release perturbed statistics that satisfy the differential privacy principle • no assumptions of background knowledge

  19. Micro-data publishing • Prevent linking of records in separate databases • k-anonymization • Prevent discovery of sensitive values • l-diversity • Prevent discovery of presence or absence in a database • delta-presence

  20. Micro-data publishing Name Age Gender Zipcode Diagnosis Henry 25 Male 53710 Influenza Table 1: Illustration of Anonymization Name Age Gender Zipcode Diagnosis Irene 28 Female 53712 Lymphoma Henry 25 Male 53710 Influenza Dan 28 Male 53711 Bronchitis Irene 28 Female 53712 Lymphoma Dan 28 Male 53711 Bronchitis Erica 26 Female 53712 Influenza Erica 26 Female 53712 Influenza Original Data Original Data Name Age Gender Zipcode Disease Name Age Gender Zipcode Disease [25 − 28] Male [53710-53711] Influenza ∗ [25 − 28] Male [53710-53711] Influenza ∗ [25 − 28] Female 53712 Lymphoma ∗ [25 − 28] Male [53710-53711] Bronchitis [25 − 28] Female 53712 Lymphoma ∗ ∗ [25 − 28] Female 53712 Influenza ∗ [25 − 28] Male [53710-53711] Bronchitis ∗ Anonymized Data [25 − 28] Female 53712 Influenza ∗ Anonymized Data

  21. k-anonymization • Quasi identifier set • Sensitive attributes • Table is k-anonymous if every record has k-1 other records with the same quasi- identifier set • The probability of linking a victim to a specific record through QID is at most 1/k

  22. l-diversity • Extension of k-anonymization • Also ensures that each group has at least l distinct sensitive values • Prevents disclosure of sensitive values

  23. Macro-data publishing • Differential Privacy is a strong privacy notion • Requires that a randomized computation yields nearly identical output when performed on nearly identical input • Interactive model • limited to a specific number of queries • Non-interactive model • need query strategies to build noisy data cubes that maximize utility for a random query workload

  24. Differentially Private Interface Query Workload Strategy Pre-designed Queries Differentially Queries Diff. Original User Private Private Data Interface Histogram Diff. Private Answers Answers

  25. HIDE Framework • Identifying and Sensitive Information Extraction • uses state-of-the-art CRF model to extract PHI and sensitive information • Data linking • provides structured patient-centric view of the data • De-identification and Anonymization • Micro-data publication - uses data suppression and generalization to provide a k-anonymized view of the data • Macro-data publication - release perturbed aggregated statistics from the patient-centric view

  26. HIDE

  27. Identifying and sensitive information extraction • Use CRF classifier to extract information • Studied impact of features including: • regular expressions • affixes • dictionaries • context • Sampling techniques to adjust classifier for higher precision or recall

  28. Example Token Label Token Label 77 B-age of O year O B B-disease old O - I-disease cell I-disease female B-gender with O lymphoma I-disease history O ( O

Recommend


More recommend