HIDE: Privacy Preserving Medical Data Publishing James Gardner Department of Mathematics and Computer Science Emory University jgardn3@emory.edu
Motivation • De-identification is critical in any health informatics system • Research • Sharing • Need an easy-to-use interface and framework for data custodians and publishers • Understanding data is necessary to de-identify data
HIPAA 1. Names; 2. All geographical subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code, if according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000. 3. All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older; 4. Phone numbers; 5. Fax numbers; 6. Electronic mail addresses; 7. Social Security numbers; 8. Medical record numbers; 9. Health plan beneficiary numbers; 10. Account numbers; 11. Certificate/license numbers; 12. Vehicle identifiers and serial numbers, including license plate numbers; 13. Device identifiers and serial numbers; 14. Web Universal Resource Locators (URLs); 15. Internet Protocol (IP) address numbers; 16. Biometric identifiers, including finger and voice prints; 17. Full face photographic images and any comparable images; and 18. Any other unique identifying number, characteristic, or code (note this does not mean the unique code assigned by the investigator to code the data)
PHI Summary • Protected Health Information (PHI) is defined by HIPAA as individually identifiable health information • Direct identifiers include name, SSN, etc. • Indirect identifiers include gender, age, address information, etc.
Research Challenges • Detect PHI in heterogeneous medical data • Apply structured anonymization principles on heterogeneous medical data (micro-privacy) • Release differentially private aggregated statistics (macro-privacy)
HIDE • Health Information DE-identification • Uses techniques from • Information Extraction • Data linking • Structured Anonymization • Differential Privacy • Data Mining
HIDE
Outline • Background and related work • Existing de-identification approaches • Named entity recognition • Privacy preserving data publishing • Proposed Work • HIDE framework • Identifying and sensitive information extraction • Micro-data publishing • Macro-data publishing • Software
Alternative Systems • Scrub System - rules and dictionaries are used to detect PHI • Semantic Lexicon System - rules and dictionaries are used to detect PHI • DE-ID - rules and dictionaries, developed at Pittsburgh and approved by IRB • Concept-Match Scrubber - removes every word not in an approved list of non-identifying terms • Carafe - uses a CRF to detect PHI
Limitations of Most Systems • Lack portability • Don ʼ t give formal privacy guarantees • Don ʼ t utilize the latest work from structured data anonymization • Focus only on removing PHI
Named Entity Recognition • Locate and classify atomic elements in text into predefined categories such as person, organization, location, expressions of time, quantities, etc. • NER systems can be classified into either: • Rule-based • Machine Learning-based
NER Examples • Part-of-speech (POS) Tagging • I/PRP think/VBP it/PRP ‘s/BES a/DT pretty/ RB good/JJ idea/NN ./. • Personal Health Identifier Detection • <age>77</age> year old <gender>female</ gender> with history of <disease>B-cell lymphoma</disease> (Marginal zone, <mrn>SH-04-4444</mrn>)
NER Metrics • Precision • TP / (TP + FP) • Recall • TP / (TP + FN)
Rule-based • Rely on hand-coded rules and dictionaries • Dictionaries can be used for terms in a closed class with an exhaustive list, e.g. geographic locations • Regular expressions are used to detect terms that follow certain syntactic patterns, e.g. phone numbers
Machine learning-based • Model the NER as a sequence labeling task where each token is assigned a label • Train classifiers to label each token • Classifiers use a list of features (or attributes) for training and classification of the sequence • Frequently applied classifiers are HMM, MEMM, SVM, and CRF
Conditional Random Field • A Conditional Random Field (CRF) provides a probabilistic framework for labeling and segmenting sequential data • A CRF defines a conditional probability of a label sequence given an observation sequence
Comparison • Rule-based • Accurate • Require experts to modify • Not portable • Machine learning-based • Accurate • Modification of models is done through training rather than “coding” • Portable
Privacy Preserving Data Publishing • Weak privacy (Micro) • release a modified version of each record according to a given anonymization principle • assumes level of background knowledge • Differential privacy (Macro) • release perturbed statistics that satisfy the differential privacy principle • no assumptions of background knowledge
Micro-data publishing • Prevent linking of records in separate databases • k-anonymization • Prevent discovery of sensitive values • l-diversity • Prevent discovery of presence or absence in a database • delta-presence
Micro-data publishing Name Age Gender Zipcode Diagnosis Henry 25 Male 53710 Influenza Table 1: Illustration of Anonymization Name Age Gender Zipcode Diagnosis Irene 28 Female 53712 Lymphoma Henry 25 Male 53710 Influenza Dan 28 Male 53711 Bronchitis Irene 28 Female 53712 Lymphoma Dan 28 Male 53711 Bronchitis Erica 26 Female 53712 Influenza Erica 26 Female 53712 Influenza Original Data Original Data Name Age Gender Zipcode Disease Name Age Gender Zipcode Disease [25 − 28] Male [53710-53711] Influenza ∗ [25 − 28] Male [53710-53711] Influenza ∗ [25 − 28] Female 53712 Lymphoma ∗ [25 − 28] Male [53710-53711] Bronchitis [25 − 28] Female 53712 Lymphoma ∗ ∗ [25 − 28] Female 53712 Influenza ∗ [25 − 28] Male [53710-53711] Bronchitis ∗ Anonymized Data [25 − 28] Female 53712 Influenza ∗ Anonymized Data
k-anonymization • Quasi identifier set • Sensitive attributes • Table is k-anonymous if every record has k-1 other records with the same quasi- identifier set • The probability of linking a victim to a specific record through QID is at most 1/k
l-diversity • Extension of k-anonymization • Also ensures that each group has at least l distinct sensitive values • Prevents disclosure of sensitive values
Macro-data publishing • Differential Privacy is a strong privacy notion • Requires that a randomized computation yields nearly identical output when performed on nearly identical input • Interactive model • limited to a specific number of queries • Non-interactive model • need query strategies to build noisy data cubes that maximize utility for a random query workload
Differentially Private Interface Query Workload Strategy Pre-designed Queries Differentially Queries Diff. Original User Private Private Data Interface Histogram Diff. Private Answers Answers
HIDE Framework • Identifying and Sensitive Information Extraction • uses state-of-the-art CRF model to extract PHI and sensitive information • Data linking • provides structured patient-centric view of the data • De-identification and Anonymization • Micro-data publication - uses data suppression and generalization to provide a k-anonymized view of the data • Macro-data publication - release perturbed aggregated statistics from the patient-centric view
HIDE
Identifying and sensitive information extraction • Use CRF classifier to extract information • Studied impact of features including: • regular expressions • affixes • dictionaries • context • Sampling techniques to adjust classifier for higher precision or recall
Example Token Label Token Label 77 B-age of O year O B B-disease old O - I-disease cell I-disease female B-gender with O lymphoma I-disease history O ( O
Recommend
More recommend