MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias francisco.m.c.dias@tecnico.ulisboa.pt 03/06/2016 Instituto Superior Técnico Slide 1 of 52
INTRODUCTION RELATED WORK ARCHITECTURE INTRODUCTION ANONYMIZATION METHODS EVALUATION INTEGRATING OUR SYSTEM CONCLUSION 03/06/2016 Instituto Superior Técnico Slide 2 of 52
INTRODUCTION ANOMYMIZATION - From the Old Greek anónumos (transl: “without name”); - It suppresses names and sensitive information ; TEXT - It processes data in the form of text ; - A text contains unstructured data; AUTOMATED - It runs without human intervention; MULTILINGUAL - It processes texts written in different languages . 03/06/2016 Instituto Superior Técnico Slide 3 of 52
MOTIVATION - Information sharing in text-form is important in some areas; (clinical and scientific research, decision making, among others) - Texts may contain private information , protected by law; - In order to share information in text-form, all sensitive information should be removed . - Manual redaction is a hard and time-consuming task. An automated anonymization system could help in this task. 03/06/2016 Instituto Superior Técnico Slide 4 of 52
CHALLENGE - To implement a multilingual anonymization system : → STRING NLP Chain; → Unbabel Translation Pipeline; - Support 4 languages: English, German, Portuguese, Spanish; - Evaluate the anonymization system: → does it remove all sensitive information? → does it replace same entities by the same label? → does the results look natural to a human reader? 03/06/2016 Instituto Superior Técnico Slide 5 of 52
RELATED WORK 03/06/2016 Instituto Superior Técnico Slide 6 of 52
RELATED WORK - Most of the previous works are based on NER techniques; - The evaluation of the previous works was based on the detection of entities in the text; - I2B2 launched two de-identification challenges in the past: 2006 and 2014. 03/06/2016 Instituto Superior Técnico Slide 7 of 52
RELATED WORK - MITRE, Wellner et al ., 2006 - Model-based and Pattern-matching techniques; - Best performance on i2b2 2006 challenge; 03/06/2016 Instituto Superior Técnico Slide 8 of 52
RELATED WORK - Szarvas et al . System, 2006 - Model-based classifiers in parallel and a voting module; - Post-processing iteration in order to detect more candidates; 03/06/2016 Instituto Superior Técnico Slide 9 of 52
RELATED WORK - Arakami et al . System, 2006 - A CRF * classifier detects candidates to sensitive information; - Label-consistency post-processing; * CRF: Conditional Random Fields 03/06/2016 Instituto Superior Técnico Slide 10 of 52
RELATED WORK - HIDE, Gardner et al ., 2008 - A CRF classifier detects candidates to sensitive information; - Uses coreferences in order to detect more candidates; 03/06/2016 Instituto Superior Técnico Slide 11 of 52
RELATED WORK - “Nottingham System”, Yang & Garibaldi, 2014 - Model-based (CRF) and Pattern-matching techniques; - It uses coreferences in order to detect more candidates; - Best performance on i2b2 2014 challenge; 03/06/2016 Instituto Superior Técnico Slide 12 of 52
ARCHITECTURE 03/06/2016 Instituto Superior Técnico Slide 13 of 52
ARCHITECTURE Pre-processing - Pipeline with 5 modules; - Based on NER techniques; NER - Post-processing and coreference modules; Second-pass Detection Coreference Resolution Anonymization 03/06/2016 Instituto Superior Técnico Slide 14 of 52
ARCHITECTURE - The NER module detects sensitive information contained in the text; - It is composed of several parallel components; Main NER Pattern- Parallel NER Parallel NER Classifier matching Classifier 1 Classifier 2 Voting 03/06/2016 Instituto Superior Técnico Slide 15 of 52
ARCHITECTURE SECOND-PASS DETECTION - Post-processing step; corrections over NER results; - It applies Short-forms and Label-Consistence ; COREFERENCE RESOLUTION - Groups named entities into mentions ; - Each mention refers to the same extra-linguistic object; ANONYMIZATION MODULE - Implements anonymization methods ; - Returns an anonymized text and a table of solutions. 03/06/2016 Instituto Superior Técnico Slide 16 of 52
ANONYMIZATION METHODS 03/06/2016 Instituto Superior Técnico Slide 17 of 52
ANONYMIZATION METHODS - The methods obfuscate original entities in text using replacement tags or entities; - We implemented 4 anonymization methods: - Suppression → Lisbon ***** - Tagging → Lisbon [LOCATION] - Random Substitution → Lisbon Cairo - Generalization → Lisbon City 03/06/2016 Instituto Superior Técnico Slide 18 of 52
RANDOM SUBSTITUTION - Random substitution replaces an entity by another random entity from the same class and morphosyntactic features ; - Morphosyntactic features are determined by the headword ; ● A e r o p o r t o ( m a s c , s i n g ) → R e c i n t o ( m a s c , s i n g ) ● F r a n k f u r t s ( m a s c , s i n g , g e n i t i v ) → We g s ( m a s c , s i n g , g e n i t i v ) - Random entities are looked up from a default list of entities; Language Class Number Gender Case Term PT location singular masculine recinto ES location singular feminine arena EN location singular venue DE location singular neuter nominative Wahrzeichen 03/06/2016 Instituto Superior Técnico Slide 19 of 52
GENERALIZATION - Generalization is any method of replacing an entity by another that mentions an item of the same type but in a more general way; - This method accesses a Knowledge Base in order to retrieve the superclasses of a given entity. City Berlin London Lisbon Madrid 03/06/2016 Instituto Superior Técnico Slide 20 of 52
EVALUATION 03/06/2016 Instituto Superior Técnico Slide 21 of 52
EVALUATION 1) Detection of sensitive information (named entities) → does it remove all sensitive information? 2) Performance of the coreference between entities → does it replace same entities by the same label? 3) Adequacy of the replacements → does the results look natural to a human reader? Evaluation of previous studies on anonymization aimed at: - detection of entities in a text (we also evaluate points 2 and 3); - clinical report text style (we aim various text styles); 03/06/2016 Instituto Superior Técnico Slide 22 of 52
DATASETS - We aim at different domains of text and languages . - We use corpora divided into documents from 2 different sources, with different text domains for each language: English: CoNLL 2003 + DCEP German: CoNLL 2003 + DCEP Portuguese: Segundo HAREM + DCEP Spanish: CoNLL 2002 + DCEP - DCEP reports were manually annotated for named entities; - All datasets were manually annotated for coreference between entities; 03/06/2016 Instituto Superior Técnico Slide 23 of 52
DETECTION OF SENSITIVE INFORMATION - Intrinsic evaluation of the performance of NER: f1-score (also recall); - 3 classes of entities: Location, Organization and Person; - 5 configurations: - Baseline ; - Baseline + Pattern-matching ; - Baseline + Second-pass Detection ; - Baseline + Parallel NER classifier ; - Baseline + All previous configurations; - Statistically different results from the baseline. 03/06/2016 Instituto Superior Técnico Slide 24 of 52
DETECTION OF SENSITIVE INFORMATION - Baseline performance depends on the text domain; - Gazetteers improve significantly* recall; - Second-pass improves significantly* f1-score (some datasets); also adds false positives; - Parallel NER improves the performance (same training text domain ) not significantly* when compared . with Second-pass in CoNLL; - All modules improves f1-score only on DCEP; * p < 0.01, compared with baseline 03/06/2016 Instituto Superior Técnico Slide 25 of 52
DETECTION OF SENSITIVE INFORMATION 03/06/2016 Instituto Superior Técnico Slide 26 of 52
DETECTION OF SENSITIVE INFORMATION 03/06/2016 Instituto Superior Técnico Slide 27 of 52
COMPARING WITH I2B2 RESULTS 2006 2014 5º 5º against 6 other systems against 10 other systems 03/06/2016 Instituto Superior Técnico Slide 28 of 52
COREFERENCE RESOLUTION - Baseline: no coreference; - Metrics: B-Cubed Score ; - Results depend on the language and text domain ; - Performance of coreference resolution is satisfactory : - Precision close to 1.0; - Recall much higher than the baseline. 03/06/2016 Instituto Superior Técnico Slide 29 of 52
COREFERENCE RESOLUTION 03/06/2016 Instituto Superior Técnico Slide 30 of 52
ANONYMIZATION - Metrics: Availability and relevance of a substitution; - Effects of anonymization in the coreference of entities; - The relevance of a substitution within a context was measured using human raters, as the ratio: 03/06/2016 Instituto Superior Técnico Slide 31 of 52
Recommend
More recommend