MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias - PowerPoint PPT Presentation

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias francisco.m.c.dias@tecnico.ulisboa.pt 03/06/2016 Instituto Superior Técnico Slide 1 of 52

 INTRODUCTION  RELATED WORK  ARCHITECTURE INTRODUCTION  ANONYMIZATION METHODS  EVALUATION  INTEGRATING OUR SYSTEM  CONCLUSION 03/06/2016 Instituto Superior Técnico Slide 2 of 52

INTRODUCTION ANOMYMIZATION - From the Old Greek anónumos (transl: “without name”); - It suppresses names and sensitive information ; TEXT - It processes data in the form of text ; - A text contains unstructured data; AUTOMATED - It runs without human intervention; MULTILINGUAL - It processes texts written in different languages . 03/06/2016 Instituto Superior Técnico Slide 3 of 52

MOTIVATION - Information sharing in text-form is important in some areas; (clinical and scientific research, decision making, among others) - Texts may contain private information , protected by law; - In order to share information in text-form, all sensitive information should be removed . - Manual redaction is a hard and time-consuming task. An automated anonymization system could help in this task. 03/06/2016 Instituto Superior Técnico Slide 4 of 52

CHALLENGE - To implement a multilingual anonymization system : → STRING NLP Chain; → Unbabel Translation Pipeline; - Support 4 languages: English, German, Portuguese, Spanish; - Evaluate the anonymization system: → does it remove all sensitive information? → does it replace same entities by the same label? → does the results look natural to a human reader? 03/06/2016 Instituto Superior Técnico Slide 5 of 52

RELATED WORK 03/06/2016 Instituto Superior Técnico Slide 6 of 52

RELATED WORK - Most of the previous works are based on NER techniques; - The evaluation of the previous works was based on the detection of entities in the text; - I2B2 launched two de-identification challenges in the past: 2006 and 2014. 03/06/2016 Instituto Superior Técnico Slide 7 of 52

RELATED WORK - MITRE, Wellner et al ., 2006 - Model-based and Pattern-matching techniques; - Best performance on i2b2 2006 challenge; 03/06/2016 Instituto Superior Técnico Slide 8 of 52

RELATED WORK - Szarvas et al . System, 2006 - Model-based classifiers in parallel and a voting module; - Post-processing iteration in order to detect more candidates; 03/06/2016 Instituto Superior Técnico Slide 9 of 52

RELATED WORK - Arakami et al . System, 2006 - A CRF * classifier detects candidates to sensitive information; - Label-consistency post-processing; * CRF: Conditional Random Fields 03/06/2016 Instituto Superior Técnico Slide 10 of 52

RELATED WORK - HIDE, Gardner et al ., 2008 - A CRF classifier detects candidates to sensitive information; - Uses coreferences in order to detect more candidates; 03/06/2016 Instituto Superior Técnico Slide 11 of 52

RELATED WORK - “Nottingham System”, Yang & Garibaldi, 2014 - Model-based (CRF) and Pattern-matching techniques; - It uses coreferences in order to detect more candidates; - Best performance on i2b2 2014 challenge; 03/06/2016 Instituto Superior Técnico Slide 12 of 52

ARCHITECTURE 03/06/2016 Instituto Superior Técnico Slide 13 of 52

ARCHITECTURE Pre-processing - Pipeline with 5 modules; - Based on NER techniques; NER - Post-processing and coreference modules; Second-pass Detection Coreference Resolution Anonymization 03/06/2016 Instituto Superior Técnico Slide 14 of 52

ARCHITECTURE - The NER module detects sensitive information contained in the text; - It is composed of several parallel components; Main NER Pattern- Parallel NER Parallel NER Classifier matching Classifier 1 Classifier 2 Voting 03/06/2016 Instituto Superior Técnico Slide 15 of 52

ARCHITECTURE SECOND-PASS DETECTION - Post-processing step; corrections over NER results; - It applies Short-forms and Label-Consistence ; COREFERENCE RESOLUTION - Groups named entities into mentions ; - Each mention refers to the same extra-linguistic object; ANONYMIZATION MODULE - Implements anonymization methods ; - Returns an anonymized text and a table of solutions. 03/06/2016 Instituto Superior Técnico Slide 16 of 52

ANONYMIZATION METHODS 03/06/2016 Instituto Superior Técnico Slide 17 of 52

ANONYMIZATION METHODS - The methods obfuscate original entities in text using replacement tags or entities; - We implemented 4 anonymization methods: - Suppression → Lisbon ***** - Tagging → Lisbon [LOCATION] - Random Substitution → Lisbon Cairo - Generalization → Lisbon City 03/06/2016 Instituto Superior Técnico Slide 18 of 52

RANDOM SUBSTITUTION - Random substitution replaces an entity by another random entity from the same class and morphosyntactic features ; - Morphosyntactic features are determined by the headword ; ● A e r o p o r t o ( m a s c , s i n g ) → R e c i n t o ( m a s c , s i n g ) ● F r a n k f u r t s ( m a s c , s i n g , g e n i t i v ) → We g s ( m a s c , s i n g , g e n i t i v ) - Random entities are looked up from a default list of entities; Language Class Number Gender Case Term PT location singular masculine recinto ES location singular feminine arena EN location singular venue DE location singular neuter nominative Wahrzeichen 03/06/2016 Instituto Superior Técnico Slide 19 of 52

GENERALIZATION - Generalization is any method of replacing an entity by another that mentions an item of the same type but in a more general way; - This method accesses a Knowledge Base in order to retrieve the superclasses of a given entity. City Berlin London Lisbon Madrid 03/06/2016 Instituto Superior Técnico Slide 20 of 52

EVALUATION 03/06/2016 Instituto Superior Técnico Slide 21 of 52

EVALUATION 1) Detection of sensitive information (named entities) → does it remove all sensitive information? 2) Performance of the coreference between entities → does it replace same entities by the same label? 3) Adequacy of the replacements → does the results look natural to a human reader? Evaluation of previous studies on anonymization aimed at: - detection of entities in a text (we also evaluate points 2 and 3); - clinical report text style (we aim various text styles); 03/06/2016 Instituto Superior Técnico Slide 22 of 52

DATASETS - We aim at different domains of text and languages . - We use corpora divided into documents from 2 different sources, with different text domains for each language: English: CoNLL 2003 + DCEP German: CoNLL 2003 + DCEP Portuguese: Segundo HAREM + DCEP Spanish: CoNLL 2002 + DCEP - DCEP reports were manually annotated for named entities; - All datasets were manually annotated for coreference between entities; 03/06/2016 Instituto Superior Técnico Slide 23 of 52

DETECTION OF SENSITIVE INFORMATION - Intrinsic evaluation of the performance of NER: f1-score (also recall); - 3 classes of entities: Location, Organization and Person; - 5 configurations: - Baseline ; - Baseline + Pattern-matching ; - Baseline + Second-pass Detection ; - Baseline + Parallel NER classifier ; - Baseline + All previous configurations; - Statistically different results from the baseline. 03/06/2016 Instituto Superior Técnico Slide 24 of 52

DETECTION OF SENSITIVE INFORMATION - Baseline performance depends on the text domain; - Gazetteers improve significantly* recall; - Second-pass improves significantly* f1-score (some datasets); also adds false positives; - Parallel NER improves the performance (same training text domain ) not significantly* when compared . with Second-pass in CoNLL; - All modules improves f1-score only on DCEP; * p < 0.01, compared with baseline 03/06/2016 Instituto Superior Técnico Slide 25 of 52

DETECTION OF SENSITIVE INFORMATION 03/06/2016 Instituto Superior Técnico Slide 26 of 52

DETECTION OF SENSITIVE INFORMATION 03/06/2016 Instituto Superior Técnico Slide 27 of 52

COMPARING WITH I2B2 RESULTS 2006 2014 5º 5º against 6 other systems against 10 other systems 03/06/2016 Instituto Superior Técnico Slide 28 of 52

COREFERENCE RESOLUTION - Baseline: no coreference; - Metrics: B-Cubed Score ; - Results depend on the language and text domain ; - Performance of coreference resolution is satisfactory : - Precision close to 1.0; - Recall much higher than the baseline. 03/06/2016 Instituto Superior Técnico Slide 29 of 52

COREFERENCE RESOLUTION 03/06/2016 Instituto Superior Técnico Slide 30 of 52

ANONYMIZATION - Metrics: Availability and relevance of a substitution; - Effects of anonymization in the coreference of entities; - The relevance of a substitution within a context was measured using human raters, as the ratio: 03/06/2016 Instituto Superior Técnico Slide 31 of 52

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias - PowerPoint PPT Presentation

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias francisco.m.c.dias@tecnico.ulisboa.pt 03/06/2016 Instituto Superior Tcnico Slide 1 of 52 INTRODUCTION RELATED WORK ARCHITECTURE INTRODUCTION ANONYMIZATION METHODS

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Drupal 8s multilingual APIs Gbor Hojtsy DRUPAL 7 MULTILINGUAL DRUPAL 7 MULTILINGUAL Drupal

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CS573 Data Privacy and Security Anonymization methods Anonymization methods Li Xiong Today

Introduction to Anonymization (I) Claire McKay Bowen Postdoctoral Researcher, Los Alamos

Data Masking and Anonymization for PostgreSQL 1 The Anonymization Challenge 8 Strategies

Data Privacy Anonymization Li Xiong CS573 Data Privacy and Security Outline Inference

Towards Plausible Graph Anonymization Yang Zhang, Mathias Humbert, Bartlomiej Surma, Praveen

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

An Automated Social Graph De-anonymization Technique Kumar Sharad 1 George Danezis 2 1 2

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Benefit Analysis of an Electronic Road Use Charge System Steven Newman, CEO EROAD New Zealand

SOTETO Design of technical Support for a socio- technical evolutionary-teal Organization Viva

Protection Regulation (GDPR) Presentation Structure What is the GDPR? When and where does

Privacy and Employee Surveys in Germany June 2020 Speakers Dr. Annette Demmel, Tarek

Presentation for the February 22, 201 7 Meeting of the Alternative Reference Rates Committee

Criteo 101 Investor Presentation February 2019 1 Safe harbor statement This presentation

Community-Preserving Generalization of Social Networks Jordi Casas-Roma 1 and Fran cois Rousseau

Incident Response as a Team Sport: Emerging and Best Practices Gerard Stegmaier Reed Smith LLP