Institut für Maschinelle Sprachverarbeitung Text Mining on Clinical Data Robert McHardy
Outline • Motivation • Medical Entity Recognition • Anonymization of Medical Reports • Knowledge-based Biomedical Word Sense Disambiguation • Extraction of Potential Adverse Drug Events • Resources Universität Stuttgart 5.12.2017 2
Motivation — Different Users Universität Stuttgart 5.12.2017 3
Motivation — Why do we need Text Mining on Clinical Data? • Doctors need to know if a drug is safe to use or not Universität Stuttgart 5.12.2017 4
Motivation — Why do we need Text Mining on Clinical Data? • Doctors need to know if a drug is safe to use or not • As fast as possible Universität Stuttgart 5.12.2017 4
Motivation — Why do we need Text Mining on Clinical Data? • Doctors need to know if a drug is safe to use or not • As fast as possible • We don‘t want to suffer from unsafe drugs Universität Stuttgart 5.12.2017 4
Motivation — Why do we need Text Mining on Clinical Data? • Doctors need to know if a drug is safe to use or not • As fast as possible • We don‘t want to suffer from unsafe drugs • Researchers want to use the data Universität Stuttgart 5.12.2017 4
Motivation — Why do we need Text Mining on Clinical Data? • Doctors need to know if a drug is safe to use or not • As fast as possible • We don‘t want to suffer from unsafe drugs • Researchers want to use the data • It has to be anonymized Universität Stuttgart 5.12.2017 4
Motivation — PubMed, again! Universität Stuttgart 5.12.2017 5
Unified Medical Language System Metathesaurus (UMLS) Universität Stuttgart 5.12.2017 6
Medical Entity Recognition — Overview • Abacha and Zweigenbaum: Consists of two parts • Detecting phrases referring to medical entities • Assigning semantic categories to the found entities Universität Stuttgart 5.12.2017 7
Medical Entity Recognition — Overview Universität Stuttgart 5.12.2017 8
Medical Entity Recognition — Overview Type 1 diabetes T1D Diabetes type 1 IDDM Juvenile diabetes Universität Stuttgart 5.12.2017 8
Medical Entity Recognition — Noun Phrase Chunking Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […] Universität Stuttgart 5.12.2017 9
Medical Entity Recognition — Noun Phrase Chunking Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […] • Many tools for NP chunking available Universität Stuttgart 5.12.2017 9
Medical Entity Recognition — Noun Phrase Chunking Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […] • Many tools for NP chunking available • Maximum recall is desired Universität Stuttgart 5.12.2017 9
Medical Entity Recognition — Noun Phrase Chunking Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […] • Many tools for NP chunking available • Maximum recall is desired • Open-domain tools like IMS‘ TreeTagger are suitable Universität Stuttgart 5.12.2017 9
Medical Entity Recognition — MetaMap and the UMLS • MetaMap is a tool which maps noun phrases in raw text to UMLS concepts • This is done according to a matching score Universität Stuttgart 5.12.2017 10
Medical Entity Recognition — MetaMap and the UMLS • Three problems with MetaMap Universität Stuttgart 5.12.2017 11
Medical Entity Recognition — MetaMap and the UMLS • Three problems with MetaMap • Noun chunking performance is worse than with specialized NLP tools Universität Stuttgart 5.12.2017 11
Medical Entity Recognition — MetaMap and the UMLS • Three problems with MetaMap • Noun chunking performance is worse than with specialized NLP tools • Medical entity detection often finds verbs and general words which aren‘t MEs Universität Stuttgart 5.12.2017 11
Medical Entity Recognition — MetaMap and the UMLS • Three problems with MetaMap • Noun chunking performance is worse than with specialized NLP tools • Medical entity detection often finds verbs and general words which aren‘t MEs • Some ambiguity is left Universität Stuttgart 5.12.2017 11
Medical Entity Recognition — MetaMap and the UMLS • Three problems with MetaMap • Noun chunking performance is worse than with specialized NLP tools • Medical entity detection often finds verbs and general words which aren‘t MEs • Some ambiguity is left • UMLS can provide several concepts for a term Universität Stuttgart 5.12.2017 11
Medical Entity Recognition — MetaMap and the UMLS • Three problems with MetaMap • Noun chunking performance is worse than with specialized NLP tools • Medical entity detection often finds verbs and general words which aren‘t MEs • Some ambiguity is left • UMLS can provide several concepts for a term • and several semantic categories for a concept Universität Stuttgart 5.12.2017 11
Medical Entity Recognition — MetaMap and the UMLS Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […] Cold temperature Common cold Cold ( term) Cold storage ( term) Cold storage Chronic obstructive lung disease Universität Stuttgart 5.12.2017 12
Medical Entity Recognition — MetaMap+ • Use tools like TreeTagger for the NP chunking Universität Stuttgart 5.12.2017 13
Medical Entity Recognition — MetaMap+ • Use tools like TreeTagger for the NP chunking • Filter NPs with a stop-word list Universität Stuttgart 5.12.2017 13
Medical Entity Recognition — MetaMap+ • Use tools like TreeTagger for the NP chunking • Filter NPs with a stop-word list • Search in specialized lists for candidate terms Universität Stuttgart 5.12.2017 13
Medical Entity Recognition — MetaMap+ • Use tools like TreeTagger for the NP chunking • Filter NPs with a stop-word list • Search in specialized lists for candidate terms • Annotate entities with MetaMap Universität Stuttgart 5.12.2017 13
Medical Entity Recognition — MetaMap+ • Use tools like TreeTagger for the NP chunking • Filter NPs with a stop-word list • Search in specialized lists for candidate terms • Annotate entities with MetaMap • Filter frequent errors and too broad semantic types Universität Stuttgart 5.12.2017 13
Medical Entity Recognition — MetaMap+ • Voting mechanism to disambiguate semantic categories Universität Stuttgart 5.12.2017 14
Medical Entity Recognition — Support Vector Machines (SVMs) • Word level features: • words of the NP • number of words of the NP • window of words around the NP • Orthographical features: • first letter capitalized • all letters upper-/lowercase • contains abbreviation(s) • POS tags Universität Stuttgart 5.12.2017 15
Medical Entity Recognition — BIO-CRFs Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […] • Words are annotated with the the tags B, I and O Universität Stuttgart 5.12.2017 16
Medical Entity Recognition — BIO-CRFs Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […] • Words are annotated with the the tags B, I and O • B-x: Begin of a phrase of class x Universität Stuttgart 5.12.2017 16
Medical Entity Recognition — BIO-CRFs Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […] • Words are annotated with the the tags B, I and O • B-x: Begin of a phrase of class x • I-x: Intermediate part of a phrase of class x Universität Stuttgart 5.12.2017 16
Medical Entity Recognition — BIO-CRFs Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […] • Words are annotated with the the tags B, I and O • B-x: Begin of a phrase of class x • I-x: Intermediate part of a phrase of class x • O: Outside entities Universität Stuttgart 5.12.2017 16
Medical Entity Recognition — BIO-CRFs • Word level features: • The word itself • Window of words • Lemmas • Orthographical features: • Upper/lowercase • contains a digit • pre- and suffixes • POS tags • (Semantic category of word (provided by MetaMap+)) Universität Stuttgart 5.12.2017 17
Medical Entity Recognition — Evaluation • Corpus contains discharge summaries and progress notes • De-identified and annotated by hand • Entities: Problem, Treatment and Test • Overall 76,665 sentences Universität Stuttgart 5.12.2017 18
Medical Entity Recognition — Evaluation Setting Precision Recall F-Score MetaMap 15.52 16.10 15.80 MetaMap+ 48.68 56.46 52.28 SVM 43.65 47.16 45.33 BIO-CRF 70.15 83.31 76.17 BIO-CRF-Hybrid 72.18 83.78 77.55 Universität Stuttgart 20.01.2016 19
Anonymization of Medical Reports Universität Stuttgart 20.01.2016 20
Anonymization of Medical Reports — What is anonymization? • De-Identification Universität Stuttgart 5.12.2017 21
Anonymization of Medical Reports — What is anonymization? • De-Identification • Completely remove all personal health information Universität Stuttgart 5.12.2017 21
Recommend
More recommend