Using NLP approaches on clinical and biomedical textual data Thierry Hamon Institut Galil´ ee - Universit´ e Paris 13,Villetaneuse, France & LIMSI-CNRS, Orsay, France hamon@limsi.fr http://perso.limsi.fr/hamon/ March 2014 ERASMUS Mobility - M¨ alardalen University (MDH) - V¨ aster˚ as - Sweden Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 1 / 66
Presentation of three applications Mining literature to identify of relations between risk factors and their pathologies Exploring graph structure to acquire synonym relations from terminological resource Mining patients’ Electronic Health Records (Discharge summaries) Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 2 / 66
Risk factors Mining literature to identify of relations between risk factors and their pathologies ıctor Raggio a , Hugo Naya a and Natalia na a , V´ Work with Martin Gra˜ Grabar b a Unidad de Bioinform´ atica, Institut Pasteur de Montevideo, Mataojo 2020, Montevideo 11400, Uruguay b UMR 8163 Savoirs, Textes, Langage (STL), Universit´ e Lille3, France Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 3 / 66
Risk factors Mining literature to identify of relations between risk factors and their pathologies Risk factors: a complex notion behaviour, environmental condition, disease, genetics... increase people’s chance to develop a given disease ⇒ Discover risk factor and design prevention strategies Research from biology, epidemiology, medicine, public health Despite an intensive activity, the knowledge is not complete coronary heart disease: only 50% of risks known (Allen, 2000) Information on risk factors is wide-spread over the web: websites, bibliographical databases, ... ⇒ Difficulties: reliability and access Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 4 / 66
Risk factors Previous work Active recent activity in text mining: scientific literature: BioCreAtIvE, TREC Genomics clinical records: I2B2 NLP challenges (specific task in 2014) Risk factors studies: data mining managing a large number of variables (Ahmad & Bath, 2005) groups with similar risks/ICD-9 codes (16) claim costs in insurance companies (17) KDD challenge 2004 ( http://lisp.vse.cz/challenge ) identify atherosclerosis risk factors monitor the evolution of these risks and their impacts Processing of narratives (18): breast cancer risk factors combination of manual and automatic meta-analysis findings consistent with known studies positive association with alcohol consumption negative association with former smoking Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 5 / 66
Risk factors Objectives Massive exploitation of Medline bibliographical database text mining methods applied to full text Extraction of risk factors and their associations to health conditions Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 6 / 66
Risk factors Material Bibliographical database Medline over 18 M citations ⇒ titles, abtracts, MeSH indexing MeSH thesaurus for information storage and retrieval ⇒ MeSH headings Snomed CT nomenclature for organizing and exhanging clinical data rich semantic network: terms and relations three relations explicit on risk factors and health conditions has causative agent : direct cause of the disorder or finding bacterial endocarditis has causative agent bacterium due to : relate a clinical finding directly to its cause acute pancreatitis due to infection associated with : clinically relevant association between concepts without either asserting or excluding a causal or sequential relationship between the two Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 7 / 66
Risk factors Bibliographical database Medline Automated detection of potentially relevant citations risk factors , factor of risk Annotation of Medline citations with linguistic information Ogmios NLP platform (Hamon&al, 2007) Segmentation, POS-tagging & lemmatization – Genia Tagger (Tsuruoka&al, 2005) Term extraction and recognition – Y T EA A (Aubin&Hamon, 2006) Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 8 / 66
Risk factors Information extraction Corresponding pathologies and health conditions Semantico-syntactic patterns 5 patterns for risk factors and pathologies 12 patterns for handling enumerations 3 patterns for pathologies <NP-RF> as a risk factor for <NP-P> where as a risk factor for : trigger sequence <NP-RF> : noun phrases corresponding to risk factors <NP-P> : pathologies ? and * : optional and recurrent elements MeSH descriptors of citations Descriptors belonging to C heading of diseases Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 9 / 66
Risk factors Information extraction Risk factors Coordination and enumeration: Risk factors for survival were age and severity of aortic stenosis ... (PMID 8705769) ...a high intake of calcium and phosphorus is a risk factor for the development of metabolic acidosis . (PMID 1435825) ...had more than one of the common risk factors for cerebrovascular accidents , including hypertension, advanced age, hyperfibrinogenemia, diabetes mellitus, and past history of cerebrovascular accident. (PMID 1560589) Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 10 / 66
Risk factors Evaluation 1 Quality and exhaustiveness of risk factors for a given pathology 2 Associations risk factor/pathology, by comparison between: text mining results MeSH indexing 3 Comparison between: text mining results Snomed CT causal and associative relations Evaluation of precision ratio of correct extractions among the results Manual evaluation: no dedicated and comprehensive gold standard is available Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 11 / 66
Risk factors Building and preparing the material Medline material: 187,544 citations selected: over 42 M word occurrences processed through the Ogmios platform Snomed CT accessed through UMLS (2008AB) 154,130 pairs pathology/causative agent, pathology/pathology 92,807 relations has causative agent 25,309 relations due to 36,134 relations associated with (120 relations provided by several Snomed CT relationships) Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 12 / 66
Risk factors Extraction of information on risk factors and pathologies Application of three kinds of patterns (1) { risk factor , pathology } , (2) risk factors, (3) pathologies Definition of relations: direct relations with patterns { risk factor , pathology } combination of information provided by (2) and (3) 10,445 PMIDs provide information 313 pairs { risk factor , pathology } 15,398 pairs by combination of (2) and (3) 5,873 risk factors (2) not associated with any pathology MeSH indexing: 5,106 pathologies and health conditions 21,584 triplets { risk factor, pathology text ?, pathology MeSH ? } 17,620 (14,895) pairs provided only by information extraction patterns 5,717 (4,412) pairs contain MeSH descriptors as pathology Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 13 / 66
Risk factors Evaluation Risk factors for coronary heart disease (CHD) CHD, most common hearth disease Important cause of premature death all around the world Evaluation by medical doctor 1,102 risk factors extracted: 128 (11.62%) rejected = 88.38% precision Well known risk factors found hypertension , smoking , diabetes , age , obesity , hypercholesterolemia , hyperlipidemia , family history ... Detection of synonyms { smoking; cigarette smoking; smoking history; importance of total life consumption of cigarettes } { hyperhomocysteinemia; hyperhomocysteinaemia; homocysteine; plasma homocysteine } Error! (?) { CHD , work } : Passive smoking at work as a risk factor for coronary heart disease in Chinese women who have never smoked Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 14 / 66
Recommend
More recommend