Processing the Scope of Negation and Modality Cues in Biomedical Texts Roser Morante, Walter Daelemans CNTS-Language Technology Group University of Antwerp
Framework • The BIOGRAPH project (www.biograph.be) University of Antwerp: - Text Mining: CNTS, Department of Linguistics, Walter Daelemans - Data Mining: ADReM, Department of Mathematics and Computer Science Bart Goethals - Genetics: AMG, Department of Molecular Genetics, Jurgen Del-Favero 1
Framework • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology for combined text analysis and data mining from heterogeneous information sources - Mining biomedical texts: providing accurate relations automatically extracted from text and weighted according to their reliability • Treatment of negation, modality and quantification 2
Framework The BIOGRAPH flow 3
Gene Prioritization • Candidate region - Gene responsible for a disease (e.g. schizophrenia or Alzheimer) is in a known area of the genome - Many genes (> 200) are in this candidate region • Experimental validation is needed - Very expensive in time and cost • Combine information in literature and in databases! - Which genes in the candidate region could be most relevant for the disease and why? - Provide a prioritization (ranking problem) 4
Event Extraction MEDLINE:7747440 Epstein-Barr virus replicative gene transcription during de novo infection of human thymocytes: simultaneous early expression of BZLF-1 and its repressor RAZ. Epstein-Barr virus (EBV) is known to infect B cells and epithelial cells. We and others have shown that EBV can also infect a subset of thymocytes. Infection of thymocytes was accompanied by the appearance of linear EBV genome within 8 hr of infection. Circularization of the EBV genome was not detected. This is in contrast to the infection in B cells where the genome can circularize within 24 hr of infection. The appearance of the BamHI ZLF-1 gene product, ZEBRA, by RT-PCR, was observed within 8 hr of infection. The appearance of a novel fusion transcript (RAZ), which comprised regions of the BZLF-1 locus and the adjacent BRLF-1 locus, was detected by RT-PCR. ZEBRA protein was also identified in infected thymocytes by immunoprecipitation. In addition, we demonstrated that the EBNA-1 gene in infected thymocytes was transcribed from the Fp promoter, rather than from the Cp/Wp promoter which is used in latently infected B cells. Transcripts encoding gp350/220, the major coat protein of EBV, were identified, but we did not find any evidence of transcription from the LMP-2A or EBER-1 loci in infected thymocytes. These observations suggest that de novo EBV infection of thymocytes differs from infection of B cells. The main difference is that with thymocytes, no evidence could be found that the virus ever circularizes. Rather, EBV remains in a linear configuration from which replicative genes are transcribed. 5
Event Extraction MEDLINE:7747440 ... In addition, we demonstrated that the EBNA-1 gene in infected thymocytes was transcribed from the Fp promoter, rather than from the Cp/Wp promoter which is used in latently infected B cells. Transcripts encoding gp350/220, the major coat protein of EBV, were identified, but we did not find any evidence of transcription from the LMP-2A or EBER-1 loci in infected thymocytes. These observations suggest that de novo EBV infection of thymocytes differs from infection of B cells. <event id="E10" source="7747440" neg="1" spec="1"> <predicate type="Transcription" begin="1216" end="1229"> transcription </predicate> <patient type="Theme" begin="1239" end="1245"> LMP-2A </patient> </event> 6
Contents • Motivation • Negation - Task description - Related work - Corpus - System description - Results • Modality - Related work - Results • Negation vs. modality • Conclusions • Further Research 7
Motivation • Extracted information that falls in the scope of hedge or negation cues cannot be presented as factual information • Vincze et al. (2008) report that 17.70% of the sentences in the BioScope corpus contain hedge cues and 13 % negation cues • Light et al. (2004) estimate that 11% of sentences in MEDLINE abstracts contain speculative fragments 8
Finding the scope of negation • Finding the scope of a negation cue means determining at a sentence level which words in the sentence are affected by the negation(s) Analysis at the phenotype and genetic level showed that lack of CD5 expression was due neither to segregation of human autosome 11 , on which the CD5 gene has been mapped, nor to deletion of the CD5 structural gene . 9
Related work • Most of the related work focuses on detecting whether a term is negated or not - Rule or regular expression based systems like NegEx (Chapman et al. 2001) and NegFinder (Mutalik et al. 2001) - Machine learning systems like Averbuch et al. (2004) - Huang and Lowe (2007) develop a hybrid system that combines regular expression matching with parsing in order to locate negated concepts 10
Corpus 11
Corpus • Medical and biological texts annotated with information about negation and speculation PMA treatment, and <xcope id=“X1.4.1”> <xcope id=“X1.4.1”> <cue type=“negation'' ref="X1.4.1"> not <cue type=“negation'' ref="X1.4.1"> </cue> retinoic acid treatment of the U937 </cue> cells </xcope> acts in inducing NF-KB </xcope> expression in the nuclei. • Corpora Clinical Papers Abstracts #Docs. 1954 9 1273 #Sent. 6383 2670 11871 #Words 41985 60935 282243 12
Experimental Setting • Abstracts corpus: - 10 fold cross-validation experiments • Clinical and papers corpora: robustness test - Training on abstracts - Testing on clinical and papers 13
System Description • We model the scope finding task as two consecutive classification tasks: - Finding negation cues: a token is classified as being at the beginning of a negation signal, inside or outside - Finding the scope: a token is classified as being the first element of a scope sequence, the last, or neither • Supervised machine learning approach 14
System Architecture 15
Preprocessing 16
Finding Negation Cues • We filter out negation cues that are unambiguous in the training corpus (17 out of 30) • For the rest, a classifier predicts whether a token is the first token of a negation signal, inside or outside of it - Algorithm : IGTREE as implemented in TiMBL (Daelemans et al. 2007) - Instances represent all tokens in a sentence - Features about the token in focus and its context 17
Features negation cue finding • Of the token - Lemma, word, POS and IOB chunk tag • Of the token context - Word, POS and IOB chunk tag of 3 tokens to the right and 3 to the left 18
Ambiguous Negation Cues In Abstracts Corpus 19
Results • Baseline: tagging as negation signals tokens that are BASELINE negation signals at least in 50% of the occurrences in TOKENS the training corpus absence, absent, cannot, could BASELINE PREC RECALL F1 IAA not, fail, Abstracts 82.00 95.17 88.09 94.46 failure, impossible, Papers 84.01 92.46 88.03 79.42 instead of, lack, miss, Clinical 97.31 97.53 97.42 90.70 neither, never, no, none, nor, not, rather SYSTEM PREC RECALL F1 than, unable, with the Abstracts 84.72 98.75 91.20 (+3.11) exception of, Papers 87.18 95.72 91.25 (+3.22) without Clinical 97.33 98.09 97.71 (+0.29) 20
Results system vs. baseline in abstracts corpus • The system performs better 21
Results in the three corpora • The system is portable 22
Discussion • Cause of lower recall on papers corpus: NOT % negation % classified signals correctly Abstracts 58.89 98.25 Papers 53.22 93.68 Clinical 6.72 91.22 • Errors: not is classified as negation signal However, programs for tRNA identification [...] do not necessarily perform well on unknown ones The evaluation of this ratio is difficult because not all true interactions are known 23
Finding Scopes • Three classifiers predict whether a token is the first token in the scope sequence, the last or neither - MBL (Daelemans et al. 2007) - SVM light (Joachims 1999) - CRF++ (Lafferty et al. 2001) • A fourth classifier predicts the same taking as input the output of the previous classifiers - CRF++ • The features used by the object classifiers and the metalearner are different 24
Finding Scopes 25
Finding Scopes • Previous attempts: lower results - Chunk-based classification, instead of word-based - BIO classification of tokens (EMNLP’08) instead of FOL (First, Other, Last) - Single classifier approach, instead of metalearner 26
Recommend
More recommend