Truecasing Clinical Narratives (Full Paper) Markus Kreuzthaler 1 , Stefan Schulz 1 , 2 1 Institute for Medical Informatics Statistics and Documentation, Medical University of Graz, Austria 2 Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Germany MIE Conference, August 30, 2011, Oslo Markus Kreuzthaler (IMI) Truecasing MIE 2011 1 / 12
Motivation (1) Original text: CHRONISCHE HEPATITIS MIT GERING BIS MITTELGRADIGER AKTIVITAET (HEPATISCHER AKTIVITAETSINDEX 6 VON 18) UND MITTELGRADIGER BIS HOEHERGRADIGER PORTALER UND MITTELGRADIGER INKOMPLETTER UND KOMPLETTER PORTOPORTALER UND PORTOZENTRALER FIBROSE (FIBROSESCORE 4 VON 6) Corrected text: Chronische Hepatitis mit gering bis mittelgradiger Aktivität (hepatischer Aktivitätsindex 6 von 18) und mittelgradiger bis höhergradiger portaler und mittelgradiger inkompletter und kompletter portoportaler und portozentraler Fibrose (Fibrosescore 4 von 6). Markus Kreuzthaler (IMI) Truecasing MIE 2011 2 / 12
Motivation (2) Clinical Center Graz, Pathology Legacy Data Repository: ~1.8 million texts (1984 - 2005) are still of interest for clinical and scientific inquiries. Markus Kreuzthaler (IMI) Truecasing MIE 2011 3 / 12
Motivation (2) Clinical Center Graz, Pathology Legacy Data Repository: ~1.8 million texts (1984 - 2005) are still of interest for clinical and scientific inquiries. Legacy data example: MITTELGRADIGE CHRONISCHE GASTRITS (MAGENMUCOSA VOM CORPUSTYP , UEBERGANGSTYP) MIT MITTELGRADIGER AKTIVITAET, KOMPLETTER UND INKOMPLETTER (TYP III) INTESTINALER METAPLASIE, MITTELGRADIGER ATROPHIE DER TIEFEN DRUESEN. ANTEIL EINES TUBULAEREN MAGENSCHLEIMHAUTADENOMS (INTESTINALER TYP; MITTELGRADIGE DYSPLASIE; WHO: GERINGGRADIGE INTRAEPITHELIALE NEOPLASIE). HP NICHT NACHWEISBAR. Upper case, no diacritics (e.g. "Füße", "FUESSE"), occasional typing errors. Markus Kreuzthaler (IMI) Truecasing MIE 2011 3 / 12
Motivation (2) Clinical Center Graz, Pathology Legacy Data Repository: ~1.8 million texts (1984 - 2005) are still of interest for clinical and scientific inquiries. Legacy data example: MITTELGRADIGE CHRONISCHE GASTRITS (MAGENMUCOSA VOM CORPUSTYP , UEBERGANGSTYP) MIT MITTELGRADIGER AKTIVITAET, KOMPLETTER UND INKOMPLETTER (TYP III) INTESTINALER METAPLASIE, MITTELGRADIGER ATROPHIE DER TIEFEN DRUESEN. ANTEIL EINES TUBULAEREN MAGENSCHLEIMHAUTADENOMS (INTESTINALER TYP; MITTELGRADIGE DYSPLASIE; WHO: GERINGGRADIGE INTRAEPITHELIALE NEOPLASIE). HP NICHT NACHWEISBAR. Upper case, no diacritics (e.g. "Füße", "FUESSE"), occasional typing errors. Acronyms are not easy to identify (e.g. "WHO", "HP"). Markus Kreuzthaler (IMI) Truecasing MIE 2011 3 / 12
Motivation (2) Clinical Center Graz, Pathology Legacy Data Repository: ~1.8 million texts (1984 - 2005) are still of interest for clinical and scientific inquiries. Legacy data example: MITTELGRADIGE CHRONISCHE GASTRITS (MAGENMUCOSA VOM CORPUSTYP , UEBERGANGSTYP) MIT MITTELGRADIGER AKTIVITAET, KOMPLETTER UND INKOMPLETTER (TYP III) INTESTINALER METAPLASIE, MITTELGRADIGER ATROPHIE DER TIEFEN DRUESEN. ANTEIL EINES TUBULAEREN MAGENSCHLEIMHAUTADENOMS (INTESTINALER TYP; MITTELGRADIGE DYSPLASIE; WHO: GERINGGRADIGE INTRAEPITHELIALE NEOPLASIE). HP NICHT NACHWEISBAR. Upper case, no diacritics (e.g. "Füße", "FUESSE"), occasional typing errors. Acronyms are not easy to identify (e.g. "WHO", "HP"). German language specific spelling variants (e.g. " c olon", " k olon"; " c erebral", " z erebral"). Markus Kreuzthaler (IMI) Truecasing MIE 2011 3 / 12
Motivation (3) "More data usually beats better algorithms." a a Anand Rajaraman. Blog. March, 2008. http://anand.typepad.com/datawocky/2008/03/more-data-usual.html Sophisticated algorithms using little data versus less sophisticated algorithms using big data. Markus Kreuzthaler (IMI) Truecasing MIE 2011 4 / 12
Motivation (3) "More data usually beats better algorithms." a a Anand Rajaraman. Blog. March, 2008. http://anand.typepad.com/datawocky/2008/03/more-data-usual.html Sophisticated algorithms using little data versus less sophisticated algorithms using big data. "The good news is that Big Data is here." a a T. White. Hadoop: The definitive guide. O’Reilly Media. Inc., June, 2009. We will use Big Data for "Truecasing" Clinical Narratives . Markus Kreuzthaler (IMI) Truecasing MIE 2011 4 / 12
Corpus Description (1) Corpus: 3,542 German-language pathology reports. 7-bit ASCII text. 83,818 tokens. Markus Kreuzthaler (IMI) Truecasing MIE 2011 5 / 12
Corpus Description (1) Corpus: 3,542 German-language pathology reports. 7-bit ASCII text. 83,818 tokens. Very low lexical coverage of 51% Of 7500 word-types in the text corpus only 3808 match any word-token of a standard medical dictionary (Pschyrembel). Markus Kreuzthaler (IMI) Truecasing MIE 2011 5 / 12
Corpus Description (2) Gold standard for formative evaluation: Sampling: 100 sentences; Delimiters (.;:!?); 9.3 tokens (SD=7.9, MIN=2, MAX=38, Median=7) per sentence. Markus Kreuzthaler (IMI) Truecasing MIE 2011 6 / 12
Corpus Description (2) Gold standard for formative evaluation: Sampling: 100 sentences; Delimiters (.;:!?); 9.3 tokens (SD=7.9, MIN=2, MAX=38, Median=7) per sentence. Correction: Manual spelling and grammar correction according to: 1996 German orthography reform. Medical spelling rules in accordance with German medical publishers. Pschyrembel Clinical Dictionary. Markus Kreuzthaler (IMI) Truecasing MIE 2011 6 / 12
Corpus Description (2) Gold standard for formative evaluation: Sampling: 100 sentences; Delimiters (.;:!?); 9.3 tokens (SD=7.9, MIN=2, MAX=38, Median=7) per sentence. Correction: Manual spelling and grammar correction according to: 1996 German orthography reform. Medical spelling rules in accordance with German medical publishers. Pschyrembel Clinical Dictionary. Reference N-gram corpus: All tokens in the World Wide Web indexed by Google. Markus Kreuzthaler (IMI) Truecasing MIE 2011 6 / 12
Algorithm (1) Scraping Google with JDOM, TagSoup and XPath. Markus Kreuzthaler (IMI) Truecasing MIE 2011 7 / 12
Algorithm (1) Scraping Google with JDOM, TagSoup and XPath. Markus Kreuzthaler (IMI) Truecasing MIE 2011 7 / 12
Algorithm (2) Example: "GERINGGRADIGE CHRONISCHE GASTRITIS" Bigram 1 "GERINGGRADIGE CHRONISCHE" Frequency Geringgradige 7 chronische 15 geringgradige 6 geringgradige" 2 Bigram 2 "CHRONISCHE GASTRITIS" Frequency Chronische 9 Gastritis 14 chronische 5 Markus Kreuzthaler (IMI) Truecasing MIE 2011 8 / 12
Algorithm (3) Merged "GERINGGRADIGE CHRONISCHE GASTRITIS" Frequency Chronische 9 Gastritis 14 Geringgradige 7 chronische 20 geringgradige 6 geringgradige" 2 Decision chronische Markus Kreuzthaler (IMI) Truecasing MIE 2011 9 / 12
Algorithm (3) Merged "GERINGGRADIGE CHRONISCHE GASTRITIS" Frequency Chronische 9 Gastritis 14 Geringgradige 7 chronische 20 geringgradige 6 geringgradige" 2 Decision chronische Decision according to weighting: frequency ( t i ) w i = levenShtein ( t , t i )+ 1 Markus Kreuzthaler (IMI) Truecasing MIE 2011 9 / 12
Results of Truecasing and Spelling Variant Correction Correction Phenomenon Total Units Right case correction of normal tokens 896 909 tokens Right case correction of acronyms 13 16 tokens Correction of diacritics ("ä","ö","ü","ß") 73 80 occurrences "c", "k", "z" - variants corrected 4 21 occurrences Meaning of sentence affected by correction 3 100 sentences Spelling / grammar error corrected 1 5 sentences New grammar error after processing 1 100 sentences Markus Kreuzthaler (IMI) Truecasing MIE 2011 10 / 12
Outlook Problem: Meaning of sentence is affected by correction ("minimaler" → "maximaler"). Markus Kreuzthaler (IMI) Truecasing MIE 2011 11 / 12
Outlook Problem: Meaning of sentence is affected by correction ("minimaler" → "maximaler"). Avoiding Google black box. Markus Kreuzthaler (IMI) Truecasing MIE 2011 11 / 12
Outlook Problem: Meaning of sentence is affected by correction ("minimaler" → "maximaler"). Avoiding Google black box. Use of open Web N-gram services: ◮ Yahoo! N-Grams, version 2.0: 12000 news-oriented sites, February 2006 to December 2006, Language English. ◮ Google/LDC Web 1T 5-gram, 10 European Languages Version 1: Web pages from October 2008 to December 2008, 10 European Languages. ◮ Microsoft Web N-gram Services: N-gram models based on Web snapshot taken in June 2009, EN-US market. ◮ Google Books N-grams: Amazon S3 in a Hadoop friendly file format. Markus Kreuzthaler (IMI) Truecasing MIE 2011 11 / 12
Recommend
More recommend