Using high-volume unstructured GP notes to predict stroke Anneloes Louwe, Master’s Thesis Project Supervision: • Hine van Os, dept. Neurology & Epidemiology, LUMC • Suzan Verberne, Text Mining & Informatjon Retrieval, LIACS
Contents • Study context and objectjves • Preprocessing of primary care consultatjon notes • Cleaning and tokenizatjon • Spelling correctjon • Keyphrase detectjon • Feature selectjon • Bag-of-words • Topic modeling • Predictjon models 2 20-Nov-18
What is stroke? • Brain infarctjons & brain hemorrhage • NL: 43.000 strokes per year • 3rd cause of death 3 Cardiovasculair Risicomanagement, NHG 6/12/19
Preventjon of stroke is key • Preventjon by general practjtjoner • Blood pressure & cholesterol medicatjon • Lifestyle change • Simplistjc risk chart, only 5 risk factors • Need for precision preventjon (and thus predictjon)! 4 Cardiovasculair Risicomanagement, NHG 6/12/19
Aim • Including free text in a predictjon model for stroke • Identjfjcatjon of novel (women-specifjc) risk factors 5 6/12/19
Free text • Captures patjents’ narratjve • Supportjng evidence • Uncertainty • Non-medical informatjon (eg. social problems) • Diagnosis Descriptjons • SOAP notes S: Subjectjve O: Objectjve A: Assessment P: Plan 6 6/12/19
Data overview • Pipeline development: ELAN dataset (n = 87000) • Proof of concept: NEO dataset (n ≈ 6000) Cases (including heart infarctjons): 182 Controls: 5890 • Main dataset: STIZON dataset (n = 3000000) 7 6/12/19
Preprocessing Preparatjon ICPC code (re)formattjng (e.g. K90.00) Grouping SOAP lines Cleaning and tokenizatjon Lowercasing and punctuatjon removal Token removal: Stopwords, numbers, short words, medicatjon specifjcatjons (e.g. 100mg or 100st ), zorgdomein codes Spelling Correctjon Vocabulary: Clinspell, ICPC defjnitjons and CoNLL Single-character edit identjfjcatjon using Symmetric Delete Keyphrase Detectjon Kullback–Leibler divergence 8 Insert > Header & footer 6/12/19
Cases vs. controls 9 6/12/19
Feature Selectjon • Unifjed Medical Language System (ULMS): Medical Concept Extractjon • Bag-of-Words • Topic Modeling Latent Dirichlet Allocatjon (LDA) Non-negatjve Matrix Factorizatjon (NMF) Topic Coherence: Word Embedding model (Word2Vec) 10 6/12/19
Topic Coherence 11 6/12/19
Models • Logistjc Regression • Random Forest 12 6/12/19
Models 13 6/12/19
Next steps • STIZON dataset Experimentatjon Pipeline optjmizatjon • Negatjon Detectjon 14 6/12/19
Thank you! Vrije Universiteit LUMC Neurologie • Mark Hoogendoorn • Hendrikus J. H. van Os • Ioannis Pantazis • Marieke J. H. Wermer LIACS LUMC PHEG • Matthijs de Leeuw • Mattijs A. Numans • Suzan Verberne • Tobias N. Bonten • Teddy Etoeharnowo • Niels H. Chavannes • Anneloes Louwe • Rolf H. H. Groenwold LUMC Statistiek • Janet Kist • Hein Putter • Michiel Meulenbroek • Erik van Zwet • Frederike Buechner Turku University (Finland) • Sepinoud Azimi 15
Recommend
More recommend