using high volume unstructured gp notes to predict stroke
play

Using high-volume unstructured GP notes to predict stroke Anneloes - PowerPoint PPT Presentation

Using high-volume unstructured GP notes to predict stroke Anneloes Louwe, Masters Thesis Project Supervision: Hine van Os, dept. Neurology & Epidemiology, LUMC Suzan Verberne, Text Mining & Informatjon Retrieval, LIACS


  1. Using high-volume unstructured GP notes to predict stroke Anneloes Louwe, Master’s Thesis Project Supervision: • Hine van Os, dept. Neurology & Epidemiology, LUMC • Suzan Verberne, Text Mining & Informatjon Retrieval, LIACS

  2. Contents • Study context and objectjves • Preprocessing of primary care consultatjon notes • Cleaning and tokenizatjon • Spelling correctjon • Keyphrase detectjon • Feature selectjon • Bag-of-words • Topic modeling • Predictjon models 2 20-Nov-18

  3. What is stroke? • Brain infarctjons & brain hemorrhage • NL: 43.000 strokes per year • 3rd cause of death 3 Cardiovasculair Risicomanagement, NHG 6/12/19

  4. Preventjon of stroke is key • Preventjon by general practjtjoner • Blood pressure & cholesterol medicatjon • Lifestyle change • Simplistjc risk chart, only 5 risk factors • Need for precision preventjon (and thus predictjon)! 4 Cardiovasculair Risicomanagement, NHG 6/12/19

  5. Aim • Including free text in a predictjon model for stroke • Identjfjcatjon of novel (women-specifjc) risk factors 5 6/12/19

  6. Free text • Captures patjents’ narratjve • Supportjng evidence • Uncertainty • Non-medical informatjon (eg. social problems) • Diagnosis Descriptjons • SOAP notes S: Subjectjve   O: Objectjve  A: Assessment P: Plan  6 6/12/19

  7. Data overview • Pipeline development: ELAN dataset (n = 87000) • Proof of concept: NEO dataset (n ≈ 6000)  Cases (including heart infarctjons): 182  Controls: 5890 • Main dataset: STIZON dataset (n = 3000000) 7 6/12/19

  8. Preprocessing Preparatjon  ICPC code (re)formattjng (e.g. K90.00)  Grouping SOAP lines Cleaning and tokenizatjon  Lowercasing and punctuatjon removal  Token removal: Stopwords, numbers, short words, medicatjon specifjcatjons (e.g. 100mg or 100st ), zorgdomein codes Spelling Correctjon  Vocabulary: Clinspell, ICPC defjnitjons and CoNLL  Single-character edit identjfjcatjon using Symmetric Delete Keyphrase Detectjon  Kullback–Leibler divergence 8 Insert > Header & footer 6/12/19

  9. Cases vs. controls 9 6/12/19

  10. Feature Selectjon • Unifjed Medical Language System (ULMS): Medical Concept Extractjon • Bag-of-Words • Topic Modeling  Latent Dirichlet Allocatjon (LDA)  Non-negatjve Matrix Factorizatjon (NMF)  Topic Coherence: Word Embedding model (Word2Vec) 10 6/12/19

  11. Topic Coherence 11 6/12/19

  12. Models • Logistjc Regression • Random Forest 12 6/12/19

  13. Models 13 6/12/19

  14. Next steps • STIZON dataset  Experimentatjon  Pipeline optjmizatjon • Negatjon Detectjon 14 6/12/19

  15. Thank you! Vrije Universiteit LUMC Neurologie • Mark Hoogendoorn • Hendrikus J. H. van Os • Ioannis Pantazis • Marieke J. H. Wermer LIACS LUMC PHEG • Matthijs de Leeuw • Mattijs A. Numans • Suzan Verberne • Tobias N. Bonten • Teddy Etoeharnowo • Niels H. Chavannes • Anneloes Louwe • Rolf H. H. Groenwold LUMC Statistiek • Janet Kist • Hein Putter • Michiel Meulenbroek • Erik van Zwet • Frederike Buechner Turku University (Finland) • Sepinoud Azimi 15

Recommend


More recommend