exploring the application of deep learning techniques on
play

Exploring the application of deep learning techniques on medical - PDF document

Exploring the application of deep learning techniques on medical text corpora Jos Antonio Miarro-Gimnez a, Oscar Marn-Alonso a,b and Matthias Samwald a a Section for Medical Expert and Knowledge-Based Systems Center for Medical


  1. Exploring the application of deep learning techniques on medical text corpora José Antonio Miñarro-Giménez a, Oscar Marín-Alonso a,b and Matthias Samwald a a Section for Medical Expert and Knowledge-Based Systems Center for Medical Statistics, Informatics, and Intelligent Systems Medical University of Vienna, Austria & Vienna University of Technology, Austria b Dept. of Computer Technology, University of Alicante, Alicante, Spain MIE 2014, 1st September 2014, Istanbul, Turkey

  2. Introduction Problem: Increasingly difficult to find relevant information ARTIFICIAL INTELLIGENCE IN MEDICINE COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS INTERNATIONAL JOURNAL OF TECHNOLOGY ASSESSMENT IN HEALTH CARE JOURNAL OF BIOMEDICAL INFORMATICS MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION MEDICAL DECISION MAKING METHODS OF INFORMATION IN MEDICINE STATISTICAL METHODS IN MEDICAL RESEARCH STATISTICS IN MEDICINE BRIEFINGS IN BIOINFORMATICS BMC BIOINFORMATICS MEDICAL IMAGE ANALYSIS ARTIFICIAL INTELLIGENCE NEUROINFORMATICS BIOINFORMATICS

  3. Introduction • Challenge: Automatically process biomedical literature. • Approaches: Data mining. Information extraction methods. Natural language processing. ... • Tools: Word2vec (https://code.google.com/p/word2vec/)

  4. Word2vec toolkit Vector models Word2vec toolkit word2vec ABC Options • Type of architecture: Skip-gram or continuous bag-of-words. • Vector space dimension. • Size of the context window. • Training algorithms: hierarchical softmax and / or negative sampling. • Threshold for downsampling the frequent words. • ...

  5. Word2vec toolkit Word2vec toolkit word2vec distance analogy

  6. Word2vec Analogy method

  7. Distance vs Analogy

  8. Corpus Corpora Word count Vocabulary size Clinically relevant subset of PubMed, full abstracts 161.428.286 204.096 Conclusion sections from clinically relevant subset of 17.342.158 47.703 PubMed, “pubmed_key_assertions” Merck Manuals 12.667.064 49.174 Medscape 25.854.998 63.600 Clinically relevant subset of Wikipedia, “wikipedia" 10.945.677 65.875 Combined corpus (including all corpora above), 236.835.672 261.353 “combined”

  9. NDF-RT ontology NDF-RT Description Example relationshi p may_treat Provides the association between drugs and the Warfarin -> may_treat -> “Thrombophlebitis” diseases they may treat. may_prevent Provides the list of diseases that a drug may prevent. Warfarin -> may_prevent -> “Myocardial Infarction” has_PE Relates drugs to their corresponding physiological Warfarin -> has_PE -> "Decreased Coagulation Factor effects. Concentration" has_MoA The mechanisms of action of each drug. Warfarin -> has_MoA -> “Vitamin K Epoxide Reductase Inhibitors”

  10. Testing system RESTful server RESTful client Analogy Distance Query Matching service service module module NDF-RT Word2vec Word2vec ontology Results analogy tool distance tool Trained Word2vec corpus train tool

  11. Pre-processing corpus

  12. Pre-processing corpus NDF-RT ontology Gathering Raw List of text text terms corpora Processing corpora Remove Avoid Group punctuation capitalized multiword signs words terms Processed text

  13. Statistics 1. The number of resulting vectors of words with at least one correct term from the relationships of NDF-RT ontology. 2. The evaluation of window size and the type of architecture. 3. The evaluation of vector dimension in vector model.

  14. Results Hit rate Corpus Tool may_treat may_prevent has_PE has_MoA Analogy 27,37% 10,59% 2,49% 6,91% combined Distance 3,21% 6,32% 0,67% 6,81% Analogy 15,74% 5,09% 0,84% 1,51% PubMed key assertions Distance 2,13% 4,07% 0,37% 3,60% Analogy 14,9% 5,35% 2,22% 2,69% wikipedia Distance 1,3% 3.34% 0,32% 3,34%

  15. Results Window size

  16. Results Vector dimension

  17. Conclusions • Word2vec is very efficient to generate vector models and to execute the different search methods. • Pre-processing the corpus content is needed to improve the resulting vector models. • The analogy method gets better related terms than distance search method. • The generated vector models provide the best results when searching for information related to “may_treat” relationship. • However, only a 27% of hit rate is a poor result compared to other approaches. • The customization of vector dimension has more impact than other training parameters such as the size of the context window. • The number of indexed terms is a better factor than the number of words in a corpus to measure their quality.

  18. Future work • Test the word2vec toolkit with even larger medical corpora (> 10GB) . • Investigate the use of contextual knowledge to improve precision and recall of word2vec search methods. – Medical terminologies and ontologies.

  19. QUESTIONS ?

Recommend


More recommend