Learning Links in MeSH Co-occurrence Network Preliminary Results Andrej Kastrin 1 , Thomas C. Rindflesch 2 and Dimitar Hristovski 3 andrej.kastrin@gmail.com dimitar.hristovski@gmail.com 1 Faculty of Information Studies, Novo mesto, Slovenia 2 Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD, USA 3 Institute of Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia MIE 2014, Istanbul, Turkey
Literature-Based Discovery • Find implicit relations between entities. • Propose implicit relations as potential scientific hypoteses. • Swanson’s XYZ model: • Relations XY and YZ are known • Implicit relation XZ is (putative) new discovery Y X Z 2/19
Swanson’s Example • Blood viscosity was found to co-occur with Raynaud’s disease. • Fish oil reduces blood viscosity. • Fish oil was proposed as a new treatment for Raynaud’s disease. High blood viscosity Y X Z Fish oil Raynaud’s disease 3/19
Literature-Based Discovery as Link Prediction Problem • We can model biomedical literature as a network of biomedical concepts. • Link prediction refers to the prediction of future links between concepts that are not directly connected in the current snapshot of a network. Y X Z 4/19
MEDLINE/PubMed www.ncbi.nlm.nih.gov/pubmed 5/19
Medical Subject Headings (MeSH) • MeSH is the source of nodes for our network. • MeSH is a comprehensive controlled vocabulary for indexing in the life sciences. • The 2013 version of MeSH contains 26 853 descriptors. • Every article in MEDLINE/PubMed is indexed with about 10-15 descriptors. • Some descriptors are designated (*), indicating the article’s major topic. 6/19
MeSH Terms as Used to Describe a Paper PMID- 20091016 TI - Chi-square-based scoring function for... AB - OBJECTIVES: Text categorization has been used... MH - Access to Information MH - Algorithms MH - Artificial Intelligence MH - Bayes Theorem MH - *Chi-Square Distribution MH - Data Collection MH - Data Interpretation, Statistical MH - *Data Mining MH - Humans MH - *MEDLINE MH - Medical Informatics MH - *Natural Language Processing 7/19
Methods • We have a training network G [ t 1 , t 2 ] which contains interactions among nodes that take place in the time interval [ t 1 , t 2 ] . • We have a test network G [ t 3 , t 4 ] which contains interactions among nodes that take place in the time interval [ t 3 , t 4 ] . • Learning (prediction) task: provide a list of edges that are present in the test network, but absent in the training network. Training network Test network D D H H B B A A F F C C G G E E 8/19
Data Collection • We constructed two networks: • Training network [2003-2007] • Test network [2008-2012] • Networks were post-processed to remove non-informative edges. • We applied χ 2 test for independence for each co-occurrence pair to obtain a statistic which indicates whether a particular pair occurs together more often than by chance. 9/19
Similarity Measures for Link Prediction • For each node pair ( u , v ) we calculate a similarity score s ( u , v ) . • Score s ( u , v ) gives the likelihood of link formation between nodes u and v . • We used two similarity measures: • Jaccard coefficient s uv = | Γ( u ) ∩ Γ( v ) | | Γ( u ) ∪ Γ( v ) | where Γ( u ) is set of neighbors of u • Adamic-Adar coefficient 1 � s uv = log | Γ( z ) | z ∈ Γ( u ) ∩ Γ( v ) 10/19
Jaccard Coefficient s uv = | Γ( u ) ∩ Γ( v ) | | Γ( u ) ∪ Γ( v ) | = 4 9 = 0 . 44 u v 11/19
Adamic–Adar Coefficient 1 � s uv = log | Γ( z ) | z 1 1 z 1 = log 7 + · · · + log 4 z 2 = 7 . 60 u v z 3 z 4 12/19
Performance Assessment • Major challenge is huge number of possible node pairs. • We use a bootstrap resampling approach: • We draw a random sample of 1000 nodes and create appropriate training and test networks. • We compute a link prediction score s ( u , v ) for each node pair that is not associated with any interaction before time t 3 . • We assign class label “positive” to this node pair if the link occurs in test network and “negative” otherwise. • We repeat this procedure 100 times. • Using class labels and similarity scores we constructed an ROC curve. 13/19
Results Topological Characteristics of the MeSH Networks Parameter Train Test Nodes 24 225 25 570 Edges 4 897 380 5 615 965 Edges (reduced) 3 328 288 3 810 535 Density 0 . 01 0 . 01 Mean degree 274 . 78 298 . 05 Average path length 2 . 23 2 . 20 Clustering coefficient 0 . 27 0 . 26 Small-worldness index 21 . 57 20 . 70 14/19
Similarity Score Distribution 0.010 Class Density 0 1 0.005 0.000 0 1000 2000 3000 Jaccard coefficient 15/19
Prediction Performance Jaccard Adamic−Adar 1.0 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Average true positive rate ● Average true positive rate ● ● ● 0.8 ● 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 0.6 ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● 0.4 ● ● ● ● ● 0.2 0.2 AUC = 0.78 AUC = 0.82 0.0 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate AUC ( Area under the ROC curve ): 0.90 – 1.00 = excellent, 0.80 – 0.90 = good, 0.70 – 0.80 = fair, 0.60 – 0.70 = poor, 0.50 – 0.60 = fail 16/19
Example • Training network: 1991 – 1995 • Test network: 1996 1|Case-Control Studies|Rats, Inbred Strains|4867 2|Follow-Up Studies|Binding Sites|4512 3|Blotting, Western|Combined Modality Therapy|4271 4|Indicators and Reagents|Age Factors|4138 5|France|Disease Models, Animal|3991 6|Prognosis|Chickens|3955 7|Water|Prognosis|3901 8|Questionnaires|Microscopy, Electron|3895 9|Great Britain|Disease Models, Animal|3833 10|Signal Transduction|Retrospective Studies|3748 ... 1135416|Prostatic Neoplasms|I-kappa B Proteins|261 17/19
Example 18/19
Future Work • Explore the role of node and edge attributes in prediction performance. • Extend the study to semantic relations instead of co-occurrences. • Assess prediction performance on a large-scale network. • Develop network filtering methods. • Develop a web application for real-time computing. 19/19
Recommend
More recommend