Research Problem Definition Part 1 Advanced Computational Intelligence and Deep Machine Learning for Early Detection and Diagnosis of Diseases Tarek Khorshed PhD Student, American University in Cairo 1 |
Problem Definition Disease Diagnosis The ability to detect and diagnose diseases is a major challenge with severe negative impacts on the global health and economy. Examples: Cancer, H1NI, SARS and Tuberculosis. One of the important areas which have not yet exploited the full potential of Computer Science is the early detection of diseases and outbreaks. Increasing complexity of health information has made it difficult if not impossible to use traditional monitoring techniques to detect irregular signs of possible disease threats or outbreaks. Complexity lies in the different interpretations of data Ex: geographical location, weather climate and seasonal changes . 2 |
Challenges in Analysis of Biomedical Data Analysis of biomedical gene expression data is extremely challenging given the complexity of biological networks and high dimensionality of the data. Current Clustering techniques rely on Molecular Gene Expression data preprocessing the data for feature extraction and dimensionality reduction which affects the accuracy of disease diagnosis [4]. The proposed research is targeting alternative solutions that are capable of processing high dimensional data to achieve better accuracy. Data preprocessing and Feature extraction [4] [4] Rui Xu; Wunsch, D.C., "Clustering Algorithms in Biomedical Research: A Review," Biomedical Engineering, IEEE Reviews in , vol.3, no., pp.120,154, 2010.. 3 |
Overview Of Gene Expression Data Gene Expression Data Representation Gene data is represented in a real valued 2-dimensional matrix. Rows: represent patterns of genes. Columns: represent profiles of samples. Matrix representation of gene data [6]. 4 |
Challenges in Gene Expression Data Data Quality Only a small subset of gene data might be influencing the disease infection being monitored. Interesting features of the disease are only present in a subset of the data which leads to further complexity in pattern analysis techniques. Gene expression matrix contains many data anomalies such as noise and missing values. Preprocessing the gene data is a crucial step before attempting any data analysis tasks for disease diagnosis. Preprocessing Tasks Data normalizing, estimating missing values, filtering gene expression data which are not relevant or significant to the biological process being analyzed. 5 |
Challenges in Gene Expression Data Features >>> Samples Typical example in cancer classification No. of features is much larger than no. of samples. Name: Short identifier of the data set Platform: Type of microarray platform N: No. of Samples (Ex: Tumor samples: hundreds) n: No. of features (Ex: Gene probes: thousands) Bontempi, G., "A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data," Computational Biology and Bioinformatics, IEEE/ACM Transactions on , vol.4, no.2, pp.293,300, April-June 2007 . 6 |
Cluster Analysis Clustering Overview Objective is to group data objects into a set of disjoint classes called clusters. Clustering is a form of unsupervised learning because it does not depend on predefined class labels. Classification of Clustering Techniques 1. Hard Partitional Clustering: Attempts to find a K-partition of the data 2. Hierarchical Clustering: Attempts to build a tree structure in the form of a partition. 3. Fuzzy Clustering: Data object can belong to a certain cluster with a degree of membership 4. Density Based Clustering: Defines core, border and noise points. 7 |
Cluster Analysis 1. Hard Partitional Clustering Attempts to find a K-partition of the data K-Means Clustering 8 |
Cluster Analysis 1. Hard Partitional Clustering Attempts to find a K-partition of the data Mixture Based Clustering: EM Algorithm for Mixture of Gaussians 9 |
Cluster Analysis 2. Hierarchical Clustering: Organizes data set into a hierarchical structure. Dendogram output for Hierarchical clustering 1. Agglomerative methods : Bottom-up approach where each element starts in its own cluster and then pairs of clusters are merged together 2. Divisive methods : Top-down approach where all elements start in one cluster and then they are divided recursively 10 |
Gene Expression Data Cluster Analysis Similarity Measures for Gene Expression Data Proximity describes how we measure the distance or similarity between a pair of data objects. 1) Distance (Dissimilarity) 2) Similarity Common similarity measures for continuous variables [4]. 11 |
Similarity Measures for Gene Expression Data Euclidean Distance Performs well for many clustering applications, but produces poor results when used with gene data [10]. For gene data we are more interested in the overall pattern similarity as opposed to the size of each individual attribute. 12 |
Similarity Measures for Gene Expression Data Pearson’s correlation coefficient Measures similarity between the shapes of two gene expression patterns. Commonly used in clustering gene data and has produced very good results. Does not perform well in the presence of outliers. Potential problem of assigning a high similarity score to a pair of dissimilar patterns if they have a common peak or valley [10]. Jackknife correlation 13 |
Gene Based Clustering Algorithms Evaluation of Hard Partitional Clustering with Gene Expression Data Advantages Complexity O(N K d). Practical for large data since number of clusters K and dimensions d are typically much smaller than N. Similarity measures can be relatively simple to compute. Variation called Bisecting K-means can enhance the performance of the algorithm [10]. Disadvantages Correct number of clusters is not known in advance. No standard method to define the initial set of clusters. Requires running the algorithm several times with random initial partitions which is computationally expensive. 14 |
References [1] International Agency for Research on Cancer (IARC), World Cancer Report. http://www.iarc.fr [2] National Cancer Institute (NCI), http://www.cancer.gov/ [3] P. Baldi and G.W. Hatfield, “DNA Microarrays and Gene Expression. From Experiments to Data Analysis and Modelling”. Cambridge Univ. Press, 2002. [4] Rui Xu; Wunsch, D.C., "Clustering Algorithms in Biomedical Research: A Review," Biomedical Engineering, IEEE Reviews in , vol.3, no., pp.120,154, 2010. [5] G. E. Hinton and R. R. Salakhutdinov , “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504 – 507, 2006. [6 ] T. Lee and D. Mumford, “Hierarchical Bayesian inference in the visual cortex,” J. Opt. Soc. Amer., vol. 20, pt. 7, pp. 1434 – 1448, 2003. [7] G. E. Hinton, S. Osindero, and Y. Teh , “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, pp. 1527 – 1554, 2006. [8] Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012 ), “ ImageNet Classification with Deep Convolutional Neural Networks”, Advances in Neural Information Processing 25, MIT Press, Cambridge, MA. [9] http://www.darpa.mil/IPTO/solicit/baa/BAA09-40_PIP.pdf [10] Carneiro, G.; Nascimento, J.C.; Freitas, A., "The Segmentation of the Left Ventricle of the Heart From Ultrasound Data Using Deep Learning Architectures," Image Processing, IEEE Transactions on , vol.21, no.3, pp.968,982, March 2012. [11] http://www.darpa.mil/IPTO/solicit/baa/BAA09-40_PIP.pdf 15 |
Recommend
More recommend