part 1
play

Part 1 Advanced Computational Intelligence and Deep Machine - PowerPoint PPT Presentation

Research Problem Definition Part 1 Advanced Computational Intelligence and Deep Machine Learning for Early Detection and Diagnosis of Diseases Tarek Khorshed PhD Student, American University in Cairo 1 | Problem Definition Disease Diagnosis


  1. Research Problem Definition Part 1 Advanced Computational Intelligence and Deep Machine Learning for Early Detection and Diagnosis of Diseases Tarek Khorshed PhD Student, American University in Cairo 1 |

  2. Problem Definition Disease Diagnosis  The ability to detect and diagnose diseases is a major challenge with severe negative impacts on the global health and economy. Examples: Cancer, H1NI, SARS and Tuberculosis.  One of the important areas which have not yet exploited the full potential of Computer Science is the early detection of diseases and outbreaks.  Increasing complexity of health information has made it difficult if not impossible to use traditional monitoring techniques to detect irregular signs of possible disease threats or outbreaks.  Complexity lies in the different interpretations of data Ex: geographical location, weather climate and seasonal changes . 2 |

  3. Challenges in Analysis of Biomedical Data  Analysis of biomedical gene expression data is extremely challenging given the complexity of biological networks and high dimensionality of the data.  Current Clustering techniques rely on Molecular Gene Expression data preprocessing the data for feature extraction and dimensionality reduction which affects the accuracy of disease diagnosis [4].  The proposed research is targeting alternative solutions that are capable of processing high dimensional data to achieve better accuracy. Data preprocessing and Feature extraction [4] [4] Rui Xu; Wunsch, D.C., "Clustering Algorithms in Biomedical Research: A Review," Biomedical Engineering, IEEE Reviews in , vol.3, no., pp.120,154, 2010.. 3 |

  4. Overview Of Gene Expression Data Gene Expression Data Representation  Gene data is represented in a real valued 2-dimensional matrix.  Rows: represent patterns of genes. Columns: represent profiles of samples. Matrix representation of gene data [6]. 4 |

  5. Challenges in Gene Expression Data Data Quality  Only a small subset of gene data might be influencing the disease infection being monitored.  Interesting features of the disease are only present in a subset of the data which leads to further complexity in pattern analysis techniques.  Gene expression matrix contains many data anomalies such as noise and missing values.  Preprocessing the gene data is a crucial step before attempting any data analysis tasks for disease diagnosis. Preprocessing Tasks  Data normalizing, estimating missing values, filtering gene expression data which are not relevant or significant to the biological process being analyzed. 5 |

  6. Challenges in Gene Expression Data Features >>> Samples Typical example in cancer classification No. of features is much larger than no. of samples. Name: Short identifier of the data set Platform: Type of microarray platform N: No. of Samples (Ex: Tumor samples: hundreds) n: No. of features (Ex: Gene probes: thousands) Bontempi, G., "A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data," Computational Biology and Bioinformatics, IEEE/ACM Transactions on , vol.4, no.2, pp.293,300, April-June 2007 . 6 |

  7. Cluster Analysis Clustering Overview  Objective is to group data objects into a set of disjoint classes called clusters.  Clustering is a form of unsupervised learning because it does not depend on predefined class labels. Classification of Clustering Techniques 1. Hard Partitional Clustering: Attempts to find a K-partition of the data 2. Hierarchical Clustering: Attempts to build a tree structure in the form of a partition. 3. Fuzzy Clustering: Data object can belong to a certain cluster with a degree of membership 4. Density Based Clustering: Defines core, border and noise points. 7 |

  8. Cluster Analysis 1. Hard Partitional Clustering Attempts to find a K-partition of the data K-Means Clustering 8 |

  9. Cluster Analysis 1. Hard Partitional Clustering Attempts to find a K-partition of the data Mixture Based Clustering: EM Algorithm for Mixture of Gaussians 9 |

  10. Cluster Analysis 2. Hierarchical Clustering:  Organizes data set into a hierarchical structure. Dendogram output for Hierarchical clustering 1. Agglomerative methods : Bottom-up approach where each element starts in its own cluster and then pairs of clusters are merged together 2. Divisive methods : Top-down approach where all elements start in one cluster and then they are divided recursively 10 |

  11. Gene Expression Data Cluster Analysis Similarity Measures for Gene Expression Data  Proximity describes how we measure the distance or similarity between a pair of data objects. 1) Distance (Dissimilarity) 2) Similarity Common similarity measures for continuous variables [4]. 11 |

  12. Similarity Measures for Gene Expression Data Euclidean Distance  Performs well for many clustering applications, but produces poor results when used with gene data [10].  For gene data we are more interested in the overall pattern similarity as opposed to the size of each individual attribute. 12 |

  13. Similarity Measures for Gene Expression Data Pearson’s correlation coefficient  Measures similarity between the shapes of two gene expression patterns. Commonly used in clustering gene data and has produced very good results.  Does not perform well in the presence of outliers. Potential problem of assigning a high similarity score to a pair of dissimilar patterns if they have a common peak or valley [10]. Jackknife correlation 13 |

  14. Gene Based Clustering Algorithms Evaluation of Hard Partitional Clustering with Gene Expression Data Advantages  Complexity O(N K d). Practical for large data since number of clusters K and dimensions d are typically much smaller than N.  Similarity measures can be relatively simple to compute.  Variation called Bisecting K-means can enhance the performance of the algorithm [10]. Disadvantages  Correct number of clusters is not known in advance.  No standard method to define the initial set of clusters.  Requires running the algorithm several times with random initial partitions which is computationally expensive. 14 |

  15. References [1] International Agency for Research on Cancer (IARC), World Cancer Report. http://www.iarc.fr [2] National Cancer Institute (NCI), http://www.cancer.gov/ [3] P. Baldi and G.W. Hatfield, “DNA Microarrays and Gene Expression. From Experiments to Data Analysis and Modelling”. Cambridge Univ. Press, 2002. [4] Rui Xu; Wunsch, D.C., "Clustering Algorithms in Biomedical Research: A Review," Biomedical Engineering, IEEE Reviews in , vol.3, no., pp.120,154, 2010. [5] G. E. Hinton and R. R. Salakhutdinov , “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504 – 507, 2006. [6 ] T. Lee and D. Mumford, “Hierarchical Bayesian inference in the visual cortex,” J. Opt. Soc. Amer., vol. 20, pt. 7, pp. 1434 – 1448, 2003. [7] G. E. Hinton, S. Osindero, and Y. Teh , “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, pp. 1527 – 1554, 2006. [8] Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012 ), “ ImageNet Classification with Deep Convolutional Neural Networks”, Advances in Neural Information Processing 25, MIT Press, Cambridge, MA. [9] http://www.darpa.mil/IPTO/solicit/baa/BAA09-40_PIP.pdf [10] Carneiro, G.; Nascimento, J.C.; Freitas, A., "The Segmentation of the Left Ventricle of the Heart From Ultrasound Data Using Deep Learning Architectures," Image Processing, IEEE Transactions on , vol.21, no.3, pp.968,982, March 2012. [11] http://www.darpa.mil/IPTO/solicit/baa/BAA09-40_PIP.pdf 15 |

Recommend


More recommend