Applications of Pattern Recognition in Computational Biology Pattern Recognition Course (2110597) Chulalongkorn University August 22 nd , 2017 Instructor: Sira Sriswasdi ( สิระ ศรีสวัสดิ์ ) 1
89% accuracy vs 73% 2
Biology + Computation Image from https://www.systemsbiology.org/about/what-is-systems-biology/ 3
Data From High-Throughput Technology Genome Sequence The Central Dogma ACCAGCGGCGAAGCTCGGGGCGGAGGGGTTGA “Information Processing in Biology” GCCACATGAGGCGATGGCGACAATGAGGCGAG ACATGGCG TGGCTGGC TGTTACATTTTGTTTT GATGAAAAGCATAACCATGCGGATGATATTTT TATTATAGACTAGAGATGATTATTGAATAGAC ATGCTCTTAACCATTTTTAACTCTAATTCCA C RNA Expression Protein Interactions Genes Image from https://www.khanacademy.org/science/biology/gene-expression- central-dogma/transcription-of-dna-into-rna/a/overview-of-transcription Conditions 4
The Omics Era DNA methylation, histone modification Protein identification and Metabolites, hormones, and quantification, post- other signaling molecules translational modification Image from http://www.cubocube.com/files/images/opengenetics/chapter11/image2.png DNA sequencing, genetic RNA sequencing, RNA mapping, recombinant DNA expression, transcriptional regulation 5
Application I: Evolutionary Genomics 6
Genome Sequence As Species Signature CCAGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACA TGGCG TGGCTGGC TGTTACATTTTGTTTTGATGAAAAGCATAACCATGCGGATGATATTTTTATTA TAGACTAGAGATGATTATTGAATAGAC ATTTTAACTCTAATTCCA ... GCGAAGCTCGTTAACCATGCGGATGATATTTTTATTATAGACTAGAGATGATTATTGAATAGAC AT GCTCTTAACCATTTTTAACTCTAA ... ~98% ~60% GCGGCGAAGCTCGGGGCGGAGGGGTTGAGCAAAAGCATAACCATGCGGATGATATTTTTATTATAG ACTAGAGATGATTATTGAA CTCTAAA ... CCAGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCATAACCATGCGGATGATATTTTTATTATAGACT AGAGATGATTATTGAATAGAC ATTTTAACTCTAATTCCA ... CCAGCGGCGAAGCGGCGATGGCGACAATGAGGCGAGACATGGCG TGGCTGGC TGTTACATTTTGTT AATCGGGGCGGAGGGGTTGAGCCACATGAGCATAACCATGCGGATGATATTTTTATTATAGACTAG ~80% AGATGATTATTGAATAGAC ATTTTAACTCTAATTCCA ... CAGCGGCGAAGCTCGGGGCGGAGGGGTTTATTTTTATTATAGACTAGAGATGATTATTGAATAGAC ATGCTCTTAACCATTTTTAACTCTAATTCC ... GCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATGG CG TGGCTGGC TGTTACATTTTGTT TAACTCTAAT ... ~90% CGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGA CATGGCG TGGCTGGC TGTTAC... CGGCGAAGCTCGGGGCGGAGGGGTTGA TTTTTAACTCTAATT ... AGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATG GCG TGGCTGGC TGTTACATTTTGTTTTGATGAAA TTTTTAACTCTAATTC ... CCAGCGGCGAAGCTCGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATGGCG TGGCTG GC TGTTACATTTTGTTTTGATGAAAAGCAT TCTAATTCCA ... 7 Image adapted from http://physrev.physiology.org/content/89/3/921
Evolution Of DNA Sequences Ancestral Sequences Present Sequence (Inferred) (Observed) CAGC G CT C AG CAC T CAGCAC A AAGACTT TAGCCC T TAG C CCA TA C CCCA T AG G C CA A AGGCCA Time 8 Image adapted from http://physrev.physiology.org/content/89/3/921 and https://www.quora.com/Arent-all-sequences-homologous
Inferring Evolutionary History (Phylogenetics) ▪ Reconstruction of evolutionary events over millions of years ▪ Based on genome sequences of currently existing species ▪ Assume some models of evolution on DNA sequence, e.g. P(A T), P(G T) ▪ Output the most likely tree topology and branch lengths ▪ Extremely large number of parameters, search spaces, and number of models to compare Time 9 Image from http://physrev.physiology.org/content/89/3/921
Growing Amount Of Genomic Data 2.5E+11 ▪ >17,000 bacterial genomes ▪ >350 fungal genomes 2E+11 ▪ >100 insect genomes ▪ >150 plant genomes 1.5E+11 Base pairs ▪ >230 animal and fish genomes ▪ >70 invertebrate genomes 1E+11 5E+10 0 Dec-82 Jan-84 Feb-85 Mar-86 Apr-87 May-88 Jun-89 Jul-90 Aug-91 Sep-92 Oct-93 Nov-94 Dec-95 Jan-97 Feb-98 Mar-99 Apr-00 May-01 Jun-02 Jul-03 Aug-04 Sep-05 Oct-06 Nov-07 Dec-08 Jan-10 Feb-11 Mar-12 Apr-13 May-14 Jun-15 Jul-16 Plotted with data from https://www.ncbi.nlm.nih.gov/genbank/statistics/ 10
Forecasting And Regulating Evolution ▪ Epidemiology • Tracking the spread of disease outbreaks • Predict the next outbreaks and prepare vaccines in advance ▪ Biotechnology • Genetic engineering and breeding of new strains with desired characteristics and capabilities ▪ Wildlife Conservation • Pairing evolutionary history with climate/environmental changes can reveal the factors that drive animal evolution and extinction 11
Application II: Population Genetics 12
Tracing Population Structure Over Time Genetic Inheritance Migration Image adapted from https://www.theodysseyonline.com/why-people-migrate Image adapted from https://wiki.uiowa.edu/display/2360159/Autosomal+Inheritance 13
Single Nucleotide Polymorphisms (SNPs) As Individual ’ s Genetic Signature Identity-By-Descent (IBD) Network Image from http://uvmgg.wikia.com/wiki/SNP Han et al . Nature Communication 8, 14238 (2017) 709,358 SNPs 774,516 individuals A T T C … G C G G A … T C G G A … G Connect individuals that share significant portion of G G T C … A consecutive SNPs 14
Roots Of North American Population 15 Han et al . Nature Communication 8, 14238 (2017)
From Population To Personalized Medicine ▪ Social Sciences Image from http://poshrx.com/23andme-is-back-on/ • Tracking the dynamics of populations • Understanding ethnic structures ▪ Medicine • Identifying common genetic variations within a population that may be associated with drug targets • Identifying disease risk factors 16
Application III: Gene Expression Analysis 17
Measuring Gene Expression Sequencing Machine Image from https://support.illumina.com/sequencing/seque ncing_instruments/hiseq-4000.html Image adapted from https://www.khanacademy.org/science/biology/gene-expression-central- dogma/transcription-of-dna-into-rna/a/overview-of-transcription Amount of RNA product of each gene 18
RNA Sequencing (Counting) Biases Image adapted from Image adapted from http://bio.lundberg.gu.se/courses/vt13/rnaseq.html/ https://mikelove.wordpress.com/2016/09/26/rna-seq- fragment-sequence-bias/ ▪ Due to technological limitation, the entire length of RNA cannot be sequenced at once • Full-length RNA has to be fragmented • Bias in fragment length ▪ To increase sensitivity, fragmented RNA has to be amplified • Bias in signal amplification ▪ Sequencing is directional Bias correction • Bias in head-to-tail read count 19
Bias Normalization via Regression Before Normalization After Normalization Adapted from Bacher et al . Nature Methods 14, 584-586 (2017) ▪ Sequencing involves sampling of RNA transcripts ▪ Estimated expression levels of low, medium, and high expression genes are differently affected by the throughput of RNA sequencing experiment ▪ Normalization by regression corrected the biases 20
What Can Gene Expression Tell Us? High Time Series Analysis Down-Regulated Genes Up-Regulated Genes Adapted from Rund et al . PNAS 108, E421-430 Low 21 Klings et al . Physiological Genomics 21, 293-298 (2005)
Structure Behind Gene Expression Profiles Each cluster represents a group of genes with similar functions Hierarchical K-means Self-Organizing Map 22 D’haeseleer et al . Nature Biotechnology 23, 1499-1501 (2005)
Identifying Disease Subtypes Gene expression data from >4,000 colorectal cancer patients Node = Patient Edge = Similar gene expression 23 Guinney et al . Nature Medicine 21, 1350-1356 (2015)
Application Of Gene Expression Analysis To understand the blueprint of biological systems and diseases 24 Image from http://pathview.r-forge.r-project.org/
A Break From Biology: Basic Clustering Techniques 25
An Illustration Of K-Mean Clustering Randomly select Assign points to K=3 centroids nearest centroid Update point Update centroids Update centroids assignments 26 From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
Characteristics Of K-Mean Clustering ▪ The number of clusters, K , is specified in advance. ▪ Euclidean distance • The nearest centroid minimizes the sum of squares, ||x-m|| 2 . ▪ Always converge to a (local) minimum. • Poor starting centroid locations can lead to incorrect minima. Image from https://en.wikipedia.org/wiki/K-means_clustering ▪ The model has several implicit assumptions: • Data points scatter around cluster’s centers. • Boundary between adjacent clusters is always halfway between the cluster centroids. 27
Effect Of Poor Initial Centroid Locations 28 From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
Distance Functions ▪ Property of distance function • d(x, y) ≥ 0 non -negativity • d(x, y) = 0 x = y identity • d(x, y) = d(y, x) symmetry • d(x, z) ≤ d(x, y) + d(y, z) triangle inequality ▪ Example of distance functions • Euclidean distance σ(𝒚 𝒋 − 𝒛 𝒋 ) 𝟑 𝐞(𝐲, 𝐳) = • Squared Euclidean distance 𝐞(𝐲, 𝐳) = σ(𝒚 𝒋 − 𝒛 𝒋 ) 𝟑 • Manhattan distance 𝐞(𝐲, 𝐳) = σ |𝒚 𝒋 − 𝒛 𝒋 | • Maximum distance 𝐞(𝐲, 𝐳) = 𝐧𝐛𝐲 |𝒚 𝒋 − 𝒛 𝒋 | 𝟐 𝒒 p -norm ||𝐲|| 𝒒 = (σ 𝒋 |𝐲 𝒋 | 𝒒 ) Τ ▪ 29
Recommend
More recommend