Machine Learning Applications to Omics Data Kelly Ruggles April 9, 2018
Diversity of Omics in Biomedicine • Genome • Long term information Proteomics storage Phosphoproteomics • Transcriptome Mutation calls • Retrieval of information Copy Number • Proteome Gene Expression • Short term information storage DNA methylation/Epigenetics • Interactome MicroRNA • Execution RPPA • Metabolome, Lipidome Clinical Data • State
Understanding Gene Regulation and Epigenetics ChIP-Seq o Chromatin is immmunoprecipitated and the recovered DNA is sequenced o Identifies binding sites of DNA-associated proteins DNAse-Seq/FAIRE-Seq o Identifies DNaseI hypersensitive sites (open chromatin = active genes) Hi-C/5C o DNA crosslinked and sequenced o Spatial organization of chromatin (promoter/enhancer regions) Bisulfite Sequencing (WGBS, RRBS) o Reads methylation status at the genome level
Assessing Copy Number and Mutation Status by Genome Sequencing Sequence Genomic DNA Next Generation Load on Library Preparation Alignment Isolation Sequencing Flow Cell Sample Copy Number Variation (CNV) Single Nucleotide Polymorphisms (SNPs) o Changes in the genome due to duplication or o Single base-pair sites that vary in a population deletion of large regions of DNA o Have been found to act as “drivers” of tumor progression T SNP C
Assessing Copy Number and Mutation Status by Genome Sequencing Sequence Next Generation Load on Library Preparation RNA Isolation Alignment Sequencing Flow Cell Sample Gene Expression Alternative Splicing o Normalized expression of genes in all samples o Splicing of exons, creating new protein isoforms o Can be used for differential expression analysis o Alternative splicing changes are frequently found in cancer o Loss of functional domains may also be a disease driver
Protein Identification and Quantitation by Mass Spectrometry Tandem Mass Spectrometry intensity Quantity Peptides Fractionation Digestion Lysis Sample m/z Identity Reverse Phase Protein Array: Discovery Proteomics: o Used to measure global protein expression (whole cell proteome) o Can enrich for phosphopeptides to measure phosphorylation status
Publically Available Omics Datasets • Collaboration between National Cancer Institute and the National Human • International collaboration funded by the Genome Research Institute National Human Genome Research • Generated comprehensive genomic Institute maps of 33 tumor types • Goal is to build comprehensive parts list of • Subset of these tumors were functional elements in the human genome characterized at the proteome level
ML Applications in Omics Sequence Element Annotation Libbrecht MW. Nat Rev Genet. 2015 Jun; 16(6): 321–332.
”Learning” Transcription Start Sites (TSSs) • Knowing the exact position of a 5’ TSS of an RNA is crucial for finding the regulatory regions that flank it • Traditionally, one will find where the 5’ cap structure maps onto the RNA • Cap analysis of gene expression (CAGE) • Oligo-capping • Robust analysis of 5’ transcript ends (5’ RATE) • Complexity surrounding the TSSs • Non-coding RNAs function • Regulatory regions around the TSS • Effective of repetitive elements Kapranov, 2009
”Learning” Transcription Start Sites (TSSs) • Identify algorithm • Provide large collection of TSS sequences and list of non-TSS sequences • Give novel sequences to the model, which predicts TSS or non-TSS for each sequence • If you can compile a list of sequence elements of a given type you can probably train a machine learning method to recognize those elements Libbrecht and Nobel, 2015
• Enhancers: distal regulatory elements with roles in the regulation of gene expression • Lack common sequence features and are far from target genes makes them difficult to identify • Used ENCODE DNaseI hypersensitivity and ChIP- Seq data and applied random forest model to predict enhancers • Identified 3 histone modifications (H3K4me1, H3K4me3, H3K27ac) that were the most informative and robust across cell types • Trained on p300 ENCODE data from human embryonic stem cells and predicted in 12 ENCODE Cell types cell types Rajagopal 2013
Annotating Genomes • To be useful, genomes must be annotated • Genome annotation: • Identifying the location and function of protein coding genes • Understand cis-regulatory sequences • Alternative splicing • Identifying promoters and enhancers Introns Exons
Annotating Genomes • Can use gene-finding algorithms to predict locations and intron/exon structure of all protein-coding genes on a chromosome Libbrecht and Nobel, 2015
Annotating Genomes Supervised Approach • Labelled DNA sequences with start/end of gene, splice sites • Model learns the properties of genes • DNA sequence patterns • Donor/acceptor splice sites • Length/distribution of UTRs Libbrecht and Nobel, 2015
Annotating Genomes Unsupervised Approach • Collection of epigenomic data sets (ENCODE) and want to identify patterns of chromatin accessibility, histone modification TF binding • We want to know what labels do best in providing an overview of the functional activities of the genome • Use unlabeled data and input desired number of labels • Model will partition genome and assign labels to each segment. • Allows for the identification of novel genomic elements Libbrecht and Nobel, 2015
• Unsupervised training on 1% of the human genome using ENCODE data (ChIP-Seq, DNAse-seq, FAIRE-seq) • Fixed the number of labels at 25 to keep them interpretable • They used a method (“Segway”) based on Dynamic Bayseian Networks to segment and cluster the data • Assigned functional categories to groups of segment labels based on features • Identifies protein coding genes, transcription factor binding, chromatin states, etc. Nature Methods, 2012
ML Applications in Genomics and Proteomics Expression-based input Libbrecht MW. Nat Rev Genet. 2015 Jun; 16(6): 321–332.
Modeling and ‘Omics • Input can also be expression matrices • RNA-seq • DNAse-seq • ChIP-seq • Microarray • Proteomics etc. • Can be used to distinguish between disease phenotypes and/or to identify potentially valuable disease biomarkers Ruggles et al., (2017) MCP
Curse of Dimensionality (‘Large p, small n’) • Often leads to results with poor biological interpretability • Reliability of models decreases with added dimension • Analysis of single and integrative omics data is due to high rates of false positives due to chance • Requires corrections for multiple hypothesis testing or dimensionality reduction • Can lose key mechanistic information Alyass, 2015
Personalized Medicine • Personalized medicine: algorithm that optimizes treatment to maximize efficacy and minimize risk based on genetic make-up • Patient populations show high inter- individual variability in drug response and toxicity. • Gene factors account for 15-30% of drug metabolism differences • Ability to identify gene biomarkers corresponding to a therapeutic effect
Imprecise Medicine • The top 10 grossing drugs in the US help between 1 of 25 and 1 of 4 people who take them • Some drugs are harmful to specific ethnic groups because the bias of Wester participants in clinical trials • Classical clinical trials do not take into account genetic and environmental factors that effect how a person responds to treatment Schork, 2015
Personalized Medicine Continuum • Spans the full spectrum of healthcare: • Greatest risk of developing a disease • Identifying prognostic, predictive and drug response markers • Developing new therapies based on biomarkers Bernstam et al., 2013
Use of ‘Omics in Personalized Medicine • Lag in personalized medicine due, in part, to our ability to generate vs. integrate/interpret omics data • NGS means we can quickly and cheaply generate data • ’Omics data can be translated into subject-specific care based on their disease network • However, our ability to determine molecular mechanisms based on this data is limited Alyass, 2015
Barriers of ‘Omics • To complete this complex data integration, expertise in many disciplines is required: • Biological mechanisms • Medicine • Informaticians and statisticians • Barriers between these disciplines still exist • 90% of scientists are self-taught in software development and lack best practices • Task automation • Code review • Version control
Prognosis Diagnosis Feature Selection Omics Input Predictive Model Model Training Drug Response Drug Toxicity
• Used RNA-Seq data from The Cancer Genome Atlas (TCGA) • 31 tumor types • 9,096 samples • 75% training, 25% testing • Goal: Identify a set of genes that can distinguish tumor types • Identified 20 genes that could classify >90% of the samples • Used a GA/KNN method • Genetic algorithm (GA) for gene feature selection • K nearest neighbors as classification tool Li et al., 2017
Prognosis Diagnosis Feature Selection Omics Input Predictive Model Model Training Drug Response Drug Toxicity
Recommend
More recommend