biomedical data i
play

Biomedical Data I Kelly Ruggles, PhD Methods in Quantitative - PowerPoint PPT Presentation

Biomedical Data I Kelly Ruggles, PhD Methods in Quantitative Biology Biomedical Data Types Next Generation Sequencing Mass Spectrometry Clinical Imaging Biomedical Data Types Molecular Data Next Generation Sequencing Mass Spectrometry


  1. Biomedical Data I Kelly Ruggles, PhD Methods in Quantitative Biology

  2. Biomedical Data Types Next Generation Sequencing Mass Spectrometry Clinical Imaging

  3. Biomedical Data Types Molecular Data Next Generation Sequencing Mass Spectrometry Clinical Imaging

  4. Diversity of Omics in Biomedicine • Genome • Long term information Proteomics storage Phosphoproteomics • Transcriptome Mutation calls • Retrieval of information Copy Number • Proteome Gene Expression • Short term information storage DNA methylation/Epigenetics • Interactome MicroRNA • Signaling networks Metabolomics • Metabolome, Lipidome Phenotype Data • State

  5. MultiOmics Ellis et al., Cancer Discovery 2013

  6. Next Generation Sequencing • Based on the dideoxy method but incorporates additional innovations: Construct DNA library using PCR amplification of all DNA fragments in the genome. Each amplified DNA fragment is attached to a solid support. Each cluster contains about 1000 identical copies of a small piece of the genome. Each nucleotide is attached to a Complementary Enzymatically wash removable fluorescent molecule nucleotide covalently away label and 3’-OH and a chain terminating incorporates and a blocking group chemical adduct (instead of a 3’ picture is taken OH group)

  7. Next Generation Sequencing (NGS) Illumina and NGS Sanger NGS Since 2005 the data output of NGS has more than doubled each year and the $1000 genome makes genomics-integrated personalized medicine a real possibility

  8. Paired-end Sequencing • Sequencing both ends of the DNA fragments in the sequence library • Aligning forward and reverse reads as read pairs • Produces double the reads for the same time/effort • Alignment of read pairs is more accurate

  9. Multiplexing • Allows libraries to be pooled and sequenced simultaneously during a single sequencing run • Unique index sequences are added to each DNA fragment during library preparation so they can be identified during data analysis

  10. NGS Instrumentation Illumina HiSeq 2500 Illumina HiSeq 4000 Whole genomes, exomes, targeted • Whole genomes, exomes, targeted • sequencing sequencing RNA-Seq • RNA-Seq • ChIP-Seq • Chromatin Immunopreciptation • Assay for transposase-accessible • sequencing (ChIP-Seq) chromatin with high-throughput seq High output but not as flexible as • (ATAC-Seq) HiSeq 2500 Illumina MiSeq Oxford Nanopore MinION Metagenomics • Long reads sequencing • 16S Ribosomal sequencing • • Bacterial and viral genomes • Higher cost per Mb compared to VERY small (USB devise), low cost • HiSeq but fastest run times and and extremely long reads but very longest illumina read lengths high error rates

  11. Oxford Nanopore MinION • Nanopore sequences the fragment of any length • Streams data in real time https://nanoporetech.com/applications/dna-nanopore-sequencing

  12. NGS Methods • Whole Genome Sequencing • Exome Sequencing (2% of the genome) • De novo sequencing (no reference sequence available) • Short insert paired end (high coverage fills in gaps) • Long-insert mate pair sequencing • RNA-Seq • Methylation Sequencing • ChIP Sequencing • Ribosome profiling

  13. Publicly Available ‘Omics Datasets ISGR: The International Genome Sample The Cancer Genome Atlas Resource • Generated comprehensive genomic maps of 33 • Started as the “1000 genomes project” tumor types • Cataloging human genetic variants across 2,000 • Includes copy number, RNA-seq, methylation, + samples globally miRNA, mutation calls, etc. LINCS: Library of Integrated Cellular Encyclopedia of DNA Elements Signatures • Catalogs how cells respond to • Goal is to build comprehensive parts list of 16,000+ genetic and environmental functional elements in the human genome stressors • Includes ChIP-seq, RNA-seq, Hi-C, 5C, DNase-seq, • Gene expression, proteomics, ATAC-seq, methylation, etc. phosphorylation..

  14. Understanding Gene Regulation and Epigenetics ChIP-Seq o Chromatin is immmunoprecipitated and the recovered DNA is sequenced o Identifies binding sites of DNA-associated proteins DNAse-Seq/FAIRE-Seq o Identifies DNaseI hypersensitive sites (open chromatin = active genes) Hi-C/5C o DNA crosslinked and sequenced o Spatial organization of chromatin (promoter/enhancer regions) Bisulfite Sequencing (WGBS, RRBS) o Reads methylation status at the genome level

  15. Assessing Copy Number and Mutation Status by Genome Sequencing Sequence Genomic DNA Load on Next Generation Library Preparation Alignment Isolation Sequencing Flow Cell Tumor Sample Copy Number Variation (CNV) Single Nucleotide Polymorphisms (SNPs) o Changes in the genome due to duplication or o Single base-pair sites that vary in a population deletion of large regions of DNA o Have been found to act as “drivers” of tumor progression T SNP C

  16. SNPs and Disease • Mendelian (monogenic) vs. mutigenic • For multigenic studies, DNA is collected from a large number people with the disease and compared to those who do not have the disease. • Associated SNPs indicate that a nearby allele likely is responsible for the increased risk (based on DNA linkage) Genome Wide Association Study!

  17. Common Disease Common Variant Hypothesis • Predicts that common disease-causing alleles (variants) will be found in all human populations which manifest a given disease.

  18. Genome-Wide Association Studies (GWAS) • Measures and analyzes DNA sequence variations across the genome to identify genetic risk factors for common diseases • Has also been used to identify genetic associations with drug metabolism • SNP arrays are done on each person (1+ million) • Typically a case-control design • Allele count for each SNP is evaluated and chi-squared test used to identify variants associated with the trait

  19. SNP arrays • Array of 25 bp oligonucleotide sequences are laid across the chip surface • Sample’s DNA is amplified and marker is attached • DNA hybridized to the array • Array is scanned to quantify the relative amount bound

  20. GWAS Example: APOE epsilon 4 and Alzheimer’s • APOE is an apolipoprotein • Essential for the normal catabolism of TG-rich lipoproteins • Mediates cholesterol metabolism • Transports cholesterol to neurons • Original study measured 502,627 SNPs in over 1000 AD cases/controls • Found APOE locus (SNP rs4420638 14 kb distal to APOE) as having association with late onset AD Coon K et al. (2007) J Clin Psychiatry 68(4):613-8

  21. VCF File Format VCF File Format # Meta-information lines Columns: 1. Chromosome 2. Position 3. ID (ex: dbSNP) 4. Reference base 5. Alternative allele 6. Quality score 7. Filter (PASS=passed filters) 8. Info (ex: SOMATIC, VALIDATED..)

  22. SNP Databases • db dbSNP: : full collection of all SNPs identified : Database of somatic mutations in human cancer • COSMIC IC: • op openS nSNP: : allows you to upload your SNP data (and make it publiclly available!). Can be downloaded by researchers. • IS ISGR : Started as the 1000 genomes project, now contains data from over 3K individuals from around the world • GO e exome: : SNP database from lung, heart and blood disorder patients

  23. Methods involved in SNP calling • Variant Calling • General Steps; Pipelines: • Align to genome • VarScan reference • Pindel • Alignment recalibration • Somatic Sniper • Raw variant calling • Radia • Quality assignment • Muse • Variant filtering • MuTect • Indelocator Ellrott et al., 2018

  24. Assessing Copy Number and Mutation Status by Genome Sequencing Sequence Load on Next Generation Library Preparation RNA Isolation Alignment Sequencing Flow Cell Tumor Sample Gene Expression Alternative Splicing o Normalized expression of genes in all samples o Splicing of exons, creating new protein isoforms o Can be used for differential expression analysis o Alternative splicing changes are frequently found in cancer o Loss of functional domains may also be a disease driver

  25. Microarrays: Studying the expression of groups of genes Tissue sample • Allows one to measure Isolate mRNA. the expression of mRNA molecules thousands of genes at a time • Has been used to Make cDNA by reverse DNA fragments transcription, using representing compare patterns of specific genes fluorescently labeled nucleotides. gene expression in Labeled cDNA molecules (single strands) different tissues, Apply the cDNA mixture to a microarray, a different gene in different times, each spot. The cDNA hybridizes DNA microarray with any complementary DNA on the microarray. different conditions. with 2,400 DNA microarray human genes Rinse off excess cDNA; scan microarray for fluorescence. Each fluorescent spot represents a gene expressed in the tissue sample.

  26. RNA-Seq

  27. RNA-Seq • Uses NGS to reveal the presence and quantity of RNA in a sample at a specific point in time • Alternative splicing, post-transcriptional modifications, gene fusions, SNPs/mutations • Gene annotation • Coding and non-coding RNA

  28. Library Construction

  29. RNA-Seq Methods RNAs are converted into cDNA fragment library Sequence adapters (blue) are added to cDNA fragments Short sequence reads from each cDNA are obtained Reads are aligned to reference sequence and classified as exonic reads, junction reads or poly(A) end-reads Used to generate a base-resolution expression profile for each gene Wang et al, 2009

Recommend


More recommend