CURRENT CHALLENGES IN GENOMIC DATA VISUALIZATION Cydney Nielsen BC Cancer Agency Genome Sciences Centre Vancouver, Canada
The Data Deluge ~$5,000 in 2001 ~10¢ in 2011
Sequencing Experiments De novo assembly Re-sequencing Enrichment CCAGACAAGACAGACACAGTA GGCATACAGACTTAGACATA AGCTTCAGATGGACAGATAA AGCTTCAGATGGACAGATAA AGCTTCAGATGGACAGATAA GGCATACAGACTTAGACATA CCAGACAAGACAGACACAGTA GGCATACAGACTTAGACATA CCAGACAAGACAGACACAGTA CCAGACAAGACAGACACAGTA CCAGACAAGACAGACACAGTA TACAAGACATAAGCAATACAGA TACAAGACATAAGCAATACAGA TACAAGACATAAGCAATACAGA CCAGACAAGACAGACACAGTA Reference Genome Reference Genome Genome Assembly
Drew Sheneman, New Jersey - The Newark Star Ledger
Challenge 1 Large number of samples for comparison “To systematically characterize the genomic changes in hundreds of tumors … and thousands of samples over the next five years” The Cancer Genome Atlas www.cancergenome.nih.gov
Genome Browsers Stacked data tracks along a common genome x-axis Data samples Genome coordinate
Home Genomes Blat Tables Gene Sorter PCR PDF/PS Session FAQ Help UCSC Cancer Genomics Heatmaps Glioblastoma Copy Number Abnormality, Agilent 244A array (n=200) Data samples r e Tumor vs normal d n e G Genome coordinate Heatmap provides a more condensed view Zhu et al ., Nature Methods, 2009 Recurrent deletion of all or part of chromosome 10, peak at PTEN locus
Challenge 1 Large number of samples for comparison Consider what information is needed e.g. replace with biologically meaningful summary, such as significant change between samples
Home Genomes Blat Tables Gene Sorter PCR PDF/PS Session FAQ Help UCSC Cancer Genomics Heatmaps Glioblastoma Copy Number Abnormality, Agilent 244A array (n=200) r e Tumor vs normal d n e G Example: Summary view (column averages) Zhu et al ., Nature Methods, 2009 Recurrent deletion of all or part of chromosome 10, peak at PTEN locus
Challenge 2 Large number of data types
Genomic rearrangements in cancer (complex representation) A Deletion-type Tail-to-tail inverted SNU-C1 (colorectal): Chr 15 Tandem dup-type Head-to-head inverted Non-inverted orientation 4 Copy 2 number 0 1 Allelic ratio 0 Inverted orientation 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Genomic location (Mb) Stephens et al. , Cell, 2011
17 mouse genomes (more compact representation) N N Z a O O D / / H SNPs S I h L 0 >100,000 D i t L J L SVs C B t P 5 J A 0 742 7 C / / J C B B 2 TEs L 3 A J 1 / B H 0 179 6 / 2 A J / N 9 Uncallable H L CAST/EiJ 1 1 J S B e 2 0 836 2 5 / J 14 9 9 / 13 2 A c 15 1 16 11 S A 18 0 P S J 19 1 7 K v 1 1 / X 2 9 E R J 8 / / 1 7 O S v / J B v 2 6 l r a I d 3 5 m H 4 4 J s d 3 5 6 2 WSB/EiJ 7 1 8 1 9 1 0 2 11 3 12 13 4 14 5 15 16 6 1 7 7 1 8 1 9 8 X 9 10 1 1 1 2 1 2 13 3 14 4 15 5 16 6 17 18 7 19 8 X 9 X 0 1 19 1 1 18 PWK/PhJ 2 1 1 7 13 16 4 15 1 15 1 4 16 13 17 12 18 19 11 10 X 9 8 1 2 7 3 6 4 5 SPRET/EiJ Still difficult to represent many data types b in a general tool Keane et al ., Nature, 2011
Challenge 2 Large number of data types Compact, customized data encoding
ABySS-Explorer Represents sequence - connectivity - strand - length - mapping on reference Interactively access - sequence coverage - scaffolding (a) reference human genome (b) inversion event in a human lymphoma genome Nielsen et al . Best Paper Award at InfoVis 2009
Challenge 3 Genomic features are sparse
Genome Browsers LOCAL VIEW Human chr1, 1 pt corresponds to 480 kb, which is larger than 98% of all human genes! - Martin Krzywinski
Hilbert Curve GLOBAL VIEW a b expressed genes Chromosome 3L Cluster of small 5 ′ 3 ′ Open chromatin domain domains PcG 5 ′ 3 ′ Heterochromatin- like domain 5 ′ 3 ′ heterochromatin Pericentromeric 5 ′ 3 ′ Chromatin states: 1 2 3 4 5 6 7 8 9 Kharchenko et al ., Nature, 2011 Anders, Bioinformatics, 2009
Challenge 3 Genomic features are sparse Need both overview and detail Functional axis (perhaps not full genome)
Spark – a genomic data exploration tool 1. ¡Focus ¡on ¡regions ¡of ¡interest ¡(e.g. ¡transcrip8onal ¡start ¡sites) ¡ H3K4me3 H3K9Ac H3K4me1 H3K36me3 H3K27me3 H3K9me3 MeDIP MRE 2. ¡Extract ¡data ¡matrices ¡ 3. ¡Cluster ¡matrices ¡ ¡ 4. ¡Interac8ve ¡cluster ¡visualiza8on ¡ ¡ Nielsen et al . in preparation
Challenge 4 No longer one genome but many
Single nucleotide variation Ossowski et al . Genome Research, 2008
Single nucleotide variation Integrative Genomics Viewer (IGV) Robinson et al . Nature Biotechnology, 2011
Structural variation Bhutkar et al ., Genetics, 2008
Challenge 4 No longer one genome but many Capture variation on a graph
Sequence variation on a graph Comeau et al ., Mol. Biol. Evol., 2010 Users may require more time to learn how to interpret graph representations, but such graphs are likely to scale better and may prove more powerful for analysis
Sequence variation on a graph Paten et al ., Genome Research, 2011
Challenge 5 Human Computational Judgement Analysis
Consed Genome Assembly and Finishing Tool David Gordon and Phil Green Good example of integrated visualization and computational analysis functionality
Challenge 5 Need to integrate computation High interactivity, low memory overhead Avoid storing large data sets locally Popularity of web-based tools Evolving sequencing technologies
Summary Large number of samples for comparison 1 Large number of data types 2 Genomic features are sparse 3 4 No longer one genome but many 5 Need to integrate computational analysis
Recommend
More recommend