interactive data visualization in the wild
play

Interactive Data Visualization in the Wild ! Challenges of Big Data - PowerPoint PPT Presentation

Interactive Data Visualization in the Wild ! Challenges of Big Data in Cancer Genomics ! CYDNEY NIELSEN UNIVERSITY OF BRITISH COLUMBIA BRITISH COLUMBIA CANCER AGENCY Outline 1 Visualization and its role in scientific discovery ! 2 Introduction


  1. Interactive Data Visualization in the Wild ! Challenges of Big Data in Cancer Genomics ! CYDNEY NIELSEN UNIVERSITY OF BRITISH COLUMBIA BRITISH COLUMBIA CANCER AGENCY

  2. Outline 1 Visualization and its role in scientific discovery ! 2 Introduction to cancer genomics ! 3 Cancer genomics visualization – building a scalable platform ! 4 Summary ! ! ! !

  3. 1 Visualization and its role in scientific discovery

  4. Discovery loop QUESTIONS ! hypothesis experiments ! generation ! INSIGHTS ! DATA ! interpretation !

  5. Discovery loop QUESTIONS ! PUBLICATIONS ! experiments ! communication ! INSIGHTS ! DATA ! interpretation !

  6. Discovery loop QUESTIONS ! PUBLICATIONS ! experiments ! communication ! INSIGHTS ! DATA ! interpretation ! computer automation + human expert !

  7. Intelligence Amplifying System > Artificial Intelligence System ! ! That is, a machine and a mind can beat a mind-imitating machine working by itself. ! - Frederick Brooks

  8. Why visualization? Visualization ! • Leverages our ability to visually recognize patterns and enhances our ability to reason about data ! • Can reveal a level of detail that may be missed in summary statistics alone ! a I II III IV x y x y x y x y 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.10 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.10 4 5.39 19 12.5 12 10.84 12 9.13 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89 b Anscombe’s quartet ! Figure 1 a

  9. Why visualization? Visualization ! • Is well suited to questions where the solution is too ill-defined to be automatically computed ! ! ! INSIGHTS ! DATA ! interpretation !

  10. Why visualization? Visualization ! • Can be further enhanced with interactivity , which is key to dynamic data exploration ! ! ! Example: ! ! Visual Information-Seeking Mantra ! ! Overview first, zoom and filter, then details-on-demand. ! ! - Shneiderman 1996 ! www.apple.com

  11. Why visualization? Visualization ! • Reduces the computational barrier posed by many data analysis workflows ! ! !

  12. 2 Cancer Genomics

  13. Human genome Image from UCSF School of Medicine Office of Educational Technology

  14. Cancer – disease of the genome Li Ding et al . Nature 2012

  15. DNA Sequencing AGCGCAGATACAGACAGGTGAAACAGTACAG ! TGACAACAGTACCAAGTCAGAGTCCACATAG ! TAGAGGAGAGGCCAACATATAGACAACAGTT ! TGACAACAGTACCACAGAGTACATAGAGGAG ! AGCGCAGATACAGACAGGTGACAACAGAGAG ! Illumina HiSeq Input Output DNA prepared from a Millions of sequencing reads population of cells from a tissue sample

  16. Detecting genomic alternations from sequence reads GATGACAACAGAGAGGTTACAC ! AGATGACAACAGAGAGGTTACA ! CAGATGACAACAGAGAGGTTAC ! GACAGGTGACAACAGAGAGGTT ! AGACAGATGACAACAGAGAGGT ! ATACAGACAGGTGACAACAGAG ! AGATACAGACAGGTGACAACAG ! GCGCAGATACAGACAGATGACA ! reference TAGCGCAGATACAGACAGGTGACAACAGAGAGGTTACACCAG !

  17. Detecting genomic alternations from sequence Mutation ! reads G A TGACAACAGAGAGGTTACAC ! AG A TGACAACAGAGAGGTTACA ! CAG A TGACAACAGAGAGGTTAC ! GACAG G TGACAACAGAGAGGTT ! AGACAG A TGACAACAGAGAGGT ! ATACAGACAG G TGACAACAGAG ! AGATACAGACAG G TGACAACAG ! GCGCAGATACAGACAG A TGACA ! reference TAGCGCAGATACAGACAG G TGACAACAGAGAGGTTACACCAG !

  18. Detecting genomic alternations from sequence Mutation ! reads G A TGACAACAGAGAGGTTACAC ! AG A TGACAACAGAGAGGTTACA ! CAG A TGACAACAGAGAGGTTAC ! GACAG G TGACAACAGAGAGGTT ! coverage ! AGACAG A TGACAACAGAGAGGT ! G A G A ATACAGACAG G TGACAACAGAG ! AGATACAGACAG G TGACAACAG ! GCGCAGATACAGACAG G TGACA ! reference TAGCGCAGATACAGACAG G TGACAACAGAGAGGTTACACCAG ! allele ratio = 0.5 !

  19. Genomic alterations Mutation ! Copy number ! Rearrangement ! G A G A deletion deletion translocation translocation

  20. Revolution in DNA sequencing technologies

  21. The promise of data Green E. et al. Nature. February 10, 2011

  22. Cancer genomics data interpretation MutationSeq ! Ding et al. ! Bioinformatics 2012 ! Computer automation G A G A To predict diverse genomic alterations ! Titan ! Ha et al. ! Genome Research ! 2014 ! Human expert deletion deletion To integrate and interpret these alternations together with relevant patient metadata ! deStruct ! translocation translocation

  23. Cancer genomics data interpretation MutationSeq ! Ding et al. ! Computer automation Bioinformatics 2012 ! G A G A Titan ! Ha et al. ! Genome Research ! 2014 ! Human expert deletion deletion Need$interac+ve$visualiza+on$tools$to$ facilitate$the$human$component$and$ complement$the$computa+onal$one$ deStruct ! translocation translocation

  24. 3 Cancer Genomics Visualization

  25. Many tools for many tasks Schroeder et al . Genome Medicine 2013, 5 :9 http://genomemedicine.com/content/5/1/9 REVIEW Visualizing multidimensional cancer genomics data Michael P Schroeder 1 , Abel Gonzalez-Perez 1 and Nuria Lopez-Bigas* 1,2 Matrix heatmaps Genomic coordinates Clinical data Chromosomal coordinates Omics data Clinical data Genes Omics data Samples Networks Interactions Genes Omics data Clinical data

  26. Many tools for many tasks Schroeder et al . Genome Medicine 2013, 5 :9 http://genomemedicine.com/content/5/1/9 REVIEW Visualizing multidimensional cancer genomics data Michael P Schroeder 1 , Abel Gonzalez-Perez 1 and Nuria Lopez-Bigas* 1,2

  27. h#p://www.cbioportal.org!

  28. Key Feature 1 Flexible integration of views

  29. Integrate multiple data types into one view Example analysis: Examine a mutation in its copy number context ! ! muta$on' dele$on'

  30. Integrate multiple data types into one view Example analysis: Examine a mutation in its copy number context ! ! mutations ! copy number !

  31. Compare data filters on a single data set Example analysis: ! Examine impact of MutationSeq probability threshold on coverage versus allele ratio distribution ! ! MutationSeq predictions !

  32. Explore views of different data types Example analysis: ! Examine both the mutations and copy number alterations for a given sample ! ! MutationSeq predictions ! Titan copy number predictions !

  33. Components View ! v! visual representation ! Region Filter ! on genomic range ! Data Filter ! on data parameters ! Data ! d! sample(s) + data type !

  34. Integrate multiple data types into one view mutations ! copy number ! v! d! d!

  35. Compare data filters on a single data set v! v! d! MutationSeq predictions !

  36. Explore views of different data types v! v! MutationSeq predictions ! d! d! Titan copy number predictions !

  37. Interface web-application implemented using D3.js !

  38. Create Select a predefined structure !

  39. Create Add to an existing structure !

  40. Define Data Sample(s) ! Query by project name / tumour type / sample id ! ! Single data type ! e.g. mutations, copy number, etc. !

  41. Filter Data Data filters depend on previously selected data type !

  42. Filter Regions Limit the view to genes or regions of interest !

  43. Select a View View types depend on previously selected data type !

  44. Adjust View

  45. Inspect/Modify

  46. Key Feature 2 Dynamic linking between views

  47. Dynamically link views of different data types MutationSeq predictions ! v! v! d! d! Titan copy number predictions !

  48. Dynamically link views of different data types v! v! d! d!

  49. Dynamically link views of different data types muta$on' v! v! dele$on' d! d!

  50. Key Feature 3 Scalability

  51. Research on big data visualization must address two major challenges: ! perceptual and interactive scalability ! Zhicheng Liu, Biye Jiang, Jeffrey Heer inMens, EuroVis 2013

  52. Interactive scalability How to enable dynamic querying and rendering of millions of data points in real time? !

  53. Search • Optimized for text search across documents ! • All fields are indexed for fast retrieval (bag-of-terms approach) ! • Query performance is a function of the number of query matches not the total data set size ! • Scales well as the data set size grows ! • Appropriate for load-once-read-many workflows !

  54. Elasticsearch • Chose for ease of use (built on top of Apache Lucene) ! • Benefits include: ! o Built-in support for distributed data (manages shards across nodes) ! o Extensive caching ! o Sophisticated query language (DSL) ! o REST API !

  55. Storing data Relational Database ! Elasticsearch ! • Database ! • Index ! • Table ! • Type ! • Row ! • Document ! • Column ! • Field !

  56. Storing data Documents (records) ' mutation ! CNV ! Fields ' sample id: SA091 ! sample id: SA091 ! chrom: 1 ! chrom: 1 ! position: 104,589 ! start: 103,062 ! ref_allele: A ! end: 109,114 ! alt_allele: T ! state: GAIN ! probability: 0.91 ! ! !

Recommend


More recommend