Outline Introduction Brief history of Bioinformatics - PowerPoint PPT Presentation

Why Aren't We Benchmarking Bioinformatics? � Joe Parker � Early Career Research Fellow (Phylogenomics) � Department of Biodiversity Informatics and Spatial Analysis � Royal Botanic Gardens, Kew � Richmond TW9 3AB � joe.parker@kew.org �

Outline � • Introduction � • Brief history of Bioinformatics � • Benchmarking in Bioinformatics � • Case study 1: Typical benchmarking across environments � • Case study 2: Mean-variance relationship for repeated measures � • Conclusions: implication for statistical genomics �

A (very) brief history of bioinformatics � Kluge & Farris (1969) Syst. Zool. 18: 1-32

A (very) brief history of bioinformatics � Stewart et al. (1987) Nature 330:40 1-404

A (very) brief history of bioinformatics � ENCODE Consortium (2012) Nature 489: 57-74

A (very) brief history of bioinformatics � Kluge & Farris (1969) Syst. Zool. 18: 1-32

Benchmarking to biologists � • Benchmarking as a comparative process � • i.e. ‘which software’s best?’ / ‘which platform’ � • Benchmarking application logic / profiling unknown � • Environments / runtimes generally either assumed to be identical, or else loosely categorised into ‘laptops vs clusters’ �

Case Study 1 � aka � ‘Which program’s the best?’ �

Bioinformatics environments are very heterogeneous � • Laptop: � – Portable � – Very costly form-factor � – Maté? Beer? � • Raspi: � – Low: cost, energy (& power) � – Highly portable � – Hackable form-factor � • Clusters: � • Not portable, setup costs • The cloud: � – Power closely linked to budget (as limited as) – Almost infinitely scalable -Have to have a connection to get data up there (and down!) – Fiddly setup

Benchmarking to biologists �

Comparison � RAM Gb / CPU type, System Arch cores MHz / HDD Gb clock GHz type Xeon E5620 1000 Haemodorum i686 8 33 @ 2.4 @ SATA Raspberry Pi ARMv7 8 ARM 1 1 2 B+ @ 1.0 @ flash card Macbook Pro Core i7 250 x64 4 8 (2011) @ 2.2 @ SSD EC2 Xeon E5 320 x64 40 160 m4.10xlarge @ 2.4 @ SSD

Reviewing / comparing new methods � • Biological problems often scale horribly unpredictably � • Algorithm analyses � • So empirical measurements on different problem sets to predict how problems will scale… �

Workflow � Set up workflow, binaries, and reference / Setup alignment data. Deploy to machines. Protein-protein blast reads (from MG- RAST repository, Bass Strait oil field) CEGMA against 458 core eukaryote genes from Short reads BLAST genes CEGMA. Keep only top hits. Use max. 2.2.30 num_threads available. Append top hit sequences to CEGMA Concatenate hits to alignments. CEGMA alignments For each: Align in MUSCLE using default Muscle parameters 3.8.31 Infer de novo phylogeny in RAxML under Dayhoff, random starting tree and max. PTHREADS. RAxML 7.2.8+ Output and parse times.

Results - BLASTP �

Results - RAxML �

Case Study 2 � aka � ‘What the hell’s a random seed ?’ �

Mean-variance plot for sitewise lnL estimates in PAML n=10 0.014 + + + + 0.012 + + + + + 0.010 + + + + + + + + + 0.008 + + + + + + + + + variance + + + + + + + + + + + + + + + + + + + + + + + + + 0.006 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.004 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.002 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.000 + + + + + + + + + + + o o o o o o o o o o o o o o o o o o + o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o + o o + o o o o o o o + o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o + o o o o o o o o o o o o o o o + o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o + -50 -40 -30 -20 -10 Mean log-likelihood

Properly benchmarking workflows � • Ignoring limiting-steps analyses � – in many workflows might actually be data cleaning / parsing / transformation � • Or (most common error) inefficiently iterating � • Or even disk I/O! �

Workflow benchmarking very rare � • Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc � • http://beast.bio.ed.ac.uk/benchmarks � • Many e.g. bioinformatics papers � • More harm than good? �

Conclusion � • Biologists and error � • Current practice � • Help! � • Interesting challenges too… � Thanks: � RBG Kew, BI&SA, Mike Chester �

Outline Introduction Brief history of Bioinformatics - PowerPoint PPT Presentation

Why Aren't We Benchmarking Bioinformatics? Joe Parker Early Career Research Fellow (Phylogenomics) Department of Biodiversity Informatics and Spatial Analysis Royal Botanic Gardens, Kew Richmond TW9 3AB joe.parker@kew.org

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Long-Term Stewardship of Three Evapotranspirative Covers, 15 years Sue Collins, Long-Term

Perspectives and Questions Meditations on the Future of Particle Physics Chris Quigg Fermilab

Outline Motivation Data Method Experiments Results Future Work Studies on

By Brendan Lamarre Mentors: Jerry Harder and Mark Rast Irradiance Trends (Harder et al., 2010)

Interactive Remote Large-Scale Data Visualization via Prioritized Multi-resolution Streaming Jon

How Deep Learning, could help to improve GeoSpatial data quality ? an OSM use case @o_courtin

Reading Reading: Angel 5.6, 9.10.3 Optional reading: Foley, van Dam, Feiner, Hughes,

Supporting Transitions Cultural Connections for People on the Autism Spectrum and other