biomake biomake
play

BioMake BioMake Chris Mungall Mungall Chris Berkeley Drosophila - PowerPoint PPT Presentation

BioMake BioMake Chris Mungall Mungall Chris Berkeley Drosophila Genome Project Berkeley Drosophila Genome Project cjm@ @fruitfly fruitfly.org .org cjm build networks build networks Many Many bioinformaticians


  1. BioMake BioMake Chris Mungall Mungall Chris Berkeley Drosophila Genome Project Berkeley Drosophila Genome Project cjm@ @fruitfly fruitfly.org .org cjm

  2. build networks build networks  Many  Many bioinformaticians bioinformaticians spend large amounts of time spend large amounts of time coding and running build/make networks build/make networks coding and running  A build network is a recipe describing the execution  A build network is a recipe describing the execution of a collection of interdependent heterogeneous tasks interdependent heterogeneous tasks of a collection of  sequence analysis pipelines sequence analysis pipelines   data data ‘ ‘compilation compilation’ ’   importing, transforming and exporting data  importing, transforming and exporting data  LIMS LIMS   Error prone, tedious repetitive code and hard to  Error prone, tedious repetitive code and hard to configure configure

  3. Existing approaches Existing approaches  Run tasks by hand, or with ad-hoc scripts  Run tasks by hand, or with ad-hoc scripts  doesn doesn’ ’t scale, leads to insanity t scale, leads to insanity   Unix/GNU  Unix/GNU makefiles makefiles  ++: concise, generic, high level abstraction ++: concise, generic, high level abstraction   --: limited expressive power, --: limited expressive power, hacky hacky   makefile  makefile replacements (cons, replacements (cons,scons scons,ant,build ,ant,build… …) )  geared towards software development geared towards software development   Bio compute pipeline software (  Bio compute pipeline software (biopipe biopipe, , enspipe enspipe) )  excellent for certain tasks excellent for certain tasks   not completely generic not completely generic 

  4. biomake: executable : executable biomake computational protocols computational protocols  A declarative language for specifying build  A declarative language for specifying build networks networks  concise, Turing-complete, highly configurable concise, Turing-complete, highly configurable   Dependency management  Dependency management  e.g e.g mask genomic sequence prior to mask genomic sequence prior to genefinding genefinding   Local and remote job execution  Local and remote job execution  compute farm job management compute farm job management   Filesystem  Filesystem or database oriented or database oriented

  5. Example Genomic Sequence Example Genomic Sequence Analysis Pipeline Analysis Pipeline  Prepare and cut assembled sequence into slices  Prepare and cut assembled sequence into slices  Download latest NR peptide dataset and index it  Download latest NR peptide dataset and index it  Blast genomic slices against NR and other datasets  Blast genomic slices against NR and other datasets  RepeatMask  RepeatMask genomic slices and run genomic slices and run genefinders genefinders on on masked sequence masked sequence  Synthesise  Synthesise gene models and do peptide analysis on gene models and do peptide analysis on results results  Store everything in a relational db, and prepare files  Store everything in a relational db, and prepare files for export to public for export to public

  6. Target dependencies Target dependencies FastaDB Flatfile Assembled (remote) Export Genomic Sequence XML BlastIndexed FastaDB Export FastaDB (local) Chunked Blast BlastP & HMM Relational Sequence Alignments Alignments Database RepeatMasked Gene Gene XML Sequence Predictions Models Import

  7. Specifying Targets Specifying Targets A generic target formatdb(D) pattern has a name run: formatdb -i D and arguments BlastIndexed FastaDB FastaDB (local) Chunked Blast Sequence Alignments blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A A target pattern has tags for flat: S-blast/ bn(S).D . P .out specifying dependencies, actions and filesystem or database IDs

  8. BioMake Execution Execution BioMake TARGET: blast(P,S,D,A) blast( blast(blastx, blastx, my.seq, my.seq, nr.fly, nr.fly, -) -) req: formatdb(D) SUBTARGET: run: blastall -p P -i S -d D A formatdb( formatdb(nr.fly) nr.fly) flat: S-blast/ bn(S).D . P .out RUN: formatdb -i nr -p T formatdb(D) OUT: nr.fly.{psq,pin,phr} run: formatdb -i D formatdb-nr.fly.OK RUN: blastall -p blastx -i my.seq -d nr.fly OUT: my.seq-blast/my.seq.nr.fly.blastx.out my.seq-blast/my.seq.nr.fly.blastx.out.OK

  9. Targets can be nested Targets can be nested store(bop(my.seq, genscan(repeatmask(my.seq, drosophila)))) target instantiations can be thought of as a skolem IDs repeatmask(S,Org) genscan(S) run: repeatmask S -a Org run: genscan S flat: S .mask flat: S -pred/ bn(S) .genscan.out bop(S,B) store(XML) run: apollo -bop -s S B -o target run: xml2db XML flat: B .game.xml

  10. Iteration Iteration  Pipelines frequently involve iterating over  Pipelines frequently involve iterating over collections of data: collections of data:  Perform a sequence analysis on every entry in a Perform a sequence analysis on every entry in a  multi-fasta fasta format file format file multi-  Perform a peptide analysis on every gene Perform a peptide analysis on every gene  prediction in some genscan genscan output output prediction in some  Query a database for a list of IDs and perform Query a database for a list of IDs and perform  some task on each some task on each  biomake  biomake has language constructs for iteration has language constructs for iteration

  11. Iterating over datasets Iterating over datasets MultiFasta >seq1 TAGGTATTGGTT AGGTGCGTCCTC analyze_multifasta(F) >seq2 GCGGTATAGCTT iterate: analyze_seq(S) TTCCTTCTCTCT >seq3 where S in splitfasta(F) CAAAGCAGAGAT ATATTTATTCGC splitfasta(F) >seq1 >seq2 >seq3 run: splitfasta.pl -d F -split -md5 F TAGGTATTGGTT GCGGTATAGCTT CAAAGCAGAGAT AGGTGCGTCCTC TTCCTTCTCTCT ATATTTATTCGC flat: F -split/ bn(F). pathlist comment: splitfasta.pl is part of the biomake distro seq1.genie.out seq3.genie seq1.nr.blastx.out analyze_seq(S) seq3.nr.blastx.out req: genie(S) seq2.genie blast( blastx ,S, nr ,-) seq2.nr.blastx.out

  12. Controlling the runmode runmode Controlling the  Tasks can be run locally or on a compute farm,  Tasks can be run locally or on a compute farm, synchronously or asynchronously synchronously or asynchronously  wrapper provided for PBS wrapper provided for PBS   runmode  runmode: : tag states the mode and wrapper for tag states the mode and wrapper for a particular target pattern a particular target pattern  can be set globally and per-pattern can be set globally and per-pattern   special status targets provide execution status  special status targets provide execution status

  13. runmode example example runmode The blast job will be blast(P,S,D,A) executed on the compute req: formatdb(D) farm via qsubwrap run: blastall -p P -i S -d D A (comes with biomake flat: S-blast/ bn(S).D . P .out distro) runmode: async(qsubwrap) Upon submission, the status target biomake can automatically status_run(blast(P,S,D,A)) status_run(blast(P,S,D,A)) handle moving data in will be generated; on and out between user’s completion, the target filesystem (or db) and status_ok (blast(P,S,D,A)) status_ok (blast(P,S,D,A)) local cluster nodes will be generated

  14. Datastores Datastores  BioMake  BioMake persists targets in persists targets in Datastores Datastores  The  The flat: flat: tag flattens targets to unique tag flattens targets to unique datastore datastore IDs IDs  Datastore  Datastore can be can be filesystem filesystem or relational database or relational database  default is default is filesystem filesystem   can be set globally or per target can be set globally or per target   e.g. analysis result targets can be stored on  e.g. analysis result targets can be stored on filesystem filesystem, status , status targets stored in DB targets stored in DB  NFS traffic can be avoided on compute farm by  NFS traffic can be avoided on compute farm by storing targets and status targets in a database storing targets and status targets in a database

  15. Asynchronous Execution Asynchronous Execution For each target T to be built: local machine 1) biomake fetches status of T running biomake skips T if status = ok/run 2) biomake stores status_run(T) status_run(T) AGENT 3) biomake creates a runner agent script and submits it to the cluster NFS scheduler 4) continue onto next target AGENT 1) agent fetches fetches any input data 2) agent runs runs command (eg blast) node synchronously 3) agent stores stores result 4) agent stores stores status of T as ‘ok’ or ‘err’

Recommend


More recommend