CnC ¡as ¡workflow ¡coordina.on ¡language ¡ ¡ for ¡scien.fic ¡compu.ng ¡ Parallel ¡Recipes Yves ¡Vandriessche Sept. ¡08, ¡2015
scripts deal with complexity of gluing together applications + = GATK, BWA, Picard, TopHat, samtools, … Broad Institute best practices seq. pipeline ~ 200 SLoC 2
Distribution and parallelisation explodes accidental complexity of scripts ( ) + GATK, BWA, Picard, TopHat, samtools, … = x distributed seq. pipeline ~ 2000 SLoC eHive exome pipeline distributed seq. pipeline 28,066 SLoC 1 (Perl) ~ 2000 SLoC => $898,255 est. 1 generated using David A. Wheeler's 'SLOCCount' 3
parallel recipe: What ¡is ¡the ¡essential ¡ 𝚬 ¡between ¡sequential ¡and ¡parallel ¡script? ordering ¡dependencies! In ¡a ¡sequential ¡world: one ¡single ¡ordering ¡of ¡operations In ¡a ¡parallel ¡world: more ¡#orderings => more ¡parallelism => more ¡performance sources ¡of ¡ordering: •data ¡dependencies produce/consume, ¡consistency •control ¡dependencies iteration, ¡branching, ¡recursion, ¡… concurrency ¡(shared ¡resources)
Parallel ¡Recipes: precipes ¡ complex ¡glue ¡x ¡ complex ¡coordination reuse ¡scripting ordering ¡dependencies Intel ¡Concurrent ¡Collections ¡inside ¡ as ¡ Coordination ¡Language • ¡ ¡cluster-‑level ¡and ¡node-‑level ¡parallelism ¡ ¡ CnC ¡offers: • ¡ ¡determinate ¡execution ¡ • ¡ ¡flexible ¡parallel ¡execution ¡model ¡ • ¡ ¡stable ¡& ¡practical ¡implementation ¡(CnC++) 5
parallel hello world recipe: $ echo ‘B’ B command: B out: B_done B_finished $ echo ‘another thing for B’ command: Bbis B_done in: A Bbis C out: B_or_C_done A_done B_or_C_done command: $ echo ‘finished’ finish { A_done, B_or_C_done } finish in: what needs to happen when I start? command: what dependencies need to be satisfied before I can start? in: what dependencies are satisfied after I finished successfully? out: 6
parallel hello world recipe bis: practical ¡consideration: ¡ ¡ parallel ¡scripts ¡rarely ¡run ¡only ¡once ¡ $ wget ftp://citizenfiles.gov/dosiers/yves.txt . command: fetch dosier dosier command: $ grep 'gross' yves.txt > yves_gross.txt extract gross income income command: $ echo -n citizen yves is making; report income cat yves_gross.txt ; echo a year. 7
parallel hello world recipe bis: practical ¡consideration: ¡ ¡ parallel ¡scripts ¡rarely ¡run ¡only ¡once ¡ parallel ¡scripts ¡typically ¡run ¡data-‑parallel ¡ tom roel yves $ wget ftp://citizenfiles.gov/dosiers/{}.txt . command: fetch dosier dosier command: $ grep 'gross' {}.txt > {}_gross.txt extract gross income income command: $ echo -n citizen {} is making ; report income cat {}_gross.txt ; echo a year. 8
parallel hello world recipe bis: out ¡of ¡the ¡box: ¡ ¡ ¡data-‑parallel ¡runs yves tom roel fetch dosier fetch dosier fetch dosier dosier dosier dosier . . . extract gross income extract gross income extract gross income income income income report income report income report income 9
{ "stages" : { "A" : { " command " : "echo A for {}.", " out " : " A_done " }, "B" : { " command " : "echo B for {}.", " out " : " B_finished " }, B "Bbis" : { " command " : "echo One more thing for B and {}.", " in " : " B_finished ", B_finished " out " : " B_or_C_done " }, "C" : { A Bbis C " command " : "echo C for {}.", " out " : " B_or_C_done " }, A_done B_or_C_done "finish" : { " command " : "echo Done with A and B for {}.", " in " : [" A_done ", " B_or_C_done "] finish } } } 10
check_paired has_paired_end_reads JSON parallel recipe: fetch_paired_1 fetch_paired_2 fetch_unpaired paired_1.fastq.gz paired_2.fastq.gz unpaired.fastq.gz $ ./precipes -p bpp.dot exome_best_practices_pipeline.json alignment_paired alignment_unpaired paired.sam unpaired.sam { "stages" : { check_no_unpaired sort_for_coordinate_order_paired sort_for_coordinate_order_unpaired check_no_paired "check_paired" : { "command" : "$CHECK_EXISTS $READS/ {} _1.filt.fastq.gz", no_unpaired_end_reads sorted_paired.bam sorted_unpaired.bam no_paired_end_reads "out" : " has_paired_end_reads " }, merge_bams_paired merge_bams_paired_unpaired merge_bams_unpaired "fetch_unpaired" : { " command " : "$FETCH $READS/ {} .filt.fastq.gz $LOCAL_DIR/ {} .unpaired.fastq.gz", sorted.bam " out " : " unpaired.fastq.gz " }, remove_duplicates "fetch_paired_1" : { " command " : "$FETCH $READS/ {} _1.filt.fastq.gz $LOCAL_DIR/ {} .paired_1.fastq.gz", dedup.bam " in " : " has_paired_end_reads ", " out " : " paired_1.fastq.gz " build_bam_index_1 }, "fetch_paired_2" : { dedup.bai " command " : "$FETCH $READS/ {} _2.filt.fastq.gz $LOCAL_DIR/ {} .paired_2.fastq.gz", " in " : " has_paired_end_reads ", realign_around_indels_1 " out " : " paired_2 . fastq.gz " }, intervals "alignment_paired" : { "command" : “\ realign_around_indels_2 $BWA mem -R '@RG\\tID:Group1\\tLB:lib1\\tPL:illumina\\tSM:sample1' \ -t $NUM_THREADS $REF/ucsc.hg19.fasta \ 7.bam $LOCAL_DIR/ {} .paired_1.fastq.gz $LOCAL_DIR/{}.paired_2.fastq.gz \ > $LOCAL_DIR/ {} .paired.sam && build_bam_index_2 rm $LOCAL_DIR/ {} .paired_1.fastq.gz $LOCAL_DIR/{}.paired_2.fastq.gz", "in" : [" paired_1.fastq.gz ", " paired_2.fastq.gz "], 7.bai "out" : " paired.sam " }, base_recalibrate_1 … recal base_recalibrate_2 11 8.bam 8.bai call_variants [1] G. A. Auwera, M. O. Carneiro, C. Hartlm, et al, “From FastQ data to high ‐ confidence variant calls: the genome analysis toolkit best practices pipeline,” Curr. Protoc. Bioinform.11.10.1-11.10.33, October 2013. vcf vcfinocx
Execution bash$ ¡ ¡./precipes ¡ exome_best_practices_pipeline.json ¡sample_{00..07} ./precipes • ¡workstation ¡ core .json • ¡cluster ¡ • ¡Amazon ¡EC2 12
Execution bash$ ¡ ¡./precipes ¡ exome_best_practices_pipeline.json ¡sample_{00..07} ./precipes core .json add_stage( “fetch_paired_1”, “$FETCH $READS/…”, { “ has_paired_end_reads ” }, { “ paired_1.fastq.gz ” } ); add_stage( “check_paired”, “test -f …”, { }, { “ has_paired_end_reads ” } ); add_stage( … ); 13
Execution bash$ ¡ ¡./precipes ¡ exome_best_practices_pipeline.json ¡sample_{00..07} // start running samples in parallel > for( int i = 2; i < argc; ++i ) pipeline.run( argv[i], i-2 ); sai sample_07 > pipeline.tags.put( “sample_00” ) … > pipeline.tags.put( “sample_01” ) sample_00 sam … pipeline.wait() 1.bam 14
parallel scaling experiment: 32 samples from g1k NA12878 Exome Best Practices Scaling Experiment 7d 1 worker thread 158h 7m 2 worker threads 6d 5d 4d Runtime 80h 31m 79h 21m 3d 2d 41h 21m 40h 21m 1d 21h 38m 21h 7m 12h 20m 1 2 4 8 # compute nodes 15
Scaling Efficiency : single fat node (exome best practices, 32 samples) 100% 100% 98,224% 98,224% 96,369% 96,369% 14d 100% 95,285% 95,285% 336h 19m time(s) efficiency 83,361% 83,361% 83% 72,38% 72,38% 10,5d 69,132% 69,132% 67% Efficiency Runtime 46,928% 46,928% 7d 50% 171h 12m 33% 3,5d 87h 15m 17% 44h 7m 25h 13m 19h 22m 15h 12m 14h 55m 0d 1 2 4 8 16 24 32 64 # workers 16
Scaling Efficiency : cluster Scaling Efficiency : 2 workers Scaling Efficiency : 1 worker 7d 100% 7d 100% 100,00% 100,00% 100,00% 100,00% 99,63% 99,63% 97,36% 97,36% 97,97% 97,97% 158h 7m 93,05% 93,05% 93,60% 93,60% 6d 6d 1 worker runtime 2 workers runtime 80% efficiency efficiency 80% 81,60% 81,60% 5d 5d 60% 60% 4d 4d Efficiency Runtime Efficiency Runtime 80h 31m 79h 21m 3d 3d 40% 40% 2d 2d 41h 21m 40h 21m 20% 20% 1d 1d 21h 38m 21h 7m 12h 20m 0d 0d 1 2 4 8 1 2 4 8 # compute nodes # compute nodes 17
execution trace: 32 samples, 4 nodes, 2 workers 0 1 2 3 18
Next! Common ¡Workflow ¡Language 1 ¡(CWL) ¡integration • ¡workstation ¡ { core … • ¡cluster ¡ "run": { "inputs": [ • ¡amazon ¡ec2 { "inputBinding": { "position": 1, "prefix": "--reverse" }, "type": "boolean", "id": "#reverse" }, { "inputBinding": { "position": 2 }, "type": "File", "id": "#input" } ], … "class": "Workflow" } Shoutout ¡to ¡BOSC ¡CodeFest2015! 1 https://github.com/common-workflow-language/common-workflow-language 19
Recommend
More recommend