input input data
play

Input Input Data Next-generation sequencing data Single-end or - PowerPoint PPT Presentation

Input Input Data Next-generation sequencing data Single-end or paired-end Mix of viruses, host, bacteria and phage DNA and RNA Capture sequencing to enrich for viruses Aim of Vi VirMap Identify reads that map to known viral


  1. Input Input Data • Next-generation sequencing data • Single-end or paired-end • Mix of viruses, host, bacteria and phage DNA and RNA • Capture sequencing to enrich for viruses Aim of Vi VirMap • Identify reads that map to known viral sequences • De novo sequence assembly and recover putative novel virus sequences • Assign each viral sequence to species (e.g. identify the taxonomy)

  2. Instructions for Installation of Vi VirMap on Gadi • Lack of installation instructions (early 2019). Enlisted help from ResTech • Instructions from ResTech: • https://github.com/rsoftone/virmap • Download VirMap pipeline: https://github.com/cmmr/virmap.git Miniconda • Conda modules • (e.g. bbmap, diamond, megahit) • Perl • RocksDB – taxonomy lookup • Recommend not to install in default $TMPDIR (/scratch/projectID/userID/tmp). Please update $TMPDIR with some other location (e.g. export TMPDIR= /g/data/projectID/userID/tmp).

  3. Testing Vi VirMap Installation

  4. Pa Part 1: Taxonomy database generation • Run script to generate the taxaJson.dat file • The name is misleading - it is not JSON but actually Sereal- encoded, Zstd compressed Perl hash reference built up from NCBI taxonomy files. Code highlights • Download taxonomy data from NCBI • Construct edges to child-nodes from taxonomy data • Sereal encoding + Zstd compression of Perl hash-reference • https://github.com/rsoftone/virmap/tree/master/10- helper-scripts

  5. Pa Part 2: Accession -> > GI lookup • Downloads a table that converts NCBI accession to GI accession • NCBI accession: NM_000202.5 • GI accession: 262118207 FASTA format header: • blue = fields that are contained in the NCBI Genbank feature table. • yellow = not currently in Genbank’s feature table and need to be obtained via Accession -> GI lookup • Information obtained by correspondence with VirMap authors. Nucleotide: • >gi|2765944|gb|Z99337.1|.Human.immunodeficiency.virus.1.isolate.C49.DNA.for.reverse;taxId=11676 Protein: • >GI|GI:817036220|KR080198.1|AKF14503.1|GI:817036221|hypothetical.protein|Mycobacterium.phage.Ca mbiare;pos=46..537;codonStart=1;taxId=1647305 https://github.com/rsoftone/virmap/tree/master/10-helper-scripts

  6. Pa Part 3: Genbank division download • https://github.com/rsoftone/virmap/tree/master/referencedbs • Download aspera fast data transfer (ascp) utility • List files served by GenBank FTP server • Downloading required GenBank divisions • The GenBank database is divided into 18 divisions. For example: • PRI - primate sequences • VRT - other vertebrate sequences • PLN - plant, fungal, and algal sequences • BCT - bacterial sequences • VRL - viral sequences

  7. Ot Other Steps s for Setting g the Database se • Part 4: Nucleotide .fasta generation using Gadi PBS script • Part 5: Protein .fasta generation using Gadi PBS script • Part 6: Build BBmap virus database (reads mapping) • Part 7: Build Diamond virus database (protein sequence DB search) • Part 7.5: Build Diamond genbank database (gbBlastx) • Part 8: Build Kraken2 database – find lowest common ancestor (LCA) of all genomes containing the short sequence (k-mer), assign taxonomy to sequence • Part 9: Build blastn genbank database (gbBlastn)

  8. Con Configure virma rmap_con onfig.sh in the ## Under $TMPDIR/virmap/ virmap_config.sh ## File path, used as the value for the --gbBlastx parameter GB_BLASTX ="/g/data1a/projectID/VirMap/200625-gbblastx.dmnd" ## Partial file path, used as the value for the --gbBlastn parameter GB_BLASTN ="/g/data1a/projectID/VirMap/191205-gbBlastn/191205-gbBlastn" ## Folder path, used as the value for the --virBbmap parameter VIR_BBMAP ="/g/data1a/projectID/VirMap/191205-virBbmap" ## File path, used as the value for the --virDmnd parameter VIR_DMND ="/g/data1a/projectID/VirMap/200625-virdiamond.dmnd" ## File path, used as the value for the --taxaJson parameter TAXA_JSON ="/g/data1a/projectID/VirMap/200117-taxaJson.dat"

  9. Run Vi VirMap Wrapper Script (part 1) # Activate Conda Environment INSTALL_DIR=$TMPDIR/virmap MINICONDA_DIR=$TMPDIR/miniconda3 export TMPDIR= /g/data/projectID/userID/tmp mkdir -p/path/to/store/output cd /path/to/store/output cp $INSTALL_DIR/activate.sh /path/to/store/output source activate.sh

  10. Run Vi VirMap Wrapper Script (part 2) qsub -P projectID -q normal -l walltime=6:00:00,\ mem=48G,ncpus=12,wd,jobfs=100GB,storage=scratch/projectID +gdata/projectID -joe -- \ $INSTALL_DIR/virmap_wrapper.sh \ --readUnpaired "/path/to/my/sample/.fastq" \ --sampleName "sample_name"

  11. Vi VirMap Output • Final Output file (e.g. nCoV_228_S83_L001_R1_001.final.fa) • Fasta file contained list of all assembled virus genomes and taxonomy information, if available. • Example fasta header: >gi|1369125429|gb|MG772934.1|Bat.SARS-like.coronavirus.isolate.bat-SL- CoVZXC21,.complete.genome.,...;taxId=1508227;length=29566;size=53956; maxScore=9709;maxAlign=9964;maxPossibleScore=15495;protScore=9709; maxProtAlign=9964;nuclScore=0;maxNuclAlign=0;letterOccupancy=71.54;hi ghDivergence=1

  12. Ag Aggregate Results • Give the script a list of results directories and it will aggregate data from running many samples into one Excel file • Whether the run is successful and returned usable results • For each sample • List of viruses, listed by NCBI taxonomy ID • The taxonomy classification hierarchy • De-duplicate read counts https://github.com/rsoftone/virmap/tree/master/aggreg_stats

  13. Ag Aggregate number of errors for each sample

  14. Wa Walltimes for running each step for individual sa sample

  15. Duration for Each Step 12 cores and 48 Gb Ram 24 cores and 112 Gb RAM

  16. Ru Run Statistics • 94 samples from 93 individuals • Two of the 94 samples failed our first attempt at running VirMap • 8 samples failed de-duplication but still produced results • The total amount of SU used by the 92 samples is 5,894 SU • median of 62.5 SU ±2.5 SU standard error (se) per sample • Walltime: • 1:37 ±4 min se. for 12 cores and 48 Gb RAM, and • 1:22 ± 2.3 min. se for 24 cores and 112 Gb RAM

  17. Ge Get t de-dupl duplicated ed rea ead d coun unts • In a table format. Columns include: • NCBI tax id (e.g. 1508227) • Size = number of deduplicated reads (e.g. 408) • Weak or strong match, whether the sequence can be assigned to a taxa, potential for mis-annotation etc… • Taxonomy lineage in text. For example: • Viruses -> Riboviria -> Nidovirales -> Cornidovirineae -> Coronaviridae -> Orthocoronavirinae -> Betacoronavirus -> Sarbecovirus -> Severe acute respiratory syndrome-related coronavirus -> Bat SARS-like coronavirus

  18. Hea Heat ma map of 94 94 sam sample les s ve versus vir viruse ses s id identifie ified – pr prelim. . re results Coronaviruses

  19. Summary Results • 63/92 (68%) samples had coronavirus of some sort (2019 reference database) as detected by VirMap • 81/92 (88%) samples had other viruses that are not coronaviruses or phage. • 55/92 (59.8%) samples had co-infection of Coronavirus and other viruses. • Endogenous retrovirus K, found, in 89/92 samples, not Coronaviruses shown in heat map

  20. Re Re-map Reads to Vi Viral Sequences to get Read Cou Counts • Get read counts for each virus • Remove all reads that mapped to • The human genome • Bacterial Genbank division genomes • Bacteriophage Genbank division genomes • Map all remaining reads to the *.final.fa virus genomes assembled from the sample • Remove multi-mapped reads • Get read counts • Usage: ./get_read_counts.sh -g human_genome_bwa -p phage_bwa -b bacteria_bwa virmap_output.final.fa virmap_input_sequence.fq • https://github.com/rsoftone/virmap/blob/master/get_read_counts.sh

  21. Av Availability on Other Platforms • Katana • Duncan has attempted to install VirMap • Differences in CentOS version (Katana v6, Gadi v7) • Conda dependencies in Katana cannot be resolved • Microsoft Azure • VirMap will install in an Ubuntu instance • Need to transfer data into blob storage and transfer output back into blob storage • Around $300 test Credits for UNSW staff and students • Need to submit request for higher capacity machine, contact support. • Remember to switch off the instance when you don’t need it • Amazon • Original VirMap Github currently has instructions on how to install pipeline and reference DB in Amazon. (Was not available before in 2019.)

  22. Futur Future e Works • Comparison to other tools (e.g. IDSeq and Genome Detective) • Analyze more samples wit VirMap • Update sequence database to 2020 version on Gadi (for COVID19 analysis) • Possible developments: • Option to replace for BlastN with a faster tool MMseqs2 (https://github.com/soedinglab/MMseqs2) • Re-write parts of VirMap in a workflow scripting language (e.g. Snakemake or Nextflow)

Recommend


More recommend