[PPT] - Input Input Data Next-generation sequencing data Single-end or PowerPoint Presentation

SLIDE 1

Input Input Data

Next-generation sequencing data
Single-end or paired-end
Mix of viruses, host, bacteria and phage DNA and RNA
Capture sequencing to enrich for viruses
Identify reads that map to known viral sequences
De novo sequence assembly and recover putative novel virus

sequences

Assign each viral sequence to species (e.g. identify the taxonomy)

Aim of Vi VirMap

SLIDE 2

Instructions for Installation of Vi VirMap on Gadi

Lack of installation instructions (early 2019). Enlisted help from ResTech
Instructions from ResTech:
https://github.com/rsoftone/virmap
Download VirMap pipeline: https://github.com/cmmr/virmap.git

Miniconda

Conda modules
(e.g. bbmap, diamond, megahit)
Perl
RocksDB – taxonomy lookup
Recommend not to install in default $TMPDIR

(/scratch/projectID/userID/tmp). Please update $TMPDIR with some other location (e.g. export TMPDIR= /g/data/projectID/userID/tmp).

SLIDE 3

Testing Vi VirMap Installation

SLIDE 4

Pa Part 1: Taxonomy database generation

Run script to generate the taxaJson.dat file
The name is misleading - it is not JSON but actually Sereal-

encoded, Zstd compressed Perl hash reference built up from NCBI taxonomy files. Code highlights

Download taxonomy data from NCBI
Construct edges to child-nodes from taxonomy data
Sereal encoding + Zstd compression of Perl hash-reference
https://github.com/rsoftone/virmap/tree/master/10-

helper-scripts

SLIDE 5

Pa Part 2: Accession -> > GI lookup

Downloads a table that converts NCBI accession to GI accession
NCBI accession: NM_000202.5
GI accession: 262118207

FASTA format header:

blue = fields that are contained in the NCBI Genbank feature table.
yellow = not currently in Genbank’s feature table and need to be obtained via Accession -> GI lookup
Information obtained by correspondence with VirMap authors.

Nucleotide:

>gi|2765944|gb|Z99337.1|.Human.immunodeficiency.virus.1.isolate.C49.DNA.for.reverse;taxId=11676

Protein:

>GI|GI:817036220|KR080198.1|AKF14503.1|GI:817036221|hypothetical.protein|Mycobacterium.phage.Ca

mbiare;pos=46..537;codonStart=1;taxId=1647305

https://github.com/rsoftone/virmap/tree/master/10-helper-scripts

SLIDE 6

Pa Part 3: Genbank division download

https://github.com/rsoftone/virmap/tree/master/referencedbs
Download aspera fast data transfer (ascp) utility
List files served by GenBank FTP server
Downloading required GenBank divisions
The GenBank database is divided into 18 divisions. For example:
PRI - primate sequences
VRT - other vertebrate sequences
PLN - plant, fungal, and algal sequences
BCT - bacterial sequences
VRL - viral sequences

SLIDE 7

Ot Other Steps s for Setting g the Database se

Part 4: Nucleotide .fasta generation using Gadi PBS script
Part 5: Protein .fasta generation using Gadi PBS script
Part 6: Build BBmap virus database (reads mapping)
Part 7: Build Diamond virus database (protein sequence DB search)
Part 7.5: Build Diamond genbank database (gbBlastx)
Part 8: Build Kraken2 database – find lowest common ancestor (LCA)
f all genomes containing the short sequence (k-mer), assign

taxonomy to sequence

Part 9: Build blastn genbank database (gbBlastn)

SLIDE 8

Con Configure virma rmap_con

nfig.sh in the

## Under $TMPDIR/virmap/ virmap_config.sh ## File path, used as the value for the --gbBlastx parameter GB_BLASTX="/g/data1a/projectID/VirMap/200625-gbblastx.dmnd" ## Partial file path, used as the value for the --gbBlastn parameter GB_BLASTN="/g/data1a/projectID/VirMap/191205-gbBlastn/191205-gbBlastn" ## Folder path, used as the value for the --virBbmap parameter VIR_BBMAP="/g/data1a/projectID/VirMap/191205-virBbmap" ## File path, used as the value for the --virDmnd parameter VIR_DMND="/g/data1a/projectID/VirMap/200625-virdiamond.dmnd" ## File path, used as the value for the --taxaJson parameter TAXA_JSON="/g/data1a/projectID/VirMap/200117-taxaJson.dat"

SLIDE 9

Run Vi VirMap Wrapper Script (part 1)

# Activate Conda Environment INSTALL_DIR=$TMPDIR/virmap MINICONDA_DIR=$TMPDIR/miniconda3 export TMPDIR= /g/data/projectID/userID/tmp mkdir -p/path/to/store/output cd /path/to/store/output cp $INSTALL_DIR/activate.sh /path/to/store/output source activate.sh

SLIDE 10

Run Vi VirMap Wrapper Script (part 2)

qsub -P projectID -q normal -l walltime=6:00:00,\ mem=48G,ncpus=12,wd,jobfs=100GB,storage=scratch/projectID +gdata/projectID -joe -- \ $INSTALL_DIR/virmap_wrapper.sh \

-readUnpaired "/path/to/my/sample/.fastq" \
-sampleName "sample_name"

SLIDE 11

Vi VirMap Output

Final Output file (e.g. nCoV_228_S83_L001_R1_001.final.fa)
Fasta file contained list of all assembled virus genomes and taxonomy

information, if available.

Example fasta header:

>gi|1369125429|gb|MG772934.1|Bat.SARS-like.coronavirus.isolate.bat-SL- CoVZXC21,.complete.genome.,...;taxId=1508227;length=29566;size=53956; maxScore=9709;maxAlign=9964;maxPossibleScore=15495;protScore=9709; maxProtAlign=9964;nuclScore=0;maxNuclAlign=0;letterOccupancy=71.54;hi ghDivergence=1

SLIDE 12

Ag Aggregate Results

Give the script a list of results directories and it will

aggregate data from running many samples into one Excel file

Whether the run is successful and returned usable results
For each sample
List of viruses, listed by NCBI taxonomy ID
The taxonomy classification hierarchy
De-duplicate read counts

https://github.com/rsoftone/virmap/tree/master/aggreg_stats

SLIDE 13

Ag Aggregate number of errors for each sample

SLIDE 14

Wa Walltimes for running each step for individual sa sample

SLIDE 15

Duration for Each Step

12 cores and 48 Gb Ram 24 cores and 112 Gb RAM

SLIDE 16

Ru Run Statistics

94 samples from 93 individuals
Two of the 94 samples failed our first attempt at running VirMap
8 samples failed de-duplication but still produced results
The total amount of SU used by the 92 samples is 5,894 SU
median of 62.5 SU ±2.5 SU standard error (se) per sample
Walltime:
1:37 ±4 min se. for 12 cores and 48 Gb RAM, and
1:22 ± 2.3 min. se for 24 cores and 112 Gb RAM

SLIDE 17

Ge Get t de-dupl duplicated ed rea ead d coun unts

In a table format. Columns include:
NCBI tax id (e.g. 1508227)
Size = number of deduplicated reads (e.g. 408)
Weak or strong match, whether the sequence can be assigned to a

taxa, potential for mis-annotation etc…

Taxonomy lineage in text. For example:
Viruses -> Riboviria -> Nidovirales -> Cornidovirineae -> Coronaviridae
> Orthocoronavirinae -> Betacoronavirus -> Sarbecovirus -> Severe

acute respiratory syndrome-related coronavirus -> Bat SARS-like coronavirus

SLIDE 18

Hea Heat ma map of 94 94 sam sample les s ve versus vir viruse ses s id identifie ified – pr prelim. . re results

Coronaviruses

SLIDE 19

Summary Results

63/92 (68%) samples had

coronavirus of some sort (2019 reference database) as detected by VirMap

81/92 (88%) samples had
ther viruses that are not

coronaviruses or phage.

55/92 (59.8%) samples had

co-infection of Coronavirus and other viruses.

Endogenous retrovirus K,

found, in 89/92 samples, not shown in heat map

Coronaviruses

SLIDE 20

Re Re-map Reads to Vi Viral Sequences to get Read Cou Counts

Get read counts for each virus
Remove all reads that mapped to
The human genome
Bacterial Genbank division genomes
Bacteriophage Genbank division genomes
Map all remaining reads to the *.final.fa virus genomes assembled from the sample
Remove multi-mapped reads
Get read counts
Usage: ./get_read_counts.sh -g human_genome_bwa -p phage_bwa -b bacteria_bwa

virmap_output.final.fa virmap_input_sequence.fq

https://github.com/rsoftone/virmap/blob/master/get_read_counts.sh

SLIDE 21

Av Availability on Other Platforms

Katana
Duncan has attempted to install VirMap
Differences in CentOS version (Katana v6, Gadi v7)
Conda dependencies in Katana cannot be resolved
Microsoft Azure
VirMap will install in an Ubuntu instance
Need to transfer data into blob storage and transfer output back into blob storage
Around $300 test Credits for UNSW staff and students
Need to submit request for higher capacity machine, contact support.
Remember to switch off the instance when you don’t need it
Amazon
Original VirMap Github currently has instructions on how to install pipeline and

reference DB in Amazon. (Was not available before in 2019.)

SLIDE 22

Futur Future e Works

Comparison to other tools (e.g. IDSeq and Genome Detective)
Analyze more samples wit VirMap
Update sequence database to 2020 version on Gadi (for COVID19

analysis)

Possible developments:
Option to replace for BlastN with a faster tool MMseqs2

(https://github.com/soedinglab/MMseqs2)

Re-write parts of VirMap in a workflow scripting language (e.g.

Snakemake or Nextflow)

SLIDE 23

Acknowledgements

Vi

Virology Research h Labo boratory:

Prof. Maria Craig.
Prof. William Rawlinson
Sonia Isaacs (PhD)
Emily Gibson & Cynthia Lau (RA)
Clare Faulkner, Dylan Foskett, Jessica Lau (Students)
Coordinators: Jacki, Deb, Susan, Kelly, Jess
UN

UNSW:

Prof. Marc Wilkins
Dr. Ignatius Pang
A/Prof. Rowena Bull
A/Prof. Fabio Luciani
UNSW ResTech (Simon, David, Martin, Duncan, Robin)
Center for Infection & Immunity:
Prof. Ian Lipkin
A/Prof. Thomas Briese
EN

ENDIA Study:

Prof. Jennifer Couper (CI)
All investigators
Study coordinators
Study team
Microsoft (Andreas Wilm & Auda Eltahla)
Mo

Mothers rs & Childre ren support rting ENDIA

National Collaborative Research Infrastructure Strategy

(NCRIS) and Bioplatforms Australia

Serology & UNSW ResTech

SLIDE 24

Ta Taxonomy data from NCBI

Notes from Simon. Downloading and unzipping should produce the files: names.dmp nodes.dmp You'll need the following fields from each file, based on the VirMap krakenFilter.pl, which imports the following columns ($parents, $names, $ranks, $children) nodes.dmp tax_id

- node id in GenBank taxonomy database. The ‘children’ needs to be derived from reverse of ‘parents’, so some code is

needed to perform this. parent tax_id

- parent node id in GenBank taxonomy database (I think this corresponds to the 'parents' field)

rank

- rank of this node (superkingdom, kingdom, ... (I think this corresponds to the 'ranks' field)

names.dmp tax_id

- the id of node associated with this name (use this field to merge with the nodes.dmp file)

name_txt

- name itself (I think this corresponds to the ‘names’ field).