empowering virus sequence research
play

EMPOWERING VIRUS SEQUENCE RESEARCH THROUGH CONCEPTUAL MODELING - PowerPoint PPT Presentation

EMPOWERING VIRUS SEQUENCE RESEARCH THROUGH CONCEPTUAL MODELING ANNA BERNASCONI, ARIF CANAKOGLU, PIETRO PINOLI, STEFANO CERI DEIB, POLITECNICO DI MILANO ER 2020 ONLINE EVENT WHAT NEEDS ARE WE RESPONDING TO? UNPRECEDENTED ATTENTION TOWARDS


  1. EMPOWERING VIRUS SEQUENCE RESEARCH THROUGH CONCEPTUAL MODELING ANNA BERNASCONI, ARIF CANAKOGLU, PIETRO PINOLI, STEFANO CERI DEIB, POLITECNICO DI MILANO ER 2020 – ONLINE EVENT

  2. WHAT NEEDS ARE WE RESPONDING TO? UNPRECEDENTED ATTENTION TOWARDS THE GENETIC MECHANISMS OF VIRUSES (caused by the pandemic outbreak of the coronavirus disease COVID-19) LACK OF PREPARATION OF THE RESEARCH COMMUNITY TO FACE PANDEMIC CRISES (e.g., lack of well-organized databases and search systems ) NEED FOR FACILITATING CURRENT AND FUTURE RESEARCH STUDIES (we provide a novel conceptual model, repository and search system collecting virus sequences and their properties)

  3. GenoSurf interface OUR BACKGROUND http://gmql.eu/genosurf/ Genomic Conceptual Model Biological view Extraction view DatasetId ReplicateId ItemId BioSampleId (1,1) (0,N) Dataset (1,N) (0,N) (1,1) (0,N) Item SourceId BioSample Replicate Name (1,1) Type Assembly Tissue SourceId IsAnn (1,1) (1,N) SourceId CellLine DataType Annotation BioReplicateNum IsHealthy Format Size Disease TechReplicateNum ExpTypeId Pipeline SourceUrl Management view Technology view LocalUri (0,N) (0,N) DonorId Experiment (0,N) CaseId Type Case Donor SourceId (1,1) SourceId Technique Species Feature SourceSite Platform Age Target ProjectId (0,N) Gender Antibody Ethnicity ProjectName Project ProgramName Bernasconi et al. «Conceptual Modeling for Genomics: Building an Canakoglu et al. «GenoSurf: metadata driven semantic search system for Integrated Repository of Open Data». ER 2017. integrated genomic datasets». Database, Volume 2019, 2019, baz132, https://doi.org/10.1007/978-3-319-69904-2_26 https://doi.org/10.1093/database/baz132

  4. BACKGROUND ANALYSIS: VIRUS RESOURCES SCENARIO 1 1 M ajo r Dat abase Inst it ut io ns 2 2 Pr im ar y Sequence Dep o sit io n Dat abases 3 3 Dir ec t Ret r ieval To o ls 4 4 Sec o ndar y Vir us Dat abases/Int er faces 5 5 Po r t als t o NC BI and GISAID Reso ur c es 6 6 Int egr at ive Sear ch Syst em s 7 7 Sat ellit e Reso ur c es (linked t o seq uences)

  5. BACKGROUND ANALYSIS: AVAILABLE METADATA in SARS-CoV2 search engines

  6. BACKGROUND ANALYSIS: REQUIREMENTS COLLECTION Extensive interviews to groups of virologists of various specializations: Ilaria Capua - One Health Center of Excellence (University of Florida, US) Matteo Chiara - Università degli Studi di Milano Statale (IT) Ana Conesa - University of Florida (US) Luca Ferretti - Oxford Big Data Institute (UK) Alice Fusaro - Istituto Zooprofilattico Sperimentale delle Venezie (IT) Ruba Al Khalaf - Politecnico di Milano (IT) Susanna Lamers - BioInfoExperts (Louisiana, US) Stefania Leopardi - Istituto Zooprofilattico Sperimentale delle Venezie (IT) Alessio Lorusso - Istituto Zooprofilattico Sperimentale Abruzzo Molise (IT) Each researcher provided us with a viewpoint on Francesca Mari - Università di Siena (IT) Carla Mavian - Department of Pathology, College of Medicine (University of Florida, US) applications of virology that serve as requirements for Graziano Pesole - Università di Bari (IT) Alessandra Renieri - Università di Siena (IT) progressively adding relevant features to our database Anna Sandionigi - Università degli Studi di Milano-Bicocca (IT) Stephen Tsui - The Chinese University of Hong Kong (HK) as well as relevant search services to comply with their Limsoon Wong - National University of Singapore (SGP) Federico Zambelli - Università degli Studi di Milano Statale (IT) needs: Diagnosis - Vaccine development - Drug-resistance and drug-resistance associated - mutations

  7. PROPOSED CONCEPTUAL MODEL AltSequence Species A n a lytica l pers pe ctive Family SpeciesTaxonID Aminoacid Start SubFamily Length CollectionDate Variant Genus The Viral Conceptual Model (VCM) , centered on Type IsolationSource SpeciesName OriginatingLab 1:1 Start SpeciesTaxonID the virus sequence described from four Country Stop GenBankAcronym Region FeatureType [gene, CDS, stem_loop, 3’UTR] EquivalentList GeoGroup AltSequence MoleculeType GeneName [E, S, M, ORF6] perspectives: 0:N Gender IsSingleStranded Product [leader protein, nsp2] Start Age IsPositiveStranded ExternalReference Length AminoacidSequence Type [INS, DEL] Annotation Impact Virus Nucleotide 1:N 1:1 Variant biological perspective (virus species and host - 1:1 0:N 1:N 1:1 HostSample environment) 1:1 SequencingTechnology 0:N Sequence AssemblyMethod B iologica l pe rs pe ctive technological perspective (sequencing Coverage - 1:1 1:N Authors technology) Title Experiment 1:N Journal Type PublicationDate organizational perspective (project responsible - PubMedID SequencingLab AccessionID [GenBank/RefSeq/EPI_ISL] SubmissionDate StrainName [SARS-CoV-2/Hu/DP/Kng/19-020] for producing the sequence) PopSet [NCBI ID] IsReference BioProjectID [PRJNA] 1:N IsComplete DatabaseSource analytical perspective (properties of the - NucleotideSequence [RefSeq,GenBank, GISAID] Sequencing Strand Length Project sequence, such as known annotations and GC% O rga n iza tion a l pers pe ctive T e ch n ica l pers pective N% variants) The schema is general and applies to any virus.

  8. EXAMPLE QUERY Application on SARS-CoV2 virus: complex conceptual queries upon VCM are able to replicate the search results of recent articles, hence demonstrating huge potential in supporting research upon viruses Family SubFamily Species Genus Start AltSequence SpeciesTaxonID Stop SpeciesName Start CollectionDate Gene Extract SARS-CoV2 SpeciesTaxonID Length IsolationSource [ORF1ab, ORF3a, ORF6, …] SARS-CoV2 Type OriginatingLab Product [leader protein, nsp2] EquivalentList [INS, DEL, SNP…] sequences from samples of ExternalReference USA MoleculeType IsSingleStranded Region IsPositiveStranded US patients that present GeoGroup Annotation Variant nucleotide variants in genes Virus that codify for open HostSample reading frames. Sequence

  9. EXAMPLE QUERY Family SubFamily Species Genus Start T SpeciesTaxonID Stop SpeciesName 8782 CollectionDate Gene SpeciesTaxonID 1 IsolationSource ORF1ab SARS-CoV2 SNP OriginatingLab Product [leader protein, nsp2] EquivalentList ExternalReference Country MoleculeType IsSingleStranded Region IsPositiveStranded Europe Variant Annotation Virus HostSample Sequence Select sequences from European patients affected by a SARS-CoV2 virus, only if they do not have a specific variant on the first gene (ORF1ab), selected by using the triple <position, alternative_sequence, type> (e.g., 8,782 SNP from C to T).

  10. EXAMPLE QUERY T T C G 20229 13064 18483 8017 1 1 1 1 SNP SNP SNP SNP Variant Variant Variant Variant In Gudbjartsson et al . (2020), specific sequence variants are used to define clades/haplogroups Sequence Sequence Sequence Sequence … … … … (e.g., the A group is characterized by the 20,229 interse in sect interse in sect in interse sect AccessionID AccessionID AccessionID AccessionID and 13,064 nucleotides, originally C mutated to T, by the 18,483 nucleotide T mutated to C, and by the 8,017, from A to G). Select sequences with all four variants corresponding to the A clade group defined in Gudbjartsson et al . (2020). … … … …

  11. Start AltSequence Stop EXAMPLE QUERY Start mature peptide Length ORF1ab Type RNA-dependent RNA polymerase [INS, DEL, SNP…] ExternalReference Variant Annotation in interse sect Family SubFamily Species Genus Start AltSequence SpeciesTaxonID Stop SpeciesName Start CollectionDate Gene SpeciesTaxonID Length IsolationSource E SARS-CoV2 Type OriginatingLab Product [leader protein, nsp2] EquivalentList [INS, DEL, SNP…] ExternalReference USA MoleculeType IsSingleStranded Region IsPositiveStranded GeoGroup Annotation Variant Virus According to Corman et al. (2020), E and RdRp HostSample genes are highly mutated and thus crucial in Sequence diagnosing COVID-19 disease; first-line screening tools of 2019-nCoV should perform an E gene assay, followed by confirmatory testing with the RdRp gene assay. Retrieve all sequences with mutations within genes E and RdRp of humans affected in China.

  12. T C EXAMPLE QUERY 8782 28144 1 1 SNP SNP Variant Variant Tang et al. (2020) claim that there are two clearly definable “major types” (S and L) of SARS-CoV2 Sequence Sequence in this outbreak, that can be differentiated by interse in sect AccessionID AccessionID transmission rates. S and L types can be distinguished by two SNPs at positions 8,782 (within the ORF1ab gene from C to T) and 28,144 (within ORF8 from T to C). Retrieve all sequences with these two SNPs.

  13. Start A EXAMPLE QUERY Stop 1841, relative Gene Length S SNP Product ExternalReference Annotation Variant interse in sect Start C Stop 318, relative FeatureType Length ORF1ab SNP nsp3 Morais Junior at al. (2020) propose a subdivision ExternalReference of the global SARS-CoV2 population into sixteen Variant Annotation subtypes, defined using “widely shared polymorphisms” identified in nonstructural (nsp3, Sequence nsp4, nsp6, 27 nsp12, nsp13 and nsp14) cistrons, structural (spike and nucleocapsid), and accessory (ORF8) genes. Extract sequences from subtype I. …

Recommend


More recommend