HIV tropism assessment HIV tropism assessment HIV tropism assessment HIV tropism assessment using next generation sequencing using next generation sequencing using next generation sequencing using next generation sequencing Mattia CF Prosperi National Institute for Infectious Diseases “Lazzaro Spallanzani” (INMI) Dept. Virology Via Portuense, 292 – 00149 – Rome, Italy. e ‐ mail: ahnven@yahoo.it
Summary • Next ‐ generation ( aka ultra ‐ deep) sequencing (NGS) • Technologies, features • Low level tools to analyse NGS data • Sequence alignment • Error ‐ correction High level tool for clinical purposes • • Ultra ‐ deep prediction of HIV ‐ 1 coreceptor usage •Statistical learning model • Web server
Next generation sequencing • Technologies – 454, Illumina, ABI Solid, Polonator, Helicos • Fields of application – De ‐ novo sequencing – Re ‐ sequencing – Metagenomics
Next ‐ generation sequencing data • 454 GS FLX, Roche – A sequence read is ~ 400 bases long (with Titanium upgrade) – 400 ‐ 600 million bases per 10 ‐ hour run – Higher error rate than Sanger sequencing • Approximately 0.1% and 0.05% for homopolymeric and non ‐ homopolymeric regions (estimated on a HIV plasmid clone) – Possible presence of contaminants • Other technologies: Illumina, ABI Solid, Helicos… – shorter reads, higher base throughput
Web ‐ server • Easy user interface – Parallelization of read alignment and error correction •Computational burden reduced from hours to minutes – Online tools for ngs ‐ aided diagnostics: •HIV ‐ 1 tropism prediction – Graph generator for variability analysis
Caspur associated universities
Sequence alignment • Optimized local pairwise alignment against a given consensus sequence •Smith ‐ Waterman ‐ Gotoh in forward and reverse – gap open/extension parameter optimisation via grid search in [1, 30] and [0.3, 3] with step size of 5 and 0.5 respectively – Two possible optimisation functions, where m is the number of matches, g is the number of gaps, N is the alignment length: » m/N (similarity maximisation) » m ‐ m*g/N (gap minimisation and similarity maximisation, accounting for alignment length)
Contaminant detection • A random alignment score distribution is derived by – aligning n (at least n =400) random sequences, whose lengths are normally distributed on the actual lane average read length and std – applying the given optimisation procedure to each random sequence • A z test with Gumbel’s extreme value distribution test (like BLAST e ‐ value) is performed for each real read alignment score, corrected for multiple testing with Benjamini Hochberg • Sequences with an adj.p >0.01 are discarded
Error detection/correction • For each position of the consensus (and relative indels) we execute a statistical test for over ‐ representation of changes within the reads – chi ‐ square statistic • After Bonferroni correction for multiple testing, we exclude positions with adj.p >0.01
Web Service Interface
Variations plot Variations plot
Shannon entropy plot Shannon entropy plot
HIV Diagnostics application HIV Diagnostics application • Idea from Martin Daumer’s group (institute of Immunology, Kaiserslauten) and MPI • HIV ‐ 1 coreceptor usage prediction – Uses statistical learning applied to NGS data •Existing methods are: geno2pheno, pssm •We developed a new method based on logistic regression ( Prosperi et al. AIDS Research and Human Retroviruses 2009; 25(3).) – Alternative to TROFILE method •Pro: less expensive, quicker results, NGS gives also description of the quasispecies •Contra: results not always concordant with TROFILE
Statistical Learning Model Statistical Learning Model • Logistic Regression – accuracy 92.76% – AUC (0.93)
CXCR4 usage prediction
CXCR4 usage prediction
People at CASPUR and INMI • MR Capobianchi, G Ippolito • A Desideri, G Chillemi • I Abbate, G Rozera • A Barbato, A Bruselles
Recommend
More recommend