  HIV tropism assessment using next generation sequencing
Mattia CF Prosperi
National Institute for Infectious Diseases "Lazzaro Spallanzani" (INMI)
Dept. Virology
Via Portuense, 292 – 00149 – Rome, Italy.

  2. Summary • Next ‐ generation ( aka ultra ‐ deep) sequencing (NGS) • Technologies, features • Low level tools to analyse NGS data • Sequence alignment • Error ‐ correction High level tool for clinical purposes • • Ultra ‐ deep prediction of HIV ‐ 1 coreceptor usage •Statistical learning model • Web server

  3. Next generation sequencing • Technologies – 454, Illumina, ABI Solid, Polonator, Helicos • Fields of application – De ‐ novo sequencing – Re ‐ sequencing – Metagenomics

  4. Next ‐ generation sequencing data • 454 GS FLX, Roche – A sequence read is ~ 400 bases long (with Titanium upgrade) – 400 ‐ 600 million bases per 10 ‐ hour run – Higher error rate than Sanger sequencing • Approximately 0.1% and 0.05% for homopolymeric and non ‐ homopolymeric regions (estimated on a HIV plasmid clone) – Possible presence of contaminants • Other technologies: Illumina, ABI Solid, Helicos… – shorter reads, higher base throughput

  5. Web ‐ server • Easy user interface – Parallelization of read alignment and error correction •Computational burden reduced from hours to minutes – Online tools for ngs ‐ aided diagnostics: •HIV ‐ 1 tropism prediction – Graph generator for variability analysis

  7. Sequence alignment • Optimized local pairwise alignment against a given consensus sequence •Smith ‐ Waterman ‐ Gotoh in forward and reverse – gap open/extension parameter optimisation via grid search in [1, 30] and [0.3, 3] with step size of 5 and 0.5 respectively – Two possible optimisation functions, where m is the number of matches, g is the number of gaps, N is the alignment length: » m/N (similarity maximisation) » m ‐ m*g/N (gap minimisation and similarity maximisation, accounting for alignment length)

  8. Contaminant detection • A random alignment score distribution is derived by – aligning n (at least n =400) random sequences, whose lengths are normally distributed on the actual lane average read length and std – applying the given optimisation procedure to each random sequence • A z test with Gumbel’s extreme value distribution test (like BLAST e ‐ value) is performed for each real read alignment score, corrected for multiple testing with Benjamini Hochberg • Sequences with an adj.p >0.01 are discarded

  9. Error detection/correction • For each position of the consensus (and relative indels) we execute a statistical test for over ‐ representation of changes within the reads – chi ‐ square statistic • After Bonferroni correction for multiple testing, we exclude positions with adj.p >0.01

  13. HIV Diagnostics application HIV Diagnostics application • Idea from Martin Daumer’s group (institute of Immunology, Kaiserslauten) and MPI • HIV ‐ 1 coreceptor usage prediction – Uses statistical learning applied to NGS data •Existing methods are: geno2pheno, pssm •We developed a new method based on logistic regression ( Prosperi et al. AIDS Research and Human Retroviruses 2009; 25(3).) – Alternative to TROFILE method •Pro: less expensive, quicker results, NGS gives also description of the quasispecies •Contra: results not always concordant with TROFILE

  14. Statistical Learning Model Statistical Learning Model • Logistic Regression – accuracy 92.76% – AUC (0.93)

  17. People at CASPUR and INMI • MR Capobianchi, G Ippolito • A Desideri, G Chillemi • I Abbate, G Rozera • A Barbato, A Bruselles

