COMET: A Novel approach to HIV-1 subtype prediction (Context-based Modeling for Expeditious Typing) Daniel Struck CRP-SANTÉ Laboratory of Retrovirology (daniel.struck@crp-sante.lu) comet.retrovirology.lu
Background • HIV-1 subtype is often used for epidemiological studies • Many different subtyping tools exist: – jpHMM, RIP (LANL), NCBI genotyping, STAR , REGA Subtyping Tool , … • Subtyping remains a controversial topic → compare the results from different approaches comet.retrovirology.lu
COMET HIV-1 subtyping tool • Context-based modeling for classification of HIV-1 sequences adapted from ppm compression algorithm ( p rediction by p artial m atch) – take ambiguities from population sequencing into consideration • Software written in Java (Linux, Windows, Apple, …) • Core algorithm holds in approx. 300 lines of code • Does not require any external analysis tool (muscle / mafft / clustal, paup / raxml / phyml) • Multi-threaded (takes advantage of all the cpu cores available) comet.retrovirology.lu
Algorithm • Training of the model with the subtype reference sequences from Los Alamos National Lab (LANL) from 2008 and 30 additional near full length sequences from LANL. • Slide over the sequence and determine the probabilities for each subtype. Simplified example with a model 4: C T A G C A A C A C T A G C A A C A C T A G C A A C A C T A G C A A C A Subtype A 0.5 0.5 0.1 0.2 0.3 Subtype B 0.5 0.5 0.4 0.6 0.8 Subtype C 0.3 0.2 0.1 0.2 0.1 • Determine the most probable subtype. • Then slide over the table of probabilities with a window size of 250bp and a stepping size of 2bp to detect possible recombination events. comet.retrovirology.lu
Analysis of 27017 prot-RT sequences from LANL • Dataset for analysis: – 27017 prot-RT sequences downloaded from LANL. • Query parameters: – HXB2 start point: 2253, end point: 3450 (prot-RT region) – Sequence length < 1700 bp • Download subtype results from the STAR and REGA subtyping tools. – STAR: all PURE, CRF: 01_AE - 02_AG – REGA v2: all PURE, CRF: 01_AE - 14_BG comet.retrovirology.lu
Subtype distribution of the dataset (27017 prot-RT sequences) STAR REGA COMET B 19988 19722 20282 C 1329 1334 1329 A 672 1200 1194 D 555 186 499 G 246 441 393 F 205 206 193 H 19 21 19 J 3 6 2 CRF02_AG 867 787 829 CRF01_AE 414 419 416 other CRF 0 653 806 unassigned 2719 2042 1055 comet.retrovirology.lu
Comparison of STAR, REGA & COMET (27017 prot-RT sequences) • All 3 tools agreed in 88.3% cases (23854) – 22352 PURE – 777 CRF – 725 unassigned • All 3 tools disagreed in only 0.1% cases (30). • COMET & REGA agreed in 6.4% cases (1722); STAR disagreed – 1034 PURE, 582 CRF, 106 unassigned • COMET & STAR agreed in 4.0% cases (1090); REGA disagreed – 910 PURE, 40 CRF, 140 unassigned • REGA & STAR agreed in 1.2% cases (321); COMET disagreed – 77 PURE, 8 CRF, 236 unassigned comet.retrovirology.lu
Comparison of REGA & COMET to LANL Of the 27017 from the dataset, 24735 had a subtype (PURE, CRF, URF) assigned in the LANL database. For comparison 24576 sequences were analyzed ( PURE, CRF: 01_AE → 14_BG, URF ) REGA & LANL agreed in 93.9% cases (23077) and disagreed in 6.1% of the cases (1499). Fleiss kappa = 0.84 COMET & LANL agreed in 96.9% cases (23818) and disagreed in 3.1% of the cases (758). Fleiss kappa = 0.92 “The Fleiss kappa measure calculates the degree of agreement in classification over that which would be expected by chance and is scored as a number between 0 and 1.” comet.retrovirology.lu
Cohen Kappa REGA ↔ LANL COMET ↔ LANL training set 01_AE 0.98 0.98 5 02_AG 0.92 0.93 6 03_AB 0 0 2 04_CPX 0.86 0.86 4 05_DF 1 0 3 06_CPX 0.83 0.77 5 07_BC 1 0.98 4 08_BC 0.97 0.97 2 09_CPX 0 0.8 4 10_CD -1.09E-04 0 2 11_CPX 0.64 0.64 3 12_BF 0.65 0.61 5 13_CPX 0.8 0.8 3 14_BG 0 0 2 A 0.96 0.96 7 A1 ,2 A2 B 0.92 0.98 7 C 0.99 0.99 6 D 0.41 0.94 6 F 0.94 0.92 6 F1, 2 F2 G 0.9 0.91 4 H 0.97 0.91 4 J 0.5 0.5 3 K 0 0 2 URF 0.38 0.55 comet.retrovirology.lu
Benchmark Anaylsis of the 27017 prot-RT sequences: 392+/-2 seconds (6 ½ minutes) on Opteron server (2 x Quad-core, 2.5GHz) => 68 prot-RT sequences / second 144+/-0 seconds (2 ½ minutes) on new Intel server (2 x Quad-core, newest generation, 2.93 GHz) => 187 prot-RT sequences / second comet.retrovirology.lu
Ultra-deep sequencing (UDS) application In-house UDS (454) software: • alignment, trimming • filtering • compressing • automatic correction of homopolymer count & “carry forward” errors • … • added adapted COMET module with bootstrap analysis (100 values per sequence, threshold 75%) comet.retrovirology.lu
UDS application, dataset: 64 patients from Rwanda AMATA study 454 Sequence length: 333 bp (454, RT, AA 88 → 198) Total sequences analyzed: 267749 (seq. with frameshifts excluded) Time needed for analysis (100 bootstraps / seq. ): 5 ½ minutes Sanger (prot-RT) (URF: 2 AC, 5CA, 1 CAC, 1 AD, 2CD, 1 DC, 1GH) comet.retrovirology.lu
UDS application, results: COMET subtype confirmation patient major subtype number minor subtype number unassigned minority % REGA STAR jpHMM man. align. insp. Sanger 5 A1 4312 C 1 0 0.02 ok A1/u ok ok URF_CA 8 D 6853 A1 1 57 0.01 ok ok ok ok D 9 C 6603 A1 14 28 0.21 u/A1 u/A1 H/A1 C-H?/A1 URF_GH 17 A1 5727 C 3 0 0.05 ok ok ok ok A1 18 C 3279 A1 5 0 0.15 ok ok ok ok C 21 A1 2856 C 4 0 0.14 u/u ok ok ok A1 22 C 5995 A1 5 0 0.08 u/A1 ok ok ok C 24 A1 6361 C 13 0 0.2 u/C ok ok ok A1 25 C 6412 A1 15 0 0.23 C/u ok ok ok URF_CD 26 A1 7350 C 1 0 0.01 u/C ok ok ok C 32 C 6094 A1 11 0 0.18 C/u ok ok ok URF_DC 33 A1 2226 C 1 0 0.04 ok ok ok ok A1 35 A1 4864 C 4 0 0.08 A1/u ok ok ok A1 36 A1 670 C 1 0 0.15 ok ok ok ok A1 47 A1 3290 C 2 0 0.06 u/C ok ok ok A1 48 A1 4120 C 1 0 0.02 u/C ok ok ok A1 49 C 5279 A1 58 0 1.09 ok ok ok ok C 64 C 1695 A1 9 0 0.53 ok ok ok ok URF_CA 65 A1 6346 C 8 0 0.13 A1/u ok ok ok A1 73 C 3335 A1 1 0 0.03 ok ok ok ok C 79 A1 3244 C 3 0 0.09 ok ok ok ok A1 21 out of 64 patients (32.81%) seem to be dually infected by two different subtypes comet.retrovirology.lu
Summary • Reliable prediction of HIV-1 subtype • Generally it is best to compare the results of different approaches to define the subtype of a sequence • High performance and scalability – suitable for deep sequencing (454) analysis • In preparation: stand-alone desktop version with possibility to inspect the recombination pattern comet.retrovirology.lu
http://comet.retrovirology.lu subtype results can be downloaded in CSV format comet.retrovirology.lu
Acknowledgements CRP-Santé, Laboratory of Retrovirology Jean-Claude Schmit Carole Devaux Danielle Perez Bercoff Jean-Claude Karasi CRP-Santé, Laboratory of Cardiovascular Research Francisco Azuaje comet.retrovirology.lu
Recommend
More recommend