HIV- -1 tropism prediction 1 tropism prediction HIV Mattia CF Prosperi ahnven@yahoo.it University of “ Roma TRE ” Faculty of Computer Science Engineering Dept of Computer Science and Automation (DIA) via della vasca navale, 79 – 00149 – Rome, ITALY
Summary • State of the art – From charge rule to structural descriptors • Roma TRE modelling – Data collection • Sequence manipulation • Enhanced domain coding – Univariable analysis and clustering – Model technologies • Logistic regression and feature selection • Validation and comparison with other models • Interpretation of relevant features
State of the art • Charge rule (De Jong, 1992) • Neural Networks, Decision Trees, Support Vector Machines (Resch, Pillai) on 200-300 examples • Position Specific Scoring Matrices (Jensen) • Support Vector Machines (Sing) on 1’100 examples with AUC maximisation adding CD4+ cell count as additional input variable • Support Vector Machines + Structural Analysis (Sander, 2007) with AUC maximisation • Neural Networks for dual-tropism prediction (Lamers, 2008) • All models work on the sole V3 loop
State of the art (2) • SVM + Structural Analysis (Sander, 2007) seems to be the best performing model at present – 91.56% accuracy – 0.93 AUC – Minor critics concerning sample collection (all different sequences, regardless patient, without accounting for real sequence population distribution) – Improvements gained with the structural analysis, over a reference SVM trained only on the V3 dummy variable encoding
Roma TRE approach • Data: collection of samples from “Los Alamos” data base – Only one sequence per patient (the longest available, no clones) except for sequences with different tropism – No problematic sequences – All subtypes – At least V3 loop, possibly all envelope gene – Clinical markers recorded • Goal: prediction of CXCR4 usage probability (regardless CCR5 usage, dual tropic strains are pooled into X4 strains)
Sequence manipulation • Previous works used multiple alignment (clustalw or muscle) either on nucleotide or amino-acids • We used local pairwise alignment (Smith- Waterman-Gotoh) with ambiguities and frameshifts correction/detection against HXB2 strain (which is X4) – Minor differences with the output of other models
Domain coding • Binary dummy variables for specific amino acidic changes (plus ins-del and “any” substitution) in the V3 loop and in the envelope • Phisico-chemical coding for position changes • Subtype • Clinical markers (HIV RNA load, CD4 and CD8 cell counts)
Univariable analysis • CD4+ are significantly associated with tropism (low CD4+ → X4) • Subtype B, D isolates are prevalently X4 • Subtype A, C, 02_AG isolates are prevalently R5
Univariable analysis • Highly significant positions in the V3 loop • 306 ( 11 ) • 302 ( 7 ) • 303 ( 8 ) • 323 ( 28 ) • 301 ( 6 ) • 313 ( 18 ) • 321 ( 26 ) • 322 ( 27 ) • 300 ( 5 ) • 315 ( 20 ) • 320 ( 25 ) • 307 ( 12 ) • 316 ( 21 ) • 325 ( 30 ) • 304 ( 9 ) • … • A few positions outside the V3 loop are significant, but slightly over the Benj-Hoch adjusted threshold (adj.p<0.1) • 440, 192, 169
Hierachical Clustering • Threshold of 0.35: {318A, ins317}, {311I, 308S, 306del, 307del}, {322I, 320 hydrophilic, 326I} • mutations positively associated with X4 viruses tend to behave more independently (306S, 303I, 308K, 300Y and 307T)
Machine Learning • Logistic Regression (LR) • Feature selection via filter and embedded methods (univariable analysis, AIC selection, CFS, ridge shrinkage) • Comparison with other (non-linear) machine learning techniques – SVM (same settings as Sander, 2007) – Random Forests and Decision Trees (RF, DT) – Rule Bases (RIPPER, JRIP) – Instance Based Reasoning (IBR) • Multiple 10-fold cross validation for model performance assessment and model comparison – Student’s t-test adjusted (Bengio and Nadeau) for sample overlap and multiple comparisons over 10 independent runs
Results • Logistic Regression – High accuracy (92.76%) and AUC (0.93) – Enhanced domain coding performs significantly better that naïve variable encoding and sole V3 loop – Equally performing as the reference SVM
Results (2)
Conclusions • Logistic Regression is a powerful and interpretable tool for tropism prediction – Importance of envelope region analysis – Importance of enhanced variable encoding – Importance of feature selection techniques – Importance of robust validation and comparison statistics – We have a linear model: from the comparison analysis, non-linear models seem not to improve performances • The modelling technique is also suitable for combination with structure-based methods
Recommend
More recommend