Arevir University of Cologne Institute of Virology Analysis of resistance mutations of HI-Virus Bioinformatics analysis of relations between mutations of the HIV genome and phenotypical drug resistance for the optimization of anti-retroviral therapies Eugen Schülter Institut für Virologie der Universität zu Köln - Cologne center of advanced european studies and research - Bonn 1 ES / Bonn Apr 2007
Outline University of Cologne Institute of Virology � Background � Database scheme � Requirements of data protection � Problems resulting from 'to much protection' � Data cleansing � Comparison between 'old' and 'new' Arevir DB � Basic statistics / samples of derived information � Collaborations / What next 2 ES / Bonn Apr 2007
Background University of Cologne Institute of Virology The Arevir project 1) is founded by Daniel Hoffmann, Rolf Kaiser 1999 and Joachim Selbig. Aim: develop computer based methods to enhance the inter- pretation of genotipic resistance tests. 2000 Niko Beerenwinkel created the basis for Arevir in his dissertation: Computational Analysis of HIV Drug Resistance Data 2001 Barbara Schmidt, Hauke Walter and Klaus Korn provided ~650 genotype � phenotype pairs 2001 First version of geno2pheno available online, predicting drug resistance from genotype 1) The project was funded by the German Research Foundation (Grants HO 1582/1-1 to -3 and KA 1569/1-1 to -3) 3 ES / Bonn Apr 2007
Background University of Cologne Institute of Virology Why develop computer based methods? VL Therapy must be changed! Resistance Test t Question: How will the current virus population change, faced with a new therapy? Genotype: 41L, 67N, 68G, 70R, 86DE, 88S, 90I, 102Q, 103S, 118IV, 135T, 162H, 190A, 203DE, 210W, 211K, 214F, 215Y, 219E, 228H, 248D, 277K, 283I, 326V, 329IV, 334L Which new therapy is the optimal choice? (~1500 drug combinations seen!) 4 ES / Bonn Apr 2007
Background University of Cologne Institute of Virology To give assistance with both questions, we tried to correlate: genotype, VL, CD4, therapy � we needed a database The Arevir DB is a MySQL database and consists mainly of 50 tables organized in 5 groups: � Patient related data (demographic data) � Therapy data � Isolate related data (serology, clinical chemistry) � Genotipic data (mutations) � Administrative data (access rights etc.) 5 ES / Bonn Apr 2007
Arevir DB 'Scheme' University of Cologne Institute of Virology pseudonyms diagnoses patID diagID pseudonym patID patients ICD_code patID year_of_birth gender transmission therapycom- therapies isolates therapyID ponents isoID patID therapyID patID therapy_start compound sampling_date therapy_stop dosage isolate_values sequences isoID seqID propID isoID methodID mutations nt_seq value seqID nt_crc mutations 6 ES / Bonn Apr 2007
Arevir DB 'Scheme' University of Cologne Institute of Virology pseudonyms diagnoses 2342 7431 'FA3E359D1...' 2342 patients 'R75' 2342 1963 'M' 'IVDA' therapycom- therapies isolates 4288 ponents 12630 2342 4288 2342 2002-06-23 '3TC' 2002-06-02 2003-11-05 150 mg isolate_values sequences 12630 2966 'HIVRNA' 12630 'bDNA' mutations 'CCTCAGATC...' 165000 2966 3247583203 K65R, L74V... 7 ES / Bonn Apr 2007
Data Protection University of Cologne Institute of Virology Requirements from data protection officials: � Trace back to the identity of a patient must be impossible � Restricted access to the data (physician can see 'her/his' data, bioinformatics have read access but not to pseudonyms) Names are not stored in Arevir at all � When a new patient record is added a pseudonym using the SHA-1 algorithm is generated on the fly from the fields first name, last name and birthday . The only way to connect the server is: � SSH (secure shell) � Using public/private key authorization � From a computer whose IP-Address is known to Arevir 8 ES / Bonn Apr 2007
Internet Access to Arevir University of Cologne Institute of Virology Internet TCP/IP 0101 0101 0101 0101 0101 0101 0101 0101 Arevir Arevir 0101 VNC tunneled via SSH Login with private key protected by pass phrase VNC login with password Arevir web application login with password 9 ES / Bonn Apr 2007
Web Interface University of Cologne Institute of Virology Arevir's web interface showing the input form for personal data 10 ES / Bonn Apr 2007
Selbigs Data Funnel University of Cologne Institute of Virology For a meaningful correlation a large amount of data is needed! Sequences s Sequence Therapies Therapies s s t t n n e e i i t t a a 4 P 4 P V V D D L L C C 2001 2005 Data from ~500 Data from ~4500 patients � ~150 patients � ~350 records suitable records suitable for evaluation for evaluation Steady small result despite of an increasing amount of data 11 ES / Bonn Apr 2007
The data funnel University of Cologne Institute of Virology Δ n / Clinical data time While the lab was Genotypes producing more genotypes per month, the number of follow up data was decreasing steadily 2000 2005 t Bad patient identification was spotted as main cause for the funnel effect � Arevir pseudonym algorithm was dependent on the exact spelling of the patient name � Maintenance of patient ↔ genotype list at the laboratory was done with an Excel sheet Solution: � 'New' pseudonym algorithm using some kind of fuzzy pre processing � Standalone application (Rosie) for the Institute of Virology � Migration to new MySQL version with some alterations in the scheme 12 ES / Bonn Apr 2007
Data cleansing I University of Cologne Institute of Virology Cleansing of patient names and assignment of an unique patient ID was done with new fuzzy indices (name aliases allowed) André Ramirez 23.03.1969 Andre Ramires 23.03.1969 Hans-Peter Schmidt 05.11.1982 Hans Schmidt 05.11.1982 Georgios Koehler 15.04.1958 Georgious Koeler 15.04.1958 Anna Meier 29.12.1978 Anna da Silveira 29.12.1978 John Miller 16.03.1970 John Miller 10.03.1970 Mgabe Osamba 12.04.1974 Ossamba Mgabe 11.04.1974 Examples are fictive! 13 ES / Bonn Apr 2007
Data cleansing II University of Cologne Institute of Virology Several checks applied to uncover suspicious data: � Genotypes without sampling date � Duplicate genotypes � Overlapping therapies � Date checks (e.g. infection < first positive test < first treatment etc.) � Therapies with 'forbidden' drug combinations � More than one isolate value (of a kind) in a period of 7 days � Lab values out of specified range (e.g. HIVRNA > 10.000.000 copies/ml) � . . . Data cleansing is hard work, time consuming, tedious but absolutely necessary! 14 ES / Bonn Apr 2007
Data cleansing III University of Cologne Institute of Virology Small errors can have quite big effects: VL 20.06.200 2 4,0 1,6 t Therapy A Therapy B 12W (15.06.2002) TCE 15.03.2002 15 ES / Bonn Apr 2007
Rosie University of Cologne Institute of Virology Two entry forms of Rosie � Data quality and consistency was improved by early checks � Pseudonym is generated in Rosie 16 ES / Bonn Apr 2007
Old and new Arevir versions University of Cologne Institute of Virology Arevir in June 2005 Arevir in February 2007 Patients ~ 4.500 Patients 2.444 Diagnoses ~ 2.900 Diagnoses 4.412 Therapies ~ 12.000 Therapies 6.154 CD4 values ~ 52.000 CD4 values 53.031 VL values ~ 41.000 VL values 25.972 Sequences ~ 3.000 Sequences 2.180 ~ 350 complete TCEs ~ 750 complete TCEs 17 ES / Bonn Apr 2007
Basic statistics University of Cologne Institute of Virology Transmission quotas 45,0 40,0 35,0 30,0 25,0 20,0 Male 15,0 Female 10,0 5,0 0,0 blood transfusion haemophiliac heterosexual Female homosexual Male IVDA pattern II unknown 18 ES / Bonn Apr 2007
Derived Data University of Cologne Institute of Virology Percentage of silent mutations per genotype 14,0 13,0 12,0 11,0 PRO (treated) 10,0 RT (treated) PRO (naïve) 9,0 RT (naïve) 8,0 7,0 6,0 5,0 1999 2000 2001 2002 2003 2004 2005 2006 n naïve = 1013 n pretreated = 2327 19 ES / Bonn Apr 2007
Collaborations / What Next University of Cologne Institute of Virology � At the moment Arevir receives data mainly from the University Clinics from Bonn, Köln (through Medeora, www.medeora.com) and Düsseldorf. � The system is open and any contribution is welcome! � There is a close collaboration with EuResist (www.euresist.org) Coming soon: � Integrase Genotypes � Hepatitis B and Hepatitis C data 20 ES / Bonn Apr 2007
Conclusions University of Cologne Institute of Virology Databases can help to improve the health condition of HIV infected. � By supporting therapy optimization algorithms � By enhancing our understanding of HIV But to extract useful information from a database, a large amount of data with high quality is needed! 21 ES / Bonn Apr 2007
Thank you! University of Cologne Institute of Virology Bernd Kupfer Christian von Behren Claudia Müller Clemens Kühn Daniel Hoffmann Dörte Hammerschmidt Gerd Fätkenheuer Hauke Walter Heike Krause Joachim Selbig Jürgen Klein Jürgen Rockstroh Mark Oette 22 ES / Bonn Apr 2007 M i Dä
Recommend
More recommend