CAGI@AIMM – Update on Community Experiment on Genome Interpretation Silvio Tosatto BioComputing UP, Department of Biology, University of Padova, Italy URL: http://protein.bio.unipd.it/
http://www.genomeinterpretation.org/
Organizers Steven E. Brenner , University of California, Berkeley John Moult , IBBR, University of Maryland Susanna Repo , University of California, Berkeley http://www.genomeinterpretation.org/
Critical Assessment of Genome Interpretation CASP -like effort for human genome variation interpretation Molecular Cellular Organismal A A A T T T
Goals of the CAGI experiment • Determine the state of the art • Identify progress and innovations • Reveal bottlenecks and guide future effort • Highlight new challenges • Collaboratively develop new approaches
CAGI 2011 experiment
2011: 11 challenges, total of 114 submissions, ~160 registered on the website 2010: 6 challenges, total of 108 submissions, ~60 registered on website
CAGI 2011 participating groups CAGI 2011: 21 participating groups Participated both 2011 and 2010 CAGI 2010: 17 participating groups
Cystathionine β ‐ Synthase (CBS) single amino acid mutations Homocysteine Cystathione Cysteine Serine cystathionase CBS Treat with high dose of B6 PLP CBS variants associated with homocystinuria
Cystathionine β ‐ Synthase (CBS) single amino acid mutations Total of 84 mutations assessed experimentally Substituted Growth rate Residue 400 ng/ml PLP D140N 103 +/- 25 A207G 0 N225S 70 +/- 12 I264T 109 +/- 14 W323G 0 A357G 104 +/- 19 Dataset provided by Jasper Rine, University of California, Berkeley Assessed by Iddo Friedberg, Miami University
Probability of observing the experimental value Predictions Experimental relative growth rate Predicted relative growth rate
Experimental and predicted relative growth rate Probability of observing the exp. value
CBS Challenge – Spearman’s rank correlation
p53 core domain mutations that restore activity of inactive p53 Baronio R et al. Nucl. Acids Res. 2010; nar.gkq571 p53 Cancer Rescue mutation mutant 14,668 variations G245S N239F G245S F113L to predict G245S S240Y G245S T123P G245S N239Y Dataset provided by Rick Lathrop, and the p53 “cancer rescue” team University of California, Irvine Assessed by Gad Getz, Broad Institute
Comparing predictions to ground truth 1: Yana Bromberg Lab 2: Yana Bromberg 3: Yana Bromberg 4: SWITCH Lab, Greet De Baets 5: Rita Casadio Lab 6: George Shackelford Lab 7: Sean Mooney Lab 8: Sean Mooney Lab
ROC curves for submissions M237I 1: Yana Bromberg Lab 2: Yana Bromberg 3: Yana Bromberg 4: SWITCH Lab, Greet De Baets 5: Rita Casadio Lab 6: George Shackelford Lab 7: Sean Mooney Lab 8: Sean Mooney Lab
Identify Crohn’s disease patients from healthy individuals Exome sequences from 4 different groups sequenced on different machines in different batches Not a case/control study! Dataset provided Andre Franke, Christian-Albrechts-University Kiel Assessed by Alexander Morgan, Stanford University
Challenge: Distinguish between exomes of Chron’s disease patients and healthy individuals Exomes of 56 individuals Multifactorial or complex diseases Who has Crohn’s disease?
: 42 / 56 42 / 56 have have Crohn’ Crohn’s s disease disease Assessm ent : Assessm ent
: 42 / 56 42 / 56 have have Crohn’ Crohn’s s disease disease Assessm ent : Assessm ent
: 42 / 56 42 / 56 have have Crohn’ Crohn’s s disease disease Assessm ent : Assessm ent #119 (ySNAP?) #94 (UniPadova)
Today • Personalized genetics has been upon us for some time • How good are we at actually identifying phenotype from whole genome?
Personal genome project (PGP) ‐ Predict individuals’ phenotype Numerical traits 33. Birth weight (in g) 34. HDL level (in mg/dL) * 35. LDL level (in mg/dL) * 36. Triglyceride level (in mg/dL) * 37. Fasting blood glucose level (in mg/dL) 38. Warfarin dose (in mg) 39. Age at Menarche 40. Annual income (in $) Dataset provided by George Church, Harvard Medical School Assessed by Sean Mooney, Buck Institute
The Submitters • s122, s123:UniPadova (2 submissions) PI: Silvio Tosatto – ANNOVAR + literature + database + expert knowledge – random prediction • s125: Netbiolab PI: Insuk Lee – SIFT + database (for population frequency) + GWAS • s126: KarchinLab PI: Rachel Karchin – Karchin: Bayes network + database (GWAS) Late Submission Shamil Sunyaev’s Lab, Harvard University
The Probabilities in the 10 Trait Name Frequency PositiveNum PGPCount 1Asthma 0.25 2 8 2Crohn's disease 0 0 8 3Ulcerative colitis 0 0 8 4Irritable bowel syndrome 0.111 1 9 5Rheumatoid arthritis 0 0 8 6Type II Diabetes 0 0 8 • Mostly zero 7Coronary artery disease 0 0 8 8Long QT Syndrome 0 0 8 9Hypertrophic cardiomyopathy 0 0 8 10Glaucoma 0.125 1 8 11Color blindness 0.125 1 8 12Bipolar disorder 0 0 8 13Celiac disease 0 0 8 14Psoriasis 0 0 8 15Lupus 0 0 8 16Breast cancer 0 0 8 17Prostate cancer 0 0 8 18Migraine 0 0 8 19Lactose intolerance 0 0 7 20Dyslexia 0.125 1 8 21Autism 0 0 8 22Osteoporosis 0 0 7 23Incontinence 0 0 8 24Kidney stones 0 0 8 25Varicose veins 0 0 8 26Sleep Apnea 0.143 1 7 27Tongue rolling (tube) 0.875 7 8 28Phenylthiocarbamide tasting 1 4 4 29Blood type - Has A antigen? 0.625 5 8 30Blood type - Has B antigen? 0.143 1 7 31Blood type - Is Rh(D) positive? 0.875 7 8 32Absolute pitch 0 0 6
The Binary Traits Results by team – only the Karchin team is statistically significant Total Predicted Submission Traits Traits Precision Recall AUC P UniPadova 228 216 0.094 0.3 0.605 0.133 UniPadova 228 228 0.118 0.095 0.405 0.923 Netbiolab 228 220 0.024 0.214 0.225 1 KarchinLab 228 228 0.652 0.714 0.896 0
The Binary Traits ‐ ROC Only S126 (Karchin lab) is statistically significant Submissions: S122: UniPadova S123: UniPadova (random) S125: Netbiolab S126: KarchinLab
Numerical traits traits Numerical We are still in the “game” phase…
Extra Questions Special questions: (a) One of the PGP10 individuals has irritable bowel syndrome. Who is that? (Answer: PGP7) (b) One of the PGP10 individuals is color ‐ blind. Which one? (Answer: PGP10) (c) One of the PGP10 individuals is not color ‐ blind but she has a color ‐ blind father and an affected son. Who is that? (Answer: PGP9) Karchin Lab got all correct, UniPadova got one correct
Some conclusions • Knowledge of individual gene is important (CBS) • Methods are highly significant (P ‐ value) but of questionable clinical applicability (r 2 ~0.7) • Different methods succeed at different challenges, and with different assessments • Predictions on the Personal Genome Project panel improved, but largely by better modeling the prior • Metapredictors unlikely to yield huge improvements currently • Unexpected success in predicting Crohn’s disease CAGI 2012 • Challenges about to be released… (September 2012) • Conference scheduled for mid-December 2012
Acknowledgements Organizers Steven E. Brenner, University of California, Berkeley John Moult, IBBR, University of Maryland Susanna Repo, University of California, Berkeley Data Providers Adam P. Arkin, UC Berkeley George Church, Harvard Medical School Andre Franke, Christian ‐ Albrechts ‐ University Kiel Joe W. Gray, OHSU Rick Lathrop, UC Irvine John Moult, University of Maryland Jasper Rine, UC Berkeley Jeremy Sanford, UC Santa Cruz Nicole Schmitt, University of Copenhagen Jay Shendure, University of Washington Michael Snyder, Stanford University Sean Tavtigian, University of Utah Assessors Rui Chen, Stanford University, Gad Getz, Broad Institute Iddo Friedberg, Miami University Website Development and Administration, Data Analysis Sean Mooney, Buck Institute Maya Zuhl, IBBR, University of Maryland Alexander A. Morgan, Stanford University Artem Sokolov, Sri Jyothsna Yeleswarapu, Tata Consultancy Services University of California, Santa Cruz Josh Stuart, University Gaurav Pandey, Mount Sinai School of Medicine of California, Santa Cruz Sean Tavtigian, University of Utah
Recommend
More recommend