In silico blood genotyping from exome sequencing data Silvio Tosatto BioComputing UP, Department of Biology, University of Padova, Italy URL: http://protein.bio.unipd.it/
Today • Personalized genetics has been upon us for some time • How good are we at actually identifying phenotype from whole genome?
The CAGI Personal Genom e Project ( PGP) Challenge • Few goals are more pure to genome interpretation than predicting traits from raw sequence (or genotype) data • In this CAGI challenge, phenotypes/traits are predicted for real people with genetic data • 10 individual’s genetic information from the Personal Genome Project are provided (PGP-10) Dataset provided by George Church
Personal genome project (PGP) ‐ Predict individuals’ phenotype Numerical traits 33. Birth weight (in g) 34. HDL level (in mg/dL) * 35. LDL level (in mg/dL) * 36. Triglyceride level (in mg/dL) * 37. Fasting blood glucose level (in mg/dL) 38. Warfarin dose (in mg) 39. Age at Menarche 40. Annual income (in $)
Personal genome project (PGP) ‐ Predict individuals’ phenotype Numerical traits 33. Birth weight (in g) 34. HDL level (in mg/dL) * 35. LDL level (in mg/dL) * 36. Triglyceride level (in mg/dL) * 37. Fasting blood glucose level (in mg/dL) 38. Warfarin dose (in mg) 39. Age at Menarche 40. Annual income (in $)
Blood Groups • Clear genetic cause of phenotypes • Model system for phenotype prediction • Good description in literature • High relevance, especially for blood transfusions (Blood. 2009;114: 248-256)
Exam ple: ABO glycosyltransferase Amino acid residues differing between blood group A- and B-active transferases, respectively (Arg176Gly; Gly235Ser; Leu266Met; Gly268Ala) are shown with the single-letter code and their positions indicated. Blood Grp Genes Antigens ABO ABO A, B, O
Relevant Blood Types 10 out of ca. 30 blood groups are relevant for transfusions Blood Grp Genes Antigens ABO ABO A, B, O RH RHCE, RHD D, E, C plus 50 minor DUFFY DARC FY(a), FY(b) Kell KEL K1, K2 plus 23 minor Di a , Di b , Wr a , Wr b Diego SLC4A1 Kidd SLC14A1 Jk(a), Jk(b) Lewis FUT3 a, b Lutheran BCAM Lu(a), Lu(b) plus 15 minor MNS GYPA, GYPB, M, N, S plus 40 minor GYBE Bombay FUT1, FUT2 H, secretor
BOOGI E: BlOOd Group I dEntifier • A knowledge-based system to predict blood groups from sequencing data • All 10 groups relevant for blood transfusions are predicted • A specialized genotype-phenotype knowledge base is required
BOOGI E: Know ledge representation • Stored in tree-like structure • Rules expressed in “ if <mutation(s)> then <phenotype(s)> ” form
BOOGI E: Know ledge collection Blood G rp G enes Antigens ABO ABO A, B, O RH RH CE, RHD D, E, C plus 50 m inor DUFFY DARC FY(a), FY(b) Kell KEL K1, K2 plus 23 m inor Di a , Di b , Wr a , Wr b Diego SLC4A1 Kidd SLC14A1 Jk(a), Jk(b) Lewis FUT3 a, b Lutheran BCAM Lu(a), Lu(b) plus 15 m inor M NS GYPA, GYPB, M , N, S plus 40 m inor GYBE Bom bay FUT1, FUT2 H, secretor – Manually curated – 580 rules derived
ANNOVAR ANNOVAR Millions of SNVs (Wang et al., Nucleic Acids Research 2010) Gene ‐ based annotation of variants Select conserved positions ANNOVAR is used to reduce the SNVs Remove unrelated to manageable genes number. Relevant variants Few relevant SNVs
BOOGI E Pipeline Blood G rp G enes Antigens ABO ABO A, B, O RH RHCE, RHD D, E, C plus 50 m inor DUFFY DARC FY(a), FY(b) Kell KEL K1, K2 plus 23 m inor Diego SLC4A1 Di a , Di b , W r a , Wr b Kidd SLC14A1 Jk(a), Jk(b) Lewis FUT3 a, b Lutheran BCAM Lu(a), Lu(b) plus 15 m inor M NS GYPA, GYPB, M , N, S plus 40 m inor GYBE Bom bay FUT1, FUT2 H, secretor
Benchm arking • BOOGIE covers all known blood group variants • Difficulty in finding genome sequences with known blood phenotypes • Personal Genome Project (PGP) as annotated benchmark set
Personal Genom e Project ( PGP) The mission of the PGP is to encourage the development of personal genomics • 10 individual’s genetic information from the Personal Genome Project are provided (PGP-10) • A larger dataset (PGP-1K) aims to cover at least 1,000 genomes Unfortunately, only ABO and Rh blood group information is available
PGP-1 0 Data Back row ( left to right ): James Sherley, Misha Angrist, John Halamka, Keith Batchelder, Rosalynn Gill. Front row ( left to right ): Esther Dyson, George Church, Kirk Maxey. Not shown : Stan Lapidus and Steven Pinker.
PGP-1 0 Data
PGP-1 0 Results BOOGIE predicts correctly all ABO types and all except one (PGP-4) Rh groups PGP1 PGP4 PGP8 Known O + A - B + ABO O A B Rh c; e; weak D c; e; weak D c; e; weak D DUFFY FY(a+); FY(b-) FY(a-); FY(b+) FY(a-); FY(b+) KELL K2; K21+; K4-; K2; K21+; K4-; K2; K21+; K4-; K3-; K11; K17; K3-; K11; K17; K3-; K11; K17; K14; K24; K6+; K14; K24; K6+; K14; K24; K6+; K7- K7- K7- Diego Dib; Memph neg Dib; Memph neg Dib; Memph neg KIDD Jk(a-); Jk(b+) Jk(a-); Jk(b+) Jk(a+); Jk(b-) Lewis negative negative negative Lutheran Lu(a-); Lu(b+); Lu(a-); Lu(b+); Lu(a-); Lu(b+); Lu6+; Lu9-; Lu4; Lu6-; Lu9+;Lu4-; Lu6+; Lu9-;Lu4-; Lu8+; Aua+;Aub- Lu8+; Aua-;Aub+ Lu8+; Aua+;Aub- MNS M; S M; s M,s Bombay H+; secretor H+; secretor H+; secretor
PGP-1 K Results • A second dataset was built from all PGP-1K participants with available blood group information for a total of 22 individuals • This dataset contains micro array data ( 23&me SNPs) P = predicted R = real * = missing blood group relevant SNPs from dataset
Conclusions • We developed a method, called BOOGIE, to predict the ten blood groups relevant for transfusions from sequencing data – Specialized knowledgebase with 580 genotype to phenotype rules – Novel variants can be easily considered • Benchmarking was (so far) only possible on PGP data for the ABO and Rh blood groups – The ABO and Rh systems are correctly predicted in 85-100% of cases – The Rh- type presents some additional difficulties
Acknowledgements Acknowledgements Manuel Giollo Giovanni Minervini Marta Scalzotto (not shown) Emanuela Leonardi Carlo Ferrari Funding FIRB Futuro in Ricerca Università di Padova CARIPLO AIRC URL: http:// URL: http://protein.bio.unipd.it protein.bio.unipd.it/ /
Recommend
More recommend