My P value is lower than your P value! Beyond GWAS in livestock genomics Joanna Szyda
Motivation P value based inference
Motivation „ Biology emerges from pathways, not from single genes ” Eric Lander
Motivation • Combine various sources of biological information • Use computational resources (data analysis) • Use brain (biological conclusions)
Outline Data set 1 Illustration of methodology and biological conclusions ARSBFGLBAC10172 4408169577_E B B 0.8830 9.9999 ARSBFGLBAC1020 4408169577_E A B 0.8990 9.9999 ARSBFGLBAC10245 4408169577_E B B 0.6582 9.9999 Combine selected sources of information Data set 2 Illustration of the available genetic variability @HWI WI-1K 1KL15 157: 7:87: 7:C3N 3NCK CKACX CXX: X:8: 8:230 307:2 :203 034:7 :7845 453 3 2:N :N:0 :0:A :AGTT TTCC GG GGGA GAACT CTTGC GCTG TGTAT ATGTG TGCA CAGGG GGAG AGCA CAGGT GTGCT CTCT CTGTG TGCCA CAAC ACCTG TGGA GAGG GGGGA GAGGG GGAT ATGGG GGGTG TGGG GGA + <= <=?DBDA DAB:+ :+<? <?<CB CB@GE GEED ED>?@ ?@A@ A@AA AACF): ):CE CECG CG@GF GFIGG GGFF FFFFG FGFI FIBF BFA<' <'5@E @E4; 4;5=@ =@?3> 3>88 889
Data Set 1 SNP ARSBFGLBAC10172 4408169577_E B B 0.8830 ARSBFGLBAC1020 4408169577_E A B 0.8990 ARSBFGLBAC10245 4408169577_E B B 0.6582 ARSBFGLBAC10345 4408169577_E A B 0.9092 ARSBFGLBAC10365 4408169577_E B B 0.8021 ARSBFGLBAC10375 4408169577_E B B 0.8858 ARSBFGLBAC10591 4408169577_E A A 0.8670 ARSBFGLBAC10793 4408169577_E B B 0.8722 ARSBFGLBAC10867 4408169577_E A A 0.9316 ARSBFGLBAC10919 4408169577_E A B 0.7805 ARSBFGLBAC10952 4408169577_E A B 0.9314 ARSBFGLBAC10960 4408169577_E A B 0.5666 ARSBFGLBAC10975 4408169577_E A B 0.8665 ARSBFGLBAC10986 4408169577_E A B 0.8687 ARSBFGLBAC10993 4408169577_E B B 0.8146 ARSBFGLBAC11000 4408169577_E A A 0.9135 ARSBFGLBAC11003 4408169577_E A A 0.9454 ARSBFGLBAC11007 4408169577_E B B 0.9106 ARSBFGLBAC11025 4408169577_E B B 0.8742 ARSBFGLBAC11028 4408169577_E A A 0.8534 ARSBFGLBAC11034 4408169577_E B B 0.5769 ARSBFGLBAC11039 4408169577_E B B 0.8987
Data Set 1 SNP 2 601 HF bulls black-white & red-white pedigree 10 355 individuals Illumina 50 K chip SNP SNP positions pairwise LD genomic position (Ensembl) Gene Gene Ontology terms (GO) metabolic pathways (KEGG) deregressed national EBV Phenotype complex inheritance mode
Data set 1 SNP effect estimation • y deregressed EBV for protein yield • µ general mean • q additive SNP • Z { -1, 0, 1 } • e residual
Data set 1 gene networks identify physiological processes underlying complex traits + corresponding genes
Data set 1 gene effect estimation • 46 267 SNP estimates • varying LD to causal variants - log 10 P • multiple testing correction • only the most significant SNP associations detected • 4 345 gene estimates • SNPs within / close to genes • better interpretation • • 6 „major” genes for PY LHX8 HEPHL1 DHX34 • BTA: 3, 8, 17, 18, 19, 29 FBP2 TANC2 AP1B1 • … find the other genes
Data set 1 network construction for PY • 44 genes • 660 GO • 75 KEGG
Data set 1 network validation Functional SNP effect information estimation • GO EBV permutation • KEGG X 100 Gene effect Network construction estimation Gene selection
Data set 1 testing functional features For each GO / KEGG: Odds for the original data Odds for permuted data
Data set 1 results Significant KEGG pathways for PY (examples) • Lysosome (bta04142) CI: 8.8-51.7 → P<0.00001 protein degradation, tissue regression, inflammation • Cell cycle (bta04110) CI: 3.0-11.4 → P=0.00005 development of mammary epithelium • Pentose phosphate (bta00030) CI: 7.5-245 → P=0.00588 NADPH production in tissues engaged in biosynthesis
Data set 1 trait similarity identify similarities between complex traits
Data set 1 trait similarity GO / genes GO / genes Trait similarity
Data set 1 similarity metrics Cosine metric: Jaccard metric: • N ij number of GO / genes in networks for trait i and j • N i number of GO / genes in a network for trait i • N j number of GO / genes in a network for trait j
Data set 1 results Similarity between traits 0.7 genes cosine 0.6 genes Jaccard 0.5 GO Jaccard 0.4 0.3 0.2 0.1 0.0 PY, FY PY, MY PY, SCS PY, STA FY, MY FY, SCS FY, STA MY, SCSMY, STA SCS, STA
Data Set 2 DNA sequence There is much more informative data to do it
Data Set 2 DNA sequence @HWI-1KL157:67:D2AGFACXX:1:2316:10694:65033 2:N:0: CTATTACACGCCCCCGAAGCTCTAGCGGGTGTTCTCACGCACCCAAGGCATCCTCAACCACCACCATTTCTG + CCCFFADFHHGHHJJGGIIG@HIIFEHIJ;@F@DGGGGCCEB8BCDDDDBACDDCDDDBDDBDDDBDDDEE @HWI-1KL157:67:D2AGFACXX:1:2316:10671:65034 2:N:0: AGTGTATTACTGTCTTTGCACTCTTTAATCCTAGGTGACTTTTGGGGGTTCAGTATCAGATAGAGAACATATT + ?@@ADDDDHDBFHCEHIIBHEHEEHEH>BF?EFHCHFGFGFHH@HIG:6@=CGICAGG=7@@CHG===7 @HWI-1KL157:67:D2AGFACXX:1:2316:10609:65040 2:N:0: CTGGAGTGGGTATCCTTTCCCTTATCCAGGTTATCTTCCCAACCCAGGGATTGAACCCAGGTATCCTGGATT + @CCFDD2AFHDH<AFHII4CGIIJIJJGGIGIIJIIIJJJIHHIJJJIJEFGGICHHGGIIIHEHIHHGHHHFFFFFDDDDDD @HWI-1KL157:67:D2AGFACXX:1:2316:10717:65046 2:N:0: TACTCAAAAGAATCTGTGTTTAGACAGTTTAGAACATCTCCTACCTCTCACAGTTGGGAGGCTCTGAACAAT + @@@DD;DDHDBCFBEGGDHGHI<FBHIAEHE@GGEEFFHGDGIHGIGIIGBGGFGHIAFEGGHGIIIIIIEHH @HWI-1KL157:67:D2AGFACXX:1:2316:10507:65046 2:N:0: GAAGAAAAACTGTGTTTATGTCTCGAACATAATAAAGTCAACATGGATTATGTTAACTGTAATTGTACATCTA + @@@DDDDBHHHHBDBBHBHH3ACHHIIGBHIGCHGHGHIHHEGHII?4BFBDHHIGIDGDGFCCBF@FHI @HWI-1KL157:67:D2AGFACXX:1:2316:10653:65048 2:N:0: TATTGAAAACCTACCTACTAGGTAAATCTTAAGTAGGTTTAATCATGTCCACGTTTCCACTTGTTCACTCATTC
Data Set 2 DNA sequence paternal half-sib 32 HF cows whole genome DNA sequence Illumina HiSeq UMD3.1 reference genome alignment BWA, Smalt variant calling FreeBayes, GATK, Samtools, CNVnator
Data set 2 genomic variability describe genetic variability on the DNA level basis for complex trait modelling
Data set 2 averaged coverage Genome averaged coverage for each cow 18 18 16 16 14 14 12 12 coverage coverage 10 10 8 8 • min: 5 6 6 • max: 17 4 4 2 2 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Cow ID Cow ID
Data set 2 coverage along the genome Chromosomewise coverage for a particular cow BTA01 : 𝒚 =8.56 =8.03 BTA10 : 𝒚 =8.14 BTA20 : 𝒚 BTX : 𝒚 =8.60
Data set 2 SNPs Total number of identified SNPs 7 000 000 6 000 000 5 000 000 • min: 2 063 811 0.08% of genome # SNP 4 000 000 • max: 6 117 976 0.23% of genome • sd: 663 223 3 000 000 • sd -32 : 216 861 2 000 000 • c 2 P < 10 -4 1 000 000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Cow ID
Data set 2 SNPs Total number of identified SNPs 1 000 000 1 3 alleles % of SNPs 900 000 0.5 800 000 total number of SNPs 700 000 0 600 000 1 4 7 10 13 16 19 22 25 28 500 000 BTA 400 000 0.008 4 alleles 300 000 • 15 272 427 % of SNPs 0.006 200 000 0.004 • 99.16% biallelic 100 000 0.002 0 0 1 4 7 10 13 16 19 22 25 28 1 4 7 10 13 16 19 22 25 28 BTA BTA
Data set 2 SNPs Missense SNPs 300 0.006 250 0.005 number of missense SNPs missense SNP density 200 0.004 150 0.003 100 0.002 50 0.001 0 0 HK SS NS HK SS NS Housekeeping Strong Selection Neutral to Selection
Data Set 2 SNPs Housekeeping beta Actin, Beta-2-microglobulin, Glyceraldehyde-3- phosphate, Hydroxymethylbilane synthase, beta Heat shock 90kDa protein 1, Ubiquitin C Strong Selection diacylglycerol O-acyltransferase 1, alpha 6 integrin, ADP- ribosylation factor-like 4A, bone morphogenetic protein 4, myeloid differentiation primary response Neutral to Selection URI1 prefoldin-like chaperone, low density lipoprotein receptor-related protein, ATP/GTP binding protein 1, ankyrin repeat domain32, spectrin repeat containing, nuclear envelope 2
Data set 2 SNPs Missense SNPs 300 0.006 250 0.005 number of missense SNPs missense SNP density 200 0.004 150 0.003 100 0.002 50 0.001 0 0 HK SS NS HK SS NS Housekeeping Strong Selection Neutral to Selection
Data set 2 SNPs Missense SNPs • ANOVA: SNPdensity = category + gene(category) F P = 0.230 category F P = 0.008 gene(category) • ANOVA: #SNP = category + gene(category) F P < 10 -4 category F P < 10 -4 gene(category) House keeping Neutral to & selection Strong Selection
Recommend
More recommend