genome wide snp selection with entropy based methods
play

Genome Wide SNP Selection with Entropy Based Methods Zhenqiu Liu - PowerPoint PPT Presentation

Genome Wide SNP Selection with Entropy Based Methods Zhenqiu Liu University of Maryland Greenebaum Cancer Center Genome Wide SNP Selection with Entropy Based Methods p. 1/40 The Genetic Diversity in Humane Any two unrelated people are 99%


  1. Genome Wide SNP Selection with Entropy Based Methods Zhenqiu Liu University of Maryland Greenebaum Cancer Center Genome Wide SNP Selection with Entropy Based Methods – p. 1/40

  2. The Genetic Diversity in Humane Any two unrelated people are 99% identical in DNA sequence. The remain 0 . 1% difference can help explain one person has distinct physical features, is more susceptible to a disease, or responsible differently to a drug or an environmental factor than another person. Genome Wide SNP Selection with Entropy Based Methods – p. 2/40

  3. Background The goal of much genetic research is to find genes that contribute to disease Finding these genes should allow an understanding of the disease process, so that methods for preventing and treating the disease can be developed For “single-gene disorders”, current methods are usually sufficient Genome Wide SNP Selection with Entropy Based Methods – p. 3/40

  4. Background Most people, however, donŠt have single-gene disorders, but develop common diseases such as heart disease, stroke, diabetes, cancers or psychiatric disorders, which are affected by many genes and environmental factors Common-Disease/Common-Variant Theory: The genetic contribution to these diseases is not clear, but many researchers consider common variants to be important Genome Wide SNP Selection with Entropy Based Methods – p. 4/40

  5. Single Nucleotide Polymorphisms A SNP is a single nucleotide site where exactly two (of four) different nucleotides occur in a large percentage of the population For example, 30% of the chromosomes may have an A, and 70% may have a G (on a specific site) These two forms, A and G, are called variants or alleles of that SNP An individual may have a genotype for that SNP that is AA, AG, or GG. Genome Wide SNP Selection with Entropy Based Methods – p. 5/40

  6. Genotype and Haplotype Diploid populations (e.g., humans) have two copies of each chromosome (one copy inherited from the father, and the other inherited from the mother) The collection of SNP variants on a single chromosome copy is a haplotype . The conflated (mixed) data from the two haplotypes is called a genotype Genome Wide SNP Selection with Entropy Based Methods – p. 6/40

  7. an example Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two alleles (states) denoted by 0 and 1 (motivated by SNPs) Haplotypes for the individual: 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 Genotype for the individual: 2 1 2 1 0 0 1 2 0 Genome Wide SNP Selection with Entropy Based Methods – p. 7/40

  8. A Graphic Presentation Genome Wide SNP Selection with Entropy Based Methods – p. 8/40

  9. SNP Association Studies 1. SNP Discovery: Where do I find SNPs to use in my association studies? (e.g. databases, direct resequencing) 2. SNP Selection: How do I choose SNPs that are informative? (i.e. assessing SNP correlation - linkage disequilibrium) 3. SNP Associations: How to find one gene or group of SNP associate with disease? 4. SNP Replication/Function: How is function predicted or assessed Genome Wide SNP Selection with Entropy Based Methods – p. 9/40

  10. Pairwise LD Measure r 2 Two bi-allelic markers: Locus 1: A , a Locus 2: B , b Allele frequencies: P A , P a , P B , P b . Haplotype frequencies: P AB , P Ab , P aB , P ab , The r 2 measure is r 2 = ( P AB P ab − P aB P Ab ) 2 P A P B P a P b Genome Wide SNP Selection with Entropy Based Methods – p. 10/40

  11. Output with Haploview Genome Wide SNP Selection with Entropy Based Methods – p. 11/40

  12. Objectives A multilocus LD measure (ER) with generalized mutual information. Genome Wide SNP Selection with Entropy Based Methods – p. 12/40

  13. Objectives A multilocus LD measure (ER) with generalized mutual information. Criteria ω ( λ ) for tagging SNP selection with joint information and ER. Genome Wide SNP Selection with Entropy Based Methods – p. 12/40

  14. Objectives A multilocus LD measure (ER) with generalized mutual information. Criteria ω ( λ ) for tagging SNP selection with joint information and ER. Algorithms for SNP selection (tagging). Genome Wide SNP Selection with Entropy Based Methods – p. 12/40

  15. Introduction Classical LD measures such as D ′ and r 2 are pairwise LD between two loci. They can not provide direct measure of LD for multiple loci. Genome Wide SNP Selection with Entropy Based Methods – p. 13/40

  16. Introduction Classical LD measures such as D ′ and r 2 are pairwise LD between two loci. They can not provide direct measure of LD for multiple loci. Multilocus LD measure ε proposed by Nothnagel et al. (2002) is useful in many applications. ε is defined as follows: ε = H E − H H E Genome Wide SNP Selection with Entropy Based Methods – p. 13/40

  17. Definition of ε Given a chromosomal segment containing n SNPs, let p j be the frequency of major allele of the j th SNP, j = 1 , . . . , n . Suppose there are m observed haplotype with frequency q i , i = 1 , . . . , m , then the entropy of haplotype distribution is defined as m � H = q i log 2 ( q i ) . i =1 Under the assumption of linkage equilibrium, we have n p I j j (1 − p j ) 1 − I j � q E k = k , k j =1 Genome Wide SNP Selection with Entropy Based Methods – p. 14/40

  18. ε Continued where I j k is a index function with value 0 and 1. Then 2 n � q E k log 2 ( q E H E = k ) i =1 and ε = H E − H H E 0 ≤ ε < 1 , but can never reach 1. Genome Wide SNP Selection with Entropy Based Methods – p. 15/40

  19. ε Continued where I j k is a index function with value 0 and 1. Then 2 n � q E k log 2 ( q E H E = k ) i =1 and ε = H E − H H E 0 ≤ ε < 1 , but can never reach 1. The larger the ε , the greater the LD. Genome Wide SNP Selection with Entropy Based Methods – p. 15/40

  20. Drawbacks of ε The upper bound of ε can never reach 1 . Genome Wide SNP Selection with Entropy Based Methods – p. 16/40

  21. Drawbacks of ε The upper bound of ε can never reach 1 . For a block in which all SNPs are in complete LD, ε ’s outcome is dependent on the number of SNPs considered. Genome Wide SNP Selection with Entropy Based Methods – p. 16/40

  22. Drawbacks of ε The upper bound of ε can never reach 1 . For a block in which all SNPs are in complete LD, ε ’s outcome is dependent on the number of SNPs considered. It is computational inefficient. Genome Wide SNP Selection with Entropy Based Methods – p. 16/40

  23. Our Work To overcome of the above drawbacks: We proposed an ER measure. Genome Wide SNP Selection with Entropy Based Methods – p. 17/40

  24. Our Work To overcome of the above drawbacks: We proposed an ER measure. also proposed a criteria and algorithms for SNP selection using ER measure. Genome Wide SNP Selection with Entropy Based Methods – p. 17/40

  25. Mulitlocus LD Measure ER Assume that each haplotype has n marks and there are m haplotype overall and x i be the i th haplotype. x ij be the allele at locus j and haplotype i, our LD measure is m p ( x i ) � (1) E = p ( x i ) log 2 j =1 p j ( x ij ) . � n i =1 Because of the properties of K-L distance, this LD measure is nonnegative and is zero if and only if the variables are independent. This measure is bounded. Genome Wide SNP Selection with Entropy Based Methods – p. 18/40

  26. ER Continued The bound can be found in terms of entropies of component variables. n � E ≤ H ( x j ) − max H ( x j ) = E max . j j =1 Consequently, we can use the normalized LD measure with p ( x i ) � m i =1 p ( x i ) log 2 E � n j =1 p j ( x ij ) (2) ER = = � n j =1 H ( x j ) − max j H ( x j ) E max Genome Wide SNP Selection with Entropy Based Methods – p. 19/40

  27. Properties of ER 1. 0 ≤ ER ≤ 1 , ER is 0 and 1 when the SNPs are in complete LE and LD respectively. 2. For two loci, ER ≈ r 2 under certain condition. Genome Wide SNP Selection with Entropy Based Methods – p. 20/40

  28. Criteria for SNP Selection The criteria for selecting tagging SNPs is defined as follows: ω ( S, λ ) = (1 − λ ) HD ( S ) + λ (1 − ER ( S )) , where HD ( S ) = H ( S ) H ( X ) represents the normalized joint information of selected SNPs. 0 ≤ λ ≤ 1 , ER(S) is the multilocus LD measure for selected SNPs. Obviously with the proposed criteria ω , we can either do the exhaustive search or forward (backward and stepwise) selection for selecting SNPs. Genome Wide SNP Selection with Entropy Based Methods – p. 21/40

  29. FSA( λ ) 1. Set predetermined constants δ 1 , δ 2 , and λ , and the maximum number of selected SNPs. 2. Choose the first SNPs X j that maximizes the entropy H ( X j ) . Then set t = 1 and X t s = { X j } . 3. let j = argmax k { ω k , k ∈ X t − s } , where X t − s contains s . If H ( s ) the remaining SNPs not in X t H ( X ) > δ 1 or ER ( S ) > δ 2 (or t > N , an additional criteria if one desires), then the algorithm is terminated and X t s is the set of selected SNPs; otherwise, set X t +1 s , X j } and go back to 3. = { X t s Genome Wide SNP Selection with Entropy Based Methods – p. 22/40

  30. Assessment of ER Example 1: There are only two haplotypes of 1111111111 and 2222222222 with frequency 0.9 and 0.1 respectively. The values of ER , ε and r 2 are given in the following Table. Table 1: LD Outputs with various window size No. Loci ER r 2 ε 2 1.0 0.50 1.0 3 1.0 0.67 - 4 1.0 0.75 - 5 1.0 0.80 - 10 1.0 0.9 - Genome Wide SNP Selection with Entropy Based Methods – p. 23/40

Recommend


More recommend