BTRY 4830/6830: Quantitative Genomics and Genetics Lecture17: Logistic Regression III and Haplotypes Jason Mezey jgm45@cornell.edu Oct. 23, 2014 (Th) 8:40-9:55
Announcements • MY OFFICE HOURS TODAY ARE CANCELLED (This week only!!) • Homework #5 will be available tomorrow (Fri.) and will be due this coming Thurs. (11:59PM)
Summary of lecture 17 • In previous two lectures, we expanded our GWAS analysis abilities by introducing the logistic regression model that allows us to analyze case / control data • Today, we will complete our discussion of logistic regression by discussing (in brief) Generalized Linear Models (GLM) the broader class of models including linear and logistic regression • We will also provide an (extremely) brief introduction to the concept of haplotypes and haplotype testing in a GWAS framework
Conceptual Overview Genetic Sample or experimental System pop Measured individuals Does A1 -> A2 (genotype, Y? phenotype) affect Regression Reject / DNR model Model params Pr(Y|X) F-test
Review: linear vs. logistic • Recall that for a linear regression, the regression function was a line and the error term accounted for the difference between each point and the expected value (the linear regression line), which we assume follow a normal • For a logistic regression, we use the logistic function and the error term makes up the value to either 0 or 1: Y Y Xa Xa
Review: the expected value for logistic regression • Recall that an individual with phenotype Yi is described by the following equation: | − Y i = E( Y i | X i ) + ⇤ i Y i = ⇥ − 1 ( Y i | X i ) + ⇤ i e β µ + x i,a β a + x i,d β d Y i = 1 + e β µ + x i,a β a + x i,d β d + ⇤ i • To understand how an individual with a phenotype Yi and a genotype Xi breaks down in this equation, we need to consider the expected (predicted!) part and the error term
Review: Inference in logistic regression • Recall that our goal with using logistic regression was to model the probability distribution of a case / control phenotype when there is a causal polymorphism • To use this for a GWAS, we need to test the null hypothesis that a genotype is not a causal polymorphism (or more accurately that the genetic marker we are testing is not in LD with a causal polymorphism!): � µ = c � a = 0 � d = 0 H 0 : � a = 0 ∩ � d = 0 • To assess this null hypothesis, we will use the same approach as in linear regression, i.e. we will construct a LRT = likelihood ratio test (recall that an F- test is an LRT and although we will not construct an F-test for logistic regression hypothesis testing, we will construct an LRT!)
Review: logistic inference for GWAS • Our goal is to do a hypothesis test for each marker with the phenotype individually • We therefore have to construct a LRT and calculate a p-value for every maker • To calculate the LRT (and p-value) we need to calculate the MLE of our regression parameters (which we will do with an algorithm!) • We therefore need equations for: 1. the algorithm, 2. the LRT
Review: logistic MLE • Recall that an MLE is simply a statistic (a function that takes the sample as an input and outputs the estimate of the parameters)! • In this case, we want to construct the following MLE: MLE (ˆ β ) = MLE ( ˆ β µ , ˆ β a , ˆ β d ) • To do this, we need to maximize the log-likelihood function for the logistic regression, which has the following form (sample size n): n ⇤ y i ln ( γ � 1 ( β µ + x i,a β a + x i,d β d )) + (1 � y i ) ln (1 � γ � 1 ( β µ + x i,a β a + x i,d β d )) � ⇥ l ( β ) = i =1 • Unlike the case of linear regression, where we had a “closed-form” equation that allows us to plug in the Y’s and X’s and returns the beta values that maximize the log-likelihood, there is no such simple equation for a logistic regression • We will therefore need an algorithm to calculate the MLE
Review: IRLS algorithm I • algorithm - a sequence of instructions for taking an input and producing an output • For logistic regression (and GLM’s in general!) we will construct an Iterative Re-weighted Least Squares (IRLS) algorithm, which has the following structure: 1. Choose starting values for the β ’s. Since we have a vector of three β ’s in our case, we assign these numbers and call the resulting vector β [0] . 2. Using the re-weighting equation (described next slide), update the β [ t ] vector. 3. At each step t > 0 check if β [ t +1] ⇡ β [ t ] (i.e. if these are approximately equal) using an appropriate function. If the value is below a defined threshold, stop. If not, repeat steps 2,3.
Step 1: IRLS algorithm • These are simply values of the vector that we assign (!!) • In one sense, these can be anything we want (!!) although for algorithms in general there are usually some restrictions and / or certain starting values that are “better” than others in the sense that the algorithm will converge faster, find a more “optimal” solution etc. • In our case, we can assign our starting values as follows: 2 3 0 � [0] = 0 4 5 0
Step 2: IRLS algorithm • At step 2, we will update (= produce a new value of the vector) using the following equation (then do this again and again until we stop!): ∼ � [ t +1] = � [ t ] + [ x T Wx ] − 1 x T ( y − ⇥ − 1 ( x � [ t ] ) | | | | − ⇤ ⌅ 1 x 1 ,a x 1 ,d e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ] d a + x i,d � [ t ] ⇥ − 1 ( � [ t ] µ + x i,a � [ t ] d ) = 1 x 2 ,a x 2 ,d ⌥ � 1 + e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ] x = ⌥ . . � ... d . . ⌥ � . . ⌥ e x β [ t ] ⇧ ⌃ ⇧ ⇥ − 1 ( x � [ t ] ) = 1 x n,a x n,d 1 + e x β [ t ] ⇤ ⌅ y 1 β [ t ] ⇤ ⌅ a + x i,d β [ t ] a + x i,d β [ t ] W ii = γ − 1 ( β [ t ] µ + x i,a β [ t ] d )(1 − γ − 1 ( β [ t ] µ + x i,a β [ t ] d )) µ y 2 β [ t ] = y = β [ t ] ⌥ � e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ] e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ] . a � ⇥ . d d ⇧ ⌃ . W ii = 1 − β [ t ] 1 + e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ] 1 + e β [ t ] µ + x i,a β [ t ] a + x i,d β [ t ] d d d y n matrix W is an n by ( W ij = 0 for i 6 = j )
Step 3: IRLS algorithm • At step 3, we “check” to see if we should stop the algorithm and, if we decide not to stop, we go back to step 2 • If we decide to stop, we will assume the final values of the vector are the MLE (it may not be exactly the true MLE, but we will assume that n � [ t +1] ⇡ � [ t ] it is close if we do not stop the algorithm to early!), e.g. • There are many stopping rules, using change in Deviance is one way to construct a rule (note the issue with ln(0)!!: is small. Wh ⇥ D = | D [ t + 1] � D [ t ] | ⇤ ⌅ 4 D < 10 � 6 ⇧ n � 1 − y i ⇥⇧ � ⇥ y i ⌅ ⇤ +(1 − y i ) ln D = 2 y i ln 1 − ⇥ − 1 ( � [ t ]or[ t +1] + x i,a � [ t ]or[ t +1] + x i,d � [ t ]or[ t +1] ⇥ − 1 ( � [ t ]or[ t +1] + x i,a � [ t ]or[ t +1] + x i,d � [ t ]or[ t +1] ) ) − µ a µ a i =1 d d � ⇥⇧ � ⇥ � n � y i ⇥ � 1 − y i ⇥⇧ ⌅ ⇤ D = 2 +(1 − y i ) ln y i ln e β [ t ]or[ t +1] + xi,a β [ t ]or[ t +1] + xi,d β [ t ]or[ t +1] e β [ t ]or[ t +1] + xi,a β [ t ]or[ t +1] + xi,d β [ t ]or[ t +1] µ a µ a d d i =1 1 − 1+ e β [ t ]or[ t +1] + xi,a β [ t ]or[ t +1] + xi,d β [ t ]or[ t +1] 1+ e β [ t ]or[ t +1] + xi,a β [ t ]or[ t +1] + xi,d β [ t ]or[ t +1] µ a µ a d d
Review: Logistic hypothesis testing I • Recall that our null and alternative hypotheses are: H 0 : � a = 0 ∩ � d = 0 H A : β a 6 = 0 [ β d 6 = 0 • We will use the LRT for the null (0) and alternative (1): | � � LRT = � 2 ln Λ = � 2 lnL (ˆ θ 0 | y ) LRT = � 2 ln Λ = 2 l (ˆ θ 1 | y ) � 2 l (ˆ θ 0 | y ) L (ˆ θ 1 | y ) � • For our case, we need the following: l ( ˆ ⌅ 1 | y ) = l ( ˆ � µ , ˆ � a , ˆ � d | y ) l ( ˆ ⌅ 0 | y ) = l ( ˆ � µ , 0 , 0 | y )
Review Logistic hypothesis testing II • For the alternative, we use our MLE estimates of our logistic regression parameters we get from our IRLS algorithm and plug these into the log-like equation n ⇥ ⇤ l (ˆ � y i ln ( γ − 1 (ˆ β µ + x i,a ˆ β a + x i,d ˆ β d )) + (1 − y i ) ln (1 − γ − 1 (ˆ β µ + x i,a ˆ β a + x i,d ˆ θ 1 | y ) = β d )) ∼ i =1 e β µ + x i,a β a + x i,d β d ⇥ − 1 ( � µ + x i,a � a + x i,d � d ) = 1 + e β µ + x i,a β a + x i,d β d • For the null, we plug in the following parameter estimates into this same equation � n h i l (ˆ X y i ln ( γ − 1 (ˆ β µ, 0 + x i,a ∗ 0 + x i,d ∗ 0)) + (1 − y i ) ln (1 − γ − 1 (ˆ θ 0 | y ) = β µ, 0 + x i,a ∗ 0 + x i,d ∗ 0)) i =1 • where we use the same IRLS algorithm to provide ˆ β µ, 0 estimates of by running the algorithm EXACTLY the same with EXCEPT we set and we do not update β a = 0 , ˆ ˆ β d = 0 these!
Recommend
More recommend