nonlinear regression
play

Nonlinear Regression 30.11.2016 Goals of Todays Lecture Understand - PowerPoint PPT Presentation

Nonlinear Regression 30.11.2016 Goals of Todays Lecture Understand the difference between linear and nonlinear regression models. See that not all functions are linearizable. Get an understanding of the fitting algorithm in a statistical


  1. Nonlinear Regression 30.11.2016

  2. Goals of Today’s Lecture Understand the difference between linear and nonlinear regression models. See that not all functions are linearizable. Get an understanding of the fitting algorithm in a statistical sense (i.e. fitting many linear regressions). Know that tests etc. are based on approximations and be able to interpret computer output, profile t -plots and profile traces. 1 / 35

  3. Nonlinear Regression Model The nonlinear regression model is h ( x (1) , x (2) , . . . , x ( m ) Y i = ; θ 1 , θ 2 , . . . , θ p ) + E i i i i = h ( x i ; θ ) + E i . where E i are the error terms, E i ∼ N (0 , σ 2 ) independent x (1) , . . . , x ( m ) are the predictors θ 1 , . . . , θ p are the parameters h is the regression function,“any”function. h is a function of the predictors and the parameters. 2 / 35

  4. Comparison with linear regression model In contrast to the linear regression model we now have a general function h . In the linear regression model we had h ( x i ; θ ) = x T i θ (there we denoted the parameters by β ). Note that in linear regression we required that the parameters appear in linear form . In nonlinear regression, we don’t have that restriction anymore. 3 / 35

  5. Example: Puromycin The speed of an enzymatic reaction depends on the concentration of a substrate. The initial speed is the response variable ( Y ). The concentration of the substrate is used as predictor ( x ). Observations are from different runs. Model with Michaelis-Menten function θ 1 x h ( x ; θ ) = θ 2 + x . Here we have one predictor x (the concentration) and two parameters: θ 1 and θ 2 . Moreover, we observe two groups: One where we treat the enzyme with Puromycin and one without treatment (control group). 4 / 35

  6. Illustration: Puromycin (two groups) 200 150 Velocity Velocity 100 50 0.0 0.2 0.4 0.6 0.8 1.0 Concentration Concentration Left: Data ( • treated enzyme; △ untreated enzyme) Right: Typical shape of the regression function. 5 / 35

  7. Example: Biochemical Oxygen Demand (BOD) Model the biochemical oxygen demand ( Y ) as a function of the incubation time ( x ) � 1 − e − θ 2 x � h ( x ; θ ) = θ 1 . 20 18 Oxygen Demand Oxygen Demand 16 14 12 10 8 1 2 3 4 5 6 7 Days Days 6 / 35

  8. Linearizable Functions Sometimes (but not always ), the function h is linearizable . Example Let’s forget about the error term E for a moment. Assume we have y = h ( x ; θ ) = θ 1 exp { θ 2 / x } ⇐ ⇒ log( y ) = log( θ 1 ) + θ 2 · (1 / x ) We can rewrite this as y = � θ 1 + � � θ 2 · � x , y = log( y ) , � θ 1 = log( θ 1 ) , � where � θ 2 = θ 2 and � x = 1 / x . If we use this linear model, we assume additive errors E i Y i = � θ 1 + � � θ 2 � x i + E i . 7 / 35

  9. This means that we have multiplicative errors on the original scale Y i = θ 1 exp { θ 2 / x i } · exp { E i } . This is not the same as using a nonlinear model on the original scale (it would have additive errors!). Hence, transformations of Y modify the model with respect to the error term . In the Puromycin example: Do not linearize because error term would fit worse (see next slide). Hence, for those cases where h is linearizable, it depends on the data if it’s advisable to do so or to perform a nonlinear regression. 8 / 35

  10. Puromycin: Treated enzyme ● 200 ● ● ● ● 150 ● ● Velocity ● ● 100 ● ● 50 ● 0.0 0.2 0.4 0.6 0.8 1.0 Concentration 9 / 35

  11. Parameter Estimation Let’s now assume that we really want to fit a nonlinear model. Again, we use least squares . Minimize � n ( Y i − η i ( θ )) 2 , S ( θ ) := i =1 where η i ( θ ) := h ( x i ; θ ) is the fitted value for the i th observation ( x i is fixed, we only vary the parameter vector θ ). 10 / 35

  12. Geometrical Interpretation First we recall the situation for linear regression . By applying least squares we are looking for the parameter vector θ such that � n � � 2 � Y − X θ � 2 Y i − x T 2 = i θ i =1 is minimized. Or in other words: We are looking for the point on the plane spanned by the columns of X that is closest to Y ∈ R n . This is nothing else than projecting Y on that specific plane. 11 / 35

  13. Linear Regression: Illustration of Projection Y Y y 10 10 8 8 6 6 η 2 | y 2 η 2 | y 2 x x 4 4 10 10 [1,1,1] [1,1,1] 8 8 2 2 6 6 4 4 η 3 | y 3 η 3 | y 3 2 2 0 0 0 0 0 2 4 6 8 10 0 2 4 6 8 10 η 1 | y 1 η 1 | y 1 12 / 35

  14. Situation for nonlinear regression Conceptually, the same holds true for nonlinear regression. The difference is: All possible points do not lie on a plane anymore, but on a curved surface , the so called model surface defined by η ( θ ) ∈ R n when varying the parameter vector θ . This is a p -dimensional surface because we parameterize it with p parameters. 13 / 35

  15. Nonlinear Regression: Projection on Curved Surface Y θ 1 = 22 20 θ 1 = 21 θ 1 = 20 y θ 2 = 0.5 18 0.4 16 η 2 | y 2 14 0.3 12 22 − 21 − 20 19 18 10 η 3 | y 3 5 6 7 8 9 10 11 η 1 | y 1 14 / 35

  16. Computation Unfortunately, we can not derive a closed form solution for the parameter estimate � θ . Iterative procedures are therefore needed. We use a Gauss-Newton approach. Starting from an initial value θ (0) , the idea is to approximate the model surface by a plane , to perform a projection on that plane and to iterate many times. Remember η : R p → R n . Define n × p matrix i ( θ ) = ∂η i ( θ ) A ( j ) . ∂θ j This is the Jacobi-matrix containing all partial derivatives. 15 / 35

  17. Gauss-Newton Algorithm More formally, the Gauss-Newton algorithm is as follows Start with initial value � θ (0) For l = 1 , 2 , . . . Calculate tangent plane of η ( θ ) in � θ ( l − 1) : η ( θ ) ≈ η ( � θ ( l − 1) ) + A ( � θ ( l − 1) ) · ( θ − � θ ( l − 1) ) Project Y on tangent plane � � θ ( l ) Projection is a linear regression problem, see blackboard. Next l Iterate until convergence 16 / 35

  18. Initial Values How can we get initial values? Available knowledge Linearized version (see Puromycin) Interpretation of parameters (asymptotes, half-life, . . . ),“fitting by eye” . Combination of these ideas (e.g., conditional linearizable functions) 17 / 35

  19. Example: Puromycin (only treated enzyme) 200 0.020 150 0.015 1/velocity Velocity 100 0.010 0.005 50 0 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 1/Concentration Concentration Dashed line: Solution of linearized problem. Solid line: Solution of the nonlinear least squares problem. 18 / 35

  20. Approximate Tests and Confidence Intervals Algorithm“only”gives us � θ . How accurate is this estimate in a statistical sense? In linear regression we knew the (exact) distribution of the estimated parameters (remember animation!). In nonlinear regression the situation is more complex in the sense that we only have approximate results . It can be shown that approx . � θ j ∼ N ( θ j , V jj ) for some matrix V ( V jj is the j th diagonal element). 19 / 35

  21. Tests and confidence intervals are then constructed as in the linear regression situation, i.e. � θ j − θ j approx . � ∼ t n − p . � V jj The reason why we basically have the same result as in the linear regression case is because the algorithm is based on (many) linear regression problems. Once converged, the solution is not only the solution to the nonlinear regression problem but also for the linear one of the last iteration. In fact A T � � σ 2 ( � A ) − 1 , V = � where � A = A ( � θ ). 20 / 35

  22. Example Puromycin (two groups) Remember, we originally had two groups (treatment and control) 200 150 Velocity Velocity 100 50 0.0 0.2 0.4 0.6 0.8 1.0 Concentration Concentration Question: Do the two groups need different regression parameters? 21 / 35

  23. To answer this question we set up a model of the form Y i = ( θ 1 + θ 3 z i ) x i + E i , θ 2 + θ 4 z i + x i where z is the indicator variable for the treatment ( z i = 1 if treated, z i = 0 otherwise). E.g., if θ 3 is nonzero we have a different asymptote for the treatment group ( θ 1 + θ 3 vs. only θ 1 in the control group). Similarly for θ 2 , θ 4 . Let’s fit this model to data. 22 / 35

  24. Computer Output Formula: velocity ~ (T1 + T3 * (treated == T)) * conc/(T2 + T4 * (treated == T) + conc) Parameters: Estimate Std.Error t value Pr(>|t|) T1 160.280 6.896 23.242 2.04e-15 T2 0.048 0.008 5.761 1.50e-05 T3 52.404 9.551 5.487 2.71e-05 T4 0.016 0.011 1.436 0.167 We only get a significant test result for θ 3 ( � different asymptotes ) and not θ 4 . A 95%- confidence interval for θ 3 (=difference between asymptotes) is 52 . 404 ± q t 19 0 . 975 · 9 . 551 = [32 . 4 , 72 . 4] , where q t 19 0 . 975 ≈ 2 . 09. 23 / 35

  25. More Precise Tests and Confidence Intervals Tests etc. that we have seen so far are only“usable”if linear approximation of the problem around the solution � θ is good. We can use another approach that is better (but also more complicated). In linear regression we had a quick look at the F -test for testing simultaneous null-hypotheses. This is also possible here. Say we have the null hypothesis H 0 : θ = θ ∗ (whole vector) . Fact: Under H 0 it holds � n − p � S ( θ ∗ ) − S ( � θ ) approx . ∼ T = F p , n − p . S ( � p θ ) 24 / 35

Recommend


More recommend