marcel dettling
play

Marcel Dettling Institute fr Datenanalyse und Prozessdesign Zrcher - PowerPoint PPT Presentation

Applied Statistical Regression HS 2011 Week 03 Marcel Dettling Institute fr Datenanalyse und Prozessdesign Zrcher Hochschule fr Angewandte Wissenschaften marcel.dettling@zhaw.ch http://stat.ethz.ch/~dettling ETH Zrich, October 10,


  1. Applied Statistical Regression HS 2011 – Week 03 Marcel Dettling Institute für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften marcel.dettling@zhaw.ch http://stat.ethz.ch/~dettling ETH Zürich, October 10, 2011 Marcel Dettling, Zurich University of Applied Sciences 1

  2. Applied Statistical Regression HS 2011 – Week 03 Simple Linear Regression Example : In India, it was observed that alkaline soil hampers plant growth. This gave rise to a search for tree species which show high tolerance against these conditions. An outdoor trial was performed, where 120 trees of a particular species were planted on a big field with considerable soil pH- value variation. After 3 years of growth, every trees height was measured. Additionally, the pH-value of the soil in the vicinity of each tree was determined and recorded. Marcel Dettling, Zurich University of Applied Sciences 2

  3. Applied Statistical Regression HS 2011 – Week 03 Scatterplot: Tree Height vs. pH-value Tree Height vs. pH-Value 7 6 5 height 4 3 2 7.5 8.0 8.5 phvalue Marcel Dettling, Zurich University of Applied Sciences 3

  4. Applied Statistical Regression HS 2011 – Week 03 Systematic Relation What is a good description Tree Height vs. pH-Value of the systematic relation 7 between pH-value and 6 tree height? height 5 1) a line connecting all the 4 data points? 3 2 7.5 8.0 8.5 phvalue Marcel Dettling, Zurich University of Applied Sciences 4

  5. Applied Statistical Regression HS 2011 – Week 03 Systematic Relation What is a good description Tree Height vs. pH-Value of the systematic relation 7 between pH-value and 6 tree height? height 5 1) a line connecting all the 4 data points? 3 2) a smooth line that tries 2 to follow the data? 7.5 8.0 8.5 phvalue Marcel Dettling, Zurich University of Applied Sciences 5

  6. Applied Statistical Regression HS 2011 – Week 03 Systematic Relation What is a good description Tree Height vs. pH-Value of the systematic relation 7 between pH-value and 6 tree height? height 5 1) a line connecting all the 4 data points? 3 2) a smooth line that tries 2 to follow the data? 7.5 8.0 8.5 3) a straight line? phvalue Marcel Dettling, Zurich University of Applied Sciences 6

  7. Applied Statistical Regression HS 2011 – Week 03 Simple Linear Regression The higher the pH-value, the smaller the trees tend to be. The relation seems to be linear, which is of course also the mathe- matically most simple way of describing the relation.           f x ( ) x height 1 ( pH value ) , resp. 0   Name/meaning of the two "Intercept" 0   parameters in the equation: "Slope" 1 Fitting a straight line into a 2-dimensional scatter plot is known as simple linear regression . This is because: • there is just one single predictor variable (" simple "). • the relation is linear in the parameters (" linear "). Marcel Dettling, Zurich University of Applied Sciences 7

  8. Applied Statistical Regression HS 2011 – Week 03 Model, Data & Random Errors No we are bringing the data into play. The regression line will not run through all the data points. Thus, there are random errors:       y x E i 1,..., n , for all i i i Meaning of variables/parameters: y i is the response variable (height) of observation . i x i is the predictor variable (pH-value) of observation . i   , are the regression coefficients. They are unknown 0 1 previously, and need to be estimated from the data. is the residual or error, i.e. the random difference bet- E i ween observation and regression line. Marcel Dettling, Zurich University of Applied Sciences 8

  9. Applied Statistical Regression HS 2011 – Week 03 Least Squares Fitting  http://hspm.sph.sc.edu/courses/J716/demos/LeastSquares/LeastSquaresDemo.html We need to fit a straight line that fits the data well. Many possible solutions exist, some are good, some are worse. Our paradigm is to fit the line such that the squared errors are minimal. Marcel Dettling, Zurich University of Applied Sciences 9

  10. Applied Statistical Regression HS 2011 – Week 03 Least Squares: Mathematics The paradigm in verbatim... ( , x y ) Given a set of data points , the goal is to fit the  1,..., i i i n regression line such that the sum of squared differences y between observed value and regression line is minimal. i The function n n n               2 2 2 ˆ ( , ) ( ) ( ( )) min! Q r y y y x 0 1 i i i i 0 i    i 1 i 1 i 1   measures, how well the regression line, defined by , , fits 0 1 the data. The goal is to minimize the function. Solution :  see next slide... Marcel Dettling, Zurich University of Applied Sciences 10

  11. Applied Statistical Regression HS 2011 – Week 03 Solution Idea: Partial Derivatives Q   • We are taking partial derivatives on the function with ( , ) 0 1   respect to both arguments and . As we are after the 0 1 minimum of the function, we set them to zero:   Q Q   and 0 0     0 1 • This results in a linear equation system, which (here) has two   , unknowns , but also two equations. These are also 0 1 known under the name normal equations .   , • The solution for can be written explicitly as a function of 0 1 ( , x y ) the data pairs , see next slide...  i i i 1,..., n Marcel Dettling, Zurich University of Applied Sciences 11

  12. Applied Statistical Regression HS 2011 – Week 03 Least Squares: Solution According to the least squares paradigm, the best fitting regression line is, i.e. the optimal coefficients are: n    ( x x )( y y ) i i     ˆ ˆ   ˆ  und y x i 1 0 1 1 n   2 ( x x ) i  i 1 ( , x y ) • For a given set of data points , we can determine  i i i 1,..., n the solution with a pocket calculator (...or better, with R). • The solution for our example "Tree Height":      ˆ ˆ 3.003, 28.723 1 0  lm(height ~ phvalue, data=treeheight) Marcel Dettling, Zurich University of Applied Sciences 12

  13. Applied Statistical Regression HS 2011 – Week 03 Least Squares Regression Line Tree Height vs. pH-Value 7 6 Tree Height 5 4 3 2 7.5 8.0 8.5 pH-Value Marcel Dettling, Zurich University of Applied Sciences 13

  14. Applied Statistical Regression HS 2011 – Week 03 Is This a Good Model for Predicting the Tree Height from the Soil pH-Value? a) Beyond the range of observed data Unknown, but most likely not... b) Within the range of observed data Yes, under the following conditions: E E  - the relation is in truth a straight line, i.e. [ ] 0 i   2 - the scatter of the errors is constant, i.e. Var E ( ) i - the data are uncorrelated (from a representative sample) - the errors are approximately normally distributed  Fodder for thougt: irrigation, shaded corners...? Marcel Dettling, Zurich University of Applied Sciences 14

  15. Applied Statistical Regression HS 2011 – Week 03 Model Diagnostics For assessing the quality of the regression line, we need to (at least roughly) check whether the assumptions are met: E E    2 and can be reviewed by: [ ] 0 Var E ( ) i i Residuals vs. pH-Value Residuals vs. Fitted Values 4 4 2 2 Residuals Residuals 0 0 -2 -2 -4 -4 7.5 8.0 8.5 2 3 4 5 6 7 15 pH-Value Fitted Values

  16. Applied Statistical Regression HS 2011 – Week 03 Model Diagnostics For assessing the quality of the regression line, we need to (at least roughly) check whether the assumptions are met: Gaussian distribution can be reviewed by: Normal Plot We will revisit model diagnostics 2 again later in this course, where 1 it will be discussed more deeply. Residuals 0 -1 "Residuals vs. Fitted" and the -2 "Normal Plot" will always stay at -3 the heart of model diagnostics. -2 -1 0 1 2 Quantiles of the Gaussian Distribution 16

  17. Applied Statistical Regression HS 2011 – Week 03 Why Least Squares? History... Within a few years (1801, 1805), the method was developed independently by Gauss and Legendre. Both were after solving applied problems in astronomy... Source:  http://de.wikipedia.org/wiki/Methode_der_kleinsten_Quadrate Carl Friedrich Gauss Adrien-Marie Legendre Marcel Dettling, Zurich University of Applied Sciences 17

  18. Applied Statistical Regression HS 2011 – Week 03 Why Least Squares? Mathematics... • Least Squares is simple in the sense that the solution is ( , x y ) known in closed form as a function of .  i i i 1,..., n ( , ) x y • The line runs through the center of gravity n   r 0 • The sum of residuals adds up to zero: i  i 1 • Some deeper mathematical optimality can be shown when   ˆ ˆ , analyzing the large sample properties of the estimates 0 1 This is especially true under the assumption of normally distributed errors . E i Marcel Dettling, Zurich University of Applied Sciences 18

Recommend


More recommend