M u lti v ariable logistic regression G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita Ciro v ic Done v Data Science Cons u ltant
M u lti v ariable setting Model form u la logit( y ) = β + β x 0 1 1 GENERALIZED LINEAR MODELS IN PYTHON
M u lti v ariable setting Model form u la logit( y ) = β + β x 0 1 1 GENERALIZED LINEAR MODELS IN PYTHON
M u lti v ariable setting Model form u la logit( y ) = β + β x + β x + ... + β x 0 1 1 2 2 p p GENERALIZED LINEAR MODELS IN PYTHON
M u lti v ariable setting Model form u la logit( y ) = β + β x + β x + ... + β x 0 1 1 2 2 p p In P y thon model = glm('y ~ x1 + x2 + x3 + x4', data = my_data, family = sm.families.Binomial()).fit() GENERALIZED LINEAR MODELS IN PYTHON
E x ample - w ell s w itching formula = 'switch ~ distance100 + arsenic' wells_fit = glm(formula = formula, data = wells, family = sm.families.Binomial()).fit() =============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 0.0027 0.079 0.035 0.972 -0.153 0.158 distance100 -0.8966 0.104 -8.593 0.000 -1.101 -0.692 arsenic 0.4608 0.041 11.134 0.000 0.380 0.542 =============================================================================== GENERALIZED LINEAR MODELS IN PYTHON
E x ample - w ell s w itching coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 0.0027 0.079 0.035 0.972 -0.153 0.158 distance100 -0.8966 0.104 -8.593 0.000 -1.101 -0.692 arsenic 0.4608 0.041 11.134 0.000 0.380 0.542 Both coe � cients are statisticall y signi � cant Sign of coe � cients logical A u nit - change in distance100 corresponds to a negati v e di � erence of 0.89 in the logit A u nit - change in arsenic corresponds to a positi v e di � erence of 0.46 in the logit GENERALIZED LINEAR MODELS IN PYTHON
Impact of adding a v ariable Impact of arsenic v ariable coef std err --------------------------------- distance100 changes from -0.62 to -0.89 Intercept 0.0027 0.079 distance100 -0.8966 0.104 F u rther a w a y from the safe w ell arsenic 0.4608 0.041 More likel y to ha v e higher arsenic le v els coef std err --------------------------------- Intercept 0.6060 0.060 distance100 -0.6291 0.097 GENERALIZED LINEAR MODELS IN PYTHON
M u lticollinearit y Variables that are correlated w ith other model v ariables Increase in standard errors of coe � cients Coe � cients ma y not be statisticall y signi � cant 1 h � ps :// en .w ikipedia . org /w iki / Correlation _ and _ dependence GENERALIZED LINEAR MODELS IN PYTHON
Presence of m u lticollinearit y? What to look for ? Coe � cient is not signi � cant , b u t v ariable is highl y correlated w ith y Adding / remo v ing a v ariable signi � cantl y changes coe � cients Not logical sign of the coe � cient Variables ha v e high pair w ise correlation GENERALIZED LINEAR MODELS IN PYTHON
Variance inflation factor ( VIF ) Most w idel y u sed diagnostic for m u lticollinearit y Comp u ted for each e x planator y v ariable Ho w in � ated the v ariance of the coe � cient is S u ggested threshold VIF > 2.5 In P y thon from statsmodels.stats.outliers_influence import variance_inflation_factor GENERALIZED LINEAR MODELS IN PYTHON
Let ' s practice ! G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON
Comparing models G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita Ciro v ic Done v Data Science Cons u ltant
De v iance Form u la D = −2 LL ( β ) Meas u re of error Lo w er de v iance → be � er model � t Benchmark for comparison is the n u ll de v iance → intercept - onl y model E v al u ate Adding a random noise v ariable w o u ld , on a v erage , decrease de v iance b y 1 Adding p predictors to the model de v iance sho u ld decrease b y more than p GENERALIZED LINEAR MODELS IN PYTHON
De v iance in P y thon GENERALIZED LINEAR MODELS IN PYTHON
Comp u te de v iance E x tract n u ll - de v iance and de v iance Comp u te de v iance u sing log likelihood # Extract null deviance print(-2*model.llf) print(model.null_deviance) 4076.2378 4118.0992 Red u ction in de v iance b y 41.86 # Extract model deviance Incl u ding distance100 impro v ed the � t print(model.deviance) 4076.2378 GENERALIZED LINEAR MODELS IN PYTHON
Model comple x it y model_1 and model_2 , w here L 1 > L 2 N u mber of parameters higher in model_2 model_2 is o v er � � ing GENERALIZED LINEAR MODELS IN PYTHON
Let ' s practice ! G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON
Model form u la G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita Ciro v ic Done v Data Science Cons u ltant
Form u la and model matri x GENERALIZED LINEAR MODELS IN PYTHON
Form u la and model matri x GENERALIZED LINEAR MODELS IN PYTHON
Form u la and model matri x GENERALIZED LINEAR MODELS IN PYTHON
Form u la and model matri x GENERALIZED LINEAR MODELS IN PYTHON
Model matri x Model matri x: y ∼ X Check model matri x str u ct u re from patsy import dmatrix Model form u la dmatrix('x1 + x2') 'y ~ x1 + x2' Intercept x1 x2 1 1 4 1 2 5 1 3 6 GENERALIZED LINEAR MODELS IN PYTHON
Variable transformation import numpy as np 'y ~ x1 + np.log(x2)' dmatrix('x1 + np.log(x2)') DesignMatrix with shape (3, 3) Intercept x1 np.log(x2) 1 1 1.38629 1 2 1.60944 1 3 1.79176 GENERALIZED LINEAR MODELS IN PYTHON
Centering and standardi z ation Statef u l transforms 'y ~ center(x1) + standardize(x2)' dmatrix('center(x1) + standardize(x2)') DesignMatrix with shape (3, 3) Intercept center(x1) standardize(x2) 1 -1 -1.22474 1 0 0.00000 1 1 1.22474 GENERALIZED LINEAR MODELS IN PYTHON
B u ild y o u r o w n transformation def my_transformation(x): return 4 * x dmatrix('x1 + x2 + my_transformation(x2)') DesignMatrix with shape (3, 4) Intercept x1 x2 my_transformation(x2) 1 1 4 16 1 2 5 20 1 3 6 24 GENERALIZED LINEAR MODELS IN PYTHON
Arithmetic operations x1 = np.array([1, 2, 3]) x1 = [1, 2, 3] x2 = np.array([4,5,6]) x2 = [4,5,6] dmatrix('I(x1 + x2')) dmatrix('I(x1 + x2)') DesignMatrix with shape (3, 2) DesignMatrix with shape (6, 2) Intercept I(x1 + x2) Intercept I(x1 + x2) 1 5 1 1 1 7 1 2 1 9 1 3 1 4 1 5 1 6 GENERALIZED LINEAR MODELS IN PYTHON
Coding the categorical data GENERALIZED LINEAR MODELS IN PYTHON
Coding the categorical data GENERALIZED LINEAR MODELS IN PYTHON
Coding the categorical data GENERALIZED LINEAR MODELS IN PYTHON
Pats y coding Strings and booleans are a u tomaticall y coded N u merical → categorical C() f u nction Reference gro u p Defa u lt : � rst gro u p Treatment levels GENERALIZED LINEAR MODELS IN PYTHON
The C () f u nction N u meric v ariable Ho w man y le v els ? dmatrix('color', data = crab) crab['color'].value_counts() DesignMatrix with shape (173, 2) 2 95 Intercept color 3 44 1 2 4 22 1 3 1 12 1 1 [... rows omitted] GENERALIZED LINEAR MODELS IN PYTHON
The C () f u nction Categorical v ariable dmatrix('C(color)', data = crab) DesignMatrix with shape (173, 4) Intercept C(color)[T.2] C(color)[T.3] C(color)[T.4] 1 1 0 0 1 0 1 0 1 0 0 0 [... rows omitted] GENERALIZED LINEAR MODELS IN PYTHON
Changing the reference gro u p dmatrix('C(color, Treatment(4))', data = crab) DesignMatrix with shape (173, 4) Intercept C(color)[T.1] C(color)[T.2] C(color)[T.3] 1 0 1 0 1 0 0 1 1 1 0 0 [... rows omitted] GENERALIZED LINEAR MODELS IN PYTHON
Changing the reference gro u p l = [1, 2, 3,4] dmatrix('C(color, levels = l)', data = crab) DesignMatrix with shape (173, 4) Intercept C(color)[T.2] C(color)[T.3] C(color)[T.4] 1 1 0 0 1 0 1 0 1 0 0 0 [... rows omitted] GENERALIZED LINEAR MODELS IN PYTHON
M u ltiple intercepts 'y ~ C(color)-1' dmatrix('C(color)-1', data = crab) DesignMatrix with shape (173, 4) C(color)[1] C(color)[2] C(color)[3] C(color)[4] 0 1 0 0 0 0 1 0 1 0 0 0 [... rows omitted] GENERALIZED LINEAR MODELS IN PYTHON
Let ' s practice ! G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON
Categorical and interaction terms G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita Ciro v ic Done v Data Science Cons u ltant
Recommend
More recommend