Multiple Linear Regression: Part II This lab covers theverificaion of MultipleLinea Regression assumptions. Similar to lab 05 (Model 1), all independent variables areused to predict 'mpi_u ban'. In thislab, w e wreusing another popul ar library" statsmodels'. Download Files To start with, dow nload "combine.cSV (same datase used in lab 05) and 'Lab_06.ipynb'. This file contains the code tha is used to create the model and its outputs presented in this task sheet. Create a DataFrame combined = pd.read_csv ( "combined.csv") The 'combined' dataset fedures: 79 rows ×10 columns Figure 1: Features presented in the 'com bined' datase. The first 9 columns are used as independent variables (x) and the I ast column,' moi_urbar' is the dependent vaiable (y). nrow, ncol = combined. shape x= combined. iloc [:, :ncol -1] y= combined.iloc [x,?1]
To use the linea regression model of 'statsmodels' library, you need to add a column of ones to serve as an intercept. X??constant - -m ad annomativ Fit OLs Mode lin_res =?n.0 lin_reg. sunmar:
1.1. Linearity of the Model The dependent variable (Y) is assurned to be alinea function of the independent variables ( X). Using independent variabl es with non-linea paterns causes significant prediction errors. T ask 1: To detect nonlinearity, inspect scater plots of 'observed vs. predicted values' or 'residuals vs. predicted values. Use the provided sample code. I deally, we are looking for points that are symmetrically distributed around a horizonta line in the residuals vs. predicted plot or around a di agond linein the observed vs. predicted values pl ot, in both caseswith a nearly constant variance. Nonlineaity can a so be reveded by systematic pattemsin plots of the residuals vs. individual features in a multidimensional dataset. Wha is your finding? 1.2. Z ero Mean of Residuals T ask 2: Obtain the mean of your model's residuals and explain your finding. Use the provided sample code. 1.3. Multicollinearity I nopection Not having multicollinearity means the eatures should belinealyindependent of each other. We used Pearson correlation heamap in Lab 05 to identify the independent variales tha show multicollinearity. A nother way to identify multicollinearity is to check the V ariance Inflation Factor (VIF). All values for VIF should be 1 if fecures are not correl ated. T ask 3: Obtain the VIF value of the independent vaiabl es and explan your finding. U se the provided sample code. A rule of thumb for interpreting the varianceinflation factor. 1 = not correlated Bew een 1 and 5 = moderaly correlaed Greater than 5 = highly correl ated.
1.4. H mosced asticity (E qual V ariance) of Residuals Heteroscedasticity occursw hen residuals do not have constant variance. Determining the true standard deviaion of the forecast errors becomes chalenging when the variance of residuas is nor-constant (Heterosedasticity). Task 4: To irvestigate if the residuals are homoscedastic, examine the pl ot of residuals vs. fitted and standardized residuas vs fitted. U se the provided sample code. Wha is your finding? If residuals grow either as a function of predicted value or time (in the case of time series) then heteroscedasticity is detected. Task 5: Staistica tests such as Breusch-Pagan can be also used to test the assumption of homoscedasticity. The null hypothesis assurnes homoscedasticity. If the p-value of thetest is lessthan some significance level (i.e. ?=0.05 ) then reject the null hypothesis and conclude that heteroscedasticity is present in the regression model. Use the provided sample code to perform the Breusch-Pagantest. Wha is your finding? 1.5. Normality of the Residuals Task 6: To investigate this assumption, generate the QQ pl ots of the residuals and perform the Anderson-Darling (AD) test. Usetheprovided sample code. If the returned AD staistic is larger than the critica value, then the null hypothesis that the data come from a norma distribution should can not be rejected. Wha is your finding? 1.6. A utocor relation of Residuals In time-series models, seria corredaion in the residuas implies that there is room for improvement in the mode. In non-time-series models, autocorrelaion in the residuals can be a sign of systemaicaly underprediction/overprediction. Task 7: To investigate if autocorreaion is present, use the result of Durbir- Wason (DW) test to evalute the assumption. See the output in Figure 2. - the test staistic aways has a value between 0 and 4 - value of 2 means tha there is no autocorredaion in the sample - values < 2 indicae positive autocorrel aion, values >2 indicate a negative autocorrelation. Wha is your finding?
combined = pd.read_csv("combined.csv") k row, ncol = combined. shape k= combined.iloc [:, :ncol - 1] = combined.iloc [:,?1]? \[ \begin{array}{l} \text { K_constant }=\text { sm.add_constant }(X) \\ \text { lin_reg }=\text { sm. OLS }(Y, \bar{X} \text { constant }) \cdot f i t() \\ \text { Lin_reg.summary }() \end{array} \]
Task 6: Function for drawing the normal QQ-plot of the residuals and running 4 statistical tests to investigate the normality of residuals. Arg: * model - fitted OLS models from statsmodels def normality_of_residuals_test(model): sm. ProbPlot (model.resid). qqplot(line='s'); plt.title("Q-Q plot"); ad = stats. anderson (model. resid, dist= 'norm ') ks = stats.kstest (model.resid, "norm ') print (f'Anderson-Darling test - .- statistic: \{ad.statistic:.4f\}, 5\% critical value: \{ad.critical_values[2]:.4f\}') hormality_of_residuals_test(lin_reg)
Task 7: perform Durbin-watson test Also can be checked from the OLS output '? durbin_watson(lin_reg.resid)