Regression Modeling in Practice: Multiple Regression

The analysis is based on the Gapminder codebook containing 15 socio-economic factors from 2007 drawn from 213 countries. I chose to review the correlation between the income per person (a quantitative explanatory variable) and the internet use rate (a quantitative response variable) in the Gapminder dataset. A second quantitative explanatory variable, urban rate, was later added in the multiple regression model. The hypothesis is that high incomes indicate wealth in a country and therefore countries with high incomes will have higher internet use rates.They also tend to have higher urbanization rates.

The analysis was generated using SAS Studio

SAS Code

PROC IMPORT DATAFILE="/home/mst07221/gapminder.csv"
DBMS=CSV
OUT=WORK.GAPMINDER;
GETNAMES=YES;
RUN;

data new; set WORK.GAPMINDER;
run;

********************************************************************
POLYNOMIAL REGRESSION
Testing the relationship between:
Explanatory variable: Income per Person and Internet Use Rate
Response Variable: Internet Use Rate
Confounding variable: Urban Rate
********************************************************************

* scatterplot with linear regression line internetuserate response variable;
* CLM option generates 95% confidence intervals;
proc sgplot;
reg x=incomeperperson y=internetuserate / lineattrs=(color=blue thickness=2) clm;
yaxis label="Internet Use Rate";
xaxis label="Income per Person";
run;

* scatterplot with linear and quadratic regression line;
proc sgplot;
reg x=incomeperperson y=internetuserate / lineattrs=(color=blue thickness=2) degree=1 clm;
reg x=incomeperperson y=internetuserate / lineattrs=(color=green thickness=2) degree=2 clm;
yaxis label="Internet Use Rate";
xaxis label="Income per Person";
run;

* centering quantitative explanatory variables by subtracting the mean;
data new2; set new;
if incomeperperson ne . and internetuserate ne . and urbanrate ne .;
incomeperperson_c=incomeperperson-8283.35;
urbanrate_c=urbanrate-55.8960440;
run;
proc means; var incomeperperson urbanrate;
run;

* check coding;
proc means; var incomeperperson_c urbanrate_c;
run;

* linear regression model;
PROC glm;
model internetuserate=incomeperperson_c/solution clparm;
run;

* polynomial regression model;
PROC glm;
model internetuserate=incomeperperson_c incomeperperson_c*incomeperperson_c/solution clparm;
run;

**********************************************************************
EVALUATING MODEL FIT
**********************************************************************;

* multiple regression adding urban rate;
PROC glm;
model internetuserate=incomeperperson_c incomeperperson_c*incomeperperson_c urbanrate_c/solution clparm;
run;

* request regression diagnostic plots;
PROC glm PLOTS(unpack)=all;
model internetuserate=incomeperperson_c incomeperperson_c*incomeperperson_c
urbanrate_c/solution clparm;
output residual=res student=stdres out=results;
run;

* plot of standardized residuals for each observation;
* vref=0 indicates a horizontal line at the mean;
proc gplot;
label stdres="Standardized Residual" country="Country";
plot stdres*country/vref=0;
run;

* using proc reg to get a partial regression plot;
* calculate quadratic terms;
data partial;
set new2;
incomeperperson2=incomeperperson_c*incomeperperson_c;
run;

*partial regression plot;
PROC reg plots=partial;
model internetuserate=incomeperperson incomeperperson2
urbanrate/partial;
run;

Results

The MEANS Procedure

Variable	N	Mean	Std Dev	Minimum	Maximum
incomeperperson urbanrate	182 182	8283.35 55.8960440	12509.74 23.6293130	103.7758572 10.4000000	81647.10 100.0000000

The MEANS Procedure

Variable	N	Mean	Std Dev	Minimum	Maximum
incomeperperson_c urbanrate_c	182 182	0.000498908 -4.395605E-8	12509.74 23.6293130	-8179.57 -45.4960440	73363.75 44.1039560

The GLM Procedure

Number of Observations Read	182
Number of Observations Used	182

The GLM Procedure

Dependent Variable: internetuserate

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	80249.2619	80249.2619	232.46	<.0001
Error	180	62139.7493	345.2208
Corrected Total	181	142389.0113

R-Square	Coeff Var	Root MSE	internetuserate Mean
0.563592	52.65300	18.58012	35.28786

Source	DF	Type I SS	Mean Square	F Value	Pr > F
incomeperperson_c	1	80249.26194	80249.26194	232.46	<.0001

Source	DF	Type III SS	Mean Square	F Value	Pr > F
incomeperperson_c	1	80249.26194	80249.26194	232.46	<.0001

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|	95% Confidence Limits
Intercept	35.28786163	1.37725007	25.62	<.0001	32.57022935	38.00549391
incomeperperson_c	0.00168319	0.00011040	15.25	<.0001	0.00146535	0.00190103

Fit Plot for internetuserate by incomeperperson_c

The GLM Procedure

Number of Observations Read	182
Number of Observations Used	182

The GLM Procedure

Dependent Variable: internetuserate

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	100630.7155	50315.3578	215.68	<.0001
Error	179	41758.2958	233.2866
Corrected Total	181	142389.0113

R-Square	Coeff Var	Root MSE	internetuserate Mean
0.706731	43.28322	15.27372	35.28786

Source	DF	Type I SS	Mean Square	F Value	Pr > F
incomeperperson_c	1	80249.26194	80249.26194	343.99	<.0001
incomeper*incomeperp	1	20381.45356	20381.45356	87.37	<.0001

Source	DF	Type III SS	Mean Square	F Value	Pr > F
incomeperperson_c	1	81295.94134	81295.94134	348.48	<.0001
incomeper*incomeperp	1	20381.45356	20381.45356	87.37	<.0001

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|	95% Confidence Limits
Intercept	40.77595082	1.27535818	31.97	<.0001	38.25927960	43.29262204
incomeperperson_c	0.00279846	0.00014991	18.67	<.0001	0.00250265	0.00309428
incomeper*incomeperp	-0.00000004	0.00000000	-9.35	<.0001	-0.00000004	-0.00000003

The GLM Procedure

Number of Observations Read	182
Number of Observations Used	182

The GLM Procedure

Dependent Variable: internetuserate

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	3	102488.5489	34162.8496	152.40	<.0001
Error	178	39900.4624	224.1599
Corrected Total	181	142389.0113

R-Square	Coeff Var	Root MSE	internetuserate Mean
0.719778	42.42810	14.97197	35.28786

Source	DF	Type I SS	Mean Square	F Value	Pr > F
incomeperperson_c	1	80249.26194	80249.26194	358.00	<.0001
incomeper*incomeperp	1	20381.45356	20381.45356	90.92	<.0001
urbanrate_c	1	1857.83335	1857.83335	8.29	0.0045

Source	DF	Type III SS	Mean Square	F Value	Pr > F
incomeperperson_c	1	33786.50522	33786.50522	150.73	<.0001
incomeper*incomeperp	1	9640.21552	9640.21552	43.01	<.0001
urbanrate_c	1	1857.83335	1857.83335	8.29	0.0045

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|	95% Confidence Limits
Intercept	39.73965684	1.30095296	30.55	<.0001	37.17238113	42.30693255
incomeperperson_c	0.00242017	0.00019713	12.28	<.0001	0.00203116	0.00280918
incomeper*incomeperp	-0.00000003	0.00000000	-6.56	<.0001	-0.00000004	-0.00000002
urbanrate_c	0.18291152	0.06353553	2.88	0.0045	0.05753172	0.30829132

The GLM Procedure

Number of Observations Read	182
Number of Observations Used	182

The GLM Procedure

Dependent Variable: internetuserate

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	3	102488.5489	34162.8496	152.40	<.0001
Error	178	39900.4624	224.1599
Corrected Total	181	142389.0113

R-Square	Coeff Var	Root MSE	internetuserate Mean
0.719778	42.42810	14.97197	35.28786

Source	DF	Type I SS	Mean Square	F Value	Pr > F
incomeperperson_c	1	80249.26194	80249.26194	358.00	<.0001
incomeper*incomeperp	1	20381.45356	20381.45356	90.92	<.0001
urbanrate_c	1	1857.83335	1857.83335	8.29	0.0045

Source	DF	Type III SS	Mean Square	F Value	Pr > F
incomeperperson_c	1	33786.50522	33786.50522	150.73	<.0001
incomeper*incomeperp	1	9640.21552	9640.21552	43.01	<.0001
urbanrate_c	1	1857.83335	1857.83335	8.29	0.0045

Parameter	Estimate	Standard Error	t Value	Pr > \|t\|	95% Confidence Limits
Intercept	39.73965684	1.30095296	30.55	<.0001	37.17238113	42.30693255
incomeperperson_c	0.00242017	0.00019713	12.28	<.0001	0.00203116	0.00280918
incomeper*incomeperp	-0.00000003	0.00000000	-6.56	<.0001	-0.00000004	-0.00000002
urbanrate_c	0.18291152	0.06353553	2.88	0.0045	0.05753172	0.30829132

Plot of RStudent by Predicted for internetuserate

Plot of RStudent by Leverage for internetuserate

Q-Q Plot of Residuals for internetuserate.

Histogram of Residuals for internetuserate with normal and kernel densities overlaid.

Residual-Fit Spread Plot for internetuserate. This plot displays two uniform Q-Q plots that show the spread in the fitted values about their mean and the spread in the residuals.

Plot of Residual by incomeperperson_c for internetuserate

Plot of Residual by incomeper*incomeperp for internetuserate

Plot of Residual by urbanrate_c for internetuserate

The REG Procedure

Model: MODEL1

Dependent Variable: internetuserate

Number of Observations Read	182
Number of Observations Used	182

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	3	102489	34163	152.40	<.0001
Error	178	39900	224.15990
Corrected Total	181	142389

Root MSE	14.97197	R-Square	0.7198
Dependent Mean	35.28786	Adj R-Sq	0.7151
Coeff Var	42.42810

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	9.46850	3.14612	3.01	0.0030
incomeperperson	1	0.00242	0.00019713	12.28	<.0001
incomeperperson2	1	-2.86043E-8	4.361813E-9	-6.56	<.0001
urbanrate	1	0.18291	0.06354	2.88	0.0045

The REG Procedure

Model: MODEL1

Dependent Variable: internetuserate

Panel of fit diagnostics for internetuserate.

Panel of scatterplots of residuals by regressors for internetuserate.

Summary of findings

Of the 213 observations available, 182 were used in the model.
The scatterplot shows a linear association between income per person and Internet use rate from the Gapminder data set. A quadratic regression line shows a bell-shaped curve. It fits the data better than the linear regression line.
After centering the incomeperperson quantitative explanatory variable, the linear regression model shows the overall F test is significant (F = 232.46, p < 0.0001). We can reject the NULL hypothesis and conclude that income per person is significantly associated with the internet use rate. The data findings support the hypothesis.
The parameter estimates show a coefficient value of 0.002 and an intercept value of 35.29 (beta0 = 35.29, beta1 = 0.002). Therefore the best fit line equation for the linear regression is: internetuserate = 35.29 + 0.002 * incomeperperson. This indicates a positive association between the two variables, also evident in the fit plot. The R-square value of 0.564 indicates that the proportion of variance in the response variable that can be attributed to the explanatory variable is only 56.4%. Confounding variables are responsible for the remaining 43.6% variability.
When a second order polynomial is added to the regression model, the F-test value of the explanatory variable is still significant (F = 343.99, p < 0.0001) and the R-square value shows that income per person is responsible for 70.7% of the variance in internet use, which is a high correlation. The higher order polynomial explanatory variable has a negative but significant association with the response variable (beta0 = 40.76, beta1 = 0.003, beta2= -0.00000004).
We can evaluate the fit of the model by checking the regression model for misspecification, by adding another centered quantitative explanatory variable – urban rate. The results also show that the coefficients for the linear and quadratic urban rate variables remain significant after adjusting for urban rate. Urban rate is also statistically significant. The positive regression coefficient indicates that countries with higher urban rates tend to have a higher internet use rate. The R-square value increases to 0.72.
The income per person and urban rate explain only 72% of the variability in internet use. This alludes to the presence of residuals, or errors, in estimating the response variable. The Q-Q Plot shows that the residuals generally follow a straight line, but deviate somewhat at the lower and higher quantiles, i.e. the residuals do not follow perfect normal distribution. The curvilinear association observed in the scatter plot was not fully estimated by the quadratic incomeperperson explanatory variable. There may be other explanatory variables that might improve estimation of the observed curvilinearity.
A plot of the standardized residuals by country indicates that the majority of the residuals fall within -2 and 2 standard deviations of the residual mean. There are a few extreme values exceeding the absolute value of 2.5 standard deviations of the mean. This is evidence that the level of error within the model is unacceptable, and can be improved by adding more explanatory variables to explain the variability in the response variable.
The Outlier and Leverage Diagnostics plot shows that the majority of the points have close to zero leverage (an influence on the estimate of the predicted value) and are within a residual standardized value of 2. That is, the majority of the observations have no leverage on the model. However, there are 9 observations that are outliers (red), 7 that have high leverage (green), and one that is both an outlier and has high leverage (brown).

Regression Modeling in Practice

Saturday, May 14, 2016

Multiple Regression

No comments:

Post a Comment