Saturday, May 14, 2016

Multiple Regression

The analysis is based on the Gapminder codebook containing 15 socio-economic factors from 2007 drawn from 213 countries. I chose to review the correlation between the income per person (a quantitative explanatory variable) and the internet use rate (a quantitative response variable) in the Gapminder dataset. A second quantitative explanatory variable, urban rate, was later added in the multiple regression model. The hypothesis is that high incomes indicate wealth in a country and therefore countries with high incomes will have higher internet use rates.They also tend to have higher urbanization rates.

The analysis was generated using SAS Studio


SAS Code

PROC IMPORT DATAFILE="/home/mst07221/gapminder.csv"
DBMS=CSV
OUT=WORK.GAPMINDER;
GETNAMES=YES;
RUN;

data new; set WORK.GAPMINDER;
run;

********************************************************************
POLYNOMIAL REGRESSION
Testing the relationship between:
Explanatory variable: Income per Person and Internet Use Rate
Response Variable: Internet Use Rate
Confounding variable: Urban Rate
********************************************************************

* scatterplot with linear regression line internetuserate response variable;
* CLM option generates 95% confidence intervals;
proc sgplot;
  reg x=incomeperperson y=internetuserate / lineattrs=(color=blue thickness=2) clm;
  yaxis label="Internet Use Rate";
  xaxis label="Income per Person";
run;

* scatterplot with linear and quadratic regression line;
proc sgplot;
  reg x=incomeperperson y=internetuserate / lineattrs=(color=blue thickness=2) degree=1 clm;
  reg x=incomeperperson y=internetuserate / lineattrs=(color=green thickness=2) degree=2 clm;
  yaxis label="Internet Use Rate";
  xaxis label="Income per Person";
run;

* centering quantitative explanatory variables by subtracting the mean;
data new2; set new;
if incomeperperson ne . and internetuserate ne . and urbanrate ne .;
incomeperperson_c=incomeperperson-8283.35;
urbanrate_c=urbanrate-55.8960440;
run;
proc means; var incomeperperson urbanrate;
run;

* check coding;
proc means; var incomeperperson_c urbanrate_c;
run;

* linear regression model;
PROC glm; 
model internetuserate=incomeperperson_c/solution clparm;
run;

* polynomial regression model;
PROC glm; 
model internetuserate=incomeperperson_c incomeperperson_c*incomeperperson_c/solution clparm;
run;

**********************************************************************
EVALUATING MODEL FIT
**********************************************************************;

* multiple regression adding urban rate;
PROC glm; 
model internetuserate=incomeperperson_c incomeperperson_c*incomeperperson_c urbanrate_c/solution clparm;
run;

* request regression diagnostic plots;
PROC glm PLOTS(unpack)=all;
 model internetuserate=incomeperperson_c incomeperperson_c*incomeperperson_c 
 urbanrate_c/solution clparm;
 output residual=res student=stdres out=results;
run;

* plot of standardized residuals for each observation;
* vref=0 indicates a horizontal line at the mean;
proc gplot;
label stdres="Standardized Residual" country="Country";
plot stdres*country/vref=0; 
run;

* using proc reg to get a partial regression plot;
* calculate quadratic terms;
data partial;
set new2;
incomeperperson2=incomeperperson_c*incomeperperson_c;
run;

*partial regression plot;
PROC reg plots=partial;
 model internetuserate=incomeperperson incomeperperson2 
 urbanrate/partial;
 run;



Results






The SGPlot Procedure

The SGPlot Procedure

The MEANS Procedure
VariableNMeanStd DevMinimumMaximum
incomeperperson
urbanrate
182
182
8283.35
55.8960440
12509.74
23.6293130
103.7758572
10.4000000
81647.10
100.0000000

The MEANS Procedure
VariableNMeanStd DevMinimumMaximum
incomeperperson_c
urbanrate_c
182
182
0.000498908
-4.395605E-8
12509.74
23.6293130
-8179.57
-45.4960440
73363.75
44.1039560

The GLM Procedure
Number of Observations Read182
Number of Observations Used182

The GLM Procedure
Dependent Variable: internetuserate
SourceDFSum of SquaresMean SquareF ValuePr > F
Model180249.261980249.2619232.46<.0001
Error18062139.7493345.2208
Corrected Total181142389.0113
R-SquareCoeff VarRoot MSEinternetuserate Mean
0.56359252.6530018.5801235.28786
SourceDFType I SSMean SquareF ValuePr > F
incomeperperson_c180249.2619480249.26194232.46<.0001
SourceDFType III SSMean SquareF ValuePr > F
incomeperperson_c180249.2619480249.26194232.46<.0001
ParameterEstimateStandard
Error
t ValuePr > |t|95% Confidence Limits
Intercept35.287861631.3772500725.62<.000132.5702293538.00549391
incomeperperson_c0.001683190.0001104015.25<.00010.001465350.00190103
Fit Plot for internetuserate by incomeperperson_c

The GLM Procedure
Number of Observations Read182
Number of Observations Used182

The GLM Procedure
Dependent Variable: internetuserate
SourceDFSum of SquaresMean SquareF ValuePr > F
Model2100630.715550315.3578215.68<.0001
Error17941758.2958233.2866
Corrected Total181142389.0113
R-SquareCoeff VarRoot MSEinternetuserate Mean
0.70673143.2832215.2737235.28786
SourceDFType I SSMean SquareF ValuePr > F
incomeperperson_c180249.2619480249.26194343.99<.0001
incomeper*incomeperp120381.4535620381.4535687.37<.0001
SourceDFType III SSMean SquareF ValuePr > F
incomeperperson_c181295.9413481295.94134348.48<.0001
incomeper*incomeperp120381.4535620381.4535687.37<.0001
ParameterEstimateStandard
Error
t ValuePr > |t|95% Confidence Limits
Intercept40.775950821.2753581831.97<.000138.2592796043.29262204
incomeperperson_c0.002798460.0001499118.67<.00010.002502650.00309428
incomeper*incomeperp-0.000000040.00000000-9.35<.0001-0.00000004-0.00000003
Fit Plot for internetuserate by incomeperperson_c

The GLM Procedure
Number of Observations Read182
Number of Observations Used182

The GLM Procedure
Dependent Variable: internetuserate
SourceDFSum of SquaresMean SquareF ValuePr > F
Model3102488.548934162.8496152.40<.0001
Error17839900.4624224.1599
Corrected Total181142389.0113
R-SquareCoeff VarRoot MSEinternetuserate Mean
0.71977842.4281014.9719735.28786
SourceDFType I SSMean SquareF ValuePr > F
incomeperperson_c180249.2619480249.26194358.00<.0001
incomeper*incomeperp120381.4535620381.4535690.92<.0001
urbanrate_c11857.833351857.833358.290.0045
SourceDFType III SSMean SquareF ValuePr > F
incomeperperson_c133786.5052233786.50522150.73<.0001
incomeper*incomeperp19640.215529640.2155243.01<.0001
urbanrate_c11857.833351857.833358.290.0045
ParameterEstimateStandard
Error
t ValuePr > |t|95% Confidence Limits
Intercept39.739656841.3009529630.55<.000137.1723811342.30693255
incomeperperson_c0.002420170.0001971312.28<.00010.002031160.00280918
incomeper*incomeperp-0.000000030.00000000-6.56<.0001-0.00000004-0.00000002
urbanrate_c0.182911520.063535532.880.00450.057531720.30829132

The GLM Procedure
Number of Observations Read182
Number of Observations Used182

The GLM Procedure
Dependent Variable: internetuserate
SourceDFSum of SquaresMean SquareF ValuePr > F
Model3102488.548934162.8496152.40<.0001
Error17839900.4624224.1599
Corrected Total181142389.0113
R-SquareCoeff VarRoot MSEinternetuserate Mean
0.71977842.4281014.9719735.28786
SourceDFType I SSMean SquareF ValuePr > F
incomeperperson_c180249.2619480249.26194358.00<.0001
incomeper*incomeperp120381.4535620381.4535690.92<.0001
urbanrate_c11857.833351857.833358.290.0045
SourceDFType III SSMean SquareF ValuePr > F
incomeperperson_c133786.5052233786.50522150.73<.0001
incomeper*incomeperp19640.215529640.2155243.01<.0001
urbanrate_c11857.833351857.833358.290.0045
ParameterEstimateStandard
Error
t ValuePr > |t|95% Confidence Limits
Intercept39.739656841.3009529630.55<.000137.1723811342.30693255
incomeperperson_c0.002420170.0001971312.28<.00010.002031160.00280918
incomeper*incomeperp-0.000000030.00000000-6.56<.0001-0.00000004-0.00000002
urbanrate_c0.182911520.063535532.880.00450.057531720.30829132
Plot of RStudent by Predicted for internetuserate
Plot of RStudent by Leverage for internetuserate
Q-Q Plot of Residuals for internetuserate.
Plot of internetuserate by Predicted
Histogram of Residuals for internetuserate with normal and kernel densities overlaid.
Residual-Fit Spread Plot for internetuserate. This plot displays two uniform Q-Q plots that show the spread in the fitted values about their mean and the spread in the residuals.
Plot of Residual by incomeperperson_c for internetuserate
Plot of Residual by incomeper*incomeperp for internetuserate
Plot of Residual by urbanrate_c for internetuserate
Contour Fit Plot for internetuserate

Plot of stdres by country

The REG Procedure
Model: MODEL1
Dependent Variable: internetuserate
Number of Observations Read182
Number of Observations Used182
Analysis of Variance
SourceDFSum of
Squares
Mean
Square
F ValuePr > F
Model310248934163152.40<.0001
Error17839900224.15990
Corrected Total181142389
Root MSE14.97197R-Square0.7198
Dependent Mean35.28786Adj R-Sq0.7151
Coeff Var42.42810
Parameter Estimates
VariableDFParameter
Estimate
Standard
Error
t ValuePr > |t|
Intercept19.468503.146123.010.0030
incomeperperson10.002420.0001971312.28<.0001
incomeperperson21-2.86043E-84.361813E-9-6.56<.0001
urbanrate10.182910.063542.880.0045

The REG Procedure
Model: MODEL1
Dependent Variable: internetuserate
Panel of fit diagnostics for internetuserate.
Panel of scatterplots of residuals by regressors for internetuserate.


Summary of findings
  • Of the 213 observations available, 182 were used in the model.
  • The scatterplot shows a linear association between income per person and Internet use rate from the Gapminder data set. A quadratic regression line shows a bell-shaped curve. It fits the data better than the linear regression line.
  • After centering the incomeperperson quantitative explanatory variable, the linear regression model shows the overall F test is significant (F = 232.46, p < 0.0001). We can reject the NULL hypothesis and conclude that income per person is significantly associated with the internet use rate. The data findings support the hypothesis.
  • The parameter estimates show a coefficient value of 0.002 and an intercept value of 35.29 (beta0 = 35.29, beta1 = 0.002). Therefore the best fit line equation for the linear regression is: internetuserate = 35.29 + 0.002 * incomeperperson. This indicates a positive association between the two variables, also evident in the fit plot. The R-square value of 0.564 indicates that the proportion of variance in the response variable that can be attributed to the explanatory variable is only 56.4%.  Confounding variables are responsible for the remaining 43.6% variability.
  • When a second order polynomial is added to the regression model, the F-test value of the explanatory variable is still significant (F = 343.99, p < 0.0001) and the R-square value shows that income per person is responsible for 70.7% of the variance in internet use, which is a high correlation. The higher order polynomial explanatory variable has a negative but significant association with the response variable (beta0 = 40.76, beta1 = 0.003, beta2= -0.00000004).
  • We can evaluate the fit of the model by checking the regression model for misspecification, by adding another centered quantitative explanatory variable – urban rate.  The results also show that the coefficients for the linear and quadratic urban rate variables remain significant after adjusting for urban rate. Urban rate is also statistically significant. The positive regression coefficient indicates that countries with higher urban rates tend to have a higher internet use rate. The R-square value increases to 0.72.
  • The income per person and urban rate explain only 72% of the variability in internet use. This alludes to the presence of residuals, or errors, in estimating the response variable. The Q-Q Plot shows that the residuals generally follow a straight line, but deviate somewhat at the lower and higher quantiles, i.e. the residuals do not follow perfect normal distribution. The curvilinear association observed in the scatter plot was not fully estimated by the quadratic incomeperperson explanatory variable. There may be other explanatory variables that might improve estimation of the observed curvilinearity.
  • A plot of the standardized residuals by country indicates that the majority of the residuals fall within -2 and 2 standard deviations of the residual mean. There are a few extreme values exceeding the absolute value of 2.5 standard deviations of the mean. This is evidence that the level of error within the model is unacceptable, and can be improved by adding more explanatory variables to explain the variability in the response variable.
  • The Outlier and Leverage Diagnostics plot shows that the majority of the points have close to zero leverage (an influence on the estimate of the predicted value) and are within a residual standardized value of 2. That is, the majority of the observations have no leverage on the model. However, there are 9 observations that are outliers (red), 7 that have high leverage (green), and one that is both an outlier and has high leverage (brown). 


No comments:

Post a Comment