The analysis was generated using SAS Studio
SAS Code
PROC IMPORT DATAFILE="/home/mst07221/gapminder.csv"
DBMS=CSV
OUT=WORK.GAPMINDER;
GETNAMES=YES;
RUN;
data new; set WORK.GAPMINDER;
run;
********************************************************************
POLYNOMIAL REGRESSION
Testing the relationship between:
Explanatory variable: Income per Person and Internet Use Rate
Response Variable: Internet Use Rate
Confounding variable: Urban Rate
********************************************************************
* scatterplot with linear regression line internetuserate response variable;
* CLM option generates 95% confidence intervals;
proc sgplot;
reg x=incomeperperson y=internetuserate / lineattrs=(color=blue thickness=2) clm;
yaxis label="Internet Use Rate";
xaxis label="Income per Person";
run;
* scatterplot with linear and quadratic regression line;
proc sgplot;
reg x=incomeperperson y=internetuserate / lineattrs=(color=blue thickness=2) degree=1 clm;
reg x=incomeperperson y=internetuserate / lineattrs=(color=green thickness=2) degree=2 clm;
yaxis label="Internet Use Rate";
xaxis label="Income per Person";
run;
* centering quantitative explanatory variables by subtracting the mean;
data new2; set new;
if incomeperperson ne . and internetuserate ne . and urbanrate ne .;
incomeperperson_c=incomeperperson-8283.35;
urbanrate_c=urbanrate-55.8960440;
run;
proc means; var incomeperperson urbanrate;
run;
* check coding;
proc means; var incomeperperson_c urbanrate_c;
run;
* linear regression model;
PROC glm;
model internetuserate=incomeperperson_c/solution clparm;
run;
* polynomial regression model;
PROC glm;
model internetuserate=incomeperperson_c incomeperperson_c*incomeperperson_c/solution clparm;
run;
**********************************************************************
EVALUATING MODEL FIT
**********************************************************************;
* multiple regression adding urban rate;
PROC glm;
model internetuserate=incomeperperson_c incomeperperson_c*incomeperperson_c urbanrate_c/solution clparm;
run;
* request regression diagnostic plots;
PROC glm PLOTS(unpack)=all;
model internetuserate=incomeperperson_c incomeperperson_c*incomeperperson_c
urbanrate_c/solution clparm;
output residual=res student=stdres out=results;
run;
* plot of standardized residuals for each observation;
* vref=0 indicates a horizontal line at the mean;
proc gplot;
label stdres="Standardized Residual" country="Country";
plot stdres*country/vref=0;
run;
* using proc reg to get a partial regression plot;
* calculate quadratic terms;
data partial;
set new2;
incomeperperson2=incomeperperson_c*incomeperperson_c;
run;
*partial regression plot;
PROC reg plots=partial;
model internetuserate=incomeperperson incomeperperson2
urbanrate/partial;
run;
Results
The MEANS Procedure
Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|
incomeperperson
urbanrate
|
182
182
|
8283.35
55.8960440
|
12509.74
23.6293130
|
103.7758572
10.4000000
|
81647.10
100.0000000
|
The MEANS Procedure
Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|
incomeperperson_c
urbanrate_c
|
182
182
|
0.000498908
-4.395605E-8
|
12509.74
23.6293130
|
-8179.57
-45.4960440
|
73363.75
44.1039560
|
The GLM Procedure
Number of Observations Read | 182 |
---|---|
Number of Observations Used | 182 |
The GLM Procedure
Dependent Variable: internetuserate
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 1 | 80249.2619 | 80249.2619 | 232.46 | <.0001 |
Error | 180 | 62139.7493 | 345.2208 | ||
Corrected Total | 181 | 142389.0113 |
R-Square | Coeff Var | Root MSE | internetuserate Mean |
---|---|---|---|
0.563592 | 52.65300 | 18.58012 | 35.28786 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
incomeperperson_c | 1 | 80249.26194 | 80249.26194 | 232.46 | <.0001 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
incomeperperson_c | 1 | 80249.26194 | 80249.26194 | 232.46 | <.0001 |
Parameter | Estimate | Standard Error | t Value | Pr > |t| | 95% Confidence Limits | |
---|---|---|---|---|---|---|
Intercept | 35.28786163 | 1.37725007 | 25.62 | <.0001 | 32.57022935 | 38.00549391 |
incomeperperson_c | 0.00168319 | 0.00011040 | 15.25 | <.0001 | 0.00146535 | 0.00190103 |
The GLM Procedure
Number of Observations Read | 182 |
---|---|
Number of Observations Used | 182 |
The GLM Procedure
Dependent Variable: internetuserate
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 2 | 100630.7155 | 50315.3578 | 215.68 | <.0001 |
Error | 179 | 41758.2958 | 233.2866 | ||
Corrected Total | 181 | 142389.0113 |
R-Square | Coeff Var | Root MSE | internetuserate Mean |
---|---|---|---|
0.706731 | 43.28322 | 15.27372 | 35.28786 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
incomeperperson_c | 1 | 80249.26194 | 80249.26194 | 343.99 | <.0001 |
incomeper*incomeperp | 1 | 20381.45356 | 20381.45356 | 87.37 | <.0001 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
incomeperperson_c | 1 | 81295.94134 | 81295.94134 | 348.48 | <.0001 |
incomeper*incomeperp | 1 | 20381.45356 | 20381.45356 | 87.37 | <.0001 |
Parameter | Estimate | Standard Error | t Value | Pr > |t| | 95% Confidence Limits | |
---|---|---|---|---|---|---|
Intercept | 40.77595082 | 1.27535818 | 31.97 | <.0001 | 38.25927960 | 43.29262204 |
incomeperperson_c | 0.00279846 | 0.00014991 | 18.67 | <.0001 | 0.00250265 | 0.00309428 |
incomeper*incomeperp | -0.00000004 | 0.00000000 | -9.35 | <.0001 | -0.00000004 | -0.00000003 |
The GLM Procedure
Number of Observations Read | 182 |
---|---|
Number of Observations Used | 182 |
The GLM Procedure
Dependent Variable: internetuserate
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 3 | 102488.5489 | 34162.8496 | 152.40 | <.0001 |
Error | 178 | 39900.4624 | 224.1599 | ||
Corrected Total | 181 | 142389.0113 |
R-Square | Coeff Var | Root MSE | internetuserate Mean |
---|---|---|---|
0.719778 | 42.42810 | 14.97197 | 35.28786 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
incomeperperson_c | 1 | 80249.26194 | 80249.26194 | 358.00 | <.0001 |
incomeper*incomeperp | 1 | 20381.45356 | 20381.45356 | 90.92 | <.0001 |
urbanrate_c | 1 | 1857.83335 | 1857.83335 | 8.29 | 0.0045 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
incomeperperson_c | 1 | 33786.50522 | 33786.50522 | 150.73 | <.0001 |
incomeper*incomeperp | 1 | 9640.21552 | 9640.21552 | 43.01 | <.0001 |
urbanrate_c | 1 | 1857.83335 | 1857.83335 | 8.29 | 0.0045 |
Parameter | Estimate | Standard Error | t Value | Pr > |t| | 95% Confidence Limits | |
---|---|---|---|---|---|---|
Intercept | 39.73965684 | 1.30095296 | 30.55 | <.0001 | 37.17238113 | 42.30693255 |
incomeperperson_c | 0.00242017 | 0.00019713 | 12.28 | <.0001 | 0.00203116 | 0.00280918 |
incomeper*incomeperp | -0.00000003 | 0.00000000 | -6.56 | <.0001 | -0.00000004 | -0.00000002 |
urbanrate_c | 0.18291152 | 0.06353553 | 2.88 | 0.0045 | 0.05753172 | 0.30829132 |
The GLM Procedure
Number of Observations Read | 182 |
---|---|
Number of Observations Used | 182 |
The GLM Procedure
Dependent Variable: internetuserate
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 3 | 102488.5489 | 34162.8496 | 152.40 | <.0001 |
Error | 178 | 39900.4624 | 224.1599 | ||
Corrected Total | 181 | 142389.0113 |
R-Square | Coeff Var | Root MSE | internetuserate Mean |
---|---|---|---|
0.719778 | 42.42810 | 14.97197 | 35.28786 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
incomeperperson_c | 1 | 80249.26194 | 80249.26194 | 358.00 | <.0001 |
incomeper*incomeperp | 1 | 20381.45356 | 20381.45356 | 90.92 | <.0001 |
urbanrate_c | 1 | 1857.83335 | 1857.83335 | 8.29 | 0.0045 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
incomeperperson_c | 1 | 33786.50522 | 33786.50522 | 150.73 | <.0001 |
incomeper*incomeperp | 1 | 9640.21552 | 9640.21552 | 43.01 | <.0001 |
urbanrate_c | 1 | 1857.83335 | 1857.83335 | 8.29 | 0.0045 |
Parameter | Estimate | Standard Error | t Value | Pr > |t| | 95% Confidence Limits | |
---|---|---|---|---|---|---|
Intercept | 39.73965684 | 1.30095296 | 30.55 | <.0001 | 37.17238113 | 42.30693255 |
incomeperperson_c | 0.00242017 | 0.00019713 | 12.28 | <.0001 | 0.00203116 | 0.00280918 |
incomeper*incomeperp | -0.00000003 | 0.00000000 | -6.56 | <.0001 | -0.00000004 | -0.00000002 |
urbanrate_c | 0.18291152 | 0.06353553 | 2.88 | 0.0045 | 0.05753172 | 0.30829132 |
The REG Procedure
Model: MODEL1
Dependent Variable: internetuserate
Number of Observations Read | 182 |
---|---|
Number of Observations Used | 182 |
Analysis of Variance | |||||
---|---|---|---|---|---|
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
Model | 3 | 102489 | 34163 | 152.40 | <.0001 |
Error | 178 | 39900 | 224.15990 | ||
Corrected Total | 181 | 142389 |
Root MSE | 14.97197 | R-Square | 0.7198 |
---|---|---|---|
Dependent Mean | 35.28786 | Adj R-Sq | 0.7151 |
Coeff Var | 42.42810 |
Parameter Estimates | |||||
---|---|---|---|---|---|
Variable | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| |
Intercept | 1 | 9.46850 | 3.14612 | 3.01 | 0.0030 |
incomeperperson | 1 | 0.00242 | 0.00019713 | 12.28 | <.0001 |
incomeperperson2 | 1 | -2.86043E-8 | 4.361813E-9 | -6.56 | <.0001 |
urbanrate | 1 | 0.18291 | 0.06354 | 2.88 | 0.0045 |
The REG Procedure
Model: MODEL1
Dependent Variable: internetuserate
Summary of findings
- Of the 213 observations available, 182 were used in the model.
- The scatterplot shows a linear association between income per person and Internet use rate from the Gapminder data set. A quadratic regression line shows a bell-shaped curve. It fits the data better than the linear regression line.
- After centering the incomeperperson quantitative explanatory variable, the linear regression model shows the overall F test is significant (F = 232.46, p < 0.0001). We can reject the NULL hypothesis and conclude that income per person is significantly associated with the internet use rate. The data findings support the hypothesis.
- The parameter estimates show a coefficient value of 0.002 and an intercept value of 35.29 (beta0 = 35.29, beta1 = 0.002). Therefore the best fit line equation for the linear regression is: internetuserate = 35.29 + 0.002 * incomeperperson. This indicates a positive association between the two variables, also evident in the fit plot. The R-square value of 0.564 indicates that the proportion of variance in the response variable that can be attributed to the explanatory variable is only 56.4%. Confounding variables are responsible for the remaining 43.6% variability.
- When a second order polynomial is added to the regression model, the F-test value of the explanatory variable is still significant (F = 343.99, p < 0.0001) and the R-square value shows that income per person is responsible for 70.7% of the variance in internet use, which is a high correlation. The higher order polynomial explanatory variable has a negative but significant association with the response variable (beta0 = 40.76, beta1 = 0.003, beta2= -0.00000004).
- We can evaluate the fit of the model by checking the regression model for misspecification, by adding another centered quantitative explanatory variable – urban rate. The results also show that the coefficients for the linear and quadratic urban rate variables remain significant after adjusting for urban rate. Urban rate is also statistically significant. The positive regression coefficient indicates that countries with higher urban rates tend to have a higher internet use rate. The R-square value increases to 0.72.
- The income per person and urban rate explain only 72% of the variability in internet use. This alludes to the presence of residuals, or errors, in estimating the response variable. The Q-Q Plot shows that the residuals generally follow a straight line, but deviate somewhat at the lower and higher quantiles, i.e. the residuals do not follow perfect normal distribution. The curvilinear association observed in the scatter plot was not fully estimated by the quadratic incomeperperson explanatory variable. There may be other explanatory variables that might improve estimation of the observed curvilinearity.
- A plot of the standardized residuals by country indicates that the majority of the residuals fall within -2 and 2 standard deviations of the residual mean. There are a few extreme values exceeding the absolute value of 2.5 standard deviations of the mean. This is evidence that the level of error within the model is unacceptable, and can be improved by adding more explanatory variables to explain the variability in the response variable.
- The Outlier and Leverage Diagnostics plot shows that the majority of the points have close to zero leverage (an influence on the estimate of the predicted value) and are within a residual standardized value of 2. That is, the majority of the observations have no leverage on the model. However, there are 9 observations that are outliers (red), 7 that have high leverage (green), and one that is both an outlier and has high leverage (brown).
No comments:
Post a Comment