Saturday, May 7, 2016

Basics of Linear Regression

The analysis is based on the Gapminder codebook containing 15 socio-economic factors from 2007 drawn from 213 countries. Based on previous research review, I chose to study the correlation between the urban rate (a quantitative explanatory variable) and the female employ rate (a quantitative response variable) in the Gapminder datasetThe analysis was generated using SAS Studio and the GLM procedure to analyze the data within the framework of a general linear model. A standardized normal distribution of the explanatory variable was created using PROC STANDARDThe procedure computes a z-score by standardizing the urban rate to a mean of 0 and a standard deviation of 1. The variable is standardized to make it easier to interpret the results of the regression analysis

SAS CODE

%web_drop_table(WORK.GAPMINDER);

PROC IMPORT DATAFILE="/home/mst07221/gapminder.csv"
DBMS=CSV
OUT=WORK.GAPMINDER;
GETNAMES=YES;
RUN;

***********************************************************************
PROC STANDARD: A new explanatory variable is created from the current explanatory variable. 
PROC MEANS: the results are confirmed
**********************************************************************;
DATA gapminder2;
  SET WORK.GAPMINDER;
  zurbanrate = urbanrate;
RUN;

PROC STANDARD DATA=gapminder2 MEAN=0 STD=1 OUT=zgapminder2;
  VAR zurbanrate;
RUN;

PROC MEANS DATA=zgapminder2;
  VAR zurbanrate urbanrate femaleemployrate;
RUN;
  
***********************************************************************
BASIC LINEAR REGRESSION. Generate a scatter plot.
**********************************************************************;
PROC SGPLOT DATA=zgapminder2;
  reg x=zurbanrate y=femaleemployrate / lineattrs=(color=blue thickness=2);
  title "Scatterplot for the association Between Urban Rate and Female Employment Rate";
  yaxis label= "Female Employment Rate";
  xaxis label="Standardized Urban Rate";
RUN;
title;

***********************************************************************
The GLM procedure uses the method of least squares to fit general linear models;
The option 'solution' produces parameter estimates;  
**********************************************************************;
PROC GLM DATA=zgapminder2; 
model femaleemployrate=zurbanrate/solution;
RUN;

%web_open_table(WORK.GAPMINDER);


RESULTS




The MEANS Procedure
VariableNMeanStd DevMinimumMaximum
zurbanrate
urbanrate
femaleemployrate
203
203
178
7.328566E-16
56.7693596
47.5494381
1.0000000
23.8449326
14.6257425
-1.9446211
10.4000000
11.3000002
1.8129907
100.0000000
83.3000031

The SGPlot Procedure

The GLM Procedure
Number of Observations Read213
Number of Observations Used173

The GLM Procedure
Dependent Variable: femaleemployrate
SourceDFSum of SquaresMean SquareF ValuePr > F
Model13446.740193446.7401917.28<.0001
Error17134100.73113199.41948
Corrected Total17237547.47132
R-SquareCoeff VarRoot MSEfemaleemployrate Mean
0.09179729.5724114.1216047.75260
SourceDFType I SSMean SquareF ValuePr > F
zurbanrate13446.7401943446.74019417.28<.0001
SourceDFType III SSMean SquareF ValuePr > F
zurbanrate13446.7401943446.74019417.28<.0001
ParameterEstimateStandard
Error
t ValuePr > |t|
Intercept47.727039701.0736626944.45<.0001
zurbanrate-4.588708481.10374814-4.16<.0001
Fit Plot for femaleemployrate by zurbanrate

SUMMARY OF FINDINGS

  • Of the 213 observations available, 173 were used in the model.
  • The overall F test is significant (F = 17.28, p < 0.0001 ). We can reject the NULL hypothesis and conclude that the urban rate is significantly associated with the female employ rate.
  • The parameter estimates show a coefficient value of -4.59 and an intercept value of 47.73 (beta0 = 47.73, beta1 = -4.59).
  • Therefore the best fit line equation is: femaleemployrate = 47.73 – 4.59 * urbanrate. This indicates a negative association between the two variables, also evident in the fit plot.
  • The p-values for both the intercept and coefficient values are very small ( both p < 0.0001). This indicates there is indeed a straight-line relationship between femaleemployrate and urbanrate.
  • The R-square value of 0.092 indicates that the proportion of variance in the response variable that can be attributed to the explanatory variable is only 9.2%. There may be other confounding variables that directly or inversely impact the relationship between the femaleemployrate and the urbanrate.

No comments:

Post a Comment