/* hw8 -----------------------------------------------
The following data come from exercise 11.11, page 395-6, in your text.
A response variable Y has been added to the SAS data set.
The purpose of this exercise is to use the residuals, influence, and
collinearity options in PROC REG for regression diagnostics.
We will regard the sand, silt, and clay percentages at the three
depths as 9 independent variables. It is obvious that if the
percentages have been computed accurately, they will add to 100% at each
depth. This will clearly introduce near singularities with the intercept
and among the three depths. In addition, it is conceivable that the
percentages at different depths may be highly correlated.
The dependent variable Y is a "Suitability" index for the soil I have
arbitrarily defined; the true model defined the regression coefficients
for each component as being equal over all depths; i.e., the true regression
coefficient for sand is 5 at all depths, the silt coefficient is 10 at all
depths, and the clay coefficient is -5 for all depths. Knowing the true
regression coefficients facilitates seeing the impact of collinearity on
the stability of the regression coefficients.
INDEX = 5*SAND + 10*SILT - 5*CLAY
Obviously only a few options are needed in the programming and most
of these are given to you. Your job is interpretation.
--------------------------------------------------------------------*/
Data soilmix;
input sand1 silt1 clay1 sand2 silt2 clay2
sand3 silt3 clay3 Y;
cards;
27.3 25.3 47.4 34.9 24.2 40.7 20.7 36.7 42.6 609.79
40.3 20.4 39.4 42.0 19.8 38.2 45.0 25.3 29.8 832.86
12.7 30.3 57.0 25.7 25.4 49.0 13.1 37.6 49.3 417.96
7.9 27.9 64.2 8.0 26.6 64.4 22.1 30.8 47.1 181.72
16.1 24.2 59.7 14.3 30.4 55.3 5.6 33.4 61.0 194.52
10.4 27.8 61.8 18.3 27.6 54.1 8.2 34.4 57.4 263.81
19.0 33.5 47.5 27.5 37.6 34.9 0.0 30.1 69.9 456.85
15.5 34.4 50.2 11.9 38.8 49.2 4.4 40.8 54.8 635.17
21.4 27.8 50.8 20.2 30.3 49.3 18.9 36.1 45.0 601.82
19.4 25.1 55.5 15.4 35.7 48.9 3.2 44.4 52.4 609.03
39.4 25.5 35.6 42.6 23.6 33.8 38.4 32.5 29.1 1035.43
32.3 32.7 35.0 20.6 28.6 50.8 26.7 37.7 35.6 813.57
35.7 25.0 39.3 42.5 20.1 37.4 60.7 13.0 26.4 864.70
35.2 19.0 45.8 32.5 27.0 40.5 20.5 42.5 37.0 724.58
37.8 21.3 40.9 44.2 19.1 36.7 52.0 21.2 26.8 658.64
30.4 28.7 40.9 30.2 32.0 37.8 11.1 45.1 43.8 851.46
40.3 16.1 43.6 34.9 20.8 44.2 5.4 44.0 50.6 557.99
27.0 28.2 44.8 37.9 30.3 31.8 8.9 57.8 32.8 1077.38
32.8 18.0 49.2 23.2 26.3 50.5 33.2 26.8 40.0 502.18
26.2 26.1 47.7 29.5 34.9 35.6 13.2 34.8 52.0 556.71
;
proc reg;
model y=sand1 silt1 clay1 sand2 silt2 clay2
sand3 silt3 clay3 / r collin influence ;
run;
/*----------------------------------------------------------------
1. Since we expected a collinearity problem, start with inspection
of the collinearity diagnostics. Is there a severe
collinearity problem? (condition index exceeding 100)
If the last principal component is eliminated from consideration, is
there a collinearity problem stemming from the next to last principal
component dimension? How many dimensions of the X-space show a severe
collinearity problem?
Is any coefficient in this model statistically significant at the
5% level? Contrast this to what is being tested by the overall model
F test.
From inspection of the variance decomposition proportions, identify the
variables (columns of X) that are primarily responsible for the most
severe singularity (last principal component dimension).
Make note of the estimates of the regression coefficients and their
standard errors. Compare the estimates to the known true values of
the regression coefficients given above. The differences are very
large. Are any of these differences STATISTICALLY significant? Should
they be?
2. Identify which observation has the greatest potential influence on
the regression results, i.e. the highest leverage according to the
hat-diagonal statistic.
Which observation has the greatest impact on the vector of estimates of
the regression coefficients, according to Cook's D? Which has the
greatest impact on DFFITS? Are there any observations that seem to have
major influence in several different respects; i.e., as measured by several
different influence measures? Is the observation with the largest potential
influence (hat diagonal) actually influential by any of the measures?
3. Inspect the RSTUDENT residuals. Are any residuals large enough to
suggest an outlier in the residuals? Is the observation with the largest
RSTUDENT residual one of the observations that have major influence
according to Cook's D?
4. Insert the appropriate program steps to output the predicted values
and the RSTUDENT residuals, and then plot these RSTUDENT residuals
against Yhat. Does the pattern of the residuals suggest any problem?
Do you have any suggested solutions?
Now, in order to see how the measures of collinearity can change, rerun
the analysis with all clay variables deleted from the model. Note that
if we assume the components sand, silt and clay add to 100% at each depth,
dropping "clay" from the model redefines each of the sand and silt coef-
ficients: That is, you are in effect replacing "clay" with "100 - sand -
silt" Make this substitution for each depth in the true model and note
how the regression coefficients for intercept, sand, and silt are changed.
--------------------------------------------------------------------------*/
proc reg;
model y=sand1 silt1 sand2 silt2 sand3 silt3 / collin;
run;
/*------------------------------------------------------------------------
5. Compare the collinearity diagnostics from this analysis to the first
analysis. Pay particular attention to the stability of the regression
coefficients. Does there continue to be any remaining severe
collinearity problem?
6. Test the residuals for normality using PROC UNIVARIATE. What is
your conclusion? For residuals named R this would do the job:
PROC UNIVARIATE NORMAL PLOT; VAR R;
----------------------------------------------------------------------*/