/*  hw8 -----------------------------------------------

  The following data come from exercise 11.11, page 395-6, in your text.
  A response variable Y has been added to the SAS data set.
  
  The purpose of this exercise is to use the residuals, influence, and
  collinearity options in PROC REG for regression diagnostics.
  We will regard the sand, silt, and clay percentages at the three
  depths as 9 independent variables.  It is obvious that if the
  percentages have been computed accurately, they will add to 100% at each
  depth.  This will clearly introduce near singularities with the intercept
  and among the three depths.  In addition, it is conceivable that the
  percentages at different depths may be highly correlated.

  The dependent variable Y is a "Suitability" index for the soil I have
  arbitrarily defined; the true model defined the regression coefficients
  for each component as being equal over all depths; i.e., the true regression
  coefficient for sand is 5 at all depths, the silt coefficient is 10 at all
  depths, and the clay coefficient is -5 for all depths.  Knowing the true
  regression coefficients facilitates seeing the impact of collinearity on
  the stability of the regression coefficients.

       INDEX = 5*SAND + 10*SILT - 5*CLAY

  Obviously only a few options are needed in the programming and most
  of these are given to you.  Your job is interpretation.
 --------------------------------------------------------------------*/

Data soilmix; 
input sand1 silt1 clay1 sand2 silt2 clay2
   sand3 silt3 clay3 Y; 
cards;
27.3 25.3 47.4 34.9 24.2 40.7 20.7 36.7 42.6  609.79
40.3 20.4 39.4 42.0 19.8 38.2 45.0 25.3 29.8  832.86
12.7 30.3 57.0 25.7 25.4 49.0 13.1 37.6 49.3  417.96
7.9 27.9 64.2 8.0 26.6 64.4 22.1 30.8 47.1    181.72
16.1 24.2 59.7 14.3 30.4 55.3 5.6 33.4 61.0   194.52
10.4 27.8 61.8 18.3 27.6 54.1 8.2 34.4 57.4   263.81
19.0 33.5 47.5 27.5 37.6 34.9 0.0 30.1 69.9   456.85
15.5 34.4 50.2 11.9 38.8 49.2 4.4 40.8 54.8   635.17
21.4 27.8 50.8 20.2 30.3 49.3 18.9 36.1 45.0  601.82
19.4 25.1 55.5 15.4 35.7 48.9 3.2 44.4 52.4   609.03
39.4 25.5 35.6 42.6 23.6 33.8 38.4 32.5 29.1 1035.43
32.3 32.7 35.0 20.6 28.6 50.8 26.7 37.7 35.6  813.57
35.7 25.0 39.3 42.5 20.1 37.4 60.7 13.0 26.4  864.70
35.2 19.0 45.8 32.5 27.0 40.5 20.5 42.5 37.0  724.58
37.8 21.3 40.9 44.2 19.1 36.7 52.0 21.2 26.8  658.64
30.4 28.7 40.9 30.2 32.0 37.8 11.1 45.1 43.8  851.46
40.3 16.1 43.6 34.9 20.8 44.2 5.4 44.0 50.6   557.99
27.0 28.2 44.8 37.9 30.3 31.8 8.9 57.8 32.8  1077.38
32.8 18.0 49.2 23.2 26.3 50.5 33.2 26.8 40.0  502.18
26.2 26.1 47.7 29.5 34.9 35.6 13.2 34.8 52.0  556.71
 ; 
 proc reg;       
  model y=sand1 silt1 clay1 sand2 silt2 clay2
   sand3 silt3 clay3 / r collin influence ;
 run;

/*----------------------------------------------------------------
   1.  Since we expected a collinearity problem, start with inspection
   of the collinearity diagnostics. Is there a severe
   collinearity problem? (condition index exceeding 100)
   If the last principal component is eliminated from consideration, is
   there a collinearity problem stemming from the next to last principal
   component dimension?  How many dimensions of the X-space show a severe
   collinearity problem?

   Is any coefficient in this model statistically significant at the
   5% level?  Contrast this to what is being tested by the overall model
   F test.

   From inspection of the variance decomposition proportions, identify the
   variables (columns of X) that are primarily responsible for the most
   severe singularity (last principal component dimension).

   Make note of the estimates of the regression coefficients and their
   standard errors.  Compare the estimates to the known true values of
   the regression coefficients given above.  The differences are very
   large.  Are any of these differences STATISTICALLY significant?  Should
   they be?

   2.  Identify which observation has the greatest potential influence on
   the regression results, i.e. the highest leverage according to the
   hat-diagonal statistic.

   Which observation has the greatest impact on the vector of estimates of
   the regression coefficients, according to Cook's D?  Which has the
   greatest impact on DFFITS?  Are there any observations that seem to have
   major influence in several different respects; i.e., as measured by several
   different influence measures?  Is the observation with the largest potential
   influence (hat diagonal) actually influential by any of the measures?

   3.  Inspect the RSTUDENT residuals.  Are any residuals large enough to
   suggest an outlier in the residuals?  Is the observation with the largest
   RSTUDENT residual one of the observations that have major influence
   according to Cook's D?

   4.  Insert the appropriate program steps to output the predicted values
   and the RSTUDENT residuals, and then plot these RSTUDENT residuals
   against Yhat.  Does the pattern of the residuals suggest any problem?
   Do you have any suggested solutions?

   Now, in order to see how the measures of collinearity can change, rerun
   the analysis with all clay variables deleted from the model.  Note that
   if we assume the components sand, silt and clay add to 100% at each depth,
   dropping "clay" from the model redefines each of the sand and silt coef-
   ficients:  That is, you are in effect replacing "clay" with "100 - sand -
   silt"  Make this substitution for each depth in the true model and note
   how the regression coefficients for intercept, sand, and silt are changed.
--------------------------------------------------------------------------*/

   proc reg;
   model  y=sand1 silt1 sand2 silt2 sand3 silt3 / collin;
   run;

/*------------------------------------------------------------------------
   5.  Compare the collinearity diagnostics from this analysis to the first
   analysis.  Pay particular attention to the stability of the regression
   coefficients.  Does there continue to be any remaining severe
   collinearity problem?

   6.  Test the residuals for normality using PROC UNIVARIATE.  What is
   your conclusion? For residuals named R this would do the job:
        PROC UNIVARIATE NORMAL PLOT; VAR R;
 ----------------------------------------------------------------------*/