ST 590G -- Computation for Data Analysis Third Assignment -- due Thursday, 25 October 2011 Determining whether a regression model is correct is difficult unless the sample include replication -- that is, observations with the same design points. Here we are looking at another, somewhat ad hoc, approach for cases where there are no replicates. This approach is a variation on cross-validation, which has been successfully employed as a tuning parameter when trading off the complexity of a model and its fit. Your assignment is to write a macro to compute suitable five-fold cross- validation statistics for testing lack of fit. Given a parameter NREP number of replications, select a random subset of four-fifths of the observations computing estimates of the regression parameters. Using those estimates, compute forecasts for the remaining one fifth of the observations. Compare the forecasts with the observations by computing d(i) = (observed Y(i) - forecast Y(i))/standard error of forecast and then compute V = SUM d(i)^2 / estimate of variance from regression model If we have the correct model, V should have a chi-square distribution with degrees of freedom equal to the number of observations in the witheld one fifth of the data. Have the macro produce only the results from a proc univariate analysis of NREP values of V's. Write appropriate titles for the output, for example, 'this should look like a chi-square distribution with 57 degrees of freedom.' Also, clean up as a part of the macro by deleting datasets that you created. (Hmmm, you might want to be careful to use atypical dataset names.) KISS -- so write the macro for just simple linear regression. (If you can easily extend it to an arbitrary regression model -- bonus!) The arguments to your macro should be a) the name of the dataset b) the names of the two variables (Y and X) c) the number of replications. Demonstrate your macro on the salary data in the October directory. ***Also, email your macro code to be as flat file attachment.***