/******************************************************************* EXAMPLE 1, CHAPTER 4 Using SAS to obtain sample mean vectors, sample covariance matrices, and sample correlation matrices. *******************************************************************/ options ls=80 ps=59 nodate; run; data dent1; infile "dental.txt"; input obsno child age distance gender; run; proc print data=dent1; run; /******************************************************************* TRANSFORM DATA IN APPROPRIATE FORMAT *******************************************************************/ /****************************************************************** STEP1 In the new data set, DENT1, the observations at ages 8, 10, 12, and 14 are placed in variables AGE1, AGE2, AGE3, and AGE4, respectively. We use PROC PRINT to print out the first 5 records (so data for the first 5 children, all girls) using the OBS= feature of the DATA= option. ******************************************************************/ data dent1; set dent1; if age=8 then age=1; if age=10 then age=2; if age=12 then age=3; if age=14 then age=4; drop obsno; run; proc sort data=dent1; by gender child; run; /****************************************************************** The DATA looks like Obs child age distance gender 1 1 1 21.0 0 2 1 2 20.0 0 3 1 3 21.5 0 4 1 4 23.0 0 5 2 1 21.0 0 6 2 2 21.5 0 7 2 3 24.0 0 8 2 4 25.5 0 ******************************************************************/ proc print data=dent1; run; /******************************************************************* TRANSFORM DATA IN APPROPRIATE FORMAT *******************************************************************/ /****************************************************************** STEP2 We redefine the values of AGE so that we may use AGE as an "index" in creating the new data set DENT2. The DATA step that creates DENT2 demonstrates one way (using the notion of an ARRAY) to transform a data set in the form of one observation per record (the original form) into a data set in the form of one record per individual. The data must be sorted prior to this operation; we invoke PROC SORT for this purpose. *******************************************************************/ data dent2(keep=age1-age4 gender child); array aa{4} age1-age4; do age=1 to 4; set dent1; by gender child; aa{age}=distance; if last.child then return; end; run; proc print data=dent2; run; title "TRANSFORMED DATA -- 1 RECORD/INDIVIDUAL"; proc print data=dent2(obs=5); run; /************************************************ Obs age1 age2 age3 age4 child gender 1 21.0 20.0 21.5 23.0 1 0 2 21.0 21.5 24.0 25.5 2 0 3 20.5 24.0 24.5 26.0 3 0 4 23.5 24.5 25.0 26.5 4 0 5 21.5 23.0 22.5 23.5 5 0 ************************************************/ /******************************************************************* ******************************************************************** PROC CORR ******************************************************************** *******************************************************************/ /******************************************************************* Here, we use PROC CORR to obtain the sample means at each age (the means of the variables AGE1,...,AGE4 in DENT2 and to calculate the sample covariance matrix and corresponding sample correlation matrix separately for each group (girls and boys). The COV option in the PROC CORR statement asks for the sample covariance to be printed; without it, only the sample correlation matrix would appear in the output. *******************************************************************/ proc sort data=dent2; by gender; run; title "SAMPLE COVARIANCE AND CORRELATION MATRICES BY GENDER"; proc corr data=dent2 cov; by gender; var age1 age2 age3 age4; run; /******************************** ----------------------------------- gender=0 ----------------------------------- The CORR Procedure 4 Variables: age1 age2 age3 age4 Covariance Matrix, DF = 10 age1 age2 age3 age4 age1 4.513636364 3.354545455 4.331818182 4.356818182 age2 3.354545455 3.618181818 4.027272727 4.077272727 age3 4.331818182 4.027272727 5.590909091 5.465909091 age4 4.356818182 4.077272727 5.465909091 5.940909091 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum age1 11 21.18182 2.12453 233.00000 16.50000 24.50000 age2 11 22.22727 1.90215 244.50000 19.00000 25.00000 age3 11 23.09091 2.36451 254.00000 19.00000 28.00000 age4 11 24.09091 2.43740 265.00000 19.50000 28.00000 Pearson Correlation Coefficients, N = 11 Prob > |r| under H0: Rho=0 age1 age2 age3 age4 age1 1.00000 0.83009 0.86231 0.84136 0.0016 0.0006 0.0012 age2 0.83009 1.00000 0.89542 0.87942 0.0016 0.0002 0.0004 age3 0.86231 0.89542 1.00000 0.94841 0.0006 0.0002 <.0001 age4 0.84136 0.87942 0.94841 1.00000 0.0012 0.0004 <.0001 ********************************/ /******************************************************************* ******************************************************************** PROC MEANS ******************************************************************** *******************************************************************/ /******************************************************************* We now obtain the "centered" and "scaled" values that may be used for plotting scatterplot matrices such as that in Figure 3. Here, we call PROC MEANS to calculate the sample mean (MAGE1,...,MAGE4) and standard deviation (SDAGE1,...,SDAGE4) for each of the variables AGE1,...,AGE4 for each gender. These are output to the data set DENTSTATS, which has two records, one for each gender (see the output). We then MERGE this data set with DENT2 BY GENDER, which has the effect of matching up the appropriate gender mean and SD to each child. We print out the first three records of the resulting data set to illustrate. *******************************************************************/ proc sort data=dent2; by gender child; run; proc means data=dent2 mean std noprint; by gender; var age1 age2 age3 age4; output out=dentstats mean=mage1 mage2 mage3 mage4 std=sdage1 sdage2 sdage3 sdage4; run; /******************************************************************* We use the NOPRINT option with PROC MEANS to suppress printing of its output. Give title and print with title. *******************************************************************/ title "SAMPLE MEANS AND SDS BY GENDER FROM PROC MEANS"; proc print data=dentstats; run; /****************************************************************** SAMPLE MEANS AND SDS BY GENDER FROM PROC MEANS 42 g _ _ s s s s e T F m m m m d d d d n Y R a a a a a a a a O d P E g g g g g g g g b e E Q e e e e e e e e s r _ _ 1 2 3 4 1 2 3 4 1 0 0 11 21.1818 22.2273 23.0909 24.0909 2.12453 1.90215 2.36451 2.43740 2 1 0 16 22.8750 23.8125 25.7188 27.4688 2.45289 2.13600 2.65185 2.08542 ******************************************************************/ /******************************************************************* The variables CSAGE1,...,CSAGE4 contain the centered/scaled values. These may be plotted against each other to obtain plots like Figure 3. We have not done this here to save space. *******************************************************************/ data dentstats; merge dentstats dent2; by gender; csage1=(age1-mage1)/sdage1; csage2=(age2-mage2)/sdage2; csage3=(age3-mage3)/sdage3; csage4=(age4-mage4)/sdage4; run; title "INDIVIDUAL DATA MERGED WITH MEANS AND SDS BY GENDER"; proc print data=dentstats(obs=3); run; /****************************************************************** INDIVIDUAL DATA MERGED WITH MEANS AND SDS BY GENDER 10 Obs gender _TYPE_ _FREQ_ mage1 mage2 mage3 mage4 sdage1 sdage2 sdage3 1 0 0 11 21.1818 22.2273 23.0909 24.0909 2.12453 1.90215 2.36451 2 0 0 11 21.1818 22.2273 23.0909 24.0909 2.12453 1.90215 2.36451 3 0 0 11 21.1818 22.2273 23.0909 24.0909 2.12453 1.90215 2.36451 Obs sdage4 age1 age2 age3 age4 child csage1 csage2 csage3 csage4 1 2.43740 21.0 20.0 21.5 23.0 1 -0.08558 -1.17092 -0.67283 -0.44757 2 2.43740 21.0 21.5 24.0 25.5 2 -0.08558 -0.38234 0.38447 0.57811 3 2.43740 20.5 24.0 24.5 26.0 3 -0.32093 0.93196 0.59593 0.78325 *******************************************************************/ /******************************************************************* ******************************************************************** PROC DISCRIM ******************************************************************** *******************************************************************/ /******************************************************************* One straightforward way to have SAS calculate the pooled sample covariance matrix and the corresponding estimated correlation matrix is using PROC DISCRIM. This procedure is focused on so-called discriminant analysis, which is discussed in a standard text on general multivariate analysis. The data are considered as in the form of vectors; here, the elements of a data vector are denoted as AGE1,...,AGE4. Here, we only use PROC DISCRIM for its facility to print out the sample covariance matrix and correlation matrix "automatically," and disregard other portions of the output. *******************************************************************/ proc discrim pcov pcorr data=dent2; class gender; var age1 age2 age3 age4; run; /***************************************************************** INDIVIDUAL DATA MERGED WITH MEANS AND SDS BY GENDER 44 The DISCRIM Procedure Total Sample Size 27 DF Total 26 Variables 4 DF Within Classes 25 Classes 2 DF Between Classes 1 Number of Observations Read 27 Number of Observations Used 27 INDIVIDUAL DATA MERGED WITH MEANS AND SDS BY GENDER 45 The DISCRIM Procedure Pooled Within-Class Covariance Matrix, DF = 25 Variable age1 age2 age3 age4 age1 5.415454545 2.716818182 3.910227273 2.710227273 age2 2.716818182 4.184772727 2.927159091 3.317159091 age3 3.910227273 2.927159091 6.455738636 4.130738636 age4 2.710227273 3.317159091 4.130738636 4.985738636 Pooled Within-Class Correlation Coefficients / Pr > |r| Variable age1 age2 age3 age4 age1 1.00000 0.57070 0.66132 0.52158 0.0023 0.0002 0.0063 age2 0.57070 1.00000 0.56317 0.72622 0.0023 0.0027 <.0001 age3 0.66132 0.56317 1.00000 0.72810 0.0002 0.0027 <.0001 age4 0.52158 0.72622 0.72810 1.00000 0.0063 <.0001 <.0001 ******************************************************************/ /******************************************************************* Although it is a bit cumbersome, we may use some DATA step manipulations and PROC CORR to obtain the values of the autocorrelation function for each gender. We first drop variables no longer needed from the data set DENTSTATS. We create then three data sets, LAG1, LAG2, and LAG3, and describe LAG1 here; the other two are similar. We create two new variables, PAIR1 and PAIR2. For LAG1, PAIR1 and PAIR2 are the two values in (5.43) for u=1. As there are 4 ages, each child has 3 such pairs. The output of PROC PRINT for LAG1 shows this for the first 2 children. We then sort the data by gender and call PROC CORR to find the sample correlation between the two variables for each gender. The same principle is used to obtain the correlation by gender for lags 2 and 3 [u=2,3]. There are other, more sophisticated ways to obtain the values of the autocorrelation function; however, for longitudinal data sets where the number of time points is small, the "manual" approach we have demonstrated here is easy to implement and understand. PAIR1 versus PAIR2 may be plotted for each lag to obtain visual presentation of the results as in Figure 6. *******************************************************************/ data dentstats; set dentstats; drop age1-age4 mage1-mage4 sdage1-sdage4; run; data lag1; set dentstats; by child; pair1=csage1; pair2=csage2; output; pair1=csage2; pair2=csage3; output; pair1=csage3; pair2=csage4; output; if last.child then return; drop csage1-csage4; run; title "AUTOCORRELATION FUNCTION AT LAG 1"; proc print data=lag1(obs=6); run; proc sort data=lag1; by gender; proc corr data=lag1; by gender; var pair1 pair2; run; /**************************** AUTOCORRELATION FUNCTION AT LAG 1 49 Obs gender _TYPE_ _FREQ_ child pair1 pair2 1 0 0 11 1 -0.08558 -1.17092 2 0 0 11 1 -1.17092 -0.67283 3 0 0 11 1 -0.67283 -0.44757 4 0 0 11 2 -0.08558 -0.38234 5 0 0 11 2 -0.38234 0.38447 6 0 0 11 2 0.38447 0.57811 AUTOCORRELATION FUNCTION AT LAG 1 50 ----------------------------------- gender=0 ----------------------------------- The CORR Procedure 2 Variables: pair1 pair2 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum pair1 33 0 0.96825 0 -2.20369 2.07616 pair2 33 0 0.96825 0 -1.88353 2.07616 Pearson Correlation Coefficients, N = 33 Prob > |r| under H0: Rho=0 pair1 pair2 pair1 1.00000 0.89130 <.0001 pair2 0.89130 1.00000 <.0001 AUTOCORRELATION FUNCTION AT LAG 1 51 ----------------------------------- gender=1 ----------------------------------- The CORR Procedure 2 Variables: pair1 pair2 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum pair1 48 0 0.97849 0 -2.39513 1.99154 pair2 48 0 0.97849 0 -1.55080 1.99154 Pearson Correlation Coefficients, N = 48 Prob > |r| under H0: Rho=0 pair1 pair2 pair1 1.00000 0.47022 0.0007 pair2 0.47022 1.00000 0.0007 ***************************/ /************************** LAG2 ***************************/ data lag2; set dentstats; by child; pair1=csage1; pair2=csage3; output; pair1=csage2; pair2=csage4; output; if last.child then return; drop csage1-csage4; run; title "AUTOCORRELATION FUNCTION AT LAG 2"; proc print data=lag2(obs=6); run; proc sort data=lag2; by gender; proc corr data=lag2; by gender; var pair1 pair2; run; /************************** LAG3 ***************************/ data lag3; set dentstats; by child; pair1=csage1; pair2=csage4; output; if last.child then return; drop csage1-csage4; run; title "AUTOCORRELATION FUNCTION AT LAG 3"; proc print data=lag3(obs=6); run; proc sort data=lag3; by gender; proc corr data=lag3; by gender; var pair1 pair2; run;