Here SAS is saying that the data file ‘child1.day’ does not exist on a:. When entering the
programme the name of the data file was mistyped as ‘child1.day’ instead of ‘child1.dat’.
This lead to a critical error and no output was produced. This however is not always the
case. Sometimes an error is not critical and output is produced but is not likely to be
correct or what you intended.
Once the log file has been checked the output shouled be consulted, see for example,
Figure 3.5. Lines 1–5 simply describe what procedure was used and what data sets were
compared. Line 9 identifies a mismatch for observation 4, in this example this coincides
with case number 0004 in both data sets. In lines 10–11 the mismatch is identified for the
variable AGEYRS and the values in the base data set (child1) and the compare dat set
(child2) are given. The absolute and percentage differences are also given. A similar
description is given for the variable RAVEN in lines 14–15. Finally, a summary of all the
mismatches is given in line 17.
In this example the researcher would need to check in both ASCII data sets,
observation 4, here case number 4, to establish which if any of the data sets has the
correct value for this variable. The original raw data would need to be consulted to
identify the source of the error, whether the wrong value was entered on the first data
entry in CHILD1.DAT or at the second data entry stage when CHILD2.DAT was created.
The same procedure would be followed for case 7, this time examining the variable
RAVEN.
3.2 Data Cleaning
After data verification and editing of any data input errors, the next step is scrutiny of the
data structure and checking of any out of range or spurious data. To do this a listing of
the data and a simple frequency count for all variables are required.
Data Listing and Frequency Count
A data listing is simply a print out of the raw data held in the ASCII data file which has
been read into a SAS data file in a DATA step. A data listing is produced by the SAS
procedure PROC PRINT. Scrutiny of the listing enables you to be certain that the data
you are about to analyse is what you think it is.
A frequency count of all variables should be checked against the coding sheet for any
odd or out of range values. This can be done using PROC SUMMARY. Three simple
checks will suffice:
1 Check the number of observations. Sometimes data is typed in twice or a datum point
may be omitted.
2 Check whether the maximum and minimum values are what you might expect. For
example, consulting the coding sheet in Figure 3.1, you would not expect a score of 8
on the RAVEN variable because the valid score range is 1–7 and missing data is coded
(.). This procedure is called checking out of range values and will identify impossible
values.
3 Check for cases which have missing values for variables. Missing values usually have
their own special value indicator such as a period (.) for numeric values and a blank
Statistical analysis for education and psychology researchers 44