Basic Statistics

(Barry) #1
CODEBOOK 33

If the investigator already has a printed copy of the data in a spreadsheet format, the
data can be scanned so that it can be recognized by using optical character recognition
(OCR) software. Most of the OCR software will then let you put the results in EXCEL.
Copy and paste can also be used with some programs. Many of the major statistical
packages (Minitab, S-PLUS, SAS-JMP, SAS, SPSS, Stata) include the option for
spreadsheet entry plus other less used options.


3.3 SCREENING THE DATA


The next step is screening the data, which is done to enable the investigator to proceed
with confidence in performing statistical analyses. This can be as simple as scanning
the data set visually if it is not too large. One screening that should always be done is
to obtain the maximum and minimum value for each variable. For example, for the
smoking question, only 1, 2, or 3 are allowable values. If a 5 is entered, a mistake
has been made. If heights of adult males are listed, heights of less than 60 in. or more
than 84 in. should be questioned. In this screening, what is being examined are called
outliers. Outliers have been defined as observations that appear to be inconsistent
with the remainder of the data. Outliers can occur in several ways: for example, as
an error in taking the measurement, recording it, or entering it into the computer.
Sometimes, extreme biological, psychological, or environmental variation may result
in unusual values. It can also be a sampling problem where one takes measurements
from, say, a patient who is not a member of the group that the investigator intended
to study.
Note that the removal of outliers is no guarantee that all incorrect observations
have been identified and removed. If measurement error lowers the height of a tall
person, it could result in a height that was in the normal range and would not be
detected as an error.
It is also useful to make sure that the data have been entered into the correct column.
Additional screening procedures will be given in subsequent chapters. In data sets
where there are concerns about the correctness of the data entry from records, the
numerical data can be entered twice by different persons and then the results can
be compared by subtracting the results from one person from those from the second
person and seeing if only zeros are obtained.
Additional screening procedures are given for various statistical analyses describd
in this book. In general, graphic displays of the data are often the best way of
performing initial data screening, and simple graphic displays (given later in the
book) should always be considered. With available computer programs, they are easy
and quick to obtain.
After data entry and screening, the next recommended step is to make a protected
backup of your data on an external storage device such as a CD, DVD, or flashdrive.


3.4 CODEBOOK

For studies that include numerous variables and many possible users, it is also useful
to write a code book so that everyone knows what data are available in the data set
Free download pdf