Statistical Analysis for Education and Psychology Researchers

(Jeff_L) #1

all levels of the suspect variable against other variables of interest. Initially it may be
more informative to examine the presence/ absence of missing data than to be concerned
with the amount.
If missing data does appear to be non-random then those cases with missing data
should be retained for further investigation. If missing data seems to be random then two
general options exist, either estimate missing values or delete cases or particular variables
that have missing data (an alternative to deleting a case is to just drop the missing
variable for a particular analysis).
How do you decide which of these two strategies to adopt?
The most radical procedure is to drop any cases with missing data. This is the default
option in many statistical programmes. If missing data are scattered at random throughout
cases and variables, dropping a large number of cases with any missing data may result in
loss of a substantive amount of data. The consequence of losing cases is more serious in
some research designs, for example, balanced experimental designs with small numbers
of subjects, than in large survey designs where a margin for data loss is designed into the
sampling strategy. In these circumstances it may be preferable to estimate missing values
provided it makes sense to do so.
Deleting cases is advised when only a few cases have missing data. Dropping
variables but retaining cases is an alternative but is generally only suitable when the
variable is not critical to the analysis.
Another alternative to deleting cases or dropping variables is to substitute missing
values with ‘best estimates’. In general there are five options ranging in degrees of
sophistication. These are substitute a missing value with:


1 a best guess;
2 the overall mean for that variable;
3 a relevant group mean;
4 a regression equation based on complete data to predict missing values;
5 a generalized approach based on the likelihood function.


Advice on using each of these options is:


1 Do not use at all.
2 and 3 Do not use with binary data. For example, if the variable sex was coded 0 for
female 1 for male, it would not make sense to substitute a proportion because this
represents the overall mean on that variable. Using the overall mean for a variable
reduces the variability (variance) of that variable especially if there is a large amount
of missing data. This is because the substituted mean is closer to itself than to the
missing value (unless the missing value was the same value as the overall mean). A
reduction in variability of a variable has the effect of reducing the correlation that
variable has with other variables (see Chapter 8). The net effect of many missing data
substitutions would be to reduce any underlying correlation between variables. This
could have a dramatic effect in some statistical procedures such as factor analysis.
4 Is only useful when other variables in the data set are likely to predict the variable(s)
with missing values, the dependent variable. If there are no suitable independent
(predictors) then use of option 2) or 3) is probably best.


Statistical analysis for education and psychology researchers 48
Free download pdf