Statistical Analysis for Education and Psychology Researchers

(Jeff_L) #1

and a clear description of how to measure performance and use tests, including advice on
test construction and interpretation is given by Morris, Fitz-Gibbon and Lindheim, 1987.


Interpretation of Validity and Reliability

Arguably, the most important aspect of a test or scale is its validity (see Chapter 1). Here
the earlier description of validity is elaborated upon to determine whether a test measures
what it is supposed to measure and to include the idea of justification of inference. For
example, it might not be known precisely what a test or scale is measuring, but there
might be independent evidence that if the test is used for selection or prediction, the test
scores are known to be related to a criterion of interest. It might be, for example, that a
potential employer asks job applicants to complete a computer programming aptitude test
knowing that high scores on the test are related to job success. Here, the validity of the
inferenee (generalization) justifies the appropriateness of the test. It was noted in Chapter
1 that construct validity encompasses other forms of validity. These other forms of
validity, concurrent, predictive and content validity are described briefly below.
Content validity is a descriptive indication (not given as a statistic) of the extent to
which the content of the test covers all aspects of the attribute or trait of interest.
Predictive validity is usually measured by a correlation coefficient such as the
Pearson Correlation Coefficient ‘r’ which has possible values ranging from −1 through 0
to +1. Higher correlation coefficients indicate better validity. The predictive validity of a
test is the extent to which the test score predicts some subsequent criterion variable of
interest. For example, the predictive validity of A-level examinations is the extent to
which A-level scores achieved by candidates predict the same candidates’ subsequent
degree performance. In this example, A-level score is the predictor variable and degree
performance is the criterion variable. As it happens, on average, the predictive validity of
A-levels is poor, with r being about 0.3 (Peers and Johnston, 1994). Some authors claim,
for example, Kline (1990), that any correlation greater than 0.3 is an acceptable
coefficient for predictive validity of a test. This seems rather low, and only about 10 per
cent of variation in the criterion variable would be accounted for by a predictive test with
such a low validity coefficient. It is suggested that the minimally acceptable coefficient
should be nearer 0.4. An obvious question to consider is, whether a test that accounts for
only a small proportion of variation in the criterion is of any use as a predictive test?
Again this judgment needs to be made by the researcher.
Concurrent validity is similar to predictive validity but both the predictor and
criterion tests are taken at the same time. If an end-of-course mathematics achievement
test was constructed by a teacher and given to a group of students and at the same time a
standardized numeric ability test was administered, the correlations between the two test
scores would provide a measure of the concurrent validity of the mathematics
achievement test. It is suggested that correlations of at least 0.5 are required for
acceptable concurrent validity.
Construct validity embodies both content and concurrent or predictive validity. It
represents all the available evidence on the trustworthiness of a test. Strictly, validity and
trustworthiness refer to the inferences drawn from a test rather than to a property of the
test itself. Whenever validity coefficients are given, sample sizes on which the validity


Statistical analysis for education and psychology researchers 28
Free download pdf