validity. An instrument can be reliable but not valid. If you weighed yourself
10 times in a row on your bathroom scale this morning, you would expect
the scale to show the same weight each time. You would conclude that the
scale is reliable. If you are anxious about your weight, you cannot conclude
that the scale measured your level of anxiety. Whereas the scale is a valid
instrument to measure your weight, it is not a valid instrument to measure
your anxiety even though the scale was shown to be reliable. Estimates of
reliability are usually presented in the form of a correlation coefficient.
Although correlations range from −1.00 to +1.00, reliability coefficients
are typically reported as values between 0.00 and +1.00 because that is the
nature of relationships that are tested. Reliability coefficients of 0.80 and
above are acceptable for well-established instruments, whereas reliability
coefficients of 0.70 and above are acceptable for newly developed instru-
ments (Griffin-Sobel, 2003).
When testing instruments for reliability, researchers are interested in three
attributes: stability, equivalence, and internal consistency. Instruments are
stable when the same scores are obtained with repeated measures under the
same circumstances (as in the bathroom scale example). An instrument is
said to be equivalent when there is agreement between alternate forms or
alternate raters. Internal consistency, also known as homogeneity, exists when
all items on a questionnaire measure the same concept. Seven ways are com-
monly used to test instruments for reliability. They are test-retest reliability,
parallel or alternate form, interrater reliability, split-half reliability, item
to total correlation, Kuder-Richardson coefficient, and Cronbach’s alpha
(Table 10-3).
Test-retest reliability determines stability by administering the instrument
to the same subjects under the same conditions at two different times. Scores
are used to calculate a Pearson r, a type of correlation coefficient. Parallel or
alternate form testing is used to test for both stability and equivalence. Research-
ers create parallel forms by altering the wording or layout of items. Because the
forms are similar, researchers expect high positive correlations. For example,
Beyer et al. (1992) compared a pocket-sized Oucher with a poster-sized Oucher
and obtained similar pain ratings.
Interrater reliability also tests for equivalence. This method is used when
instruments record observations. A common way to determine interrater reli-
ability is to have two observers score the same event. Ratings are compared, and
if the ratings are similar, the instrument is considered to have strong reliability.
Another way to establish interrater reliability is to have one individual make
multiple observations over time.
KEY TERMS
known group
testing: A test for
construct validity
in which new
instruments are
administered to
individuals known
to be high or low on
the characteristic
being measured
factor analysis: A
test for construct
validity that is a
statistical approach
to identify items
that group together
reliability: Obtain-
ing consistent
measurements over
time
correlation
coefficient: An
estimate, ranging
from 0.00 to +1.00,
that indicates the
reliability of an
instrument; a statis-
tic used to describe
the relationship be-
tween two variables
stability: An
attribute of
reliability when
instruments
render the same
scores with
repeated measures
under the same
circumstances
equivalence:
An attribute of
reliability in which
there is agreement
between alternate
forms of an
instrument or
alternate raters
268 CHAPTER 10 Collecting Evidence