exactly where you will find it. (For additional material on the intraclass correlation, go to
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/icc/icc.html.)
Suppose that we are interested in measuring the reliability with which judges rate the
degree of prosocial behavior in young children. We might investigate this reliability by
having two or more judges each rate a behavior sample of a number of children, assigning
a number from 1 to 10 to reflect the amount of prosocial behavior in each behavior sample.
I will demonstrate the procedure with some extreme data that were created to make a point.
Look at the data in Table 14.14.
In Table 14.14a the judges are in almost perfect agreement. They all see wide differ-
ences among children, they all agree on which children show high levels of prosocial
behavior and which show low levels, andthey are nearly in agreement on how high or
low those levels are. In this case nearly all of the variability in the data involves differ-
ences among children—there is almost no variability among judges and almost no ran-
dom error.
In Table 14.14b we see much the same pattern, but with a difference. The judges do see
overall differences among the children, and they do agree on which children show the high-
est (and lowest) levels of the behavior. But the judges disagree in terms of the amount of
prosocial behavior they see. Judge II sees slightly less behavior than Judge I (his mean is
1 point lower), and Judge III sees relatively more behavior than do the others. In other
words, while the judges agree on orderingchildren, they disagree on level. Here the data
involve both variability among children and variability among judges. However, the
random error component is still very small. This is often the most realistic model of how
people rate behavior because each of us has a different understanding of how much behav-
ior is required to earn a rating of “7,” for example. Our assessment of the reliability of a
rating system must normally take variability among judges into account.
Finally, Table 14.14c shows a pattern where not only do the judges disagree in level,
they also disagree in ordering children. A large percentage of the variability in these data is
error variance.
So what do we do when we want to talk about reliability? One way to measure relia-
bility when judges use only a few levels or categories is to calculate the percentage of
times that two judges agree on their rating, but this measure is biased because of high lev-
els of chance agreement whenever one or two categories predominate. (But see the dis-
cussion earlier of Cohen’s kappa.) Another common approach is to correlate the ratings of
two judges, and perhaps average pairwise correlations if you have multiple judges. But
this approach will not take differences between judges into account. (If one judge always
rates five points higher than another judge the correlation will be 1.00, but the judges are
saying different things about the subjects.) A third way is to calculate what is called the
intraclass correlation,taking differences due to judges into account. That is what we will
do here.
496 Chapter 14 Repeated-Measures Designs
Table 14.14 Data for intraclass correlation examples
(a) (b) (c)
Judge Judge Judge
Child I II III I II III I II III
1 112103137
2 333325315
3 555547574
4 566547555
5 777768767
intraclass
correlation