49
individual items, an “item-rest” correlation is one of the most popular tests. This
straightforward statistic simply tells us the correlation between answers to a given
item and the average score for all other items on the assessment (see [ 7 ]).
Overall, answers to each item in a given assessment should be correlated with
answers on the remaining items. We should be able to reasonably predict the answer
to any single item using answers given to the other items on the assessment. If
answers to an item are weakly correlated with answers to other items, this probably
means that the item shouldn’t be grouped with all of the others.
A measure called Cronbach’s alpha is the most common test of internal consis-
tency and reliability. It provides an overall estimate of how closely correlated our
items are with one another. If we find that our items are weakly correlated, it means
that all of our items are not measuring the same thing – and therefore our items are
not all measuring our construct. A low level on internal consistency does not, how-
ever, mean that none of our items are capturing our construct. It is possible that a
few items are weakly correlated with most of the others and are dragging down the
overall consistency levels. Virtually every statistics package that calculates
Cronbach’s alpha would help to spot such items.
Another popular, related measure explores the relationship in answers to our
assessment: factor analysis. Factor analysis can be used to explore internal consis-
tency, but it can also be used to explore dimensionality. Exploratory factor analysis,
in layman’s terms, can examine whether a subset of your items are intercorrelated
to a stronger (or weaker) degree than all of the items together.
Cronbach’s alpha and factor analysis are powerful tools. They are surprisingly
easy to conduct using everyday statistics software. Too easy perhaps. You should
not proceed with these tests without studying them more closely than we have done
here. The purpose of this section is to highlight the intuition behind these psycho-
metric procedures – not the mathematical mechanics or the deeper theoretical
underpinnings, which are important. If you’ve never used these procedures before,
work with someone who has. Again, assessment is a social enterprise.
Our goal has been to create a measure that is drawn from a composite of answers
to multiple questions. Our data must pass certain tests in order for us to combine
answers into an average score (which can be raw or weighted). However, simply
passing tests of internal consistency is only a step. It means that our questions have
the appearance of measuring something in common. But it does not yet mean that
our questions measure what we think they measure.
Does Your Measure Predict Other Outcomes?
You are developing an assessment because you want to measure a skill (i.e., con-
struct) that you believe is important to the real world of surgery. Scores on your
assessment should be measurably related to outcomes in the real world – this is the
intuition behind a concept called predictive validity.
We have mentioned the example of “grit” throughout this chapter. Grit was
defined by Duckworth and colleagues as perseverance and passion for long-term
goals. Therefore scores on the Grit scale should be predictive of outcomes in instances
where these traits are important. In a now seminal article, Duckworth and colleagues
4 Measurement in Education: A Primer on Designing Assessments