Surgeons as Educators A Guide for Academic Development and Teaching Excellence

48

There are many measures of variation (e.g., variance, standard deviations).
Simple tabulations are the safest place to start: what is the frequency of each
response to each question? The goal for each question should be that it returns a
variety of responses.
A lack of variation is a problem. If every respondent gives the same answer to a
given question, the item will not help us distinguish between people on a given
concept – because every answer is the same. Likewise, if nearly everyone returns
the same answer, the item’s usefulness in distinguishing between people is very
limited.
However, the opposite is not true. Just because there is variation in responses to
a given item, it doesn’t mean that the variation is meaningful. Recall our conversa-
tion above: a badly written item can produce a wide variety of answers simply
because your readers are confused.
Here, intuition plays a role. Does the spread of answers resemble what you would
expect? This is not a statistical question, but a conceptual one, based on how you’ve
defined the concept that you’re attempting to measure with that question.
Then there is the question of whether our items fit together. When examining a
single question, we want responses to vary. When looking across several questions,
we examine whether they covary.

Are Answer Patterns Consistent?
This chapter has focused on the development of multiple-question assessments. We
might combine the answers into a single composite score. This is somewhat strange
exercise, taking answers to qualitative questions and averaging them into a single,
quantitative score. After all, we couldn’t take answers to any two random questions
(e.g., “My co-workers in the operating room are hardworking” and “My co-workers
in the operating room are fans of major league baseball”) and combine them to into
a meaningful measure. So what justifies our doing so with the data we’ve collected?
Psychometric tests are needed.
We ask multiple questions for two main reasons, when trying to measure a single
construct. The first is that our construct is complex – it has many parts – and we
need different questions to cover different parts of our construct. The second is that
language is messy – there is no perfect way of asking about a given thing, so we
sometimes ask redundant questions, with the hope that the common theme across
several answers will be more accurate than the answer to any single question.
Put more simply: in a multi-item assessment, each question is really just a differ-
ent way of asking about the same construct. Therefore, we would expect a person’s
answer to one question to resemble her answers to other questions on the same
assessment. That is, throughout our data, we would expect answers across items to
be correlated with one another.
Consider a hypothetical two-item assessment. If answers to the first item were
completely unrelated to answers to the second, it would be difficult to argue that the
items were measuring the same thing.
This intuition is the basis for what in psychometrics is called “internal consis-
tency and reliability.” Various statistical tests of reliability exist. When analyzing

C. Hitt

Surgeons as Educators A Guide for Academic Development and Teaching Excellence

Get our desktop app

Company

Features

Documentation

Resources