142
programs can greatly increase the ease and efficiency of frequent evaluations of
individual operative performances.
To try and conceptualize the various assessment tools already in use and any in
the future, it may be helpful to understand them in terms of their final product. Each
assessment produces an account of the operative performance based on the param-
eters set forth in its design. This account or characterization is most often a numeri-
cal score due to the fact that they are simple and quick to complete and have the
advantages of being easy to quantify, combine, and analyze. What the numerical
score actually represents will depend on the design of the assessment. The most
commonly used design involves a generic assessment that can be applied to all pro-
cedures and asks the rater to provide a numerical score, for various aspects of the
operative performance. Examples include the objective assessment of technical
skills (0SATS) system [ 12 ], the nontechnical skills for surgeons system [ 13 ], and
the O-SCORE system [ 14 ]. In an effort to make the assessment thorough and
descriptive, as many as eight separate elements of surgical performance will be
rated on a scale with five to nine levels. Similar systems have also been developed
with procedure-specific metrics or a combination of universal and specific parame-
ters such as global operative assessment of laparoscopic skills (GOALS) [ 15 ] and
OPRS [ 16 ]. Procedure-specific evaluations may provide more detailed data or iden-
tify particular techniques that a resident needs to improve but yield fewer compa-
rable data points to analyze.
Whichever of these two approaches is chosen, however, there is a growing body
of research highlighting some important drawbacks. It may seem straightforward to
assign a numerical value along a Likert-type scale to metrics such as preoperative
planning or tissue handling, particularly when clear descriptions of the metrics and
numerical values are given. However, raters appear to largely ignore the prescribed
categories and rankings and instead view the performance in terms of one or two
broad characterizations that are applied to all the metrics used in the assessment [ 17 ,
18 ]. This at least partially explains why many assessments used in medical teaching
tend to show more correlation among a single assessor’s ratings of multiple differ-
ent trainees than they do among different assessors’ ratings of the same trainee or
performance [ 17 ]. The other main drawback to assessments based around multiple
categories with detailed descriptions is that they take more time to read, understand,
and complete. Although more experience with a given system and utilization of
technology such as smart phones may expedite the process, more questions will
always equal more time to complete. And the more time it takes to complete an
evaluation, the more likely that fewer evaluations will be completed.
We then seemingly have to choose between fewer evaluations with more detailed,
although possibly flawed, data or more evaluations with less detailed data. There is
a growing body of research suggesting that a greater number of completed evalua-
tions, even if they are comprised of only one or two questions, may be more valu-
able. Williams and colleagues reported that increasing the number of evaluations
had a greater impact on reliability than did increasing the number of items assessed
[ 11 ]. They later compared a single-item global assessment of operative performance
to the standard OPRS evaluation of approximately ten items and found nearly
M. Mirza and J.F. Koenig