Surgeons as Educators A Guide for Academic Development and Teaching Excellence

(Ben Green) #1
85

to do so, Chen et al. took videos from a previous study, which filmed surgeons per-
forming Fundamentals of Laparoscopic Surgery tasks using the da Vinci robot [ 52 ].
501 Amazon.com crowd workers and 110 Facebook users were selected as the
crowdsourced reviewers, and ten expert robotic surgeons were recruited for the con-
trol group. All participants reviewed the same video and completed only three
domains of GEARS to grade: depth perception, bimanual dexterity, and efficiency.
Facebook users and experts received no compensation; the Mechanical Turks
received $1.00 per HIT.
Crowed sourced scores that did not fall within a 95% confidence interval of the
gold standard set by the expert reviewers as a benchmark were excluded. This elimi-
nated one expert, 92 of the Mechanical Turks, and 43 of the Facebook users (90%,
82%, and 63% retained, respectively) [ 51 ]. Response times were also highly vari-
able. Whereas it took expert surgeons 24 days to review the video, it took Mechanical
Turks only 5. Facebook users took the longest, at 25 days. Chen’s study was limited
to only one video, but it was able to show that the Mechanical Turks were efficient
and reliable assessors than social media in general.
Kowalewski et  al. advanced C-SATS further, beginning with 24 videos taken
from the BLUS validation study of the pegboard and suturing exercises [ 10 , 50 ].
Each was reviewed approximately 60 times by individual crowd – workers, who had
first been evaluated for calibration and attention. This involved discontinuation of
participation with workers who failed to notice a trick question or whose answers
strayed too far from the norm [ 50 ]. 1,438 reviews passing the exclusion tests arrived
within 48 hours, far surpassing the 10-day period it took to get a mere 120 ratings
from the expert reviewers [ 50 ]. The crowd workers were also more discriminating
than the expert reviewers, marking 10 videos as failing versus the experts’ 8. Out of
what the experts designated passing and failing videos, the crowd workers passed
no failing performers and failed 89% of what experts claimed was only “question-
ably good” [ 50 ]. Direct comparison of expert and crowd worker scores yielded
between a 1.16 and 1.57 line of best fit, illustrated in Fig. 5.27 for the suturing and
pegboard tasks, again showing that the experts gave slightly higher scores than
crowd workers [ 50 ].
In the Turks evaluation against EDGE tracking devices, the crowd workers
were equally reliable and advantageous in terms of cost. Each Turk earned $0.67
per video, costing a total of around $1200. EDGE itself costs several times that
number and cannot evaluate every possible metric, as some do not include tool
movement [ 10 ].
The goal of a global ratings scale is to have a universal and objective model for
scoring. However, even with anchored points along the Likert scale, there is a mea-
sure of objectivity. A human is still required for the test and thus introduces their
biases, perceptions, and potential for error. Some researchers consider global rating
scales to be subjective means, while only computer-generated scores can be truly
objective [ 53 ].
All of these methods have been extensively validated by a multitude of different
sites and teams. However, there is some evidence that this kind of assessment may
fail to accurately assess the end result  – the actual surgery done on a patient.


5 Performance Assessment in Minimally Invasive Surgery

Free download pdf