102
high- skilled and intermediate-skilled surgeons. Then, they were asked to rate a vid-
eotaped robotic suturing task on three GEARS domains. Embedded within this task
was an attention question. Ratings from the expert surgeons served as a “gold stan-
dard” for the true quality of the task performance. Mean scores among the groups
were markedly similar; the mean surgeon rating was 12.22 (95% CI 11.11–13.11)
as compared to 12.21 (95% CI 11.98–12.43) and 12.06 (95% CI 11.57–12.55) for
Amazon Mechanical Turk workers and Facebook users, respectively. Notably,
responses were obtained from all Amazon Mechanical Turk users within just 5 days,
as compared to 25 days for Facebook users and 24 days for the faculty surgeons.
Moreover, more complex feedback from the crowdworkers appeared to correlate
with the expert ratings, suggesting that it might also be possible to identify higher-
quality responses to optimize this form of feedback [ 47 ]. This study was the first to
suggest that inexperienced crowdworkers could evaluate surgical simulation task
performance in a manner consistent to expert surgeons and in a markedly more
expeditious fashion.
One of the most important aspects of crowd-based feedback is its apparent equiv-
alency to feedback provided by surgical experts for specific technical tasks.
Published studies in the literature using well-established objective scoring systems
have demonstrated good correlation between crowdsourced ratings and expert sur-
geon ratings for technical tasks across a wide range of specialties (Table 6.3).
Several studies have demonstrated a strong linear relationship between the two
groups, with Pearson’s coefficients ranging from 0.74 to 0.89 [ 48 – 54 ] and r^2 values
for such correlations ranging between 0.70 and 0.93 [ 27 , 49 , 55 ]. Others have quan-
tified this relationship by comparing mean composite rating scores between crowd-
workers and surgical experts using Cronbach’s α scores, with scores greater than 0.9
indicating “excellent agreement,” 0.9–0.7 indicating “good agreement,” and scores
below 0.5 indicating “poor and unacceptable” levels of agreement. Multiple studies
using this analysis have demonstrated Cronbach’s α from 0.79 to 0.92 across a wide
range of tasks, including robotic and laparoscopic pegboard transfer and suturing,
as well as a simulated cricothyroidotomy exercise [ 49 , 50 , 56 ].
It is notable that in one study examining medical student performance on a vari-
ety of surgical skills tasks, poor correlation has been described. Twenty-five medi-
cal students performed four simulation-based tasks for open knot tying, robotic
suturing, laparoscopic peg transfer, and a fulguration skills tasks on the LAP Mentor
©, a commercially available virtual reality laparoscopic simulator. For the first three
tasks, videos were assessed both by faculty experts and crowds using the C-SATS
platform employing OSATS, GEARS, and GOALS, respectively. For the fulgura-
tion task, candidates were evaluated using a proprietary ranking score generated by
the LAP Mentor ©, in lieu of expert evaluation. There was fair agreement of crowd
assessments for the knot tying task (Cronbach’s α = 0.62), good agreement for the
robotic suturing task (Cronbach’s α = 0.86), and excellent agreement for the laparo-
scopic peg transfer (Cronbach’s α  =  0.92). However, the proprietary assessments
generated by the LAP Mentor © had poor agreement with crowd assessments with
Cronbach’s α of 0.32 [ 56 ]. Given the consistent agreement between crowds and
experts for the other simulation tasks, the authors attributed such poor correlation to
J.C. Dai and M.D. Sorensen