Science - 06.12.2019

sciencemag.org SCIENCE

INSIGHTS | POLICY FORUM

ample, the values associated with a few pixels
may be changed in a way that is medically
insignificant. The AI/ML system must now
make a prediction from a feature vector that
is vanishingly different from the initial vec-
tor. A stable algorithm should give predic-
tions that are similarly “close” in the output
space (in probability) when it is given a slight
variation relative to another input [Dwork et
al. ( 10 ) describe this property and use it as
the basis for a definition of fairness in ML].
When this condition is not satisfied, the
algorithm is not stable in the sense that med-
ically similar patients can receive dissimilar
diagnoses. From the perspective of patient
safety, it is undesirable to have a diagnostic
system that frequently classifies medically
similar lesions very differently. Paying atten-
tion to this encourages thinking not in terms
of same inputs/same outputs, but in terms of
similar inputs/similar outputs.
In modern AI/ML, leading classification
systems are highly nonlinear. This makes
them especially vulnerable to such instabil-
ity [for example, ( 11 )]. This problem extends
beyond adversarial attacks ( 6 ). “System lock-
ing” of an algorithm, though it can guarantee
that the same inputs will lead to the same
outputs, does not secure against the bigger
concern—instability. Meanwhile, any prede-
termined change control plan does not get to
the core of the problem either, because it is
impossible to know in advance what kind of
instabilities the world actually has.

A CONTINUOUS RISK-MONITORING
APPROACH
As regulators push forward, their emphasis
should be on developing a process to con-
tinuously monitor, identify, and manage as-
sociated risks due to AI/ML features such as
concept drift, covariate shift, and instability.
Such a process can include, for example, the
following elements, some of which might
even be automated with improvements in
AI/ML technology:

Retesting
An AI/ML system may need to be regularly
retested (possibly continuously retested, with
dedicated infrastructure) on all past cases, or
a random subset of them, including but not
limited to the ones used for the initial mar-
keting authorization. Major discrepancies on
past verdicts may lead to regulatory action.

Simulated checks
An AI/ML system can be continuously ap-
plied to “simulated patients”—generated,
for example, by perturbing the data of past
patients, an idea often used to examine the
robustness of AI/ML models—to evaluate
whether its behavior is reliable with respect
to a sufficient diversity of patient types.

Adversarial stress tests Every AI/ML system may need to be paired with a monitoring mechanism to ensure robustness to adversarial examples ( 12 ). Regu- lators could use the adversarial approach to conduct algorithmic stress tests throughout the AI/ML system’s lifecycle, borrowing practices such as red teaming and adversarial attack testing from cybersecurity.

An appropriate division of labor Monitoring of AI/ML systems should, in general, be done by actors different from the ones developing these systems. Separation of development and testing is common in other contexts. For example, in software development, quality assurance and development teams are separate, while risk management and compliance departments are separated from traders in the financial sector. Such divi- sions may be likewise required for companies developing medical AI/ML systems. More- over, third-party organizations that monitor AI/ML systems based on standards the in- dustry develops, similar in spirit to those of professional organizations like the Institute of Electrical and Electronics Engineers and the International Organization for Standard- ization, may also play a role in the future.

Use of innovative electronic systems Regulators could also use new electronic systems and data analysis techniques, such as change-point detection (a family of sta- tistical techniques that attempt to identify changes in the distribution of a stochastic process) or anomaly detection (a family of techniques used to identify rare items in a data set), to continuously monitor AI/ML systems. Existing elements of regulators’ product oversight could be adapted for monitoring AI/ML. For example, systems such as the FDA’s national medical product monitoring system Sentinel ( 13 , 14 ) could be used to continuously monitor the behavior of ap- proved AI/ML-based medical devices. Com- bining information from electronic health records and other data from such devices, regulators could themselves perform some of the tasks described. Indeed, in September 2019, the FDA announced an intention to ex- pand Sentinel to three separate coordinating centers monitoring more traditional medical product safety ( 13 ). Its new “Operations Cen- ter” seeks to use partnerships in epidemiol- ogy, statistics, and data science among other fields ( 13 ). Its new “Innovation Center” will explore new roads “to extract and structure information from electronic health records” ( 13 ). Both of these new centers could be used to implement aspects of the AI/ML monitoring proposed here.. Our suggestions refine the FDA’s aim of implementing “real-world performance

monitoring” ( 1 , 2 ) by articulating some key features that risk-monitoring should focus on (i.e., concept drift, covariate shift, and instability) and suggesting some ways to implement it. In principle, our goal is to emphasize the risks that can arise from un- anticipated changes in how medical AI/ML systems react or adapt to their environ- ments. Subtle, often unrecognized paramet- ric updates or new types of data can cause large and costly mistakes. Hence, they need to be continuously monitored and tested. Although this discussion has been moti- vated by the FDA’s current approach to AI/ML-based SaMD and the U.S. experi- ence, with appropriate adaptations, the les- sons here apply to other countries and their regulators as well. j

REFERENCES AND NOTES

U.S. Food and Drug Administration (FDA), “Proposed
regulatory framework for modifications to artificial
intelligence/machine learning (AI/ML)-based
software as a medical device (SaMD)” (discussion
paper and request for feedback, 2019); http://www.fda.gov/
media/122535/download.

U.S. Food and Drug Administration (FDA), “Developing
a Software Precertification Program: A Working Model”
(v.1.0, January 2019); http://www.fda.gov/media/119722/
download.

The learning healthcare project, “Background,
learning healthcare system” (2019); http://www.learn-
inghealthcareproject.org/section/background/
learning-healthcare-system.

A. Yala, C. Lehman, T. Schuster, T. Portnoi, R. Barzilay,
Radiology 292 , 60 (2019).

A. S. Fauci, M. A. Marovich, C. W. Dieffenbach, E. Hunter,
S. P. Buchbinder, Science 344 , 49 (2014).

S. G. Finlayson et al., Science 363 , 1287 (2019).

US Food and Drug Administration (FDA),
“Software as a medical device (SaMD)” (2018);
http://www.fda.gov/medical-devices/digital-health/
software-medical-device-samd.

A. Esteva et al., Nature 542 , 115 (2017).

S. Bickel, M. Bruckner, T. Scheffer, Discriminative
learning for differing training and test distributions,
Proceedings of the 24th International Conference on
Machine Learning (ICML), Corvallis, OR, 2007.

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, R. Zemel,
Fairness through awareness, Proceedings of the
3rd Innovations in Theoretical Computer Science
Conference (ITCS), Cambridge, MA, 2012.

I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining
and harnessing adversarial examples, Proceedings
of the 3rd International Conference on Learning
Representations (ICLR), San Diego, CA, 2015.

S. Gu, L. Rigazio, Towards deep neural network archi-
tectures robust to adversarial examples, Proceedings
of the 3rd International Conference on Learning
Representations (ICLR), San Diego, CA, 2015.

U.S. Food and Drug Administration (FDA), “FDA’s
Sentinel Initiative” (2019); http://www.fda.gov/safety/
fdas-sentinel-initiative.

Sentinel Coordinating Center, “Sentinel is a national
medical product monitoring system” (2019);
http://www.sentinelinitiative.org.

ACKNOWLEDGMENTS S.G. and I.G.C. were supported by a grant from the Collaborative Research Program for Biomedical Innovation Law, a scientifically independent collaborative research program supported by a Novo Nordisk Foundation grant (NNF17SA0027784). All authors contributed equally to the analysis and drafting of the paper. I.G.C. served as a bioethics consultant for Otsuka on their Abilify MyCite product. The authors declare no other competing interests. 10.1126/science.aay9547

1204 6 DECEMBER 2019 • VOL 366 ISSUE 6470

Published by AAAS

on December 12, 2019^

http://science.sciencemag.org/

Downloaded from

Science - 06.12.2019

Get our desktop app

Company

Features

Documentation

Resources