Body mass
Under1kg?
Geographic
range
Under 150,000
km^2?
TREE #2 TREE #3
TREE #1
Life span
Less than
1 year?
Age of
sexual maturity
More than
50 days?
Group
population
size
Less than
10?
Litters
per year
Less than 3?
Period of
activity
Active during
day?
Ye s
Ye s
Ye s
No
No
No Yes No Yes No Yes No
Ye s N o
58% identied correctly
83% identied correctly
rp
oc
es
s^ r
ep
ea
ts
67% identied correctly process repeats
non-disease carrier disease carrier
unlabeled for algorithm
carrier non-carrier carrier non-carrier carrier non-carrier carrier non-carrier
Misclassiied
Not known to transmit disease
Known to transmit disease
Not yet classiied by algorithm
Key
60 DISCOVERMAGAZINE.COM
TOP: COURTESY OF BARBARA HAN. BOTTOM: ALISON MACKEY/DISCOVER. SHUTTERSTOCK ELEMENTS FROM BASEL101658, POTAPOV ALEXANDER, HEIN NOUWENS, A SK (CREATURES); BLACK CREATOR (MAN)
A Learning Process
In the world of epidemiology, diseases
that have seen an uptick in recent
years are called “emerging infectious
diseases.” But are there really more
cases of these diseases, or have we
just become better at spotting them?
According to Barbara Han, a disease
ecologist at the non-proit Cary
Institute of Ecosystem Studies in New
York, it’s not just us getting better.
“It’s actually an increasing problem of
infectious diseases,” she says. And most
of these diseases originate in animals.
Han decided to igure out what
makes certain animals more likely
to host speciic diseases. “There is
something inherent about a species that
enables it to carry disease, compared to
the vast majority that don’t,” she
says. “I want to know what the
data can give me, what can the
data show me, about what distin-
guishes those two.” She turned to
algorithms and machine learning.
Han starts with a list of species
that researchers have already
agged as disease carriers or
non-disease carriers. She then
trains a computer algorithm
to separate the species on the
list — not labeled in any way, so the
algorithm doesn’t know which is which
— by dozens of traits. For example,
the algorithm may start by looking at
an animal’s body mass, followed by its
age of sexual maturity and inally by
whether it’s nocturnal or not.
At the end of this sorting, the
algorithm will ideally have
grouped species by whether
they’re disease carriers or not.
But this irst sort gets a fair
bit wrong. To make the algo-
rithm more accurate, Han has
the computer do another round
of sorting, this time focusing
on the species it miscategorized
the irst time. When it does
this over and over again, the algorithm
learns. And, importantly, it learns which
factors contribute to a species carrying a
transferable disease or not. “At the end
of that process, you get a very powerful
predictor,” Han says. When the model
To train an algorithm to identify
zoonotic species, known disease
carriers and non-disease carriers
from an animal group, in this case
rodents, are fed into the algorithm
as unlabeled data points. NOTE: Only
around 10 percent of disease carriers are
currently known, but we show a 50/50
split here, for simplicity.
The algorithm then sorts them into
groups using randomly selected traits.
The algorithm gets a
lot wrong (hollow
boxes) in its first
pass, so it repeats
the process multiple
times with other
randomly selected
traits, focusing on
misclassified species.
With each attempt, the
algorithm learns which
traits are most likely to
crop up in disease carriers.
Diseases that can
pass between
animals and humans
are called zoonotic.
1
2
3
Barbara
Han
disease
ecologist,
Cary Institute
of Ecosystem
Studies