Evolution, 4th Edition

(Amelia) #1

A STATISTICS PRIMER A–9


Statistics can quantify how certain we are that a difference between two sets
of measurements represents a real difference between the two populations from
which the measurements came, rather than an accident of sampling. A calculation
shows that the probability is less than 1 in a billion that by chance so many of the
deer from Colorado would be heavier than those from Texas if the two populations
in fact did have the same mean.
This example illustrates a fundamental principle about how statistics is used to
make inferences. Statistics cannot prove a hypothesis, it can only reject one. Our sam-
ples do not let us calculate the true mean weights of all the deer living in Colorado and
Texas. It is therefore impossible to know with certainty that deer in Texas are heavier
on average. We can, however, determine the probability that they are different.
To do that, we begin with a null hypothesis that we will seek to reject. We then
calculate the probability that if the null hypothesis were true, then our data would
produce a result as extreme as or more extreme than what our data show. This
probability is called the P value. The smaller the P value, the more confident we are
that the null hypothesis is false. In evolutionary biology, it is conventional to reject
a null hypothesis if the P value is less than 0.05. We say that a conclusion is statisti-
cally significant if it meets that threshold.
To see how this idea is used, let’s return to the deer example. Our null hypoth-
esis is that the weights of deer in Texas and Colorado have the same distribution.
If that were true, the probability that what we see in the two samples—all but one
of the deer weighed in Colorado are heavier than those in Texas—turns out to be
1.5 × 10 –10. An even more extreme outcome would be that all the deer weighed
in Colorado are heavier than those in Texas, and under the null hypothesis the
probability of that outcome is 1.5 × 10 –30. Adding together the two probabilities
gives us the P value, which is 1.5 × 10 –10. In other words, there is a 99.999999985
percent chance that the null hypothesis is wrong. The P value is far smaller than
the threshold of 5 percent, so we can conclude that deer in Colorado are highly
significantly heavier than those in Texas.
Several approaches are used to test null hypotheses. The most common strat-
egy is to use statistical tests that make assumptions about the distributions in the
populations that are being sampled. Tests of this sort that you may have encoun-
tered already are the chi-square test and the t-test. The appropriate choice of which
test to use depends on the nature of the null hypothesis and the data. (The text
by Whitlock and Schluter has the details [2].) Returning to the example of heights
shown in Figure A.10, a t-test for the samples of five females and males (left panel)
reports the probability that those two distributions have the same mean is P = 0.06,
which is not statistically significant. In contrast, the probability that the samples
of 250 females and 250 males (right panel) have the same mean is P = 3.7 × 10 –50,
which is highly statistically significant.
A second strategy for testing hypotheses statistically is called randomization.
Say that we have weights of 14 deer from Texas and 19 deer from Colorado, and
the difference in their mean weights is 10 kg. We can use a computer to randomly
assign the 33 weights in this data set to two groups, one of size 14 (representing
a sample of deer from Texas) and the other of size 19 (representing a sample from
Colorado). We then record the difference between the weights in these two groups.
The aim of this procedure is to simulate what we might see under the null hypoth-
esis that there in fact is no difference in the distributions of weights in Colorado
and Texas. By repeating this randomization thousands of times, we determine how
often the difference in means of the two groups of randomized data is as large as,
or larger than, what we actually observed (FIGURE A.11). If the difference in the
randomized data is as big as in the real data less than 5 percent of the time, we
conclude that the difference in our sample is statistically significant.

23_EVOL4E_APP.indd 9 3/22/17 1:52 PM

Free download pdf