Encyclopedia of Environmental Science and Engineering, Volume I and II

(Ben Green) #1

STATISTICAL METHODS FOR ENVIRONMENTAL SCIENCE 1127


a population which maximizes the probability of obtaining the
observed set of sample values, assuming random sampling. It
has the advantages of yielding estimates which fully utilize the
information in the sample, if such estimates exist, and which
are less variable under certain conditions for large samples
than other estimates.
The method consists of taking the equation for the prob-
ability, or probability density function, finding its maximum
value, either directly or by maximizing the natural loga-
rithm of the function, which has a maximum for the same
parameter values, and solving for these parameter values.
The sample mean, m^= (n

i=1
xi)/Nu , is a maximum likelihood
estimate of the true mean of the distribution for a number of
distributions. The variance, s^^2 , calculated from the sample
by s^^2 = (

n
i=1 (xi-m^)^2 , is a maximum likelihood estimate of the
population s 2 for the normal distribution.
Note that such estimates may not be the best in some
other sense. In particular, they may not be unbiased. An
unbiased estimate is one whose value will, on the average,
equal that of the parameter for which it is an estimate, for
repeated sampling. In other words, the expected value of
an unbiased estimate is equal to the value of the parameter
being estimated. The variance is, in fact, biased. To obtain an
unbiased estimate of the population variance it is necessary
to multiply s^2 by n /( n  1), to yield s^2 , the sample variance,
and s, ( s^2 ) the sample standard deviation.
There are other situations in which the maximum like-
lihood estimate may not be “best” for the purposes of the
investigator. If a distribution is badly skewed, use of the
mean as a measure of central tendency may be quite mis-
leading. It is common in this case to use the median, which
may be defined as the value of the variable which divides the
distribution into two equal parts. Income statistics, which are
strongly skewed positively, commonly use the median rather
than the mean for this reason.
If a distribution is very irregular, any measure of central
tendency which attempts to base itself on the entire range of
scores may be misleading. In this case, it may be more useful
to examine the maximum points of f ( x ); these are known as
modes. A distribution may have 1, 2 or more modes; it will
then be referred to as unimodal, bimodal, or multimodal,
respectively.
Other measures of dispersion may be used besides the
standard deviation. The probable error, p.e., has often been
used in engineering practice. It is a number such that

pefxdx

pe
.. () ..

..
 
m^05

m

(15)

The p.e. is seldom used today, having been largely replaced
by s 2.
The interquartile range may sometimes be used for a set
of observations whose true distribution is unknown. It con-
sists of the limits of the range of values which include the
middle half of sample values. The interquartile range is less
sensitive than the standard deviation to the presence of a few
very deviant data values.

The sample mean and standard deviation may be used to
describe the most likely true value of these parameters, and
to place confidence limits on that value. The standard error
of the mean is given by s/n ( n = sample-size). The stan-
dard error of the mean can be used to make a statement about
the probability that a range of values will include the true
mean. For example, assuming normality, the range of values
defined by the observed mean 1.96s/n will be expected to
include the value of the true mean in 95% of all samples.
A more general approach to estimation problems can be
found in Bayseian decision theory (Pratt et al. , 1965). It is pos-
sible to appeal to decision theory to work out specific answers
to the “best estimate” problem for a variety of decision cri-
teria in specific situations. This approach is well described
in Weiss (1961). Although the method is not often applied
in routine statistical applications, it has received attention in
systems analysis problems and has been applied to such envi-
ronmentally relevant problems as resource allocation.

Frequency Data

The analysis of frequency data is a problem which often
arises in environmental work. Frequency data for a hypo-
thetical experiment in genetics are shown in Table 1. In this
example, the expected frequencies are assumed to be known
independently of the observed frequencies. The chi-square
statistic, x 2 , is defined as

x^2

2
 2

()EO
E


(16)

where E is the expected frequency and O is the observed
frequency. It can be applied to frequency tables, such as that
shown in Table 1. Note that an important assumption of the
chi-square test is that the observations be independent. The
same samples or individuals must not appear in more than
one cell.
In the example given above, the expected frequencies
were assumed to be known. In practice this is very often not
the case; the experimenter will have several sets

TABLE 1
Hypothetical data on the frequency of plants producing red, pink and white
flowers in the first generation of an experiment in which red and white
parent plants were crossed, assuming single gene inheritance, neither gene
dominant of observed frequencies, and will wish to determine whether
or not they represent samples from one population, but will not know the
expected frequency for samples from that population.

Flower color
Red Pink White
Number of
plants

expected 25 50 25

observed 28 48 24

C019_004_r03.indd 1127C019_004_r03.indd 1127 11/18/2005 1:30:56 PM11/18/2005 1:30:56 PM

Free download pdf