Encyclopedia of Environmental Science and Engineering, Volume I and II

STATISTICAL METHODS FOR ENVIRONMENTAL SCIENCE 1127

a population which maximizes the probability of obtaining the observed set of sample values, assuming random sampling. It has the advantages of yielding estimates which fully utilize the information in the sample, if such estimates exist, and which are less variable under certain conditions for large samples than other estimates. The method consists of taking the equation for the probability, or probability density function, finding its maximum value, either directly or by maximizing the natural loga- rithm of the function, which has a maximum for the same parameter values, and solving for these parameter values. The sample mean, m^= (n

i=1 xi)/Nu , is a maximum likelihood estimate of the true mean of the distribution for a number of distributions. The variance, s^^2 , calculated from the sample by s^^2 = (

n i=1 (xi-m^)^2 , is a maximum likelihood estimate of the population s 2 for the normal distribution. Note that such estimates may not be the best in some other sense. In particular, they may not be unbiased. An unbiased estimate is one whose value will, on the average, equal that of the parameter for which it is an estimate, for repeated sampling. In other words, the expected value of an unbiased estimate is equal to the value of the parameter being estimated. The variance is, in fact, biased. To obtain an unbiased estimate of the population variance it is necessary to multiply s^2 by n /( n 1), to yield s^2 , the sample variance, and s, ( s^2 ) the sample standard deviation. There are other situations in which the maximum likelihood estimate may not be “best” for the purposes of the investigator. If a distribution is badly skewed, use of the mean as a measure of central tendency may be quite misleading. It is common in this case to use the median, which may be defined as the value of the variable which divides the distribution into two equal parts. Income statistics, which are strongly skewed positively, commonly use the median rather than the mean for this reason. If a distribution is very irregular, any measure of central tendency which attempts to base itself on the entire range of scores may be misleading. In this case, it may be more useful to examine the maximum points of f ( x ); these are known as modes. A distribution may have 1, 2 or more modes; it will then be referred to as unimodal, bimodal, or multimodal, respectively. Other measures of dispersion may be used besides the standard deviation. The probable error, p.e., has often been used in engineering practice. It is a number such that

pefxdx

pe .. () ..

.. m^05

m ∫ (15)

The p.e. is seldom used today, having been largely replaced by s 2. The interquartile range may sometimes be used for a set of observations whose true distribution is unknown. It consists of the limits of the range of values which include the middle half of sample values. The interquartile range is less sensitive than the standard deviation to the presence of a few very deviant data values.

The sample mean and standard deviation may be used to describe the most likely true value of these parameters, and to place confidence limits on that value. The standard error of the mean is given by s/n ( n = sample-size). The standard error of the mean can be used to make a statement about the probability that a range of values will include the true mean. For example, assuming normality, the range of values defined by the observed mean 1.96s/n will be expected to include the value of the true mean in 95% of all samples. A more general approach to estimation problems can be found in Bayseian decision theory (Pratt et al. , 1965). It is pos- sible to appeal to decision theory to work out specific answers to the “best estimate” problem for a variety of decision cri- teria in specific situations. This approach is well described in Weiss (1961). Although the method is not often applied in routine statistical applications, it has received attention in systems analysis problems and has been applied to such envi- ronmentally relevant problems as resource allocation.

Frequency Data

The analysis of frequency data is a problem which often arises in environmental work. Frequency data for a hypothetical experiment in genetics are shown in Table 1. In this example, the expected frequencies are assumed to be known independently of the observed frequencies. The chi-square statistic, x 2 , is defined as

x^2

2 2

()EO E

∑ (16)

where E is the expected frequency and O is the observed frequency. It can be applied to frequency tables, such as that shown in Table 1. Note that an important assumption of the chi-square test is that the observations be independent. The same samples or individuals must not appear in more than one cell. In the example given above, the expected frequencies were assumed to be known. In practice this is very often not the case; the experimenter will have several sets

TABLE 1 Hypothetical data on the frequency of plants producing red, pink and white flowers in the first generation of an experiment in which red and white parent plants were crossed, assuming single gene inheritance, neither gene dominant of observed frequencies, and will wish to determine whether or not they represent samples from one population, but will not know the expected frequency for samples from that population.

Flower color Red Pink White Number of plants

expected 25 50 25

observed 28 48 24

C019_004_r03.indd 1127C019_004_r03.indd 1127 11/18/2005 1:30:56 PM11/18/2005 1:30:56 PM

Encyclopedia of Environmental Science and Engineering, Volume I and II

Get our desktop app

Company

Features

Documentation

Resources