Robert_V._Hogg,_Joseph_W._McKean,_Allen_T._Craig

(Jacob Rumans) #1
234 Some Elementary Statistical Inferences

The histogram provides a somewhat crude but often used estimator of the pdf,
so a few remarks on it are pertinent. Letx 1 ,...,xnbe the realized values of the
random sample on a continuous random variableXwith pdff(x). Our histogram
estimate off(x) is obtained as follows. While for the discrete case, there are natural
classes for the histogram, for the continuous case these classes must be chosen. One
way of doing this is to select a positive integerm,anh>0, and a valueasuch that
a<minxi, so that themintervals


(a−h, a+h],(a+h, a+3h],(a+3h, a+5h],...,(a+(2m−3)h, a+(2m−1)h] (4.1.15)

cover the range of the sample [minxi,maxxi]. These intervals form our classes. Let
Aj=(a+(2j−3)h, a+(2j−1)h]forj=1,...m.
Letf̂h(x) denote our histogram estimate. Ifx≤a−horx>a+(2m−1)h
then definef̂h(x)=0. Fora−h<x≤a+(2m−1)h,xis in one, and only one,


Aj.Forx∈Aj, definef̂h(x)tobe:


f̂h(x)=#{xi∈Aj}
2 hn

. (4.1.16)


Note thatf̂h(x)≥0andthat


∫∞

−∞

f̂h(x)dx =

∫a+(2m−1)h

a−h

f̂h(x)dx=

∑m

j=1


Aj

#{xi∈Aj}
2 hn

dx

=

1
2 hn

∑m

j=1

#{xi∈Aj}[h(2j− 1 − 2 j+3)]=

2 h
2 hn

n=1;

so,f̂h(x) satisfies the properties of a pdf.
For the discrete case, except when classes are merged, the histogram is unique.
For the continuous case, though, the histogram depends on the classes chosen. The
resulting picture can be quite different if the classes are changed. Unless there is
a compelling reason for the class selection, we recommend using the default classes
selected by the computational algorithm. The histogram algorithms in most statis-
tical packages such as R are current on recent research for selection of classes. The
histogram in the following example is based on default classes.


Example 4.1.7.In Example 4.1.3, we presented a data set involving sulfur dioxide
concentrations in a damaged Bavarian forest. The histogram of this data set is
found in Figure 4.1.3. There are only 24 data points in the sample which are far
too few for density estimation. With this in mind, although the distribution of data
is mound shaped, the center appears to be too flat for normality. We have overlaid
the histogram with the default R density estimate (solid line) which confirms some
caution on normality. Recall that sample mean and standard deviations for this
data are 53.91667 and 10.07371, respectively. So we also plotted the normal pdf
with this mean and standard deviation (dashed line). The R code assumes that the
data are in the R vectorsulfurdioxide.
hist(sulfurdioxide,xlab="Sulfurdioxide",ylab=" ",pr=T,ylim=c(0,.04))

Free download pdf