range of texts were analysed, but it will not vary much.
It is not surprising that ‘the’ is the most common, and ‘of’ is second. The list
continues and you might want to know that ‘among’ is in 500th position and
‘neck’ is ranked 1000. We shall only consider the top ten words. If you pick up a
text at random and count these words you will get more or less the same words
in rank order. The surprising fact is that the ranks have a bearing on the actual
number of appearances of the words in a text. The word ‘the’ will occur twice as
often as ‘of’ and three times more frequently than ‘and’, and so on. The actual
number is given by a well-known formula. This is an experimental law and was
discovered by Zipf from data. The theoretical Zipf’s law says that the percentage
of occurrences of the word ranked r is given by
where the number k depends only on the size of the author’s vocabulary. If an
author had command of all the words in the English language, of which there are
around a million by some estimates, the value of k would be about 0.0694. In
the formula for Zipf’s law the word ‘the’ would then account for about 6.94% of
all words in a text. In the same way ‘of’ would account for half of this, or about
3.47% of the words. An essay of 3000 words by such a talented author would
therefore contain 208 appearances of ‘the’ and 104 appearances of the word ‘of’.
For writers with only 20,000 words at their command, the value of k rises to
0.0954, so there would be 286 appearances of ‘the’ and 143 appearances of the
word ‘of’. The smaller the vocabulary, the more often you will see ‘the’
appearing.
Crystal ball gazing
Whether Poisson, Benford or Zipf, all these distributions allow us to make
predictions. We may not be able to predict a dead cert but knowing how the
probabilities distribute themselves is much better than taking a shot in the dark.
Add to these three, other distributions like the binomial, the negative binomial,
the geometric, the hypergeometric, and many more, the statistician has an
effective array of tools for analysing a vast range of human activity.