Working with Collections
Using sums and counts for statistics
The definitions of the arithmetic mean have an appealingly trivial definition based
on sum() and len(), which is as follows:
def mean( iterable ):
return sum(iterable)/len(iterable)
While elegant, this doesn't actually work for iterables. This definition only works
for sequences.
Indeed, we have a hard time performing a simple computation of mean or standard
deviation based on iterables. In Python, we must either materialize a sequence object,
or resort to somewhat more complex operations.
We have a fairly elegant expression of mean and standard deviation in the
following definitions:
import math
s0= len(data) # sum(1 for x in data) # x0
s1= sum(data) # sum(x for x in data) # x1
s2= sum(x*x for x in data)
mean= s1/s0
stdev= math.sqrt(s2/s0 - (s1/s0)**2)
These three sums, s0, s1, and s2, have a tidy, parallel structure. We can easily
compute the mean from two of the sums. The standard deviation is a bit more
complex, but it's still based on the three sums.
This kind of pleasant symmetry also works for more complex statistical functions
such as correlation and even least-squares linear regression.
The moment of correlation between two sets of samples can be computed from their
standardized value. The following is a function to compute the standardized value:
def z( x, μ_x, σ_x ):
return (x-μ_x)/σ_x
The calculation is simply to subtract the mean, μ_x, from each sample, x, and divide
by the standard deviation, σ_x. This gives as a value measured in units of sigma, σ.
A value ±1 σ is expected about two-thirds of the time. Larger values should be less
common. A value outside ±3 σ should happen less than 1 percent of the time.