Functional Python Programming

(Wang) #1

Working with Collections


Using sums and counts for statistics


The definitions of the arithmetic mean have an appealingly trivial definition based
on sum() and len(), which is as follows:


def mean( iterable ):
return sum(iterable)/len(iterable)


While elegant, this doesn't actually work for iterables. This definition only works
for sequences.


Indeed, we have a hard time performing a simple computation of mean or standard
deviation based on iterables. In Python, we must either materialize a sequence object,
or resort to somewhat more complex operations.


We have a fairly elegant expression of mean and standard deviation in the
following definitions:


import math
s0= len(data) # sum(1 for x in data) # x0
s1= sum(data) # sum(x for x in data) # x
1
s2= sum(x*x for x in data)


mean= s1/s0
stdev= math.sqrt(s2/s0 - (s1/s0)**2)


These three sums, s0, s1, and s2, have a tidy, parallel structure. We can easily
compute the mean from two of the sums. The standard deviation is a bit more
complex, but it's still based on the three sums.


This kind of pleasant symmetry also works for more complex statistical functions
such as correlation and even least-squares linear regression.


The moment of correlation between two sets of samples can be computed from their
standardized value. The following is a function to compute the standardized value:


def z( x, μ_x, σ_x ):
return (x-μ_x)/σ_x


The calculation is simply to subtract the mean, μ_x, from each sample, x, and divide
by the standard deviation, σ_x. This gives as a value measured in units of sigma, σ.
A value ±1 σ is expected about two-thirds of the time. Larger values should be less
common. A value outside ±3 σ should happen less than 1 percent of the time.

Free download pdf