Functional Python Programming

Working with Collections

Using sums and counts for statistics

The definitions of the arithmetic mean have an appealingly trivial definition based
on sum() and len(), which is as follows:

def mean( iterable ):
return sum(iterable)/len(iterable)

While elegant, this doesn't actually work for iterables. This definition only works
for sequences.

Indeed, we have a hard time performing a simple computation of mean or standard
deviation based on iterables. In Python, we must either materialize a sequence object,
or resort to somewhat more complex operations.

We have a fairly elegant expression of mean and standard deviation in the
following definitions:

import math
s0= len(data) # sum(1 for x in data) # x0
s1= sum(data) # sum(x for x in data) # x1
s2= sum(x*x for x in data)

mean= s1/s0
stdev= math.sqrt(s2/s0 - (s1/s0)**2)

These three sums, s0, s1, and s2, have a tidy, parallel structure. We can easily
compute the mean from two of the sums. The standard deviation is a bit more
complex, but it's still based on the three sums.

This kind of pleasant symmetry also works for more complex statistical functions
such as correlation and even least-squares linear regression.

The moment of correlation between two sets of samples can be computed from their
standardized value. The following is a function to compute the standardized value:

def z( x, μ_x, σ_x ):
return (x-μ_x)/σ_x

The calculation is simply to subtract the mean, μ_x, from each sample, x, and divide
by the standard deviation, σ_x. This gives as a value measured in units of sigma, σ.
A value ±1 σ is expected about two-thirds of the time. Larger values should be less
common. A value outside ±3 σ should happen less than 1 percent of the time.

Functional Python Programming

Using sums and counts for statistics

Get our desktop app

Company

Features

Documentation

Resources