Chapter 4
We can use this scalar function as follows:
d = [2, 4, 4, 4, 5, 5, 7, 9]
list(z(x, mean(d), stdev(d)) for x in d)
[-1.5, -0.5, -0.5, -0.5, 0.0, 0.0, 1.0, 2.0]
We've materialized list that consists of normalized scores based on some raw
data in the variable, d. We used a generator expression to apply the scalar function,
z(), to the sequence object.
The mean() and stdev() functions are simply based on the examples shown above:
def mean(x):
return s1(x)/s0(x)
def stdev(x):
return math.sqrt(s2(x)/s0(x) - (s1(x)/s0(x))**2)
The three sum functions, similarly, are based on the examples above:
def s0(data):
return sum(1 for x in data) # or len(data)
def s1(data):
return sum(x for x in data) # or sum(data)
def s2(data):
return sum(x*x for x in data)
While this is very expressive and succinct, it's a little frustrating because we can't
simply use an iterable here. We're computing a mean, which requires a sum of the
iterable, plus a count. We're also computing a standard deviation that requires two
sums and a count from the iterable. For this kind of statistical processing, we must
materialize a sequence object so that we can examine the data multiple times.
The following is how we can compute the correlation between two sets of samples:
def corr( sample1, sample2 ):
μ_1, σ_1 = mean(sample1), stdev(sample1)
μ_2, σ_2 = mean(sample2), stdev(sample2)
z_1 = (z(x, μ_1, σ_1) for x in sample1)
z_2 = (z(x, μ_2, σ_2) for x in sample2)
r = sum(zx1*zx2 for zx1, zx2 in zip(z_1, z_2) )/s0(sample1)
return r