Additional Tuple Techniques
A common statistical measure of correlation between two sets of data is the
Spearman rank correlation. This compares the rankings of two variables. Rather
than trying to compare values, which might have different scales, we'll compare
the relative orders. For more information, visit http://en.wikipedia.org/wiki/
Spearman%27s_rank_correlation_coefficient.
Computing the Spearman rank correlation requires assigning a rank value to each
observation. It seems like we should be able to use enumerate(sorted()) to do this.
Given two sets of possibly correlated data, we can transform each set into a sequence
of rank values and compute a measure of correlation.
We'll apply the Wrap-Unwrap design pattern to do this. We'll wrap data items with
their rank for the purposes of computing the correlation coefficient.
In Chapter 3, Functions, Iterators, and Generators, we showed how to parse a simple
dataset. We'll extract the four samples from that dataset as follows:
from ch03_ex5 import series, head_map_filter, row_iter
with open("Anscombe.txt") as source:
data = tuple(head_map_filter(row_iter(source)))
series_I= tuple(series(0,data))
series_II= tuple(series(1,data))
series_III= tuple(series(2,data))
series_IV= tuple(series(3,data))
Each of these series is a tuple of Pair objects. Each Pair object has x and y
attributes. The data looks as follows:
(Pair(x=10.0, y=8.04), Pair(x=8.0, y=6.95), ..., Pair(x=5.0, y=5.68))
We can apply the enumerate() function to create sequences of values as follows:
y_rank= tuple(enumerate(sorted(series_I, key=lambda p: p.y)))
xy_rank= tuple(enumerate(sorted(y_rank, key=lambda rank: rank[1].x)))
The first step will create simple two-tuples with (0) a rank number and (1) the
original Pair object. As the data was sorted by the y value in each pair, the rank
value will reflect this ordering.
The sequence will look as follows:
((0, Pair(x=8.0, y=5.25)), (1, Pair(x=8.0, y=5.56)), ...,
(10, Pair(x=19.0, y=12.5)))