Chapter 7
Similarly, we could have written two properly recursive functions to emit the
collection with the assigned rank values. Again, we've optimized that recursion
into nested for loops. To make it clear how we're computing the rank value,
we've included the low end of the range (base+1) and the high end of the range
(base+dups) and taken the midpoint of these two values. If there is only a single
duplicate, we evaluate (2*base+2)/2, which has the advantage of being a
general solution.
The following is how we can test this to be sure it works.
list(rank([0.8, 1.2, 1.2, 2.3, 18]))
[(1.0, 0.8), (2.5, 1.2), (2.5, 1.2), (4.0, 2.3), (5.0, 18)]
data= ((2, 0.8), (3, 1.2), (5, 1.2), (7, 2.3), (11, 18))
list(rank(data, key=lambda x:x[1]))
[(1.0, (2, 0.8)), (2.5, (3, 1.2)), (2.5, (5, 1.2)), (4.0, (7, 2.3)),
(5.0, (11, 18))]
The sample data included two identical values. The resulting ranks split positions 2
and 3 to assign position 2.5 to both values. This is the common statistical practice for
computing the Spearman rank-order correlation between two sets of values.
The rank() function involves rearranging the input data as part of
discovering duplicated values. If we want to rank on both the x and y
values in each pair, we need to reorder the data twice.
Wrapping instead of state changing
We have two general strategies to do wrapping; they are as follows:
- Parallelism: We can create two copies of the data and rank each copy. We
then need to reassemble the two copies into a final result that includes both
rankings. This can be a bit awkward because we'll need to somehow merge
two sequences that are likely to be in different orders. - Serialism: We can compute ranks on one variable and save the results as a
wrapper that includes the original raw data. We can then rank this wrapped
data on the other variable. While this can create a complex structure, we can
optimize it slightly to create a flatter wrapper for the final results.