Chapter 9
This allows us to use the corr() function from Chapter 4, Working with Collections, to
compare two columns of data.
This is how we can compute all combinations of correlations:
from itertools import *
from Chapter_4.ch04_ex4 import corr
for p, q in combinations(range(9), 2):
header_p, *data_p = list(column(source, p))
header_q, *data_q = list(column(source, q))
if header_p == header_q: continue
r_pq = corr(data_p, data_q)
print("{"{("{2: 4.2f}: {0} vs {1}".
format(header_p, header_q, r_pq)))))
For each combination of columns, we've extracted the two columns of data from our
data set and used multiple assignments to separate the header from the remaining
rows of data. If the headers match, we're comparing a variable to itself. This will be
True for the three combinations of year and year that stem from the redundant
year columns.
Given a combination of columns, we will compute the correlation function and then
print the two headings along with the correlation of the columns. We've intentionally
chosen some datasets that show spurious correlations with a dataset that doesn't
follow the same pattern. In spite of this, the correlations are remarkably high.
The results look like this:
0.96: year vs Per capita consumption of cheese (US)Pounds (USDA)
0.95: year vs Number of people who died by becoming tangled in their
bedsheetsDeaths (US) (CDC)
0.92: year vs Per capita consumption of mozzarella cheese (US)Pounds
(USDA)
0.98: year vs Civil engineering doctorates awarded (US)Degrees
awarded (National Science Foundation)
-0.80: year vs US crude oil imports from VenezuelaMillions of barrels
(Dept. of Energy)
-0.95: year vs Per capita consumption of high fructose corn syrup
(US)Pounds (USDA)
0.95: Per capita consumption of cheese (US)Pounds (USDA) vs Number of
people who died by becoming tangled in their bedsheetsDeaths (US)
(CDC)
0.96: Per capita consumption of cheese (US)Pounds (USDA) vs year