Chapter 3
Sadly, we can't trivially process this with the csv module. We have to do a little bit
of parsing to extract the useful information from this file. Since the data is properly
tab-delimited, we can use the csv.reader() function to iterate through the various
rows. We can define a data iterator as follows:
import csv
def row_iter(source):
return csv.reader(source, delimiter="\t")
We simply wrapped a file in a csv.reader function to create an iterator over rows.
We can use this iterator in the following context:
with open("Anscombe.txt") as source:
print( list(row_iter(source)) )
The problem with this is that the first three items in the resulting iterable aren't data.
The Anacombe's quartet file looks as follows when opened:
[["Anscombe's quartet"], ['I', 'II', 'III', 'IV'],
['x', 'y', 'x', 'y', 'x', 'y', 'x', 'y'],
We need to filter these rows from the iterable. Here is a function that will neatly
excise three expected title rows, and return an iterator over the remaining rows:
def head_split_fixed(row_iter):
title= next(row_iter)
assert len(title) == 1 and title[0] == "Anscombe's quartet"
heading= next(row_iter)
assert len(heading) == 4 and heading == ['I', 'II', 'III', 'IV']
columns= next(row_iter)
assert len(columns) == 8 and columns == ['x', 'y', 'x', 'y', 'x',
'y', 'x', 'y']
return row_iter
This function plucks three rows from the iterable. It asserts that each row has
an expected value. If the file doesn't meet these basic expectations, it's a symptom
that the file was damaged or perhaps our analysis is focused on the wrong file.
Since both the row_iter() and the head_split_fixed() functions expect an
iterable as an argument value, they can be trivially combined as follows:
with open("Anscombe.txt") as source:
print( list(head_split_fixed(row_iter(source))))
We've simply applied one iterator to the results of another iterator. In effect, this
defines a composite function. We're not done, of course; we still need to convert the
strings values to the float values and we also need to pick apart the four parallel
series of data in each row.