Chapter 8
If we zip together a finite collection of numbers, we'll get a set of triples with a
number, and two flags showing whether or not the number is a multiple of 3 or a
multiple of 5. It's important to introduce a finite iterable to create a proper upper
bound on the volume of data being generated. Here's a sequence of values and their
multiplier flags:
multipliers = zip(range(10), m3, m5)
We can now decompose the triples and use a filter to pass numbers which are
multiples and reject all others:
sum(i for i, *multipliers in multipliers if any(multipliers))
This function has another, more valuable use for exploratory data analysis.
We often need to work with samples of large sets of data. The initial phases of
cleansing and model creation are best developed with small sets of data and tested
with larger and larger sets of data. We can use the cycle() function to fairly select
rows from within a larger set. The population size, NP, and the desired sample size,
NS, denotes how long we can use a cycle:
P
S
N
c
N
=
We'll assume that the data can be parsed with the csv module. This leads to an
elegant way to create subsets. We can create subsets using the following commands:
chooser = (x == 0 for x in cycle(range(c)))
rdr= csv.reader(source_file)
wtr= csv.writer(target_file)
wtr.writerows(row for pick, row in zip(chooser, rdr) if pick)
We created a cycle() function based on the selection factor, c. For example, we
might have a population of 10 million records: a 1,000-record subset involves picking
1/10,000 of the records. We assumed that this snippet of code is nestled securely
inside a with statement that opens the relevant files. We also avoided showing
details of any dialect issues with the CSV format files.
We can use a simple generator expression to filter the data using the cycle()
function and the source data that's available from the CSV reader. Since the chooser
expression and the expression used to write the rows are both non-strict, there's little
memory overhead from this kind of processing.