Optimizations and Improvements
Once we've read the data, the next step is to develop two probabilities so that we
can properly compute expected defects for each shift and each type of defect. We
don't want to divide the total defect count by 12, since that doesn't reflect the actual
deviations by shift or defect type. The shifts may be more or less equally productive.
The defect frequencies are certainly not going to be similar. We expect some defects
to be very rare and others to be more common.
Reading summarized data
As an alternative to reading all of the raw data, we can look at processing only
the summary counts. We want to create a Counter object similar to the previous
example; this will have defect counts as a value with a key of shift and defect code.
Given summaries, we simply create a Counter object from the input dictionary.
Here's a function that will read our summary data:
from collections import Counter
import csv
def defect_counts(source):
rdr= csv.DictReader(source)
assert rdr.fieldnames == ["shift", "defect_code", "count"]
convert = map(
lambda d: ((d['shift'], d['defect_code']),
int(d['count'])),
rdr)
return Counter(dict(convert))
We require an open file as the input. We'll create a csv.DictReader() function that
helps parse the raw CSV data that we got from the database. We included an assert
statement to confirm that the file really has the expected data.
We defined a lambda object that creates a two tuple with the key and the integer
conversion of the count. The key is itself a two tuple with the shift and defect
information. The result will be a sequence such as ((shift,defect), count),
((shift,defect), count), ...). When we map the lambda to the DictReader
parameter, we'll have a generator function that can emit the sequence of two tuples.
We will create a dictionary from the collection of two tuples and use this dictionary
to build a Counter object. The Counter object can easily be combined with other
Counter objects. This allows us to combine details acquired from several sources.
In this case, we only have a single source.