Chapter 10
Using map() and reduce() to sanitize raw data
When doing data cleansing, we'll often introduce filters of various degrees of
complexity to exclude invalid values. We may also include a mapping to sanitize
values in the cases where a valid but improperly formatted value can be replaced
with a valid but proper value.
We might produce the following output:
def comma_fix(data):
try:
return float(data)
except ValueError:
return float(data.replace(",", ""))
def clean_sum(cleaner, data):
return reduce(operator.add, map(cleaner, data))
We've defined a simple mapping, the comma_fix() class, that will convert data
from a nearly correct format into a usable floating-point value.
We've also defined a map-reduce that applies a given cleaner function, the
comma_fix() class, in this case, to the data before doing a reduce() function
using the operator.add method.
We can apply the previously described function as follows:
d = ('1,196', '1,176', '1,269', '1,240', '1,307',
... '1,435', '1,601', '1,654', '1,803', '1,734')
clean_sum(comma_fix, d)
14415.0
We've cleaned the data, by fixing the commas, as well as computed a sum. The syntax
is very convenient for combining these two operations.
We have to be careful, however, of using the cleaning function more than once.
If we're also going to compute a sum of squares, we really should not execute the
following command:
comma_fix_squared = lambda x: comma_fix(x)**2