Chapter 10Using map() and reduce() to sanitize raw data
When doing data cleansing, we'll often introduce filters of various degrees of
complexity to exclude invalid values. We may also include a mapping to sanitize
values in the cases where a valid but improperly formatted value can be replaced
with a valid but proper value.
We might produce the following output:
def comma_fix(data):
try:
return float(data)
except ValueError:
return float(data.replace(",", ""))
def clean_sum(cleaner, data):
return reduce(operator.add, map(cleaner, data))
We've defined a simple mapping, the comma_fix() class, that will convert data
from a nearly correct format into a usable floating-point value.
We've also defined a map-reduce that applies a given cleaner function, the
comma_fix() class, in this case, to the data before doing a reduce() function
using the operator.add method.
We can apply the previously described function as follows:
d = ('1,196', '1,176', '1,269', '1,240', '1,307',
... '1,435', '1,601', '1,654', '1,803', '1,734')
clean_sum(comma_fix, d)
14415.0
We've cleaned the data, by fixing the commas, as well as computed a sum. The syntax
is very convenient for combining these two operations.
We have to be careful, however, of using the cleaning function more than once.
If we're also going to compute a sum of squares, we really should not execute the
following command:
comma_fix_squared = lambda x: comma_fix(x)**2