Functional Python Programming

(Wang) #1
Chapter 12

Clearly, we want to interleave the other operations so that they are not waiting for
I/O to complete. We can interleave operations along a spectrum from individual
rows to whole files. We'll look at interleaving whole files first, as this is relatively
simple to implement.


The functional design for parsing Apache CLF files can look as follows:


data = path_filter(access_detail_iter(access_iter(local_gzip
(filename))))


We've decomposed the larger parsing problem into a number of functions that
will handle each portion of the parsing problem. The local_gzip() function reads
rows from locally-cached GZIP files. The access_iter() function creates a simple
namedtuple object for each row in the access log. The access_detail_iter()
function will expand on some of the more difficult to parse fields. Finally, the
path_filter() function will discard some paths and file extensions that aren't
of much analytical value.


Parsing log files – gathering the rows


Here is the first stage in parsing a large number of files: reading each file and
producing a simple sequence of lines. As the logfiles are saved in the .gzip format,
we need to open each file with the gzip.open() function instead of the io.open()
function or the builtins.open() function.


The local_gzip() function reads lines from locally cached files, as shown in the
following command snippet:


def local_gzip(pattern):


zip_logs= glob.glob(pattern)


for zip_file in zip_logs:


with gzip.open(zip_file, "rb") as log:


yield (line.decode('us-ascii').rstrip() for line in log)


The preceding function iterates through all files. For each file, the yielded value
is a generator function that will iterate through all lines within that file. We've
encapsulated several things, including wildcard file matching, the details of opening
a logfile compressed with the .gzip format, and breaking a file into a sequence of
lines without any trailing \n characters.


The essential design pattern here is to yield values that are generator expressions for
each file. The preceding function can be restated as a function and a mapping that
applies that function to each file.

Free download pdf