Functional Python Programming

(Wang) #1
Chapter 12

r"(?P\d+)\s+"
r"(?P\S+)\s+"
r'"(?P.?)"\s+' # [SIC]
r'"(?P.+?)"\s
'
)


We can use this regular expression to break each row into a dictionary of nine
individual data elements. The use of []and " to delimit complex fields such as the
time, request, referrer, and user_agent parameters are handled gracefully by the
namedtuple pattern.


Each individual access can be summarized as a namedtuple() function as follows:


Access = namedtuple('Access', ['host', 'identity', 'user', 'time',
'request', 'status', 'bytes', 'referrer', 'user_agent'])


We've taken pains to assure that the namedtuple function's fields
match the regular expression group names in the (?P<name>)
constructs for each portion of the record. By making sure the names
match, we can very easily transform the parsed dictionary into a tuple
for further processing.

Here is the access_iter() function that requires each file to be represented
as an iterator over the lines of the file:


def access_iter(source_iter):


for log in source_iter:


for line in log:


match= format_pat.match(line)


if match:


yield Access(**match.groupdict())


The output from the local_gzip() function is a sequence of sequences. The outer
sequence consists of individual logfiles. For each file, there is an iterable sequence
of lines. If the line matches the given pattern, it's a file access of some kind. We can
create an Access namedtuple from the match dictionary.


The essential design pattern here is to build a static object from the results of a
parsing function. In this case, the parsing function is a regular expression matcher.

Free download pdf