Chapter 12
r"(?P
r"(?P
r'"(?P
r'"(?P
)
We can use this regular expression to break each row into a dictionary of nine
individual data elements. The use of []and " to delimit complex fields such as the
time, request, referrer, and user_agent parameters are handled gracefully by the
namedtuple pattern.
Each individual access can be summarized as a namedtuple() function as follows:
Access = namedtuple('Access', ['host', 'identity', 'user', 'time',
'request', 'status', 'bytes', 'referrer', 'user_agent'])
We've taken pains to assure that the namedtuple function's fields
match the regular expression group names in the (?P<name>)
constructs for each portion of the record. By making sure the names
match, we can very easily transform the parsed dictionary into a tuple
for further processing.
Here is the access_iter() function that requires each file to be represented
as an iterator over the lines of the file:
def access_iter(source_iter):
for log in source_iter:
for line in log:
match= format_pat.match(line)
if match:
yield Access(**match.groupdict())
The output from the local_gzip() function is a sequence of sequences. The outer
sequence consists of individual logfiles. For each file, there is an iterable sequence
of lines. If the line matches the given pattern, it's a file access of some kind. We can
create an Access namedtuple from the match dictionary.
The essential design pattern here is to build a static object from the results of a
parsing function. In this case, the parsing function is a regular expression matcher.