Functional Python Programming

(Wang) #1

The Multiprocessing and Threading Modules


An application that analyzes multiple logfiles is also a good candidate for
parallelization. We can create a Pool object of analytical processes. We can assign
each logfile to an analyzer; this allows reading and analysis to proceed in parallel
among the various workers in the Pool object. Each individual worker will involve
serialized I/O and computation. However, one worker can be analyzing the
computation while other workers are waiting for I/O to complete.


Processing many large files


Here is an example of a multiprocessing application. We'll scrape Common Log
Format (CLF) lines in web logfiles. This is the generally used format for an access
log. The lines tend to be long, but look like the following when wrapped to the
book's margins:


99.49.32.197 - - [01/Jun/2012:22:17:54 -0400] "GET /favicon.ico
HTTP/1.1" 200 894 "-" "Mozilla/5.0 (Windows NT 6.0)
AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52
Safari/536.5"

We often have large numbers of large files that we'd like to analyze. The presence
of many independent files means that concurrency will have some benefit for our
scraping process.


We'll decompose the analysis into two broad areas of functionality. The first phase of
any processing is the essential parsing of the logfiles to gather the relevant pieces of
information. We'll decompose this into four stages. They are as follows:



  1. All the lines from multiple source logfiles are read.

  2. Then, create simple namedtuples from the lines of log entries in a collection
    of files.

  3. The details of more complex fields such as dates and URLs are parsed.

  4. Uninteresting paths from the logs are rejected; we can also think of this
    as passing only the interesting paths.


Once past the parsing phase, we can perform a large number of analyses. For our
purposes in demonstrating the multiprocessing module, we'll look at a simple
analysis to count occurrences of specific paths.


The first portion, reading from source files, involves the most input processing.
The Python use of file iterators will translate into lower-level OS requests for
buffering of data. Each OS request means that the process must wait for the data
to become available.

Free download pdf