Chapter 12
Here is an example to show just how similar they are:
import concurrent.futures
pool_size= 4
pattern = "*.gz"
combined= Counter()
with concurrent.futures.ProcessPoolExecutor
(max_workers=pool_size) as workers:
for result in workers.map(analysis, glob.glob(pattern)):
combined.update(result)
The most significant change between the preceding example and previous examples
is that we're using an instance of the concurrent.futures.ProcessPoolExecutor
object instead of the multiprocessing.Pool method. The essential design pattern
is to map the analysis() function to the list of filenames using the pool of available
workers. The resulting Counter objects are consolidated to create a final result.
The performance of the concurrent.futures module is nearly identical to the
multiprocessing module.
Using concurrent.futures thread pools
The concurrent.futures module offers a second kind of executor that we can use in
our applications. Instead of creating a concurrent.futures.ProcessPoolExecutor
object, we can use the ThreadPoolExecutor object. This will create a pool of threads
within a single process.
The syntax is otherwise identical to using a ProcessPoolExecutor object. The
performance, however, is remarkably different. The logfile processing is dominated
by I/O. All of the threads in a process share the same OS scheduling constraints.
Due to this, the overall performance of multithreaded logfile analysis is about the
same as processing the logfiles serially.
Using sample logfiles and a small four-core laptop running Mac OS X, these are the
kinds of results that indicate the difference between threads that share I/O resources
and processes:
- Using the concurrent.futures thread pool, the elapsed time was
168 seconds - Using a process pool, the elapsed time was 68 seconds
In both cases, the Pool object's size was 4. It's not clear which kind of applications
benefit from a multithreading approach. In general, multiprocessing seems to be best
for Python applications.