Chapter 12
We have two approaches to interleaving computation and I/O. They are as follows:
- We can try to interleave I/O and calculation for the entire problem as a
whole. We might create a pipeline of processing with read, compute, and
write as operations. The idea is to have individual data objects flowing
through the pipe from one stage to the next. Each stage can operate in
parallel. - We can decompose the problem into separate, independent pieces that can be
processed from the beginning to the end in parallel.
The differences between these approaches aren't crisp; there is a blurry middle
region that's not clearly one or the other. For example, multiple parallel pipelines are
a hybrid mixture of both designs. There are some formalisms that make it somewhat
easier to design concurrent programs. The Communicating Sequential Processes
(CSP) paradigm can help design message-passing applications. Packages such as
pycsp can be used to add CSP formalisms to Python.
I/O-intensive programs often benefit from concurrent processing. The idea is
to interleave I/O and processing. CPU-intensive programs rarely benefit from
attempting concurrent processing.
Using multiprocessing pools and tasks
To make non-strict evaluation available in a larger context, the multiprocessing
package introduces the concept of a Pool object. We can create a Pool object of
concurrent worker processes, assign tasks to them, and expect the tasks to be
executed concurrently. As noted previously, this creation does not actually mean
simultaneous creation of Pool objects. It means that the order is difficult to predict
because we've allowed OS scheduling to interleave execution of multiple processes.
For some applications, this permits more work to be done in less elapsed time.
To make the most use of this capability, we need to decompose our application into
components for which non-strict concurrent execution is beneficial. We'd like to
define discrete tasks that can be processed in an indefinite order.
An application that gathers data from the Internet via web scraping is often optimized
through parallel processing. We can create a Pool object of several identical website
scrapers. The tasks are URLs to be analyzed by the pooled processes.