Chapter 4
We've explicitly separated low-level XML parsing from higher-level reorganization
of the data. The XML parsing produced a generic tuple of string structure. This
is compatible with the output from the CSV parser. When working with SQL
databases, we'll have a similar iterable of tuple structures. This allows us to write
code for higher-level processing that can work with data from a variety of sources.
We'll show a series of transformations to rearrange this data from a collection of
strings to a collection of waypoints along a route. This will involve a number of
transformations. We'll need to restructure the data as well as convert from strings
to floating-point values. We'll also look at a few ways to simplify and clarify the
subsequent processing steps. We'll use this data set in later chapters because it's
reasonably complex.
Pairing up items from a sequence
A common restructuring requirement is to make start-stop pairs out of points in
a sequence. Given a sequence, Ss={0 1, ,ss 2 ,...,sn}, we want to create a paired
sequence Ssˆ={()0 1, ,ss()1 2, , ...,ss()nn− 1 ,s }. When doing time-series analysis, we might
be combining more widely separated values. In this example, it's adjacent values.
A paired sequence will allow us to use each pair to compute distances from point to
point using a trivial application of a haversine function. This technique is also used
to convert a path of points into a series of line segments in a graphics application.
Why pair up items? Why not do something like this?
begin= next(iterable)
for end in iterable:
compute_something(begin, end)
begin = end
This, clearly, will process each leg of the data as a begin-end pair. However,
the processing function and the loop that restructures the data are tightly bound,
making reuse more complex than necessary. The algorithm for pairing is hard to
test in isolation because it's bound to the compute_something() function.
This combined function also limits our ability to reconfigure the application. There's
no easy way to inject an alternative implementation of the compute_something()
function. Additionally, we've got a piece of explicit state, the begin variable, which
makes life potentially complex. If we try to add features to the body of loop, we can
easily fail to set the begin variable correctly if a point is dropped from consideration.
A filter() function introduces an if statement that can lead to an error in
updating the begin variable.