elle
(Elle)
#1
Chapter 7. Input/Output Operations
It is a capital mistake to theorize before one has data.
— Sherlock Holmes
As a general rule, the majority of data, be it in a finance context or any other application
area, is stored on hard disk drives (HDDs) or some other form of permanent storage
device, like solid state disks (SSDs) or hybrid disk drives. Storage capacities have been
steadily increasing over the years, while costs per storage unit (e.g., megabytes) have been
steadily falling.
At the same time, stored data volumes have been increasing at a much faster pace than the
typical random access memory (RAM) available even in the largest machines. This makes
it necessary not only to store data to disk for permanent storage, but also to compensate
for lack of sufficient RAM by swapping data from RAM to disk and back.
Input/output (I/O) operations are therefore generally very important tasks when it comes
to finance applications and data-intensive applications in general. Often they represent the
bottleneck for performance-critical computations, since I/O operations cannot in general
shuffle data fast enough to the RAM
[ 28 ]
and from the RAM to the disk. In a sense, CPUs
are often “starving” due to slow I/O operations.
Although the majority of today’s financial and corporate analytics efforts are confronted
with “big” data (e.g., of petascale size), single analytics tasks generally use data (sub)sets
that fall in the “mid” data category. A recent study concluded:
Our measurements as well as other recent work shows that the majority of real-world analytic jobs process less
than 100 GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for
petascale processing.
— Appuswamy et al. (2013)
In terms of frequency, single financial analytics tasks generally process data of not more
than a couple of gigabytes (GB) in size — and this is a sweet spot for Python and the
libraries of its scientific stack, like NumPy, pandas, and PyTables. Data sets of such a size
can also be analyzed in-memory, leading to generally high speeds with today’s CPUs and
GPUs. However, the data has to be read into RAM and the results have to be written to
disk, meanwhile ensuring today’s performance requirements are met.
This chapter addresses the following areas:
Basic I/O
Python has built-in functions to serialize and store any object on disk and to read it
from disk into RAM; apart from that, Python is strong when it comes to working
with text files and SQL databases. NumPy also provides dedicated functions for fast
storage and retrieval of ndarray objects.
I/O with pandas
The pandas library provides a plentitude of convenience functions and methods to
read data stored in different formats (e.g., CSV, JSON) and to write data to files in
diverse formats.