Python for Finance: Analyze Big Financial Data

Chapter 7. Input/Output Operations

It is a capital mistake to theorize before one has data.

— Sherlock Holmes

As a general rule, the majority of data, be it in a finance context or any other application

area, is stored on hard disk drives (HDDs) or some other form of permanent storage

device, like solid state disks (SSDs) or hybrid disk drives. Storage capacities have been

steadily increasing over the years, while costs per storage unit (e.g., megabytes) have been

steadily falling.

At the same time, stored data volumes have been increasing at a much faster pace than the

typical random access memory (RAM) available even in the largest machines. This makes

it necessary not only to store data to disk for permanent storage, but also to compensate

for lack of sufficient RAM by swapping data from RAM to disk and back.

Input/output (I/O) operations are therefore generally very important tasks when it comes

to finance applications and data-intensive applications in general. Often they represent the

bottleneck for performance-critical computations, since I/O operations cannot in general

shuffle data fast enough to the RAM

[ 28 ]

and from the RAM to the disk. In a sense, CPUs

are often “starving” due to slow I/O operations.

Although the majority of today’s financial and corporate analytics efforts are confronted

with “big” data (e.g., of petascale size), single analytics tasks generally use data (sub)sets

that fall in the “mid” data category. A recent study concluded:

Our measurements as well as other recent work shows that the majority of real-world analytic jobs process less

than 100 GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for

petascale processing.

— Appuswamy et al. (2013)

In terms of frequency, single financial analytics tasks generally process data of not more

than a couple of gigabytes (GB) in size — and this is a sweet spot for Python and the

libraries of its scientific stack, like NumPy, pandas, and PyTables. Data sets of such a size

can also be analyzed in-memory, leading to generally high speeds with today’s CPUs and

GPUs. However, the data has to be read into RAM and the results have to be written to

disk, meanwhile ensuring today’s performance requirements are met.

This chapter addresses the following areas:

Basic I/O

Python has built-in functions to serialize and store any object on disk and to read it

from disk into RAM; apart from that, Python is strong when it comes to working

with text files and SQL databases. NumPy also provides dedicated functions for fast

storage and retrieval of ndarray objects.

I/O with pandas

The pandas library provides a plentitude of convenience functions and methods to

read data stored in different formats (e.g., CSV, JSON) and to write data to files in

diverse formats.