Python for Finance: Analyze Big Financial Data

(Elle) #1

Chapter 7. Input/Output Operations


It is a capital mistake to theorize before one has data.

— Sherlock Holmes

As a general rule, the majority of data, be it in a finance context or any other application


area, is stored on hard disk drives (HDDs) or some other form of permanent storage


device, like solid state disks (SSDs) or hybrid disk drives. Storage capacities have been


steadily increasing over the years, while costs per storage unit (e.g., megabytes) have been


steadily falling.


At the same time, stored data volumes have been increasing at a much faster pace than the


typical random access memory (RAM) available even in the largest machines. This makes


it necessary not only to store data to disk for permanent storage, but also to compensate


for lack of sufficient RAM by swapping data from RAM to disk and back.


Input/output (I/O) operations are therefore generally very important tasks when it comes


to finance applications and data-intensive applications in general. Often they represent the


bottleneck for performance-critical computations, since I/O operations cannot in general


shuffle data fast enough to the RAM


[ 28 ]

and from the RAM to the disk. In a sense, CPUs


are often “starving” due to slow I/O operations.


Although the majority of today’s financial and corporate analytics efforts are confronted


with “big” data (e.g., of petascale size), single analytics tasks generally use data (sub)sets


that fall in the “mid” data category. A recent study concluded:


Our measurements as well as other recent work shows that the majority of real-world analytic jobs process less

than 100 GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for

petascale processing.

— Appuswamy et al. (2013)

In terms of frequency, single financial analytics tasks generally process data of not more


than a couple of gigabytes (GB) in size — and this is a sweet spot for Python and the


libraries of its scientific stack, like NumPy, pandas, and PyTables. Data sets of such a size


can also be analyzed in-memory, leading to generally high speeds with today’s CPUs and


GPUs. However, the data has to be read into RAM and the results have to be written to


disk, meanwhile ensuring today’s performance requirements are met.


This chapter addresses the following areas:


Basic I/O


Python has built-in functions to serialize and store any object on disk and to read it


from disk into RAM; apart from that, Python is strong when it comes to working


with text files and SQL databases. NumPy also provides dedicated functions for fast


storage and retrieval of ndarray objects.


I/O with pandas


The pandas library provides a plentitude of convenience functions and methods to


read data stored in different formats (e.g., CSV, JSON) and to write data to files in


diverse formats.

Free download pdf