Python for Finance: Analyze Big Financial Data

Conclusions

SQL-based (i.e., relational) databases have advantages when it comes to complex data

structures that exhibit lots of relations between single objects/tables. This might justify in

some circumstances their performance disadvantage over pure NumPy ndarray-based or

pandas DataFrame-based approaches.

However, many application areas in finance or science in general, can succeed with a

mainly array-based data modeling approach. In these cases, huge performance

improvements can be realized by making use of native NumPy I/O capabilities, a

combination of NumPy and PyTables capabilities, or of the pandas approach via HDF5-

based stores.

While a recent trend has been to use cloud-based solutions — where the cloud is made up

of a large number of computing nodes based on commodity hardware — one should

carefully consider, especially in a financial context, which hardware architecture best

serves the analytics requirements. A recent study by Microsoft sheds some light on this

topic:

We claim that a single “scale-up” server can process each of these jobs and do as well or better than a cluster in

terms of performance, cost, power, and server density.

— Appuswamy et al. (2013)

Companies, research institutions, and others involved in data analytics should therefore

analyze first what specific tasks have to be accomplished in general and then decide on the

hardware/software architecture, in terms of:

Scaling out

Using a cluster with many commodity nodes with standard CPUs and relatively low

memory

Scaling up

Using one or a few powerful servers with many-core CPUs, possibly a GPU, and

large amounts of memory

Our out-of-memory analytics example in this chapter underpins the observation. The out-

of-memory calculation of the numerical expression with PyTables takes roughly 1.5

minutes on standard hardware. The same task executed in-memory (using the numexpr

library) takes about 4 seconds, while reading the whole data set from disk takes just over 5

seconds. This value is from an eight-core server with enough memory (in this particular

case, 64 GB of RAM) and an SSD drive. Therefore, scaling up hardware and applying

different implementation approaches might significantly influence performance. More on

this in the next chapter.