elle
(Elle)
#1
Conclusions
SQL-based (i.e., relational) databases have advantages when it comes to complex data
structures that exhibit lots of relations between single objects/tables. This might justify in
some circumstances their performance disadvantage over pure NumPy ndarray-based or
pandas DataFrame-based approaches.
However, many application areas in finance or science in general, can succeed with a
mainly array-based data modeling approach. In these cases, huge performance
improvements can be realized by making use of native NumPy I/O capabilities, a
combination of NumPy and PyTables capabilities, or of the pandas approach via HDF5-
based stores.
While a recent trend has been to use cloud-based solutions — where the cloud is made up
of a large number of computing nodes based on commodity hardware — one should
carefully consider, especially in a financial context, which hardware architecture best
serves the analytics requirements. A recent study by Microsoft sheds some light on this
topic:
We claim that a single “scale-up” server can process each of these jobs and do as well or better than a cluster in
terms of performance, cost, power, and server density.
— Appuswamy et al. (2013)
Companies, research institutions, and others involved in data analytics should therefore
analyze first what specific tasks have to be accomplished in general and then decide on the
hardware/software architecture, in terms of:
Scaling out
Using a cluster with many commodity nodes with standard CPUs and relatively low
memory
Scaling up
Using one or a few powerful servers with many-core CPUs, possibly a GPU, and
large amounts of memory
Our out-of-memory analytics example in this chapter underpins the observation. The out-
of-memory calculation of the numerical expression with PyTables takes roughly 1.5
minutes on standard hardware. The same task executed in-memory (using the numexpr
library) takes about 4 seconds, while reading the whole data set from disk takes just over 5
seconds. This value is from an eight-core server with enough memory (in this particular
case, 64 GB of RAM) and an SSD drive. Therefore, scaling up hardware and applying
different implementation approaches might significantly influence performance. More on
this in the next chapter.