Python for Finance: Analyze Big Financial Data

(Elle) #1

Conclusions


SQL-based (i.e., relational) databases have advantages when it comes to complex data


structures that exhibit lots of relations between single objects/tables. This might justify in


some circumstances their performance disadvantage over pure NumPy ndarray-based or


pandas DataFrame-based approaches.


However, many application areas in finance or science in general, can succeed with a


mainly array-based data modeling approach. In these cases, huge performance


improvements can be realized by making use of native NumPy I/O capabilities, a


combination of NumPy and PyTables capabilities, or of the pandas approach via HDF5-


based stores.


While a recent trend has been to use cloud-based solutions — where the cloud is made up


of a large number of computing nodes based on commodity hardware — one should


carefully consider, especially in a financial context, which hardware architecture best


serves the analytics requirements. A recent study by Microsoft sheds some light on this


topic:


We claim that a single “scale-up” server can process each of these jobs and do as well or better than a cluster in

terms of performance, cost, power, and server density.

— Appuswamy et al. (2013)

Companies, research institutions, and others involved in data analytics should therefore


analyze first what specific tasks have to be accomplished in general and then decide on the


hardware/software architecture, in terms of:


Scaling out


Using a cluster with many commodity nodes with standard CPUs and relatively low


memory


Scaling up


Using one or a few powerful servers with many-core CPUs, possibly a GPU, and


large amounts of memory


Our out-of-memory analytics example in this chapter underpins the observation. The out-


of-memory calculation of the numerical expression with PyTables takes roughly 1.5


minutes on standard hardware. The same task executed in-memory (using the numexpr


library) takes about 4 seconds, while reading the whole data set from disk takes just over 5


seconds. This value is from an eight-core server with enough memory (in this particular


case, 64 GB of RAM) and an SSD drive. Therefore, scaling up hardware and applying


different implementation approaches might significantly influence performance. More on


this in the next chapter.

Free download pdf