Python for Finance: Analyze Big Financial Data

This indeed takes much longer than before. However, the compression ratio is about 20%,

saving 80% of the space on disk. This may be of importance for backup routines or when

shuffling large data sets between servers or even data centers:

In [ 130 ]: ll $path* Out[130]: -rw-r—r— 1 root 200313168 28. Sep 15:18 /flash/data/tab.h5 -rw-r—r— 1 root 41335178 28. Sep 15:18 /flash/data/tab.h5c In [ 131 ]: h5c.close()

Working with Arrays

We have already seen that NumPy has built-in fast writing and reading capabilities for

ndarray objects. PyTables is also quite fast and efficient when it comes to storing and

retrieving ndarray objects:

In [ 132 ]: %%time arr_int = h5.create_array(‘/’, ‘integers’, ran_int) arr_flo = h5.create_array(‘/’, ‘floats’, ran_flo) Out[132]: CPU times: user 2 ms, sys: 33 ms, total: 35 ms Wall time: 35 ms

Writing these objects directly to an HDF5 database is of course much faster than looping

over the objects and writing the data row-by-row to a Table object. A final inspection of

the database shows now three objects in it, the table and the two arrays:

In [ 133 ]: h5 Out[133]: File(filename=/flash/data/tab.h5, title=u”, mode=‘w’, root_uep=’/’, f ilters=Filters(complevel=0, shuffle=False, fletcher32=False, least_sig nificant_digit=None)) / (RootGroup) u” /floats (Array(2000000, 2)) ” atom := Float64Atom(shape=(), dflt=0.0) maindim := 0 flavor := ‘numpy’ byteorder := ‘little’ chunkshape := None /integers (Array(2000000, 2)) ” atom := Int64Atom(shape=(), dflt=0) maindim := 0 flavor := ‘numpy’ byteorder := ‘little’ chunkshape := None /ints_floats (Table(2000000,)) ‘Integers and Floats’ description := { “Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0), “No1”: Int32Col(shape=(), dflt=0, pos=1), “No2”: Int32Col(shape=(), dflt=0, pos=2), “No3”: Float64Col(shape=(), dflt=0.0, pos=3), “No4”: Float64Col(shape=(), dflt=0.0, pos=4)} byteorder := ‘little’ chunkshape := (2621,) In [ 134 ]: ll $path* Out[134]: -rw-r—r— 1 root 200313168 28. Sep 15:18 /flash/data/tab.h5 -rw-r—r— 1 root 41335178 28. Sep 15:18 /flash/data/tab.h5c In [ 135 ]: h5.close() In [ 136 ]: !rm -f $path*

HDF5-BASED DATA STORAGE

The HDF5 database (file) format is a powerful alternative to, for example, relational databases when it comes to

structured numerical and financial data. Both on a standalone basis when using PyTables directly and when

combining it with the capabilities of pandas, you can expect to get almost the maximum I/O performance that the

available hardware allows.

Out-of-Memory Computations