Python for Finance: Analyze Big Financial Data

(Elle) #1

This indeed takes much longer than before. However, the compression ratio is about 20%,


saving 80% of the space on disk. This may be of importance for backup routines or when


shuffling large data sets between servers or even data centers:


In  [ 130 ]:    ll $path*
Out[130]: -rw-r—r— 1 root 200313168 28. Sep 15:18 /flash/data/tab.h5
-rw-r—r— 1 root 41335178 28. Sep 15:18 /flash/data/tab.h5c
In [ 131 ]: h5c.close()

Working with Arrays


We have already seen that NumPy has built-in fast writing and reading capabilities for


ndarray objects. PyTables is also quite fast and efficient when it comes to storing and


retrieving ndarray objects:


In  [ 132 ]:    %%time
arr_int = h5.create_array(‘/’, ‘integers’, ran_int)
arr_flo = h5.create_array(‘/’, ‘floats’, ran_flo)
Out[132]: CPU times: user 2 ms, sys: 33 ms, total: 35 ms
Wall time: 35 ms

Writing these objects directly to an HDF5 database is of course much faster than looping


over the objects and writing the data row-by-row to a Table object. A final inspection of


the database shows now three objects in it, the table and the two arrays:


In  [ 133 ]:    h5
Out[133]: File(filename=/flash/data/tab.h5, title=u”, mode=‘w’, root_uep=’/’, f
ilters=Filters(complevel=0, shuffle=False, fletcher32=False, least_sig
nificant_digit=None))
/ (RootGroup) u”
/floats (Array(2000000, 2)) ”
atom := Float64Atom(shape=(), dflt=0.0)
maindim := 0
flavor := ‘numpy’
byteorder := ‘little’
chunkshape := None
/integers (Array(2000000, 2)) ”
atom := Int64Atom(shape=(), dflt=0)
maindim := 0
flavor := ‘numpy’
byteorder := ‘little’
chunkshape := None
/ints_floats (Table(2000000,)) ‘Integers and Floats’
description := {
“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0),
“No1”: Int32Col(shape=(), dflt=0, pos=1),
“No2”: Int32Col(shape=(), dflt=0, pos=2),
“No3”: Float64Col(shape=(), dflt=0.0, pos=3),
“No4”: Float64Col(shape=(), dflt=0.0, pos=4)}
byteorder := ‘little’
chunkshape := (2621,)
In [ 134 ]: ll $path*
Out[134]: -rw-r—r— 1 root 200313168 28. Sep 15:18 /flash/data/tab.h5
-rw-r—r— 1 root 41335178 28. Sep 15:18 /flash/data/tab.h5c
In [ 135 ]: h5.close()
In [ 136 ]: !rm -f $path*

HDF5-BASED DATA STORAGE

The HDF5 database (file) format is a powerful alternative to, for example, relational databases when it comes to

structured numerical and financial data. Both on a standalone basis when using PyTables directly and when

combining it with the capabilities of pandas, you can expect to get almost the maximum I/O performance that the

available hardware allows.

Out-of-Memory Computations

Free download pdf