This indeed takes much longer than before. However, the compression ratio is about 20%,
saving 80% of the space on disk. This may be of importance for backup routines or when
shuffling large data sets between servers or even data centers:
In [ 130 ]: ll $path*
Out[130]: -rw-r—r— 1 root 200313168 28. Sep 15:18 /flash/data/tab.h5
-rw-r—r— 1 root 41335178 28. Sep 15:18 /flash/data/tab.h5c
In [ 131 ]: h5c.close()
Working with Arrays
We have already seen that NumPy has built-in fast writing and reading capabilities for
ndarray objects. PyTables is also quite fast and efficient when it comes to storing and
retrieving ndarray objects:
In [ 132 ]: %%time
arr_int = h5.create_array(‘/’, ‘integers’, ran_int)
arr_flo = h5.create_array(‘/’, ‘floats’, ran_flo)
Out[132]: CPU times: user 2 ms, sys: 33 ms, total: 35 ms
Wall time: 35 ms
Writing these objects directly to an HDF5 database is of course much faster than looping
over the objects and writing the data row-by-row to a Table object. A final inspection of
the database shows now three objects in it, the table and the two arrays:
In [ 133 ]: h5
Out[133]: File(filename=/flash/data/tab.h5, title=u”, mode=‘w’, root_uep=’/’, f
ilters=Filters(complevel=0, shuffle=False, fletcher32=False, least_sig
nificant_digit=None))
/ (RootGroup) u”
/floats (Array(2000000, 2)) ”
atom := Float64Atom(shape=(), dflt=0.0)
maindim := 0
flavor := ‘numpy’
byteorder := ‘little’
chunkshape := None
/integers (Array(2000000, 2)) ”
atom := Int64Atom(shape=(), dflt=0)
maindim := 0
flavor := ‘numpy’
byteorder := ‘little’
chunkshape := None
/ints_floats (Table(2000000,)) ‘Integers and Floats’
description := {
“Date”: StringCol(itemsize=26, shape=(), dflt=”, pos=0),
“No1”: Int32Col(shape=(), dflt=0, pos=1),
“No2”: Int32Col(shape=(), dflt=0, pos=2),
“No3”: Float64Col(shape=(), dflt=0.0, pos=3),
“No4”: Float64Col(shape=(), dflt=0.0, pos=4)}
byteorder := ‘little’
chunkshape := (2621,)
In [ 134 ]: ll $path*
Out[134]: -rw-r—r— 1 root 200313168 28. Sep 15:18 /flash/data/tab.h5
-rw-r—r— 1 root 41335178 28. Sep 15:18 /flash/data/tab.h5c
In [ 135 ]: h5.close()
In [ 136 ]: !rm -f $path*
HDF5-BASED DATA STORAGE
The HDF5 database (file) format is a powerful alternative to, for example, relational databases when it comes to
structured numerical and financial data. Both on a standalone basis when using PyTables directly and when
combining it with the capabilities of pandas, you can expect to get almost the maximum I/O performance that the
available hardware allows.
Out-of-Memory Computations