In [ 81 ]: %%time
res = data[[‘No1’, ‘No2’]][((data[‘No1’] > 0.5) | (data[‘No1’] < -0.5))
& ((data[‘No2’] < - 1 ) | (data[‘No2’] > 1 ))]
Out[81]: CPU times: user 49 ms, sys: 0 ns, total: 49 ms
Wall time: 48.7 ms
In [ 82 ]: plt.plot(res.No1, res.No2, ‘ro’)
plt.grid(True); plt.axis(‘tight’)
Figure 7-2. Scatter plot of complex query result
As expected, using the in-memory analytics capabilities of pandas leads to a significant
speedup, provided pandas is able to replicate the respective SQL statement. This is not the
only advantage of using pandas, though pandas is tightly integrated with PyTables, which
is the topic of the next section. Here, it suffices to know that the combination of both can
speed up I/O operations considerably. This is shown in the following:
In [ 83 ]: h5s = pd.HDFStore(filename + ‘.h5s’, ‘w’)
In [ 84 ]: %time h5s[‘data’] = data
Out[84]: CPU times: user 43 ms, sys: 60 ms, total: 103 ms
Wall time: 161 ms
In [ 85 ]: h5s
Out[85]: <class ‘pandas.io.pytables.HDFStore’>
File path: /flash/data/numbs.h5s
/data frame (shape->[1000000,5])
In [ 86 ]: h5s.close()
The whole DataFrame with all the data from the original SQL table is written in well below
1 second. Reading is even faster, as is to be expected:
In [ 87 ]: %%time
h5s = pd.HDFStore(filename + ‘.h5s’, ‘r’)
temp = h5s[‘data’]
h5s.close()
Out[87]: CPU times: user 13 ms, sys: 22 ms, total: 35 ms
Wall time: 32.7 ms
A brief check of whether the data sets are indeed the same:
In [ 88 ]: np.allclose(np.array(temp), np.array(data))
Out[88]: True
In [ 89 ]: temp = 0.0
Also, a look at the two files now on disk, showing that the HDF5 format consumes
somewhat less disk space:
In [ 90 ]: ll $path*
Out[90]: -rw-r—r— 1 root 54446080 28. Sep 15:16 /flash/data/numbs.db
-rw-r—r— 1 root 48007368 28. Sep 15:16 /flash/data/numbs.h5s