Python for Finance: Analyze Big Financial Data

In [ 81 ]: %%time res = data[[‘No1’, ‘No2’]][((data[‘No1’] > 0.5) | (data[‘No1’] < -0.5)) & ((data[‘No2’] < - 1 ) | (data[‘No2’] > 1 ))] Out[81]: CPU times: user 49 ms, sys: 0 ns, total: 49 ms Wall time: 48.7 ms In [ 82 ]: plt.plot(res.No1, res.No2, ‘ro’) plt.grid(True); plt.axis(‘tight’)

Figure 7-2. Scatter plot of complex query result

As expected, using the in-memory analytics capabilities of pandas leads to a significant

speedup, provided pandas is able to replicate the respective SQL statement. This is not the

only advantage of using pandas, though pandas is tightly integrated with PyTables, which

is the topic of the next section. Here, it suffices to know that the combination of both can

speed up I/O operations considerably. This is shown in the following:

In [ 83 ]: h5s = pd.HDFStore(filename + ‘.h5s’, ‘w’) In [ 84 ]: %time h5s[‘data’] = data Out[84]: CPU times: user 43 ms, sys: 60 ms, total: 103 ms Wall time: 161 ms In [ 85 ]: h5s Out[85]: <class ‘pandas.io.pytables.HDFStore’> File path: /flash/data/numbs.h5s /data frame (shape->[1000000,5]) In [ 86 ]: h5s.close()

The whole DataFrame with all the data from the original SQL table is written in well below

1 second. Reading is even faster, as is to be expected:

In [ 87 ]: %%time h5s = pd.HDFStore(filename + ‘.h5s’, ‘r’) temp = h5s[‘data’] h5s.close() Out[87]: CPU times: user 13 ms, sys: 22 ms, total: 35 ms Wall time: 32.7 ms

A brief check of whether the data sets are indeed the same:

In [ 88 ]: np.allclose(np.array(temp), np.array(data)) Out[88]: True In [ 89 ]: temp = 0.0

Also, a look at the two files now on disk, showing that the HDF5 format consumes

somewhat less disk space:

In [ 90 ]: ll $path* Out[90]: -rw-r—r— 1 root 54446080 28. Sep 15:16 /flash/data/numbs.db -rw-r—r— 1 root 48007368 28. Sep 15:16 /flash/data/numbs.h5s