Python for Finance: Analyze Big Financial Data

(Elle) #1
In  [ 81 ]: %%time
res = data[[‘No1’, ‘No2’]][((data[‘No1’] > 0.5) | (data[‘No1’] < -0.5))
& ((data[‘No2’] < - 1 ) | (data[‘No2’] > 1 ))]
Out[81]: CPU times: user 49 ms, sys: 0 ns, total: 49 ms
Wall time: 48.7 ms
In [ 82 ]: plt.plot(res.No1, res.No2, ‘ro’)
plt.grid(True); plt.axis(‘tight’)

Figure 7-2. Scatter plot of complex query result

As expected, using the in-memory analytics capabilities of pandas leads to a significant


speedup, provided pandas is able to replicate the respective SQL statement. This is not the


only advantage of using pandas, though pandas is tightly integrated with PyTables, which


is the topic of the next section. Here, it suffices to know that the combination of both can


speed up I/O operations considerably. This is shown in the following:


In  [ 83 ]: h5s =   pd.HDFStore(filename    +   ‘.h5s’, ‘w’)
In [ 84 ]: %time h5s[‘data’] = data
Out[84]: CPU times: user 43 ms, sys: 60 ms, total: 103 ms
Wall time: 161 ms
In [ 85 ]: h5s
Out[85]: <class ‘pandas.io.pytables.HDFStore’>
File path: /flash/data/numbs.h5s
/data frame (shape->[1000000,5])
In [ 86 ]: h5s.close()

The whole DataFrame with all the data from the original SQL table is written in well below


1 second. Reading is even faster, as is to be expected:


In  [ 87 ]: %%time
h5s = pd.HDFStore(filename + ‘.h5s’, ‘r’)
temp = h5s[‘data’]
h5s.close()
Out[87]: CPU times: user 13 ms, sys: 22 ms, total: 35 ms
Wall time: 32.7 ms

A brief check of whether the data sets are indeed the same:


In  [ 88 ]: np.allclose(np.array(temp), np.array(data))
Out[88]: True
In [ 89 ]: temp = 0.0

Also, a look at the two files now on disk, showing that the HDF5 format consumes


somewhat less disk space:


In  [ 90 ]: ll $path*
Out[90]: -rw-r—r— 1 root 54446080 28. Sep 15:16 /flash/data/numbs.db
-rw-r—r— 1 root 48007368 28. Sep 15:16 /flash/data/numbs.h5s
Free download pdf