Python for Finance: Analyze Big Financial Data

(Elle) #1
print “Max  %18.3f” %   values.max()
print “Ave %18.3f” % values.mean()
print “Min %18.3f” % values.min()
print “Std %18.3f” % values.std()
Out[121]: Max 5.152
Ave -0.000
Min -5.537
Std 1.000
CPU times: user 44 ms, sys: 39 ms, total: 83 ms
Wall time: 82.6 ms
In [ 122 ]: %%time
results = [(row[‘No1’], row[‘No2’]) for row in
tab.where(‘((No1 > 9800) | (No1 < 200)) \
& ((No2 > 4500) & (No2 < 5500))’)]
for res in results[: 4 ]:
print res
Out[122]: (9987, 4965)
(9934, 5263)
(9960, 4729)
(130, 5023)
CPU times: user 167 ms, sys: 37 ms, total: 204 ms
Wall time: 118 ms
In [ 123 ]: %%time
results = [(row[‘No1’], row[‘No2’]) for row in
tab.where(‘(No1 == 1234) & (No2 > 9776)’)]
for res in results:
print res
Out[123]: (1234, 9805)
(1234, 9785)
(1234, 9821)
CPU times: user 93 ms, sys: 40 ms, total: 133 ms
Wall time: 90.1 ms

Working with Compressed Tables


A major advantage of working with PyTables is the approach it takes to compression. It


uses compression not only to save space on disk, but also to improve the performance of


I/O operations. How does this work? When I/O is the bottleneck and the CPU is able to


(de)compress data fast, the net effect of compression in terms of speed might be positive.


Since the following examples are based on the I/O of a state-of-the-art (at the time of this


writing) SSD, there is no speed advantage of compression to be observed. However, there


is also almost no disadvantage of using compression:


In  [ 124 ]:    filename    =   path    +   ‘tab.h5c’
h5c = tb.open_file(filename, ‘w’)
In [ 125 ]: filters = tb.Filters(complevel= 4 , complib=‘blosc’)
In [ 126 ]: tabc = h5c.create_table(‘/’, ‘ints_floats’, sarray,
title=‘Integers and Floats’,
expectedrows=rows, filters=filters)
In [ 127 ]: %%time
res = np.array([(row[‘No3’], row[‘No4’]) for row in
tabc.where(‘((No3 < -0.5) | (No3 > 0.5)) \
& ((No4 < -1) | (No4 > 1))’)])[:: 100 ]
Out[127]: CPU times: user 670 ms, sys: 41 ms, total: 711 ms
Wall time: 602 ms

Generating the table with the original data and doing analytics on it is slightly slower


compared to the uncompressed table. What about reading the data into an ndarray? Let’s


check:


In  [ 128 ]:    %time arr_non   =   tab.read()
Out[128]: CPU times: user 13 ms, sys: 49 ms, total: 62 ms
Wall time: 61.3 ms
In [ 129 ]: %time arr_com = tabc.read()
Out[129]: CPU times: user 161 ms, sys: 33 ms, total: 194 ms
Wall time: 193 ms
Free download pdf