Python for Finance: Analyze Big Financial Data

print “Max %18.3f” % values.max() print “Ave %18.3f” % values.mean() print “Min %18.3f” % values.min() print “Std %18.3f” % values.std() Out[121]: Max 5.152 Ave -0.000 Min -5.537 Std 1.000 CPU times: user 44 ms, sys: 39 ms, total: 83 ms Wall time: 82.6 ms In [ 122 ]: %%time results = [(row[‘No1’], row[‘No2’]) for row in tab.where(‘((No1 > 9800) | (No1 < 200)) \ & ((No2 > 4500) & (No2 < 5500))’)] for res in results[: 4 ]: print res Out[122]: (9987, 4965) (9934, 5263) (9960, 4729) (130, 5023) CPU times: user 167 ms, sys: 37 ms, total: 204 ms Wall time: 118 ms In [ 123 ]: %%time results = [(row[‘No1’], row[‘No2’]) for row in tab.where(‘(No1 == 1234) & (No2 > 9776)’)] for res in results: print res Out[123]: (1234, 9805) (1234, 9785) (1234, 9821) CPU times: user 93 ms, sys: 40 ms, total: 133 ms Wall time: 90.1 ms

Working with Compressed Tables

A major advantage of working with PyTables is the approach it takes to compression. It

uses compression not only to save space on disk, but also to improve the performance of

I/O operations. How does this work? When I/O is the bottleneck and the CPU is able to

(de)compress data fast, the net effect of compression in terms of speed might be positive.

Since the following examples are based on the I/O of a state-of-the-art (at the time of this

writing) SSD, there is no speed advantage of compression to be observed. However, there

is also almost no disadvantage of using compression:

In [ 124 ]: filename = path + ‘tab.h5c’ h5c = tb.open_file(filename, ‘w’) In [ 125 ]: filters = tb.Filters(complevel= 4 , complib=‘blosc’) In [ 126 ]: tabc = h5c.create_table(‘/’, ‘ints_floats’, sarray, title=‘Integers and Floats’, expectedrows=rows, filters=filters) In [ 127 ]: %%time res = np.array([(row[‘No3’], row[‘No4’]) for row in tabc.where(‘((No3 < -0.5) | (No3 > 0.5)) \ & ((No4 < -1) | (No4 > 1))’)])[:: 100 ] Out[127]: CPU times: user 670 ms, sys: 41 ms, total: 711 ms Wall time: 602 ms

Generating the table with the original data and doing analytics on it is slightly slower

compared to the uncompressed table. What about reading the data into an ndarray? Let’s

check:

In [ 128 ]: %time arr_non = tab.read() Out[128]: CPU times: user 13 ms, sys: 49 ms, total: 62 ms Wall time: 61.3 ms In [ 129 ]: %time arr_com = tabc.read() Out[129]: CPU times: user 161 ms, sys: 33 ms, total: 194 ms Wall time: 193 ms