Python for Finance: Analyze Big Financial Data

Figure 7-1. Plot of the query result

From SQL to pandas

A generally more efficient approach, however, is the reading of either whole tables or

query results with pandas. When you are able to read a whole table into memory,

analytical queries can generally be executed much faster than when using the SQL disk-

based approach. The sublibrary pandas.io.sql contains functions to handle data stored in

SQL databases:

In [ 77 ]: import pandas.io.sql as pds

Reading the whole table with pandas takes roughly the same amount of time as reading it

into a NumPy ndarray object. There as here, the bottleneck is the SQL database:

In [ 78 ]: %time data = pds.read_sql(‘SELECT * FROM numbers’, con) Out[78]: CPU times: user 2.16 s, sys: 60 ms, total: 2.22 s Wall time: 2.23 s In [ 79 ]: data.head() Out[79]: No1 No2 No3 No4 No5 0 -1.67378 -0.58292 -1.10616 1.14929 -0.03930 1 1.38006 0.82665 0.34168 -1.16760 -0.53274 2 0.79329 0.11947 2.06403 -0.36208 1.77442 3 -0.33507 -0.00715 -1.01193 0.23157 1.30225 4 -0.35292 0.67483 1.59507 -1.21263 0.14745

[5 rows x 5 columns]

The data is now in-memory. This allows for much faster analytics. The SQL query that

takes a few seconds with SQLite3 finishes in less than 0.1 seconds with pandas in-

memory:

In [ 80 ]: %time data[(data[‘No1’] > 0 ) & (data[‘No2’] < 0 )].head() Out[80]: CPU times: user 50 ms, sys: 0 ns, total: 50 ms Wall time: 49.9 ms

No1 No2 No3 No4 No5 6 1.17749 -1.13017 -0.24176 -0.64047 1.58002 8 0.18625 -0.99949 2.29854 0.91816 -0.92661 9 1.09481 -0.26301 1.11341 0.68716 -0.71524 18 0.31836 -0.33039 -1.50109 0.52961 0.96595 20 0.40261 -0.45917 0.37339 -1.09515 0.23972

[5 rows x 5 columns]

pandas can master even more complex queries, although it is neither meant nor able to

replace SQL databases when it comes to complex, relational data structures. The result of

the next query is shown in Figure 7-2: