Figure 7-1. Plot of the query result
From SQL to pandas
A generally more efficient approach, however, is the reading of either whole tables or
query results with pandas. When you are able to read a whole table into memory,
analytical queries can generally be executed much faster than when using the SQL disk-
based approach. The sublibrary pandas.io.sql contains functions to handle data stored in
SQL databases:
In [ 77 ]: import pandas.io.sql as pds
Reading the whole table with pandas takes roughly the same amount of time as reading it
into a NumPy ndarray object. There as here, the bottleneck is the SQL database:
In [ 78 ]: %time data = pds.read_sql(‘SELECT * FROM numbers’, con)
Out[78]: CPU times: user 2.16 s, sys: 60 ms, total: 2.22 s
Wall time: 2.23 s
In [ 79 ]: data.head()
Out[79]: No1 No2 No3 No4 No5
0 -1.67378 -0.58292 -1.10616 1.14929 -0.03930
1 1.38006 0.82665 0.34168 -1.16760 -0.53274
2 0.79329 0.11947 2.06403 -0.36208 1.77442
3 -0.33507 -0.00715 -1.01193 0.23157 1.30225
4 -0.35292 0.67483 1.59507 -1.21263 0.14745
[5 rows x 5 columns]
The data is now in-memory. This allows for much faster analytics. The SQL query that
takes a few seconds with SQLite3 finishes in less than 0.1 seconds with pandas in-
memory:
In [ 80 ]: %time data[(data[‘No1’] > 0 ) & (data[‘No2’] < 0 )].head()
Out[80]: CPU times: user 50 ms, sys: 0 ns, total: 50 ms
Wall time: 49.9 ms
No1 No2 No3 No4 No5
6 1.17749 -1.13017 -0.24176 -0.64047 1.58002
8 0.18625 -0.99949 2.29854 0.91816 -0.92661
9 1.09481 -0.26301 1.11341 0.68716 -0.71524
18 0.31836 -0.33039 -1.50109 0.52961 0.96595
20 0.40261 -0.45917 0.37339 -1.09515 0.23972
[5 rows x 5 columns]
pandas can master even more complex queries, although it is neither meant nor able to
replace SQL databases when it comes to complex, relational data structures. The result of
the next query is shown in Figure 7-2: