This illustrates what kind of overhead the spreadsheet structure brings along with it.
Reading (and plotting) the data is a faster procedure (cf. Figure 7-4):
In [ 94 ]: %time pd.read_excel(filename + ‘.xlsx’, ‘Sheet1’).cumsum().plot()
Out[94]: CPU times: user 12.9 s, sys: 6 ms, total: 12.9 s
Wall time: 12.9 s
Figure 7-4. Paths of random data from Excel file
Inspection of the generated files reveals that the DataFrame with HDFStore combination is
the most compact alternative (using compression, as described later in this chapter, further
increases the benefits). The same amount of data as a CSV file — i.e., as a text file — is
somewhat larger in size. This is one reason for the slower performance when working with
CSV files, the other being the very fact that they are “only” general text files:
In [ 95 ]: ll $path*
Out[95]: -rw-r—r— 1 root 48831681 28. Sep 15:17 /flash/data/numbs.csv
-rw-r—r— 1 root 54446080 28. Sep 15:16 /flash/data/numbs.db
-rw-r—r— 1 root 48007368 28. Sep 15:16 /flash/data/numbs.h5s
-rw-r—r— 1 root 4311424 28. Sep 15:17 /flash/data/numbs.xlsx
In [ 96 ]: rm -f $path*