Python for Finance: Analyze Big Financial Data

(Elle) #1

This illustrates what kind of overhead the spreadsheet structure brings along with it.


Reading (and plotting) the data is a faster procedure (cf. Figure 7-4):


In  [ 94 ]: %time pd.read_excel(filename    +   ‘.xlsx’,    ‘Sheet1’).cumsum().plot()
Out[94]: CPU times: user 12.9 s, sys: 6 ms, total: 12.9 s
Wall time: 12.9 s

Figure 7-4. Paths of random data from Excel file

Inspection of the generated files reveals that the DataFrame with HDFStore combination is


the most compact alternative (using compression, as described later in this chapter, further


increases the benefits). The same amount of data as a CSV file — i.e., as a text file — is


somewhat larger in size. This is one reason for the slower performance when working with


CSV files, the other being the very fact that they are “only” general text files:


In  [ 95 ]: ll $path*
Out[95]: -rw-r—r— 1 root 48831681 28. Sep 15:17 /flash/data/numbs.csv
-rw-r—r— 1 root 54446080 28. Sep 15:16 /flash/data/numbs.db
-rw-r—r— 1 root 48007368 28. Sep 15:16 /flash/data/numbs.h5s
-rw-r—r— 1 root 4311424 28. Sep 15:17 /flash/data/numbs.xlsx
In [ 96 ]: rm -f $path*
Free download pdf