Python for Finance: Analyze Big Financial Data

In [ 65 ]: lines[: 6 ] Out[65]: [‘PriceIndices-EUROCurrency\n’, ‘Date;Blue-Chip;Blue-Chip;Broad;Broad;ExUK;ExEuroZone;Blue-Chip;Broad\ n’, ‘;Europe;Euro-Zone;Europe;Euro-Zone;;;Nordic;Nordic\n’, ‘;SX5P;SX5E;SXXP;SXXE;SXXF;SXXA;DK5F;DKXF\n’, ‘31.12.1986;775.00;900.82;82.76;98.58;98.06;69.06;645.26;65.56\n’, ‘01.01.1987;775.00;900.82;82.76;98.58;98.06;69.06;645.26;65.56\n’]

The above-mentioned format change can be seen between lines 3,883 and 3,990 of the

file. From December 27, there suddenly appears an additional semicolon at the end of

each data row:

In [ 66 ]: for line in lines[ 3883 : 3890 ]: print line[ 41 :], Out[66]: 317.10;267.23;5268.36;363.19 322.55;272.18;5360.52;370.94 322.69;272.95;5360.52;370.94 327.57;277.68;5479.59;378.69; 329.94;278.87;5585.35;386.99; 326.77;272.38;5522.25;380.09; 332.62;277.08;5722.57;396.12;

To make the data set easier to import, we do the following:

1 . Generate a new text file.

2 . Delete unneeded header lines.

3 . Write an appropriate new header line to the new file.

4 . Add a helper column, DEL (to catch the trailing semicolons).

5 . Write all data rows to the new file.

With these adjustments, the data set can be imported and the helper column deleted after

the import. But first, the cleaning code:

In [ 67 ]: new_file = open(‘./data/es50.txt’, ‘w’) # opens a new file new_file.writelines(‘date’ + lines[ 3 ][:- 1 ] + ‘;DEL’ + lines[ 3 ][- 1 ]) # writes the corrected third line of the original file # as first line of new file new_file.writelines(lines[ 4 :]) # writes the remaining lines of the orignial file new_file.close()

Let us see how the new header looks:

In [ 68 ]: new_lines = open(‘./data/es50.txt’, ‘r’).readlines() new_lines[: 5 ] Out[68]: [‘date;SX5P;SX5E;SXXP;SXXE;SXXF;SXXA;DK5F;DKXF;DEL\n’, ‘31.12.1986;775.00;900.82;82.76;98.58;98.06;69.06;645.26;65.56\n’, ‘01.01.1987;775.00;900.82;82.76;98.58;98.06;69.06;645.26;65.56\n’, ‘02.01.1987;770.89;891.78;82.57;97.80;97.43;69.37;647.62;65.81\n’, ‘05.01.1987;771.89;898.33;82.82;98.60;98.19;69.16;649.94;65.82\n’]

It looks appropriate for the import with the read_csv function of pandas, so we continue:

In [ 69 ]: es = pd.read_csv(‘./data/es50.txt’, index_col= 0 , parse_dates=True, sep=‘;’, dayfirst=True) In [ 70 ]: np.round(es.tail()) Out[70]: SX5P SX5E SXXP SXXE SXXF SXXA DK5F DKXF DEL date 2014-09-22 3096 3257 347 326 403 357 9703 565 NaN 2014-09-23 3058 3206 342 321 398 353 9602 558 NaN 2014-09-24 3086 3244 344 323 401 355 9629 560 NaN 2014-09-25 3059 3202 341 320 397 353 9538 556 NaN 2014-09-26 3064 3220 342 321 398 353 9559 557 NaN