Python for Finance: Analyze Big Financial Data

(Elle) #1
                                Wall    time:   5.61    s

Let us separate the index data since we need it regularly:


In  [ 4 ]:  dax =   pd.DataFrame(data.pop(‘^GDAXI’))

The DataFrame object data now has log return data for the 30 DAX stocks:


In  [ 5 ]:  data[data.columns[: 6 ]].head()
Out[5]: ADS.DE ALV.DE BAS.DE BAYN.DE BEI.DE BMW.DE
Date
2010-01-04 38.51 88.54 44.85 56.40 46.44 32.05
2010-01-05 39.72 88.81 44.17 55.37 46.20 32.31
2010-01-06 39.40 89.50 44.45 55.02 46.17 32.81
2010-01-07 39.74 88.47 44.15 54.30 45.70 33.10
2010-01-08 39.60 87.99 44.02 53.82 44.38 32.65

Applying PCA


Usually, PCA works with normalized data sets. Therefore, the following convenience


function proves helpful:


In  [ 6 ]:  scale_function  =   lambda x:   (x  -   x.mean())   /   x.std()

For the beginning, consider a PCA with multiple components (i.e., we do not restrict the


number of components):


[ 43 ]

In  [ 7 ]:  pca =   KernelPCA().fit(data.apply(scale_function))

The importance or explanatory power of each component is given by its Eigenvalue.


These are found in an attribute of the KernelPCA object. The analysis gives too many


components:


In  [ 8 ]:  len(pca.lambdas_)
Out[8]: 655

Therefore, let us only have a look at the first 10 components. The tenth component already


has almost negligible influence:


In  [ 9 ]:  pca.lambdas_[: 10 ].round()
Out[9]: array([ 22816., 6559., 2535., 1558., 697., 442., 378.,
255., 183., 151.])

We are mainly interested in the relative importance of each component, so we will


normalize these values. Again, we use a convenience function for this:


In  [ 10 ]: get_we  =   lambda x:   x   /   x.sum()
In [ 11 ]: get_we(pca.lambdas_)[: 10 ]
Out[11]: array([ 0.6295725 , 0.1809903 , 0.06995609, 0.04300101, 0.01923256,
0.01218984, 0.01044098, 0.00704461, 0.00505794, 0.00416612])

With this information, the picture becomes much clearer. The first component already


explains about 60% of the variability in the 30 time series. The first five components


explain about 95% of the variability:


In  [ 12 ]: get_we(pca.lambdas_)[: 5 ].sum()
Out[12]: 0.94275246704834414

Constructing a PCA Index


Next, we use PCA to construct a PCA (or factor) index over time and compare it with the


original index. First, we have a PCA index with a single component only:


In  [ 13 ]: pca =   KernelPCA(n_components= 1 ).fit(data.apply(scale_function))
dax[‘PCA_1’] = pca.transform(-data)
Free download pdf