Wall time: 5.61 s
Let us separate the index data since we need it regularly:
In [ 4 ]: dax = pd.DataFrame(data.pop(‘^GDAXI’))
The DataFrame object data now has log return data for the 30 DAX stocks:
In [ 5 ]: data[data.columns[: 6 ]].head()
Out[5]: ADS.DE ALV.DE BAS.DE BAYN.DE BEI.DE BMW.DE
Date
2010-01-04 38.51 88.54 44.85 56.40 46.44 32.05
2010-01-05 39.72 88.81 44.17 55.37 46.20 32.31
2010-01-06 39.40 89.50 44.45 55.02 46.17 32.81
2010-01-07 39.74 88.47 44.15 54.30 45.70 33.10
2010-01-08 39.60 87.99 44.02 53.82 44.38 32.65
Applying PCA
Usually, PCA works with normalized data sets. Therefore, the following convenience
function proves helpful:
In [ 6 ]: scale_function = lambda x: (x - x.mean()) / x.std()
For the beginning, consider a PCA with multiple components (i.e., we do not restrict the
number of components):
[ 43 ]
In [ 7 ]: pca = KernelPCA().fit(data.apply(scale_function))
The importance or explanatory power of each component is given by its Eigenvalue.
These are found in an attribute of the KernelPCA object. The analysis gives too many
components:
In [ 8 ]: len(pca.lambdas_)
Out[8]: 655
Therefore, let us only have a look at the first 10 components. The tenth component already
has almost negligible influence:
In [ 9 ]: pca.lambdas_[: 10 ].round()
Out[9]: array([ 22816., 6559., 2535., 1558., 697., 442., 378.,
255., 183., 151.])
We are mainly interested in the relative importance of each component, so we will
normalize these values. Again, we use a convenience function for this:
In [ 10 ]: get_we = lambda x: x / x.sum()
In [ 11 ]: get_we(pca.lambdas_)[: 10 ]
Out[11]: array([ 0.6295725 , 0.1809903 , 0.06995609, 0.04300101, 0.01923256,
0.01218984, 0.01044098, 0.00704461, 0.00505794, 0.00416612])
With this information, the picture becomes much clearer. The first component already
explains about 60% of the variability in the 30 time series. The first five components
explain about 95% of the variability:
In [ 12 ]: get_we(pca.lambdas_)[: 5 ].sum()
Out[12]: 0.94275246704834414
Constructing a PCA Index
Next, we use PCA to construct a PCA (or factor) index over time and compare it with the
original index. First, we have a PCA index with a single component only:
In [ 13 ]: pca = KernelPCA(n_components= 1 ).fit(data.apply(scale_function))
dax[‘PCA_1’] = pca.transform(-data)