Python for Finance: Analyze Big Financial Data

Wall time: 5.61 s

Let us separate the index data since we need it regularly:

In [ 4 ]: dax = pd.DataFrame(data.pop(‘^GDAXI’))

The DataFrame object data now has log return data for the 30 DAX stocks:

In [ 5 ]: data[data.columns[: 6 ]].head() Out[5]: ADS.DE ALV.DE BAS.DE BAYN.DE BEI.DE BMW.DE Date 2010-01-04 38.51 88.54 44.85 56.40 46.44 32.05 2010-01-05 39.72 88.81 44.17 55.37 46.20 32.31 2010-01-06 39.40 89.50 44.45 55.02 46.17 32.81 2010-01-07 39.74 88.47 44.15 54.30 45.70 33.10 2010-01-08 39.60 87.99 44.02 53.82 44.38 32.65

Applying PCA

Usually, PCA works with normalized data sets. Therefore, the following convenience

function proves helpful:

In [ 6 ]: scale_function = lambda x: (x - x.mean()) / x.std()

For the beginning, consider a PCA with multiple components (i.e., we do not restrict the

number of components):

[ 43 ]

In [ 7 ]: pca = KernelPCA().fit(data.apply(scale_function))

The importance or explanatory power of each component is given by its Eigenvalue.

These are found in an attribute of the KernelPCA object. The analysis gives too many

components:

In [ 8 ]: len(pca.lambdas_) Out[8]: 655

Therefore, let us only have a look at the first 10 components. The tenth component already

has almost negligible influence:

In [ 9 ]: pca.lambdas_[: 10 ].round() Out[9]: array([ 22816., 6559., 2535., 1558., 697., 442., 378., 255., 183., 151.])

We are mainly interested in the relative importance of each component, so we will

normalize these values. Again, we use a convenience function for this:

In [ 10 ]: get_we = lambda x: x / x.sum() In [ 11 ]: get_we(pca.lambdas_)[: 10 ] Out[11]: array([ 0.6295725 , 0.1809903 , 0.06995609, 0.04300101, 0.01923256, 0.01218984, 0.01044098, 0.00704461, 0.00505794, 0.00416612])

With this information, the picture becomes much clearer. The first component already

explains about 60% of the variability in the 30 time series. The first five components

explain about 95% of the variability:

In [ 12 ]: get_we(pca.lambdas_)[: 5 ].sum() Out[12]: 0.94275246704834414

Constructing a PCA Index

Next, we use PCA to construct a PCA (or factor) index over time and compare it with the

original index. First, we have a PCA index with a single component only:

In [ 13 ]: pca = KernelPCA(n_components= 1 ).fit(data.apply(scale_function)) dax[‘PCA_1’] = pca.transform(-data)