Python for Finance: Analyze Big Financial Data

Principal Component Analysis

Principal component analysis (PCA) has become a popular tool in finance. Wikipedia

defines the technique as follows:

Principal component analysis (PCA) is a statistical procedure that uses orthogonal transformation to convert a set

of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called

principal components. The number of principal components is less than or equal to the number of original

variables. This transformation is defined in such a way that the first principal component has the largest possible

variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component

in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the

preceding components.

Consider, for example, a stock index like the German DAX index, composed of 30

different stocks. The stock price movements of all stocks taken together determine the

movement in the index (via some well-documented formula). In addition, the stock price

movements of the single stocks are generally correlated, for example, due to general

economic conditions or certain developments in certain sectors.

For statistical applications, it is generally quite hard to use 30 correlated factors to explain

the movements of a stock index. This is where PCA comes into play. It derives single,

uncorrelated components that are “well suited” to explain the movements in the stock

index. One can think of these components as linear combinations (i.e., portfolios) of

selected stocks from the index. Instead of working with 30 correlated index constituents,

one can then work with maybe 5, 3, or even only 1 principal component.

The example of this section illustrates the use of PCA in such a context. We retrieve data

for both the German DAX index and all stocks that make up the index. We then use PCA

to derive principal components, which we use to construct what we call a pca_index.

First, some imports. In particular, we use the KernelPCA function of the scikit-learn

machine learning library (cf. the documentation for KernelPCA):

In [ 1 ]: import numpy as np import pandas as pd import pandas.io.data as web from sklearn.decomposition import KernelPCA

The DAX Index and Its 30 Stocks

The following list object contains the 30 symbols for the stocks contained in the German

DAX index, as well as the symbol for the index itself:

In [ 2 ]: symbols = [‘ADS.DE’, ‘ALV.DE’, ‘BAS.DE’, ‘BAYN.DE’, ‘BEI.DE’, ‘BMW.DE’, ‘CBK.DE’, ‘CON.DE’, ‘DAI.DE’, ‘DB1.DE’, ‘DBK.DE’, ‘DPW.DE’, ‘DTE.DE’, ‘EOAN.DE’, ‘FME.DE’, ‘FRE.DE’, ‘HEI.DE’, ‘HEN3.DE’, ‘IFX.DE’, ‘LHA.DE’, ‘LIN.DE’, ‘LXS.DE’, ‘MRK.DE’, ‘MUV2.DE’, ‘RWE.DE’, ‘SAP.DE’, ‘SDF.DE’, ‘SIE.DE’, ‘TKA.DE’, ‘VOW3.DE’, ‘^GDAXI’]

We work only with the closing values of each data set that we retrieve (for details on how

to retrieve stock data with pandas, see Chapter 6):

In [ 3 ]: %%time data = pd.DataFrame() for sym in symbols: data[sym] = web.DataReader(sym, data_source=‘yahoo’)[‘Close’] data = data.dropna() Out[3]: CPU times: user 408 ms, sys: 68 ms, total: 476 ms