Principal Component Analysis
Principal component analysis (PCA) has become a popular tool in finance. Wikipedia
defines the technique as follows:
Principal component analysis (PCA) is a statistical procedure that uses orthogonal transformation to convert a set
of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called
principal components. The number of principal components is less than or equal to the number of original
variables. This transformation is defined in such a way that the first principal component has the largest possible
variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component
in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the
preceding components.
Consider, for example, a stock index like the German DAX index, composed of 30
different stocks. The stock price movements of all stocks taken together determine the
movement in the index (via some well-documented formula). In addition, the stock price
movements of the single stocks are generally correlated, for example, due to general
economic conditions or certain developments in certain sectors.
For statistical applications, it is generally quite hard to use 30 correlated factors to explain
the movements of a stock index. This is where PCA comes into play. It derives single,
uncorrelated components that are “well suited” to explain the movements in the stock
index. One can think of these components as linear combinations (i.e., portfolios) of
selected stocks from the index. Instead of working with 30 correlated index constituents,
one can then work with maybe 5, 3, or even only 1 principal component.
The example of this section illustrates the use of PCA in such a context. We retrieve data
for both the German DAX index and all stocks that make up the index. We then use PCA
to derive principal components, which we use to construct what we call a pca_index.
First, some imports. In particular, we use the KernelPCA function of the scikit-learn
machine learning library (cf. the documentation for KernelPCA):
In [ 1 ]: import numpy as np
import pandas as pd
import pandas.io.data as web
from sklearn.decomposition import KernelPCA
The DAX Index and Its 30 Stocks
The following list object contains the 30 symbols for the stocks contained in the German
DAX index, as well as the symbol for the index itself:
In [ 2 ]: symbols = [‘ADS.DE’, ‘ALV.DE’, ‘BAS.DE’, ‘BAYN.DE’, ‘BEI.DE’,
‘BMW.DE’, ‘CBK.DE’, ‘CON.DE’, ‘DAI.DE’, ‘DB1.DE’,
‘DBK.DE’, ‘DPW.DE’, ‘DTE.DE’, ‘EOAN.DE’, ‘FME.DE’,
‘FRE.DE’, ‘HEI.DE’, ‘HEN3.DE’, ‘IFX.DE’, ‘LHA.DE’,
‘LIN.DE’, ‘LXS.DE’, ‘MRK.DE’, ‘MUV2.DE’, ‘RWE.DE’,
‘SAP.DE’, ‘SDF.DE’, ‘SIE.DE’, ‘TKA.DE’, ‘VOW3.DE’,
‘^GDAXI’]
We work only with the closing values of each data set that we retrieve (for details on how
to retrieve stock data with pandas, see Chapter 6):
In [ 3 ]: %%time
data = pd.DataFrame()
for sym in symbols:
data[sym] = web.DataReader(sym, data_source=‘yahoo’)[‘Close’]
data = data.dropna()
Out[3]: CPU times: user 408 ms, sys: 68 ms, total: 476 ms