Python for Finance: Analyze Big Financial Data

(Elle) #1

Principal Component Analysis


Principal component analysis (PCA) has become a popular tool in finance. Wikipedia


defines the technique as follows:


Principal component analysis (PCA) is a statistical procedure that uses orthogonal transformation to convert a set

of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called

principal components. The number of principal components is less than or equal to the number of original

variables. This transformation is defined in such a way that the first principal component has the largest possible

variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component

in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the

preceding components.

Consider, for example, a stock index like the German DAX index, composed of 30


different stocks. The stock price movements of all stocks taken together determine the


movement in the index (via some well-documented formula). In addition, the stock price


movements of the single stocks are generally correlated, for example, due to general


economic conditions or certain developments in certain sectors.


For statistical applications, it is generally quite hard to use 30 correlated factors to explain


the movements of a stock index. This is where PCA comes into play. It derives single,


uncorrelated components that are “well suited” to explain the movements in the stock


index. One can think of these components as linear combinations (i.e., portfolios) of


selected stocks from the index. Instead of working with 30 correlated index constituents,


one can then work with maybe 5, 3, or even only 1 principal component.


The example of this section illustrates the use of PCA in such a context. We retrieve data


for both the German DAX index and all stocks that make up the index. We then use PCA


to derive principal components, which we use to construct what we call a pca_index.


First, some imports. In particular, we use the KernelPCA function of the scikit-learn


machine learning library (cf. the documentation for KernelPCA):


In  [ 1 ]:  import numpy as np
import pandas as pd
import pandas.io.data as web
from sklearn.decomposition import KernelPCA

The DAX Index and Its 30 Stocks


The following list object contains the 30 symbols for the stocks contained in the German


DAX index, as well as the symbol for the index itself:


In  [ 2 ]:  symbols =   [‘ADS.DE’,  ‘ALV.DE’,   ‘BAS.DE’,   ‘BAYN.DE’,  ‘BEI.DE’,
‘BMW.DE’, ‘CBK.DE’, ‘CON.DE’, ‘DAI.DE’, ‘DB1.DE’,
‘DBK.DE’, ‘DPW.DE’, ‘DTE.DE’, ‘EOAN.DE’, ‘FME.DE’,
‘FRE.DE’, ‘HEI.DE’, ‘HEN3.DE’, ‘IFX.DE’, ‘LHA.DE’,
‘LIN.DE’, ‘LXS.DE’, ‘MRK.DE’, ‘MUV2.DE’, ‘RWE.DE’,
‘SAP.DE’, ‘SDF.DE’, ‘SIE.DE’, ‘TKA.DE’, ‘VOW3.DE’,
‘^GDAXI’]

We work only with the closing values of each data set that we retrieve (for details on how


to retrieve stock data with pandas, see Chapter 6):


In  [ 3 ]:  %%time
data = pd.DataFrame()
for sym in symbols:
data[sym] = web.DataReader(sym, data_source=‘yahoo’)[‘Close’]
data = data.dropna()
Out[3]: CPU times: user 408 ms, sys: 68 ms, total: 476 ms
Free download pdf