Python for Finance: Analyze Big Financial Data

generally most useful when seen as diachronic — i.e., in the sense that new data revealed

over time allows for better regressions and estimates.

To incorporate this concept in the current example, we assume that the regression

parameters are not only random and distributed in some fashion, but that they follow some

kind of random walk over time. It is the same generalization used when making the

transition in finance theory from random variables to stochastic processes (which are

essentially ordered sequences of random variables):

To this end, we define a new PyMC3 model, this time specifying parameter values as

random walks with the variance parameter values transformed to log space (for better

sampling characteristics).

In [ 42 ]: model_randomwalk = pm.Model() with model_randomwalk: # std of random walk best sampled in log space sigma_alpha, log_sigma_alpha = \ model_randomwalk.TransformedVar(‘sigma_alpha’, pm.Exponential.dist(1. / . 02 , testval=. 1 ), pm.logtransform) sigma_beta, log_sigma_beta = \ model_randomwalk.TransformedVar(‘sigma_beta’, pm.Exponential.dist(1. / . 02 , testval=. 1 ), pm.logtransform)

After having specified the distributions of the random walk parameters, we can proceed

with specifying the random walks for alpha and beta. To make the whole procedure more

efficient, 50 data points at a time share common coefficients:

In [ 43 ]: from pymc.distributions.timeseries import GaussianRandomWalk # to make the model simpler, we will apply the same coefficients # to 50 data points at a time subsample_alpha = 50 subsample_beta = 50

with model_randomwalk: alpha = GaussianRandomWalk(‘alpha’, sigma_alpha**- 2 , shape=len(data) / subsample_alpha) beta = GaussianRandomWalk(‘beta’, sigma_beta**- 2 , shape=len(data) / subsample_beta)

# make coefficients have the same length as prices alpha_r = np.repeat(alpha, subsample_alpha) beta_r = np.repeat(beta, subsample_beta)

The time series data sets have a length of 1,967 data points:

In [ 44 ]: len(data.dropna().GDX.values) # a bit longer than 1,950 Out[44]: 1967

For the sampling to follow, the number of data points must be divisible by 50. Therefore,

only the first 1,950 data points are taken for the regression:

In [ 45 ]: with model_randomwalk: # define regression regression = alpha_r + beta_r * data.GDX.values[: 1950 ]

# assume prices are normally distributed # the mean comes from the regression sd = pm.Uniform(‘sd’, 0 , 20 ) likelihood = pm.Normal(‘GLD’, mu=regression, sd=sd, observed=data.GLD.values[: 1950 ])

All these definitions are a bit more involved than before due to the use of random walks

instead of a single random variable. However, the inference steps with the MCMC remain