This is better, as we’ve reduced the number of memory allocations to
just two, down from possibly many in the first version of the code.
If we really need the speed, we might consider recoding this in C, as dis-
cussed in Chapter 14.
2.5.2 Extended Example: Predicting Discrete-Valued Time Series..........
Suppose we observe 0- and 1-valued data, one per time period. To make
things concrete, say it’s daily weather data: 1 for rain and 0 for no rain. Sup-
pose we wish to predict whether it will rain tomorrow, knowing whether it
rained or not in recent days. Specifically, for some numberk, we will predict
tomorrow’s weather based on the weather record of the lastkdays. We’ll use
majority rule: If the number of 1s in the previousktime periods is at least
k/2, we’ll predict the next value to be 1; otherwise, our prediction is 0. For
instance, ifk=3and the data for the last three periods is 1,0,1, we’ll predict
the next period to be a 1.
But how should we choosek? Clearly, if we choose too small a value, it
may give us too small a sample from which to predict. Too large a value will
cause us to rely on data from the distant past that may have little or no pre-
dictive value.
A common solution to this problem is to take known data, called atrain-
ing set, and then ask how well various values ofkwould have performed on
that data.
In the weather case, suppose we have 500 days of data and suppose we
are considering usingk=3. To assess the predictive ability of that value for
k, we “predict” each day in our data from the previous three days and then
compare the predictions with the known values. After doing this throughout
our data, we have an error rate fork=3. We do the same fork=1,k=2,
k=4, and so on, up to some maximum value ofkthat we feel is enough. We
then use whichever value ofkworked best in our training data for future
predictions.
So how would we code that in R? Here’s a naive approach:
1 preda <- function(x,k) {
2 n <- length(x)
3 k2 <- k/2
4 # the vector pred will contain our predicted values
5 pred <- vector(length=n-k)
6 for (i in 1:(n-k)) {
7 if (sum(x[i:(i+(k-1))]) >= k2) pred[i] <- 1 else pred[i] <- 0
8 }
9 return(mean(abs(pred-x[(k+1):n])))
10 }
The heart of the code is line 7. There, we’re predicting dayi+k(pre-
diction to be stored inpred[i]) from thekdays previous to it—that is, days
i,...,i+k-1. Thus, we need to count the 1s among those days. Since we’re
Vectors 37