Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

130 S. Gustafson et al.



  1. Leveraged the same sliding window approach for training the model. Initial
    results showed, that like GBR, GP extrapolation capability greatly reduced as
    newer data was received.

  2. Added a validation set to first select for better solutions to extrapolate, but then
    later to enable an ensemble approach.

  3. Added ensembles using a simple averaging of best solutions from the ensemble.


The biggest implication of these steps was that the code had to be changed in the
GP system to output solutions to create the ensemble method, which means that a
prototype system became much more brittle as a custom library as well as more
custom scripts would need to be maintained.
Regarding the infrastructure required to run the GP system as a competitive Data
Science tool, we leveraged our access to a very large cluster of compute nodes with
multiple processors, each with a large amount of dedicated memory. This allowed
us to develop GP solutions in a somewhat reasonable timescale as it took the GBR
method. The GBR could develop models within several seconds on a basic desktop
machine, approximately 17 s on average. For GP, one iteration took approximately 2
to complete one run. We executed this 30 times in parallel. Without parallelization
this would take an hour. Additional parallelization could bring this down to around
5 min. While the compute time puts the GP method at a distinct disadvantage, in
the era of Big Data, this kind of infrastructure difference is less critical: data sets
will become larger and larger, giving easily parallelizable and distributed methods
like GP an advantage, and data sets will natively be stored in massively distributed
storage systems.
We now describe some of the GP settings we used. First we set the population
size to 1000, and used a generation based model with 100 max generations. We used
standard functions (+,, *, /, sqrt, square, exp) and 32 variables. Other parameters
were set to typical competent GP values: tree initial depth was 15, tournament
selection was 10, crossover rate of 0.7, 10 tries to produce a unique tree in crossover,
mutation rate or 0.2, and replication rate of 0.1. After some brief, initial probing
of hyper parameters in the setup, we determined the following as an effective GP
approach. First we split out training data into five equal sized time sequential groups.
The first three earliest in time periods were used to train our GP system. To train, we
ran our GP system 30 times to produce a selection of best solutions of various sizes
and accuracy. Secondly, we used the fourth training period (the next consecutive
one) as a validation data to select top 15 models (measured according to RMSE and
model complexity) across all the populations and runs. Specifically, for each run we
store the models that had the best accuracy and were the least complex (smallest),
the least complex model, and the most accurate models. So, after 30 runs, we have
approximately 90 possible models. From those models, we test them all against the
fourth training period and score them by their RMSE. Then, we select the top 15
models (we determined 15 by minimal trial and error) and allow them to extrapolate
on each new data point (sensorss 1 :::sN 1 ) to predict sensorsN. The final prediction
is then the average of the predictions from the 15 models, as well as a 1 standard

Free download pdf