Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

nPool: Cross-Validation in EC-Star 89


1.Diversity promotion
Training takes place on a diversity of mutually exclusive training sets, which
can potentially lead to a diversity of solutions.
2.Generalization
The method allows for much more effective use of the data set, removing
the requirement to always hide part of the data from any training to be used as
the out-of-sample set. The method also significantly reduces sensitivity to any
selection bias on the training set by allowingnoriginating segments, each to
act as a training set for a subset of the candidates. This method removes, or at
least reduces, the need for a separate process for verifying evolved candidates on
out-of-sample data, as that step is built into the production system. In addition,
generalization of any evolved candidate is much more reliable as its respective
out-of-sample evaluation set isn 1 times larger than the evolved set.
3.Scale
Many more candidates are tested on unseen, this is done in parallel, and it is
simultaneous to the training run.
4.Speed
A lower top layer max age means faster convergence over the training sets
(i.e., originating segments). However, this is evened out somewhat because
more time and processing capacity is spent on validation, so less capacity is
available for training. The age-layered nature of the system filters out over-fitting
candidates, so segments with uneven distributions of data points have less of
an impact. This manner of association of segments to Evolution Engines has
the added benefit of allowing for the caching of the data points at the worker
nodes, reducing the need for moving data packages around. This allows the
infrastructure to be smarter about moving the candidates around rather than
the data—reducing the bandwidth requirements and, as a result, improving the
efficiency of the system.


5.1 Future Work


One of the drawbacks to the system just described is that the single fold of data
on which the candidates are trained is fixed throughout training, with no overlap.
This means that there is a chance that the training pools associated with each data
segment may converge relatively quickly, and the diversity of data is not translated
into a diversity of genotypic solutions. In other words, if the data segments are small
enough, there is a risk that candidates with the same source segment evolve to local
optima much quicker than if we trained on the entire data set.
To combat this problem, we can alter the approach so that an Evolution
Coordinator designates each Evolution Engine to evaluate material from M different
sub-segments, where M is less than the total number of data segments in the system
(often less than1=2). Candidates originating from an Evolution Engine are also
marked as originating from all of the segments available to the Evolution Engine.

Free download pdf