Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

Using GP for Data Science 129


New Input
Dara with
possible
“drift”

Exisiting
Model

Predict using
Updated
Predictors

Compare
The results
of the two
models

Build and Try out
Predict Using updated model
Existing Model

Significantly
different

Within set threshold

“Publish”
updated model

“Retain”
The existing model

Compare Feature sensitivity
Do Coefficients diff significantly?

Fig. 5 Schematic showing criteria for handling sensor “Drift”


like selection, evaluation, and recombination. The GP system also contains several
configuration files, or parameters file, where things like population size, functions,
initialization method and selection pressure can be specified. The goal of the
configuration files are to allow the customization of the system without modifying
the class files, which would require a recompile of the source files. Like most
systems for EC, there is a decent learning curve to understand how certain
functionality is represented and programmed.
If we look across all existing GP solutions, each provides strengths in various
attributes: user interfaces, cloud and distributed compute support, integration with
data management and visualization solutions like Mathematica or Matlab or R,
or advanced GP features like ensembles like FlexGP, etc. We chose to use a
package that was more mature on the advanced features, but less mature in the user
experience aspects. This choice is suitable for users with a high degree of expertise,
but as we will see later, has its downside for both novice users and integration with
other systems and prototyping. The process used to create a competitive GP solution
for our Data Science task was as follows:



  1. Feature selection as in GBR,

  2. Simplification of mathematical operators,

  3. Increased the training data size. Initial results showed that GP benefited with
    more data.

Free download pdf