Genetic_Programming_Theory_and_Practice_XIII

Using GP for Data Science 133

un-optimized, or un-simplified, postfix expressions, or counting frequent subtrees or variables, is not effective and could be potentially misleading.

GP needs to be able to iterate faster and reduce the time to create a good first
model. Diversity could be an indicator of when to stop the model building
process sooner and return a result. Other code optimization opportunities to
reduce compute time should be pursued.

Frequent updating a model or adding new features during model re-training can
confuse the end user and make maintenance difficult. When a new feature is
added by the GP algorithm as a result of retraining the model with new data, the
change should be intuitive to the user. A new direction of research could look
to illuminate what new features used during model retraining might mean for
solving or modeling the system.

Open Source implementations that have been matured by communities of users,
as is the case for our GBR approach, are typically high performing out-of-the-
box. When this is the case, it is best not to tweak the runtime parameters that
have been optimized by others over long periods of time. GP should seek such
broad community development to improve method robustness and out-of-the-box
performance.

Code optimization should be saved for much later in the development phase.
Giving priority to ‘working scripts’ leads to quick results, which can be shared
with clients and approaches can be altered based on user feedback. Practitioners
should avoid modifying the core GP system in favor of tuning system parameters
contained in configuration files.

In our work, by comparing the two Data Science approaches, GBR and GP, we were
able to see quite clearly the strengths and weaknesses of GP as compared to GBR.
The above lessons learned represent both the positive attributes of GP as well as
places where more work is needed. We see a lot of potential in GP as a new Data
Science tool, particularly for use on Big Data and in complex, nonlinear, and domain
knowledge intense domains. However, to get mainstream adoption, we believe these
lessons learned should help identify future areas of both research and development
of GP systems.

5 Conclusions

This chapter described a case study of applying GP to a real-world Data Science
task in the problem domain of operations optimization. We believe the application
is quite novel in attempting to build an online sensor estimation method to both
validate data quality (it could be used to signal when a sensor is starting to
drift), provide an estimation of a sensor when it fails or goes offline, as well
as provide transparency to dynamic systems when they change by highlighting
how the underlying GP solution changes. While both methods were able to find
acceptable and similar accuracy, the GBR method won out in the client application

Genetic_Programming_Theory_and_Practice_XIII

Get our desktop app

Company

Features

Documentation

Resources