Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

Using GP for Data Science 133


un-optimized, or un-simplified, postfix expressions, or counting frequent subtrees
or variables, is not effective and could be potentially misleading.


  1. GP needs to be able to iterate faster and reduce the time to create a good first
    model. Diversity could be an indicator of when to stop the model building
    process sooner and return a result. Other code optimization opportunities to
    reduce compute time should be pursued.

  2. Frequent updating a model or adding new features during model re-training can
    confuse the end user and make maintenance difficult. When a new feature is
    added by the GP algorithm as a result of retraining the model with new data, the
    change should be intuitive to the user. A new direction of research could look
    to illuminate what new features used during model retraining might mean for
    solving or modeling the system.

  3. Open Source implementations that have been matured by communities of users,
    as is the case for our GBR approach, are typically high performing out-of-the-
    box. When this is the case, it is best not to tweak the runtime parameters that
    have been optimized by others over long periods of time. GP should seek such
    broad community development to improve method robustness and out-of-the-box
    performance.

  4. Code optimization should be saved for much later in the development phase.
    Giving priority to ‘working scripts’ leads to quick results, which can be shared
    with clients and approaches can be altered based on user feedback. Practitioners
    should avoid modifying the core GP system in favor of tuning system parameters
    contained in configuration files.


In our work, by comparing the two Data Science approaches, GBR and GP, we were
able to see quite clearly the strengths and weaknesses of GP as compared to GBR.
The above lessons learned represent both the positive attributes of GP as well as
places where more work is needed. We see a lot of potential in GP as a new Data
Science tool, particularly for use on Big Data and in complex, nonlinear, and domain
knowledge intense domains. However, to get mainstream adoption, we believe these
lessons learned should help identify future areas of both research and development
of GP systems.


5 Conclusions


This chapter described a case study of applying GP to a real-world Data Science
task in the problem domain of operations optimization. We believe the application
is quite novel in attempting to build an online sensor estimation method to both
validate data quality (it could be used to signal when a sensor is starting to
drift), provide an estimation of a sensor when it fails or goes offline, as well
as provide transparency to dynamic systems when they change by highlighting
how the underlying GP solution changes. While both methods were able to find
acceptable and similar accuracy, the GBR method won out in the client application

Free download pdf