Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

132 S. Gustafson et al.



  • Spread of accuracy values for the population. If all programs in the population are
    giving similar accuracy, then they are all very similar. However, if the accuracies
    range is large, then the population is more diverse.
    During typical GP model building, the accuracy stops improving at some point.
    As the accuracy improves, the diversity tends to decline before the accuracy has
    stopped improving. This is because as the accuracy is increasing, the more fit
    member of populations are generating offspring that are similar to them. Eventually
    the population becomes less diverse because all the members of the population are
    descended from similar fit individuals. So even though the population as a whole
    becomes more accurate, it becomes less diverse. In Data Sciences applications,
    diversity is a potential indicator of when to stop a run early, which could enable
    GP to have faster iteration time. In addition, having diverse sets of models is useful
    for creating ensembles of models. After the GP run loses diversity, the models are
    likely to have similar errors, or residuals, which when combined with each other in
    an ensemble is likely to be less beneficial. Ensemble methods benefit when solutions
    are combined that have different errors. Our initial experiments showed promise to
    use diversity as an early stopping criterion as well as a way to find better ensembles.


4 Lessons Learned


In the previous section, we described how two methods were used to solve the
operations optimization problem for a real-world Data Science client. We also
discussed several advantages, disadvantages and considerations of each method.
Several lessons were extracted from our experiences that could be useful for future
Data Science engagements as well as for future development of GP as a Data
Science tool.



  1. Data Science needs data management, and GP needs better linkage to the ‘data
    environment’. Without a strong linkage to data management, GP must rely on
    additional tools to prepare data, intermediate data files, extra scripts to manage
    those intermediate files, which all create an additional burden on the Data
    Scientist and possibly the application developer who will turn the model into
    a prototype. GP systems like DEAP and Data Modeler are examples where a
    linkage to the data environment is strong.

  2. The development of more competent GP systems needs to continue to be
    pursued, making the out-of-box performance of GP better. System enhancements
    during model development by a Data Scientist should not occur within core
    libraries. Elements like ensembles should be made as default.

  3. Simplification of understanding and visualization of solutions can differentiate
    GP in the Data Science category. While some tools are better than others, the
    community should embrace this feature with new and insightful ways to optimize
    and visualize solutions—the effort should be very low to query a population of
    solutions for “how” they are solving the problem. Current approaches of listing

Free download pdf