132 S. Gustafson et al.
- Spread of accuracy values for the population. If all programs in the population are
giving similar accuracy, then they are all very similar. However, if the accuracies
range is large, then the population is more diverse.
During typical GP model building, the accuracy stops improving at some point.
As the accuracy improves, the diversity tends to decline before the accuracy has
stopped improving. This is because as the accuracy is increasing, the more fit
member of populations are generating offspring that are similar to them. Eventually
the population becomes less diverse because all the members of the population are
descended from similar fit individuals. So even though the population as a whole
becomes more accurate, it becomes less diverse. In Data Sciences applications,
diversity is a potential indicator of when to stop a run early, which could enable
GP to have faster iteration time. In addition, having diverse sets of models is useful
for creating ensembles of models. After the GP run loses diversity, the models are
likely to have similar errors, or residuals, which when combined with each other in
an ensemble is likely to be less beneficial. Ensemble methods benefit when solutions
are combined that have different errors. Our initial experiments showed promise to
use diversity as an early stopping criterion as well as a way to find better ensembles.
4 Lessons Learned
In the previous section, we described how two methods were used to solve the
operations optimization problem for a real-world Data Science client. We also
discussed several advantages, disadvantages and considerations of each method.
Several lessons were extracted from our experiences that could be useful for future
Data Science engagements as well as for future development of GP as a Data
Science tool.
- Data Science needs data management, and GP needs better linkage to the ‘data
environment’. Without a strong linkage to data management, GP must rely on
additional tools to prepare data, intermediate data files, extra scripts to manage
those intermediate files, which all create an additional burden on the Data
Scientist and possibly the application developer who will turn the model into
a prototype. GP systems like DEAP and Data Modeler are examples where a
linkage to the data environment is strong. - The development of more competent GP systems needs to continue to be
pursued, making the out-of-box performance of GP better. System enhancements
during model development by a Data Scientist should not occur within core
libraries. Elements like ensembles should be made as default. - Simplification of understanding and visualization of solutions can differentiate
GP in the Data Science category. While some tools are better than others, the
community should embrace this feature with new and insightful ways to optimize
and visualize solutions—the effort should be very low to query a population of
solutions for “how” they are solving the problem. Current approaches of listing