Using GP for Data Science 121
to distribute the search process. Recent work has shown how GP can directly
leverage Big Data effectively (Arnaldo et al. 2014 ; Fazenda et al. 2012 ).
- Data Science extracts knowledge about a particular problem using data. GP
produces easy-to-inspect solutions, which make it a particularly valuable method
for Data Scientists. - Data Science leverages any existing knowledge to get to an answer. GP has
a direct way of encoding knowledge into the algorithm, through functions,
terminals, and objective functions, or indirectly through selection pressure or
operators. - Data Science tools are used by many people not necessarily trained in machine
learning. While there are some new tools that have intuitive interfaces (Schmidt
and Lipson 2009 ; Wagner and Kronberger 2011 ; De Rainville et al. 2012 ; Smits
et al. 2010 ), the core GP system is still quite complex with many parameters. - Data Science requires Data Scientists to iterate quickly on building new models,
get feedback, and build more models. GP can often take a significant amount of
time to setup, tune parameters, and search for good models. - Data Science tools integrate with other tools, particularly data management and
visualization tools. Data Modeller and DEAP leverage the built-in capability
of Mathematica and Python, whereas tools like FlexGP (Veeramachaneni et al.
2015 ) and Eureqa are standalone solutions that must provide their own imple-
mentations or leverage external tools. - Data Science tools need to perform relatively out-of-the-box. Approaches like
Random Forest and Gradient Boosted Regression have become popular as
they are robust with their default settings. In general, GP requires a lot of
customization to make it perform well. Recent work looks to improve the
basic performance capability of GP by combing it with other machine learning
techniques (Icke and Bongard 2013 ). In O’Neill et al. ( 2010 ), several open issues
are highlighted to further improve GP. - Data Science tools produce models that need to be implemented quickly for
client-facing prototypes and demos. GP still exists in many stand alone envi-
ronments, or requires a fair amount of tweaking the source packages.
2.2 Summary of Attributes of Data Science and GP
In Table 2 , we summarize the attributes from the previous section and identify which
GP capabilities are potential areas of concern in Data Science. Of the attributes of
GP for Data Science, the inability to iterate and create new models quickly was
the biggest concern for us. Some of the other issues, like integrating with big data
infrastructure or simple to use interfaces, have seen progress with new commercial
tools.
In Data Science, building the first model is usually a very informative and
valuable task by forcing assumptions and data issues out into the open, and
demonstrates a viable end-to-end pipeline of data to insight capability. There are
two main challenges in building the first model. Firstly, the integration of the tool