Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

Using GP for Data Science 121


to distribute the search process. Recent work has shown how GP can directly
leverage Big Data effectively (Arnaldo et al. 2014 ; Fazenda et al. 2012 ).


  1. Data Science extracts knowledge about a particular problem using data. GP
    produces easy-to-inspect solutions, which make it a particularly valuable method
    for Data Scientists.

  2. Data Science leverages any existing knowledge to get to an answer. GP has
    a direct way of encoding knowledge into the algorithm, through functions,
    terminals, and objective functions, or indirectly through selection pressure or
    operators.

  3. Data Science tools are used by many people not necessarily trained in machine
    learning. While there are some new tools that have intuitive interfaces (Schmidt
    and Lipson 2009 ; Wagner and Kronberger 2011 ; De Rainville et al. 2012 ; Smits
    et al. 2010 ), the core GP system is still quite complex with many parameters.

  4. Data Science requires Data Scientists to iterate quickly on building new models,
    get feedback, and build more models. GP can often take a significant amount of
    time to setup, tune parameters, and search for good models.

  5. Data Science tools integrate with other tools, particularly data management and
    visualization tools. Data Modeller and DEAP leverage the built-in capability
    of Mathematica and Python, whereas tools like FlexGP (Veeramachaneni et al.
    2015 ) and Eureqa are standalone solutions that must provide their own imple-
    mentations or leverage external tools.

  6. Data Science tools need to perform relatively out-of-the-box. Approaches like
    Random Forest and Gradient Boosted Regression have become popular as
    they are robust with their default settings. In general, GP requires a lot of
    customization to make it perform well. Recent work looks to improve the
    basic performance capability of GP by combing it with other machine learning
    techniques (Icke and Bongard 2013 ). In O’Neill et al. ( 2010 ), several open issues
    are highlighted to further improve GP.

  7. Data Science tools produce models that need to be implemented quickly for
    client-facing prototypes and demos. GP still exists in many stand alone envi-
    ronments, or requires a fair amount of tweaking the source packages.


2.2 Summary of Attributes of Data Science and GP


In Table 2 , we summarize the attributes from the previous section and identify which
GP capabilities are potential areas of concern in Data Science. Of the attributes of
GP for Data Science, the inability to iterate and create new models quickly was
the biggest concern for us. Some of the other issues, like integrating with big data
infrastructure or simple to use interfaces, have seen progress with new commercial
tools.
In Data Science, building the first model is usually a very informative and
valuable task by forcing assumptions and data issues out into the open, and
demonstrates a viable end-to-end pipeline of data to insight capability. There are
two main challenges in building the first model. Firstly, the integration of the tool

Free download pdf