Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

120 S. Gustafson et al.


Ta b l e 1 Data from various Google queries suggesting low popularity and adoption by Data
Scientists in Data Science competitions on Kaggle.com, as a topic of research in Universities,
on new Data Science courses, within the Government, and also within job market on LinkedIn
Approach Kaggle forumssite:edusite edu syllabussite:govsite:linkedin.com
Logistic regression 76 3620 551 50 20,800
Neural network 59 3200 46 34 4710
Random forest 79 773 9 19 2920
Genetic programming 1 92 17 13 497
The queries were constructed as (Data Science + logistic regression + modifier), where
modifier would be site.edu for example

GP is at least 25 years old. Even from its initial days, learning models to fit
data was a focus and is usually referred to as Symbolic Regression. For the past
6 years, the first author on this paper has run a workshop at the annual Genetic
and Evolutionary Computation Conference on the topic of symbolic regression
research and industry tools: the Symbolic Regression and Modeling Workshop. The
workshop produced several interesting papers talks, led to new lines of research,
and enabled new software tools to be highlighted to the community. Symbolic
regression, the identification of a model, its variables, and their relationship (both
linear and nonlinear) is at the heart of Data Science.
While GP has been shown to successfully solve problems in countless papers, it
is still seen as an outsider in the mainstream machine learning and artificial intel-
ligence community. As such, GP is still not widely available in many commercial
analytical tools and Data Scientists have often not received any relevant training
as to how to use the method. In Table 1 , we performed several online searches to
understand a rough idea of the popularity of GP as compared to other popular Data
Science tools, namely logistic regression, neural networks, and random forest. One
search was performed on Kaggle.com user forums, which is a popular place for
thousands of Data Scientists competing on the Kaggle.com site in Data Science
challenges to share and discuss the approaches and methods. The forums represent
both a fascinating and educational look into how Data Scientists work. Table 1
shows that GP lags behind all other methods in all but one case. While one shouldn’t
place too much significance on the actual numbers in Table 1 , it does support our
belief that GP is not being considered as a Data Science tool.


2.1 Attributes of GP for Data Science


We now look at several requirements of Data Science and how GP can meet them
as a Data Science tool.



  1. Data Science leverages Big Data for data-driven decisions and outcomes. GP is
    a distributed search algorithm, and countless studies have look at better ways

Free download pdf