Genetic_Programming_Theory_and_Practice_XIII

120 S. Gustafson et al.

Ta b l e 1 Data from various Google queries suggesting low popularity and adoption by Data Scientists in Data Science competitions on Kaggle.com, as a topic of research in Universities, on new Data Science courses, within the Government, and also within job market on LinkedIn Approach Kaggle forumssite:edusite edu syllabussite:govsite:linkedin.com Logistic regression 76 3620 551 50 20,800 Neural network 59 3200 46 34 4710 Random forest 79 773 9 19 2920 Genetic programming 1 92 17 13 497 The queries were constructed as (Data Science + logistic regression + modifier), where modifier would be site.edu for example

GP is at least 25 years old. Even from its initial days, learning models to fit
data was a focus and is usually referred to as Symbolic Regression. For the past
6 years, the first author on this paper has run a workshop at the annual Genetic
and Evolutionary Computation Conference on the topic of symbolic regression
research and industry tools: the Symbolic Regression and Modeling Workshop. The
workshop produced several interesting papers talks, led to new lines of research,
and enabled new software tools to be highlighted to the community. Symbolic
regression, the identification of a model, its variables, and their relationship (both
linear and nonlinear) is at the heart of Data Science.
While GP has been shown to successfully solve problems in countless papers, it
is still seen as an outsider in the mainstream machine learning and artificial intel-
ligence community. As such, GP is still not widely available in many commercial
analytical tools and Data Scientists have often not received any relevant training
as to how to use the method. In Table 1 , we performed several online searches to
understand a rough idea of the popularity of GP as compared to other popular Data
Science tools, namely logistic regression, neural networks, and random forest. One
search was performed on Kaggle.com user forums, which is a popular place for
thousands of Data Scientists competing on the Kaggle.com site in Data Science
challenges to share and discuss the approaches and methods. The forums represent
both a fascinating and educational look into how Data Scientists work. Table 1
shows that GP lags behind all other methods in all but one case. While one shouldn’t
place too much significance on the actual numbers in Table 1 , it does support our
belief that GP is not being considered as a Data Science tool.

2.1 Attributes of GP for Data Science

We now look at several requirements of Data Science and how GP can meet them
as a Data Science tool.

Data Science leverages Big Data for data-driven decisions and outcomes. GP is
a distributed search algorithm, and countless studies have look at better ways

Genetic_Programming_Theory_and_Practice_XIII

Get our desktop app

Company

Features

Documentation

Resources