Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

118 S. Gustafson et al.


Chemical and compared against the typical statistical approaches. They found that
GP had positive attributes, and where there were weaknesses, they recommended
solutions. That work led to several enhancements to GP and motivated a workshop
series led by this author for 6 years. Whereas that work stemmed out of GP
applied to symbolic regression and modeling work, today, GP is sitting on the ledge
of breaking out of the traditional machine learning communities to much wider
adoption and impact as a potential Data Science tool.
In a recent O’Reilly report (Loukides 2010 ), the importance of data was stated
as: “The future belongs to the companies and people that turn data into products”.
According to Dhar ( 2013 ), Data Science is about extracting knowledge from data,
discovering new relationships between things in this world, their interactions,
outcomes and predictors. Data Science in practice is about speed and the ability
to answer meaningful questions effectively. It is about empowering analysts with
a new skillset to leverage Big Data and analytics to make effective computational
policy and business decisions. Data Scientists are often not computer scientists, and
hence lack formal machine learning or artificial intelligence training. Data Scientists
often come from the physical sciences where domain knowledge is leveraged to turn
data into a meaningful product or outcome.
Genetic Programming is poised to become a significant enabler for Data Science.
But it isn’t today. In this article, we review a recent attempt to use GP for Data
Science and discuss the lessons learned. We identified a novel result that is holding
GP back, the iteration speed at which a Data Scientists can generate new results,
that is not being addressed by existing work. In non-GP systems, iteration speed
is primarily impacted by how fast can someone change the python/R/matlab scripts
and re-run the code. But in GP, iteration speed might mean modifying the source
code, it might mean building extra scripts to work data, or in some cases it
might mean performing novel research into advanced topics like ensemble learning.
Whereas some tools like Data Modeler (Castillo et al. 2004 ) are currently positioned
well to become a Data Science tool, a new tool like DEAP (De Rainville et al. 2012 )
that is based on Python and open source may gain wider adoption. But both tools
need more advanced capabilities in the core system to become a capable GP system
for Data Science. This chapter describes the case study, introduces both methods,
reports the outcomes, lessons learned and recommendations as to how GP could
become a more effective Data Science tool.


2 Background


There are several trends that are enabling artificial intelligence technologies to
increase their value and outcomes. The first trend has been well documented and
publicized: Big Data. As computer storage and compute cycles became cheaper,
new industries have grown and changed around the massive collections of data and
algorithms that run on top of them. Companies like Google, Amazon and Netflix are
good examples of this trend. The second trend started out as Semantic Web and later

Free download pdf