Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

Using GP for Data Science 119


became Linked Data, and as the growing Internet began to connect devices, the trend
of the Internet of Things (IoT) became popular. Inside industry, the IoT is being
shaped as the Industrial Internet or The Internet of Everything, and also the Web of
Things, among others (see Gustafson and Sheth ( 2014 ) for a brief introduction).
This trend in the massive connection of devices with data and analytic systems
only produces more data, for example from personal devices like iPhones and smart
watches, but provides more sensors and actuators available for artificial intelligence
technologies to sit between. The third trend, Data Science which is in many ways
a direct result of the first two, is the increasing demand to leverage data to direct
outcomes, either in businesses, health, or global/economic policies.
Data Science originally described the study of turning data into insights. How-
ever, more recently, the focus of Data Science has become the training and search
for the skills required to practice Data Science effectively. Today, Data Science has
emerged as a popular topic for students, a workforce reskilling opportunity, a major
focus of national funding agencies, the focus of both private investors and startup
companies, and outcomes delivery mechanism on top of the Big Data initiatives
begun years earlier. The IoT predicts that billions of machines will be connected in
the near future, and all those machines will be producing massive amounts of data,
and algorithms and insights that can be gleaned from them will allow optimization,
new businesses, and understanding that shapes policy and society.
As members of the General Electric Global Research center, we have participated
in many Data Science related activities for finance, healthcare, aviation, oil and
gas, power and water, and media. There are several consistently common activities
shared across these industries in the Data Science tasks: accessing data, learning
domain knowledge, and building descriptive and predictive models. Genetic Pro-
gramming presents a compelling approach as it can both learn nonlinear models,
is relatively easy to insert domain knowledge into, naturally produces a range of
possible solutions, and finds solutions that can be further optimized, inspected and
simplified. The latter characteristics, in particular, are interesting to a Data Scientist
as it means the solution can be communicated to customers and engineers and they
can “understand” how the data is being used. This is in comparison to a forest of
decision trees or a neural network, for example.
Data Science as a discipline is usually described as the combination of several
skills. Firstly, computer science skills are needed to work efficiently with data,
statistical and math skills allow one to find complex patterns within data, a physical
science background helps one to understand how to find and ask meaningful
questions, and creativity is required to elegantly display and communicate results.
Of course, very few people are highly skilled in each area, and thus Data Scientists
are often teams of people working together. Data Scientists are usually embedded
within industries, are measured by their efficiency to work with data and find
patterns, and their ability to find and answer the big, high-valued questions. GP, in
particular the Symbolic Regression branch of GP that deals with learning regression
models, has particular relevance for the Data Science community. There is at least
one GP software explicitly targeted at the Data Science space using symbolic
regression, Eureqa (Schmidt and Lipson 2009 ; Dubcakova 2011 ).

Free download pdf