Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

Using GP for Data Science 125


prediction (and withhold the test points from within that time period), can accurately
interpolate the test data. However, this would not be a true test, since in actual
implementation, future values would never be available to use. We would only
have historical data for model building. In order to address this, and to truly assess
the capabilities of the two methods, we limited both methods to use only training
data that occurred prior in time when making any predictions. Therefore, our test
measure of RMSE is technically a measure of extrapolation (forward looking), as
opposed to interpolation. How do we know if the RMSE that we obtained is really
good, or just good enough? To guide us for this, we can use the raw data variability
from sensorsN. Ideally, the RMSE and the raw data variability should be within
the same magnitude. In our study, using six different periods of testing, we had an
average standard deviation of actual sensorsNvalue (a percentage) to be around
3.5. The GP system was able to achieve an average RMSE over the same six periods
of 6.0, and the GBR system had 5.5. Thus, both approaches achieved reasonable
extrapolation capabilities.


3.2 Data Management


One aspect that is related to Data Science but often not mentioned in the GP
literature is the common task of managing data. In academic settings, artificial
intelligence and machine learning research is often carried out using pre-cleaned or
benchmark data sets. However, in industry, the challenge of gathering, organizing,
and preparing data is significant. Particularly when collaborating between multiple
people using different approaches, data management done poorly can lead to
significant issues, and in some cases call into question the validity of results. In our
work, we used an approach that is growing interest in industry: ontology-based data
access. We created a model of the domain, link our data to that domain model, and
then employ a suitable query system to shape and access data. Given that the data
we have corresponds to sensor readings, we create models of the different types
of things that have sensors, their properties, and the relationships between those
things. For example, an electric submersible pump may be part of an oil well, or a
high pressure turbine blade may be a part of a gas turbine, and a pump may have
some rotation frequency. The sensor readings typically correspond to the properties
like rotation frequency.
We used the Web Ontology Language (OWL), which is a W3C standard for
representing ontologies, to capture the domain model. Figure 3 (generated using
the OntoGraf plugin in Protégé^1 ) shows a sample ontology describing a part of our
system.


(^1) http://protege.stanford.edu/

Free download pdf