Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

124 S. Gustafson et al.


Fig. 2 Schematic illustrating the input Time Series Data, with one sensor that goes offline
periodically


data from sensorsN(when it was available). An operation has many moving parts
and changing system dynamics over time will cause the sensors to drift from their
relationships to each other and various business objectives. In our case, we had data
for 1 year, with measurements consistently every 10 min for all sensors and a binary
value to tell us whether sensorsNwas offline or online. A conceptual schematic of
the input data is shown in Fig. 2.
As part of the Data Science study, two different but related methods were used.
The first method, Gradient Boosted Regression (GBR), has been commonly and
successfully employed in multiple Data Science competitions and is available in
several open source packages. The other method is GP for Symbolic Regression. In
particular, a system developed by MIT which is referred to in Arnaldo et al. ( 2014 )
as a competent GP, because it contains many state-of-the-art features. Since this
was an actual Data Science engagement, with real data and a client waiting for the
results, and not a simulated experiment for publication, we could only attempt two
different methods given our deadlines and commitments. To measure and compare
the performance of the two methods, we used the Root Mean Squared Error (RMSE)
over the period of extrapolation as a measure of accuracy.
As mentioned before, sensorsNis available (online) only during certain time
intervals and those periods in the historical data constitute our training and testing
data. On this dataset, with the power of either of the two techniques being compared,
getting very low RMSE was easy if we resorted to interpolation. In other words,
if we used input data that sandwiched the time period of interest (from both the
past and the future) to predict the value of sensorsN, we obtain good predictions.
This is due to the fact that there exist very strong temporal relationships within the
data. Therefore, creating a model using training data that spans the time period of

Free download pdf