Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

Using GP for Data Science 127


the data sets and preparation. In a recent edition of AI Magazine, there are several
articles highlighting the value of Semantic approaches for data management (van
Harmelen et al. 2015 ).


3.3 Gradient Boosted Regression


We needed another method to compare against and for this we focused on a popular
Data Science technique called Gradient Boosted Regression (GBR). GBR can be
used to build predictive models for regression problems. GBR was first proposed
in Friedman ( 2001 ). In brief, GBR uses a weighted ensemble of very simple
classification and regression trees (CART) to estimate a function. After creating
an initial model, GBR then uses gradient descent to find subsequent models that
minimize the errors of the residuals from the previously selected model. This
technique has emerged to become a very powerful and efficient technique for
regression as well as classification problems.
GBR has a way to combine several shallow trees (called learners). Each tree by
itself is fairly weak in its ability to predict, but when combined they ‘grow’ into
a strong predictor. These trees are iteratively modified, based on the residuals, or
model error, the difference between the predicted values and the actuals. In GBR,
the goal is to minimize that difference, or residual, measured by a loss function.
In gradient boosting, we start with an imperfect model, and stage by stage try to
improve it. Each improvement is informed by the residual of the preceding imperfect
model. It turns out that the residuals are the negative gradients of the loss function
(hence the name). In summary, in GBR, we successively reduce the loss function
and thereby generating a better set of parameters. By using a selection of freely
available Python libraries, we were able to very easily import the data, format it to
feed it to the model for training, testing and validation. Specifically, we used Pandas
for data manipulation, and the SciP, NumPy, and Scikit-Learn Python modules for
GBR itself (Jones et al. 2001 ; van der Walt et al. 2011 ; Pedregosa et al. 2011 ).


3.4 Practical Considerations When Implementing the Gradient


Boosting Method


We created the initial GBR scripts in Python. Note that there are excellent and well-
documented GBR implementations available in Python, Matlab and R. The initial
solution script was fairly straightforward to build. Once the experiment setup was
complete (splitting of data into training and testing, and a way to cross validate),
there were numerous online resources available to demonstrate how to easily use
the multiple Python libraries to build a GBR model and to tune its parameters.

Free download pdf