Genetic_Programming_Theory_and_Practice_XIII

46 V.V. de Melo and W. Banzhaf

their importance, and then with the reduced feature set to measure the actual
solution quality. Therefore there is an expansion of the feature set, followed by
feature selection. Finally, to reduce the risk of overfitting we used cross-validation
in the training.

5 Experiments

This section presents our experiments performed to evaluate KP for classification.
KP was tested using publicly available two-class medical datasets from the UCI
online repository (Lichman 2013 ). Some characteristics of the datasets are presented
in Table 1. The datasets were chosen after selecting papers from literature that will
be used for comparison.

5.1 Pre-processing

Given that KP generates mathematical expressions using features from the dataset,
it is necessary to prepare the data. The Weka machine learning tool (Hall et al.
2009 ) was used to replace missing values with the means from the training data,
instead of removing incomplete instances. No other transformation, normalization,
or standardization was performed on the data.

5.2 Computational Environment

KP was implemented in the Python programming language (version 2.7.6), using
GP from DEAP (Distributed Evolutionary Algorithms in Python) library (version
1.0.1), and scikit-learn library (version 0.14.2) for CART. To evaluate the features
discovered by KP, tests were performed using CART in Weka (version 3.6.11)
running on Java (version 1.7.0_55) via OpenJDK Runtime Environment (IcedTea
version 2.4.7). The experiments were executed on an Intel i7 920 desktop, with 6Gb
of RAM, Archbang Linux (kernel version 3.14.5-1), GCC (version 4.9.0 20140521).

Ta b l e 1 Summary of the
two-class datasets employed
in the experiments

Continuous Dataset attributes Instances Breast-w (Winsconsin) 9 699 Diabetes (PIMA) 8 768 Liver-disorders (BUPA) 6 345 Parkinson 22 195

Genetic_Programming_Theory_and_Practice_XIII

Get our desktop app

Company

Features

Documentation

Resources