Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

46 V.V. de Melo and W. Banzhaf


their importance, and then with the reduced feature set to measure the actual
solution quality. Therefore there is an expansion of the feature set, followed by
feature selection. Finally, to reduce the risk of overfitting we used cross-validation
in the training.


5 Experiments


This section presents our experiments performed to evaluate KP for classification.
KP was tested using publicly available two-class medical datasets from the UCI
online repository (Lichman 2013 ). Some characteristics of the datasets are presented
in Table 1. The datasets were chosen after selecting papers from literature that will
be used for comparison.


5.1 Pre-processing


Given that KP generates mathematical expressions using features from the dataset,
it is necessary to prepare the data. The Weka machine learning tool (Hall et al.
2009 ) was used to replace missing values with the means from the training data,
instead of removing incomplete instances. No other transformation, normalization,
or standardization was performed on the data.


5.2 Computational Environment


KP was implemented in the Python programming language (version 2.7.6), using
GP from DEAP (Distributed Evolutionary Algorithms in Python) library (version
1.0.1), and scikit-learn library (version 0.14.2) for CART. To evaluate the features
discovered by KP, tests were performed using CART in Weka (version 3.6.11)
running on Java (version 1.7.0_55) via OpenJDK Runtime Environment (IcedTea
version 2.4.7). The experiments were executed on an Intel i7 920 desktop, with 6Gb
of RAM, Archbang Linux (kernel version 3.14.5-1), GCC (version 4.9.0 20140521).


Ta b l e 1 Summary of the
two-class datasets employed
in the experiments


Continuous
Dataset attributes Instances
Breast-w (Winsconsin) 9 699
Diabetes (PIMA) 8 768
Liver-disorders (BUPA) 6 345
Parkinson 22 195
Free download pdf