Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

2 M. Kommenda et al.


the necessary model structure to describe the data implicitly. Another benefit due
to the model being described as a mathematical formula is that these formulas can
be easily interpreted, validated, and incorporated in other programs (Affenzeller
et al. 2014 ). However, the interpretability of symbolic regression models is often
hampered by overly complex and large formulas, bloating and introns, and the
excessive usage of variables. Furthermore, complex models tend to be overfit and
memorize the training data, which results in poor prediction performance on unseen
data. Hence, simpler models with similar training accuracy are generally preferred
to complex ones.
Symbolic regression problems are commonly solved by genetic programming,
where the formulas are generated during the optimization process and internally
represented as symbolic expression trees. An approach to avoid overly complex
formulas and to limit the growth of the symbolic expression trees is to specify
static tree size and depth limits (Koza 1992 ; Poli et al. 2008 ). Appropriate values
of these two parameters, so that the trees can grow large enough to model the data
accurately while avoiding unnecessary complexity, cannot be known a-priori and
must be adapted to the concrete problem. Other methods of controlling the tree size
include dynamic size limits (Silva and Costa 2009 ), parsimony pressure methods
(Luke and Panait 2002 ; Poli 2010 ) and controlling the distribution of tree sizes
through so-called Tarpeian bloat control (Dignum and Poli 2008 ).
In this work, we follow another approach for evolving simple symbolic regres-
sion models: we change the problem formulation from single-objective to multi-
objective (Smits and Kotanchek 2005 ), where the prediction errors and the model
complexities are simultaneously minimized. Hence, no complexity related parame-
ters values such as the maximum size of the evolved trees and the allowed function
symbols have to be predefined, because the multi-objective algorithm implicitly
optimizes those as well. Furthermore, no additional methods for bloat or size control
have to be incorporated in the algorithm to evolve simple and parsimonious models.
The result of such a multi-objective algorithm execution is not a single solution, but
a complete Pareto set of models of varying complexity and prediction accuracy. The
question remains how to measure the complexity of symbolic regression models
and what effects the selected complexity measure has on the overall algorithm
performance and to which extent the evolved models differ syntactically.


2 Complexity Measures for Symbolic Regression


Several measures for calculating the complexity of symbolic regression models have
been previously proposed. The simplest ones are based only on the characteristics
of the symbolic expression tree representing the regression model such as thetree
length(Eq. ( 1 )) or thetotal visitation length(Eq. ( 2 ), also denoted asexpressional
complexityby Smits and Kotanchek ( 2005 )). The visitation length (Keijzer and
Foster 2007 ) has the advantage that it not only includes the size of the trees, but
also incorporates information about the skewness of the tree and favors balanced

Free download pdf