Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

Evolving Simple Symbolic Regression Models 3


trees. Another proposed complexity measure is the number of variable symbols
(either the number of unique variables in the expression, or the total number of
leaf nodes representing variables, Eq. ( 3 )). A benefit of those complexity measures
is that they can be calculated efficiently within a single tree iteration with the use
of caching strategies for already calculated subtree lengths and thus the runtime
of the optimization algorithm is hardly affected. A drawback of those complexity
measures is that semantic information about the regression models is not included
and only the tree shape is taken into account.
This drawback is overcome by theorder of nonlinearitymetric defined by
Vladislavleva et al. ( 2009 ). The order of nonlinearity is recursively calculated by
aggregating the complexity of the subtrees (e.g.,comp.aCb/D max.comp.a/,
comp.b//) and includes the minimal degree of a Chebyshev polynomial approxi-
mating the response of individual subtrees sufficiently well. This gives an accurate
and intuitive definition of the complexity of a regression model, but Chebyshev
polynomial approximation can be computationally expensive, although simplifica-
tions to reduce the computation time have been proposed, and depends on the range
and number of the presented data points.
Another interesting complexity measure is thefunctional complexity(Vanneschi
et al. 2010 ), which is based on the curvature of the model’s response. It basically
expresses how many times the slope of the model’s response changes in each
dimension and can be calculated in polynomial time with the number of presented
data points. However, the functional complexity includes no information about the
tree length or shape and on its own does not prefer smaller models as the other
complexity measures do, which is desired when performing multi-objective genetic
programming for symbolic regression.
Based on the characteristics of the previously suggested complexity measures,
we have derived a new complexity measure that provides an intuitive definition,
is independent of the actual data points, and can be calculated efficiently. The
goal of this measure is to be used in multi-objective genetic programming to steer
the algorithm towards simple symbolic regression models and to strengthen its
ability to identify the necessary function symbols to build highly accurate models.
The complexity measure is recursively defined by assigning a complexity of 1
to constants and 2 to variable symbols and aggregating the complexity values of
those leaf nodes according to specified rules (Eq. ( 4 )). Most of the aggregation
rules originate from the mathematical semantics of the encountered symbol. Due
to its recursive definition the complexity measure can be calculated with a single
iteration of the symbolic expression tree without evaluating the model itself.
Another reason for defining the complexity measure in that way, has been that
while the complexity of sin.x/ D 22 D 4 is still rather small, the complexity
increases exponentially with the size of the subtree beneath the symbol. Therefore,
the total complexity of a symbolic regression model is heavily dependent on the
level in which more complicated function symbols occur and when performing
multi-objective optimization these are pushed towards the leaf nodes of a tree.
An alternative definition could be to reduce the complexity values of constants
and variables to 0 and 1 respectively, but this would not penalize large models

Free download pdf