Genetic_Programming_Theory_and_Practice_XIII

(C. Jardin) #1

12 M. Kommenda et al.


The length of the evolved symbolic regression models for all single-objective
genetic programming configurations reaches or slightly exceeds the predefined
limit. The length constraint can be exceeded due to the additive and multiplicative
linear scaling terms which are added to the models to account for the scaling
invariance of the Pearson’sR^2. All multi-objective algorithms perform similarly
with respect to the model length with the exception of the variable complexity that
has almost no selection pressure towards smaller models. Noteworthy is that multi-
objective genetic programming finds exactly the data generating formula for the
fourth problemF 4 (small predictions errors on the test partition result from slightly
inaccurate numerical constants).
Next to the accuracy and length of the final models, we are interested in the
functions used in the obtained models. Therefore, we analyzed how often and
where trigonometric, exponential and power symbols occur in those models. This
is calculated by summing over the size of the affected subtrees whose symbols fall
into defined categories (trigonometric: sin;cos;tan—exponential: exp;log—power:
x^2 ;


p
x). If a symbol occurs multiple times all occurrences are counted and the
affected subtree size can exceed the model length.
The results of this analysis are displayed in Table 4. The interpretation is eased
by comparing the values with the affected subtree size of the shortest model solving
the problem exactly (shown next to the problem name). The calculated subtree size
can fall below the optimal value for power symbols, becausex^2 can be reformulated
asxx, yielding a slightly larger model. This happens for example on problemF 1
andF 3.
The standard genetic programming algorithms with a length constraint of 50 and
100 include all available symbols rather often. Standard genetic programming with
the smallest length constraint 20 works quite well due to the strict limitation of the
search space. NSGA-II with the newly defined complexity measure overall achieves
the best results in terms of the affected subtree size of the investigated symbols,
which indicates that the combination of syntactical information and the semantics of
the symbols, improves the algorithm’s ability to determine the necessary complexity
to evolve simple yet accurate models. Comparing our complexity measure with the
tree size and the visitation length, the last two algorithms generate models with
a slightly more complex structure as more nodes are affected by the investigated
functions. However, the optimization towards more parsimonious models also
helps the algorithm to produce models using fewer trigonometric, exponential or
power functions compared to single-objective algorithms using the same length
constraints.


4.2.1 Exemplary Models


The advantages of multi-objective symbolic regression are illustrated by the best
models (Eqs. ( 6 )–( 9 )) generated for problemF 2. The best training model out of
the 50 repetitions for every algorithm variant has been extracted and after constant
folding and numeric optimization all models obtained a test NMSE of at most 10 ^10.

Free download pdf