Symbolic regression of sine function

Please change directory to 'PERLGP_BASE/demos/sin' and look at the README file before running this demo.

Find an approximation to using arithmetic operations plus, minus, multiply and protected division, and the integers 1,2,3,5,7 and the input .

The fitness measure is basically the root mean squared error or deviation of the predicted function from the target function, measured on a set of randomly sampled points, between -1.0 and +1.0 initially, but the range and number of points increases during the run (see below).

Here, `loadSet()` reads in data (randomly sampled values
from the files `TrainingSet` or `TestingSet`) which is
generated by a simple script ./generate_data.pl (which gets run by the
`refresh()` method, but more on this later). The general flow
of data is discussed in detail in Section 2.4.

Two subtle ``hacks'' seemed to be needed to get well behaved GP runs
on this problem. Domain specific knowledge was used in both cases.
Firstly, the error function incorporates a cap on the per-sample
error. If the squared difference between the predicted and known
values is greater than 1.0, it is counted as 1.0. This may not be
essential, but with prior knowledge of the Taylor expansions of
trigonometric functions we know that at infinity the
approximation also deviates infinitely from the axis. So this
error capping should help not to punish functions containing high
powers of . The second hack is to start training on a small range
of values centred around zero and then increase the range
(attribute `Range`) as the evolved solutions reach a certain
target fitness. Again we are using domain specific knowledge: we know
that the early terms in the Taylor expansion best fit the sine
function for values of around zero. You can play around with more
complex target functions (for example, try
) where
there is no trivial solution early on.

Take a look at the `refresh()` method in Algorithm.pm, this is
called on the first tournament and subsequently every
`RefreshInterval` tournaments. When the best training fitness
goes below a threshold, the `Range` is increased and the data
is regenerated, and loaded up again (and the whole Population's
fitnesses are reset). An ``anti-stagnation'' measure is also
implemented in `refresh()`: if more than 1000 tournaments go by
without an improvement in fitness, the data is resampled and reloaded
(but `Range` does not increase).

This demo also uses self-adapting `NodeXoverProb` and
`NodeMutationProb` attributes.

The various devices employed here may or may not be necessary to get a good result, but they illustrate some of the ways you can customise PerlGP.

Note: the fitness function and data structures are not really optimised for speed (wasteful use of hashes and so on). You can read about speed comparisons against lilgp on the same problem in my EuroGP 2003 paper (available on request and later on the web). It's always a good idea to have something else to do while jobs are running however...