Visual Analytic Tools
Here a quick roundup of the visual tools is provided. These tools help you interpret your GPTIPS models and select models that are good for your use cases. It also provides tools to let you fine tune the structure of your models in order to optimise their accuracy-simplicity ratio (ASR).
Graphical run summary
A simple graphical representation of a GPTIPS run expressed in terms of predictive performance (RMS error on training data).
The upper blue line represents the loss function/fitness of the 'best' individual in population vs iterations/generations completed.
The lower orange line represents the mean performance of the population of models.
To generate this from the command line use:
>>summary(gp)
Interactive population browser
The population browser (popbrowser) is an interactive tool for exploring a population (library of models).
This may be used to identify models that lie on the trade off surface of predictive performance and model complexity.
For regression models, green dots represent the Pareto optimal surface of models in terms of model performance (1 – R2) and model complexity.
Blue dots represent non-Pareto models and are not usually worth looking at.
The red circled dot represents the ‘best’ model in the population purely in terms of predictive performance (R2) on the training data.
NoteEach model in the population has a numerical ID (MID) associated it with it, e.g. 65 is model number 65 in the population. The MID is an indexing method and is unrelated to model performance.
Clicking on a dot shows a popup containing the MID and the simplified model equation.
To run this visualisation tool from the command line use:
>>popbrowser(gp,'train')
to show the population performance on the training data.
Additionally:
>>popbrowser(gp,'test')
shows the performance on the test data set (if present). And:
>>popbrowser(gp,'val')
shows the performance on the validation data set (if present).
Example of popbrowser on a symbolic regression problem. The population contains 600 models. The green dots represent the Pareto optimal set of models that lie on the accuracy-simplicity trade off curve. The blue dots represent sub-optimal models. The popup shows the symbolic regression equation for a selected model.
Interactive gene browser - how to fine tune your models
This provides a mechanism for you to create tailored, fine-tuned models that express desired accuracy-simplicity ratios (ASR) by
(1) progressively reducing model complexity in a controlled manner
(2) and/or adding other genes in the population that are not in your model
After a GPTIPS run, you can extract a MATLAB data structure containing all of the ‘unique’ model trees (genes) in a population using the uniquegenes function as indicated below.
>>genes = uniquegenes(gp)
NoteWithin the genes structure each unique tree/gene is allocated a numerical ID called the Unique Gene Number (UGN) - not to be confused with a model ID (MID) in the GPTIPS population. The UGN is for indexing and is unrelated to any predictive properties of the gene.
To provide an interactive visualisation of the genes in the population and a selected model - the genebrowser function can be used. In the example below, it is used on the model that performed best (in terms of predictive R2) on the training data.
>>genebrowser(gp,genes,’best’)
Clicking on any blue bar shows a popup containing the equation of the tree/gene and the reduction in R2 that would result if that gene were to be removed from the model.
Clicking on any orange bar in the lower axis does the same for genes that are not in the current model and shows the increase in R2 that would be attained if that gene were added to the model.
Once the user has identified a suitable gene to be removed (or added) from the model, a new model without the gene can be generated using the genes2gpmodel function using the unique gene IDs as input arguments.
The model data structure returned from this function can be examined and exported in exactly the same way as any model contained within the population.
ExampleIn the figure below it can be seen the best model on the training data contains the 6 unique genes with IDs (UGNs)
82 207 226 145 232 44
It can also be seen that removing gene with UGN 44 (x1x4) from this model will reduce its R2 from 0.99 to around 0.97 on the training data.
So to create the lower complexity model the genes2gpmodel function can be used as follows.
You provide a list of UGNs (shown in the model gene list above) and exclude the gene you identified for removal (UGN 44).
>>reducedModel = genes2gpmodel(gp,genes,[82 207 226 145 232]) % UGN 44 excluded!
The regression model reducedModel can now be analysed using the standard GPTIPS functions, e.g.
>>runtree(gp,reducedModel)
>>gpmodelreport(gp,reducedModel)
It can also be exported to an m file:
>>gpmodel2mfile(gp,reducedModel)
Or analysed using genebrowser to identify further terms for possible removal:
>>genebrowser(gp,genes,reducedModel)
Pareto report
GP populations contain numerous models - however it is the models that lie on the Pareto optimal trade off surface of predictive ability and complexity that are almost always of the most interest (e.g. see the popbrowser function). The set of models on this trade off surface is known as the Pareto front.
GPTIPS can generate a standalone interactive HTML report listing the multigene regression models on the Pareto front in terms of their (simplified) equation structure, model complexity and predictive performance (R2).
The report table is interactive and the models can be sorted by predictive performance (training data) or complexity by clicking on the appropriate column header. The model ID (MID) is shown in the left hand column.
An example of an extract from such a report is shown below. It clearly shows the typical trend of increase in predictive performance with model complexity.
Hence, the report assists you in rapidly identifying the most promising model or models to investigate in more detail.
After a GPTIPS run is complete, the report can be generated using the paretoreport function as follows:
>>paretoreport(gp)