HELP
What is GPTIPS?
It is a free, open source MATLAB toolbox for explainable symbolic machine learning to generate predictive and explanatory models.
It uses a biologically inspired machine learning method called multigene genetic programming (MGGP) as the Hypothesis-ML engine that drives the automatic model discovery process.
GPTIPS is built around a MGGP engine for MATLAB (hypothesis-ML). This generates rules/models/hypotheses in the form of multiple trees.
The most popular use case use of GPTIPS is to perform explainable symbolic non-linear regression. GPTIPS provides a stack of additional functions to help you do this (referred to collectively as the sym-XAI module).
That is, to allow you to automatically discover and interpret empirical symbolic non-linear regression models from data. Hypothesis-ML generates the models and symXAI lets you analyse, interpret, visualise and export them.
Non-linear regression models are typically of the form
y = f(x1, ... , xN) + E
where y is an output/response variable (the thing you are trying to predict) and x1, ... , xN are feature (input/predictor) variables. These are things you know and want to use to predict y. Here f is a model, i.e. a symbolic non-linear function (or a collection of non-linear functions). E is the error (residual) - the difference between the observed value of y and the model's prediction of y. The typical aim is to choose an f that minimises, in some sense, the size of E. In most practical cases there will be many f's that will minimise E.
In symbolic regression the form(s) of f is determined automatically by GPTIPS.
A key concept is that the GPTIPS machine learning algorithm does not build a single model equation, rather it builds a library (sometimes called a population) of f models. This is because it uses a biologically inspired algorithm that loosely mimics the processes of biological evolution in populations. These equations breed at a frankly indecent rate.
It may be helpful to think of this as a process that generates a set of different hypotheses about your data.
This means that you need to choose the model equations/hypotheses that you want when you have run GPTIPS on your data. GPTIPS provides functionality to help you do this.
Is this explainable AI (XAI)? Is this important?
Yes and yes.
Symbolic machine learning in GPTIPS is the process of extracting hidden, meaningful relationships from data in the form of symbolic equations and the models are 100% transparent in both its mode of operation (the sequence and type of computations) and the features included in the model. This often yields new insight into the physical systems or processes that generated the data. GPTIPS is intrinsic explainable AI (XAI) by design.
NoteAlthough GPTIPS is heavily engineered towards regression models that kind of look like traditional regression equations - it can also generate more exotic models that contain both traditional regression terms and procedural flow control structures such as IF THEN ELSES as well as threshold functions and similar. You can use these in your models by defining the right building blocks (e.g. IF THEN ELSE functionality is provided by the iflte tree node in your GPTIPS configuration). See the next section for an example of this using the thresh threshold function node.
Society and corresponding international legal regulatory frameworks (e.g. GDPR) increasingly demand that ML models are transparent and accountable. Pure black box models are becoming societally unacceptable and possibly illegal for some uses in a number of legislative regions.
I love machine learning and I hope you do too - but I don't really want either of us end up in front of a court because of it!
By design, GPTIPS models are completely transparent and have nothing to hide. They explicitly define an understandable audit trail of what features you used and what computations you did to get your prediction. This is in stark contrast to many other machine learning and statistical methods such as neural networks, support vector machines etc. which are essentially black boxes. In theory you could examine the internals of these black boxes to establish the chain of the model's computational reasoning but - as they say - good luck with that!
To get a near 360° view of the 'explanation credentials' of your GPTIPS models I recommend using GPTIPS in conjunction with the SHAP (SHapley Additive exPlanations) XAI framework. The two form a good complement in explaining what your model actually is and does. Each provides a different part of the 'explainability' picture: GPTIPS lets you look into the internal mechanisms of your models while SHAP describes a game theoretic linear approximation of how your models will actually behave when let loose in the world.
I have significant reservations about SHAP-like approaches (it's actually using a simple post hoc linear approximation to your ML model, so your SHAP explanations are in fact post hoc simplifications of the 'true' explanations), but it's extremely clever, beautifully implemented and can be applied to models from any ML methodology. Just remember that SHAP explanations are approximate explanations and do not constitute a 100% accurate description of how your model crunches features to get a prediction.
Read this!A must read on the subject of XAI and explainability (and lack thereof) in black box models is a recent-ish paper by Professor Cynthia Rudin Of Duke University in Nature - Machine Intelligence. It's called Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. It's insightful, well written and, as a bonus, contains pictures of dogs and is largely free of hard math(s). It's not behind a paywall either!
Explainable AI - how do I visualise, interpret and explain the trees in a GPTIPS model?
In GPTIPS, the tree representations comprising any model can be drawn to an web page / HTML file using the drawtees function.
The tree for a model comprising one tree is shown below. For models comprising more than one tree all trees will be drawn. An example of how to use a tree visualisation to interpret and explain your model as a sequence of 100% transparent simple computational steps is shown below.
Visualisation of a GPTIPS model tree using drawtrees. Note that this model tree contains a procedural flow control element (the thresh node which acts like a switch).
This particular model can be interpreted as the following human-friendly transparent explanation:
Raise the value of feature x4 to the third power
Add the result of step 1 to x4.
Check if the value of x8 is larger than or equal to x6. If so then add x8 to the result from step 2, otherwise add zero to the result of step 2.
Output the value of your tree as the result from step 3.
This is just a simple example. In GTIPS symbolic regression your model will contain one or more trees (explanations) like this that are optimally linearly weighted to give your final model output.
Are GPTIPS regression models as 'good as neural net' models?
It depends on (a) what you mean by good and (b) the problem at hand and (c) what you want the model for. It's kind of like wondering whether oranges are better than apples (which, admittedly, I've done).
GPTIPS sometimes (usually when only a few feature variables are involved) lags behind a roughly equivalent neural net model in terms of raw predictive performance but the equivalent GPTIPS models are simpler and may be open to physical interpretation. It's not always an easy question to answer.
To put it another way
Is the model y = 3x12 + 2x1 x2 (R2 = 0.93) "better" or "worse" than a black box neural net model with R2 = 0.95?
Is that 0.02 of R2 uplift worth the lack of explainability and the additional effort to deploy it?
Is that 0.02 boost even "true" in the sense you will see it in real world use?
Can you explain that neural net to your non-technical boss or your clients in a way they will understand?
GPTIPS generates models that are intended to be interpreted by humans, but neural networks - whilst very powerful when trained correctly - are not.
How does symbolic non-linear regression work?
This will be explained using old school ordinary linear regression as a starting point.
In classical linear regression (and non-linear regression), a pre-determined model structure is assumed (by you) and then the problem is to find the 'optimal' parameters of the model to minimise some prediction error metric of y (typically the sum of squared errors; SSE) over a set of known y and X data values (this known data set is usually called the 'training' data). Once these 'optimal' parameters have been found, you now have a model which can (you hope) predict unknown y values for new X values.
Consider a typical linear regression scenario
There is an output (response) variable y and N feature variables x1, ... , xN. There are a number of observations of y and corresponding observations of the N inputs. These comprise the training data set. It is then assumed that y is a unknown linear function of the features x1, ... , xN . Now the problem is to find the optimal values of the unknown parameters a0, ... , aN in the expression that minimises a metric of the error E over the training data set.
y = a0 + a1 x1 + a2 x2 + ... + aN xN + E
For instance, the SSE (sum of squared errors) metric can be minimised by using the least squares normal equation to give optimal a0, ... , aN. The problem is that linear regression models will not generally not capture non-linear relationships.
However, in GPTIPS symbolic non-linear regression a model structure is not explicitly assumed, and a machine learning algorithm is used to find the structure and parameters of a non-linear model that minimises the chosen error metric. In GPTIPS, a variant of an algorithm class called MGGP is used to do this. A typical (regression) model generated by GPTIPS is
y = 0.23 x1 + 0.33(x1 - x5) + 1.23 x32 - 3.34 cos(x1) + 0.22
This model contains both linear and non-linear terms.
In GPTIPS, you don't specify the functional form of the models, you only specify what 'building blocks' that models can be constructed from, e.g. plus, minus, times, cos, tanh etc. The advantage is that you can build non-linear models without having to know beforehand what the model structure should be.
Models evolved by GPTIPS are usually much more accurate and interpretable that those created using other methods.
Another advantage is that collinearity and correlation of the inputs (which can cause severe problems in many regression methods) is not generally problematic in GPTIPS.
How do I run GPTIPS ?
GPTIPS is run from the MATLAB command line using a configuration file to specify the run settings and to load or generate user data. This config file is a standard MATLAB m file.
Once you have performed a run you can use a wide range of GPTIPS command line functions and graphical tools to delve into the details of your new models and to export selected models for use elsewhere (including outside of MATLAB).
The easiest way to see how this is done is run the symbolic regression demos (gpdemo1, gpdemo2, gpdemo3, gpdemo4) from the command line, e.g.
>>gpdemo3
To run GPTIPS with your own data use the rungp function from the command line with a function handle @ to your configuration file as a parameter. E.g.
>>gp = rungp(@myconfig);
This performs a run and creates a data structure gp. The gp data structure is the common currency of GPTIPS and is used extensively in post run analysis and model selection.
A config file can be as simple as
function gp = myconfig(gp)
gp.userdata.xtrain = rand(100,10);
gp.userdata.ytrain = rand(100,1); % this must be a column vector!
gp.nodes.functions.name = {'plus','minus','times','rdivide','cube','sqrt','abs'};
This example randomly generates feature (X) and output/target (y) data and specifies that the trees should be built using plus, minus, times, rdivide (unprotected divide), cube, square root (sqrt) and abs nodes. Other nodes typically used in symbolic regression are square, sin, cos, exp, power, add3, mult3, log, negexp and neg.
All other settings use the GPTIPS regression defaults, including the fitness / objective / loss function regressmulti_fitfun which performs multi-tree (genes) symbolic regression model discovery.
Your config file acts as a full specification to the model discovery module (hypothesis engine).
You can overwrite any default setting by adding an appropriate line to your config file. For example, the default number of genes is 4 and to override this to use 8 genes add the following line
gp.genes.max_genes = 8;
When the run is complete the hypothesis engine outputs a gp data structure. This can then be visualised and processed using a variety of post-run GPTIPS functions, e.g. to run the 'best' individual in the population (as evaluated in terms of predictive performance on the training data) use:
>>runtree(gp, 'best');
To run the best on the testing data (as evaluated in terms of predictive performance on the testing data) use:
>>runtree(gp, 'testbest');
To graphically browse the created models in terms of goodness of fit and model complexity, the popbrowser function can be used as follows:
>>popbrowser(gp);
When using popbrowser - it's a good idea to also generate an HTML report of the model equations that form the Pareto front (i.e. the green dots on the popbrowser figure window) using the paretoreport function.
In this report the equations can be sorted by model complexity and model performance by clicking on the appropriate header. To generate the report use:
>>paretoreport(gp)
In addition to the demos provided with GPTIPS 2 (gpdemo1, gpdemo2, gpdemo3, gpdemo4) the following example config files for symbolic regression are also included for reference purposes and for you to play around with.
cubic_config - Symbolic regression on data from a cubic polynomial.
y = 3.4 x3 + 2.9 x2 + 6.2 x + 0.75
e.g. to run this use
>>gp = rungp(@cubic_config)
uball_config - Symbolic regression on data on the '5 dimensional Unwrapped Ball' function.
e.g. to run this use
>>gp = rungp(@uball_config)
ripple_config - Symbolic regression on data from a mathematical function f(x1, x2) of two input variables
y = f(x1, x2) = (x1 - 3)(x2 - 3) + 2 sin((x1 - 4) (x2 - 4))
e.g. to run this use
>>gp = rungp(@ripple_config)
salustowicz1d_config - Symbolic regression on data from a mathematical function f(x) of a single input variable
y = f(x) = exp(-x) x3 cos(x) sin(x) (sin(x)2 cos(x) - 1)
e.g. to run this use
>>gp = rungp(@ salustowicz1d_config)
In addition, see the following tutorial for a step by step example on how to use GPTIPS to accurately model a synthetic nonlinear regression problem.
Is there a manual for GPTIPS 2?
No.
There are however - these help pages, a tutorial, an academic paper and and extensive documentation in the MATLAB command line help for each of the files in GPTIPS 2.
Additionally, there is a PDF manual for GPTIPS version 1 in the downloads section of this site - there is useful information in it but bear in mind that the latest version does not work exactly the same. I shall endeavour to transfer the most useful info in this to these help pages.
How do I run GPTIPS in parallel mode on a single machine?
Add the following line to your config file:
gp.runcontrol.parallel.auto = true;
You must have the Parallel Computing Toolbox installed and licensed for this to work.
The first time you run it in a session there is a short delay whilst the parallel mode initialises. GPTIPS will autodetect the number of cores you have on your machine.
I'm getting weird Java errors when running GPTIPS 2 in parallel mode. Why is this happening?
There is a known issue with the JVM in older versions of MATLAB (all platforms) prior to version R2013b (6.3). This causes a failure of the Parallel Computing Toolbox in most cases.
There is a fix/workaround for this here:
http://www.mathworks.com/support/bugreports/919688
Please apply this fix if you are somehow trapped in the past and using a version prior to R2013b.
Error messages in GPTIPS when running Symbolic Math functions
This affects a wide variety of GPTIPS functions including, but not limited to, the demo files provided gpdemo1 etc., gppretty, popbrowser, paretoreport, gpmodel2mfile and more.
A fix for the Symbolic Math toolbox issue currently affecting the use of GPTIPS 2 in MATLAB R2018a onwards will be released early 2021.
How do I run GPTIPS 2 for a fixed amount of time?
To perform a GPTIPS run that terminates after a set amount of time (in seconds) add the following line to your config file:
gp.runcontrol.timeout = 60;
Where, in this case, the run terminates after 60 seconds regardless of how many generations (iterations) it was set to run for.
This can be used effectively in combination with multiple runs. Because Hypothesis-ML is a non-deterministic machine learning algorithm, results vary from run to run and so it is most often a good idea to perform multiple runs. You can perform multiple runs of fixed time duration that are merged at the end to form a single population. For instance, to perform 5 runs of 30 seconds each use the following settings in your config file.
gp.runcontrol.runs = 5;
gp.runcontrol.timeout = 60;
How do I export GPTIPS symbolic regression models as standalone M files for use outside GPTIPS?
The GPTIPS (SymXAI-regress module) function gpmodel2mfile does this (it requires the MATLAB Symbolic Math Toolbox to create the standalone model file - but this toolbox is not required to run the standalone model file).
For example, to convert the 'best' (as evaluated in terms of predictive performance on the training data) symbolic model to a standalone M file use
>>gpmodel2mfile(gp,'best','mymodel');
This writes the model to the file mymodel.m
You can then run the model on a new data input matrix x using mymodel.m as follows:
>> yprediction = mymodel(x);
Additionally, if you want to see the vector of model predictions for (say) your original training data you can use:
>>yprediction_train = mymodel(gp.userdata.xtrain);
How do I export a GPTIPS symbolic regression model as a Symbolic Math object?
This is done with the SymXAI-regress module function gpmodel2sym (it requires the MATLAB Symbolic Math Toolbox).
For instance, to convert the 'best' model (as evaluated in terms of predictive performance on the testing data) use:
>>gpmodel2sym(gp,'testbest');
The symbolic model can then be manipulated like any other MATLAB symbolic math object.
NoteThe word 'best' is here used to denote the regression model with the best predictive performance, i.e. lowest RMSE on the data. This does not imply it is the best model for your use case - for instance the 'best' model selected using the command above may actually be a highly complex model which doesn't perform well when deployed in the real world. A better way to select models is to use the popbrowser and paretoreport functions to choose models from the Pareto front of complexity and predictive performance.
Example
Run GPTIPS on the supplied cubic polynomial function.
>>gp = rungp(@cubic_config);
Extract the 'best' model (in terms of predictive performance on the training data) to a symbolic math object.
>>modelsym = gpmodel2sym(gp,'best')
Set the model equation display precision to 2 using MATLAB's vpa function (variable precision arithmetic). This appears in the chart title.
>>modelsym = vpa(modelsym,2)
Plot the model's symbolic math object using MATLAB's ezplot function. Note that the model equation appears above the graph.
>>ezplot(m)
What GP selection methods does GPTIPS support?
The GPTIPS Hypothesis-ML engine supports Pareto tournaments based on performance and tree complexity. For instance, to use only Pareto tournaments add the following line to your config file
gp.selection.tournament.p_pareto = 1;
To set a quarter of all selection events to Pareto tournaments use the following (the remaining 3/4 will be regular tournaments based only on predictive performance).
gp.selection.tournament.p_pareto = 0.25;
Note
Lexicographic selection is enabled by default for regular tournament selection.
'Selection' refers to the machine learning process of selecting models in the current population of models (based on their performance and complexity) to create new models in the next iteration of learning.
More details on using the drawtrees command line function to help you interpret your models.
>> drawtrees(gp,'best')
draws the best model in the population (as evaluated on the training data in terms of R2 predictive performance).
>> drawtrees(gp,'valbest')
draws the best model in the population (as evaluated on the validation data - if it exists).
>> drawtrees(gp,'testbest')
draws the best model in the population (as evaluated on the test data - if it exists).
>> drawtrees(gp,5)
draws the model in the population with numerical MID (model ID) 5 where the MID is an integer that can range from 1 to the population size.
You can control the formatting of the drawn trees (colour, line width, font etc.) using additional CSS arguments to the drawtrees function.
For instance, to change the font to 'Comic Sans MS' use:
>> drawtrees(gp,'best',[],'Comic Sans MS')
For further advanced formatting see the help for the drawtrees function at the MATLAB command line:
>>help drawtrees
NoteYou need an internet connection for this function because it uses the Google Charts Javascript API. Your data is not sent to Google (according to Google). Internet Explorer does not render the trees well - so use another browser for best results.
What license is GPTIPS distributed under?
GPTIPS is free subject to the GPL (GNU General Public ) v3 license which can be viewed here http://www.gnu.org/licenses/gpl-3.0.html.
How do I cite GPTIPS?
If you use GPTIPS in any published work then please use the following citations.
GPTIPS 2: an open-source software platform for symbolic data mining, Searson, D.P. Chapter 22 in Handbook of Genetic Programming Applications, A.H. Gandomi et al., (Eds.), Springer, New York, NY, 2015 .
GPTIPS: an open source genetic programming toolbox for multigene symbolic regression, Searson, D.P., Leahy, D.E. & Willis, M.J., Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 (IMECS 2010), Hong Kong, 17-19 March, 2010.
How are regression models represented in GPTIPS?
Each symbolic regression model is represented as one or more trees (genes) and a bias term - all multiplied by regression weights.
Each tree can be thought of as a partial model fragment which has a weighted contribution to the overall model. The regression weights are determined by a least squares procedure to minimise the sum of squared errors (SSE) with respect to the training data and are guaranteed to be optimal in the least squares sense.
Internally within the Hypothesis-ML engine, each tree is represented by a compact coded string. These coded strings facilitate the machine learning process of simulated evolution to create populations of better trees from existing ones using tree mutation and crossover operations.
Technical noteRegression weights are computed by means of the Moore-Penrose pseudo inverse to mitigate collinearity problems caused by the possible existence of duplicate trees in candidate models.How many genes do I need to model my data in symbolic regression?
This really depends on the data and your expectations of the resulting model. More genes usually results in a more "accurate" regression model, but the complexity of the model may be high.
That said - as a very broad guideline - start with 3 or 4 genes (with a maximum depth of 4 or 5 nodes) and work upwards from there.
After a few runs you should begin find the "sweet spot" that gives a decent trade off between model accuracy and performance. As a side note: I would avoid having large ( > 7) maximum tree depths as this tends to encourage overfitting and bloated models. It also makes GPTIPS run more slowly.
Finally, too many genes can also lead to bloated models (this is "horizontal bloat" in contrast to the usual "vertical bloat" found in single tree GP models) which may not generalise well. GPTIPS 2 contains tools to explicitly identify horizontal bloat in multigene models.
Does GPTIPS do feature selection?
GPTIPS implicitly performs feature selection.
The simulated evolutionary processes driving the GPTIPS hypothesis engine will "try" to pick the input variables that give the best overall performance. However, how well it does this is dependent on several factors, like how many inputs there are, what population size is used, how many iterations GPTIPS is run for etc.
Does GPTIPS do L1 and/or L2 model regularisation?
Not yet - but it seems like a good idea. Watch this space.
Does GPTIPS scale my data?
No - GPTIPS does not perform any scaling at all of your feature/input or your target/output data.
This is a deliberate design choice and in most circumstances scaling is not required. Obviously there are exceptions to this - if the numerical scale of some of your feature and/or target variables differ by several orders of magnitude you may want to think of rescaling prior to running GPTIPS.
What are the system requirements for GPTIPS?
GPTIPS 2 has been tested on 64 bit Windows (R2011b) and 64 bit Mac OSX (R2014a , R2014b and R2015a).
For certain GPTIPS 2 functions, namely drawtrees, gpmodelreport and paretoreport you will need a web browser (please, not Internet Explorer) and an internet connection. This is to allow the automatic download of Google JavaScript visualisation APIs - your data is not sent to any servers however, all processing is done locally in your browser.
GPTIPS 2.01 (to be available early 2021) has only been tested on Windows R2020b.
What Mathworks toolboxes are required?
None for the core tree generating module of GPTIPS (the hypothesis engine).
BUT for post run analysis of symbolic non-linear regression the MATLAB Symbolic Math Toolbox is required.
NoteSo you can still build non-linear symbolic regression models without the Symbolic Math toolbox - but you won't be able to do very much with them.
You could, however, build the models on one machine that doesn't have the Symbolic Math Toolbox (say, in the cloud) then - after the run - save & download the gp data structure (the output of the hypothesis engine) onto another desktop machine that does have the Symbolic Math toolbox for analysis and model export.
The MATLAB Statistics Toolbox is optional and used for a very small number of GPTIPS features (such as computing the statistical significance of model terms) and you don't need it unless you want to calculate the statistical significance of model terms.
The MATLAB Parallel Computing Toolbox is optional. It speeds up the execution time of runs significantly on multicore machines and clustered MATLAB instances and is recommended for power users. If you don't need this kind of setup you don't need this toolbox - GPTIPS will run fine without it.
But MATLAB costs money ...
True. But MATLAB really has a best-in-class Symbolic Math capability - without this I would have found it nearly impossible to get the functionality I wanted.
Corporate licences for MATLAB can be pricey* - but if you are student or staff in a Higher Educational Institute (e.g. university) you may well have a campus licence for it already.
* I have no financial interests in MATLAB :)Does GPTIPS run on Octave?
No. There are too many limitations in Octave to make it work properly.
Will there be a Python version of GPTIPS?
It's a possibility - but likely would be a long while off due to demands on my time.