API Documentation

This is the API documentation for MESS, and provides detailed information on the Python programming interface. See the Intro API Tutorial for an introduction to using this API to run simulations.

Simulation model

Region

Metacommunity

Local Community

Inference Procedure

class PTA.inference.Ensemble(empirical_df, sims='', algorithm='rf', verbose=False)

The Ensemble class is a parent class from which Classifiers and Regressors inherit shared methods. You normally will not want to create an Ensemble class directly, but the methods documented here are inherited by both Classifier() and Regressor() so may be called on either of them.

The base Ensemble class takes care of reading in the empirical dataframe, calculating summary stats, reading the simulated date, and reshaping the sim sumstats to match the stats of the real data.

Attention:Ensemble objects should never be created directly. It is a base class that provides functionality to Classifier() and Regressor().
cross_val_predict(cv=5, features='', quick=False, verbose=False)

Perform K-fold cross-validation prediction. For each of the cv folds, simulations will be split into sets of K - (1/K) training simulations and 1/K test simulations.

Note

CV predictions are not appropriate for evaluating model generalizability, these should only be used for visualization and exploration.

Parameters:
  • cv (int) – The number of K-fold cross-validation splits to perform.
  • quick (bool) – If True skip feature selection and hyper-parameter tuning, and subset simulations. Runs fast but does a bad job. For testing.
  • verbose (bool) – Report on progress. Depending on the number of CV folds this will be more or less chatty (mostly useless except for debugging).
Returns:

The array of predicted targets for each set of features when it was a member of the held-out testing set. Also saves the results in the Estimator.cv_preds variable.

cross_val_score(cv=5, quick=False, verbose=False)

Perform K-fold cross-validation scoring. For each of cv folds simulations will be split into sets of K - (1/K) training simulations and 1/K test simulations.

Parameters:
  • cv (int) – The number of K-fold cross-validation splits to perform.
  • quick (bool) – If True skip feature selection and hyper-parameter tuning, and subset simulations. Runs fast but does a bad job. For testing.
  • verbose (bool) – Report on progress. Depending on the number of CV folds this will be more or less chatty (mostly useless except for debugging).
Returns:

The array of scores of the estimator for each K-fold. Also saves the results in the Estimator.cv_scores variable.

dump(outfile)

Save the model to a file on disk. Useful for saving trained models to prevent having to retrain them.

Parameters:outfile (str) – The file to save the model to.
feature_importances()

Assuming predict() has already been called, this method will return the feature importances of all features used for prediction.

Returns:A pandas.DataFrame of feature importances.
feature_selection(quick=False, verbose=False)

Access to the feature selection routine. Uses BorutaPy, an all-relevant feature selection method: https://github.com/scikit-learn-contrib/boruta_py http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/

Hint:

Normally you will not run this on your own, but will use it indirectly through the predict() methods.

Parameters:
  • quick (bool) – Run fast but do a bad job.
  • verbose (bool) – Print lots of quasi-informative messages.
static load(infile)

Load a PTA.inference model from disk. This is complementary to the PTA.inference.Ensemble.dump() method.

Parameters:infile (str) – The file to load a trained model from.
Returns:Returns the PTA.inference.Ensemble object loaded from the input file.
plot_feature_importance(cutoff=0.05, figsize=(10, 12), layout=None, subplots=True, legend=False)

Construct a somewhat crude plot of feature importances, useful for a quick and dirty view of these values. If more than one feature present in the model then a grid-layout is constructed and each individual feature is displayed within a subplot. This function is a thin wrapper around pandas.DataFrame.plot.barh().

Parameters:
  • cutoff (float) – Remove any features that do not have greater importance than this value across all plotted features. Just remove uninteresting features to reduce the amount of visual noise in the figures.
  • figsize (tuple) – A tuple specifying figure width, height in inches.
  • layout (tuple) – A tuple specifying the row, column layout of the sub-panels. By default we do our best, and it’s normally okay.
  • subplots (bool) – Whether to plot each feature individually, or just cram them all into one huge plot. Unless you have only a few features, setting this option to False will look insane.
  • legend (bool) – Whether to plot the legend.
Returns:

Returns all the matplotlib axes

set_data(empirical_df, verbose=False)

A convenience function to allow using pre-trained models to make predictions on new datasets without retraining the model. This will calculate summary statistics on input data (recycling metacommunity traits if these were previously input), and reshape the statistics to match the features selected during initial model construction.

This is only sensible if the data from the input community consists of identical axes as the data used to build the model. This will be useful if you have community data from mutiple islands in the same archipelago, different communities that share a common features, and share a metacommunity.

Parameters:
  • empirical_df (pandas.DataFrame) – A DataFrame containing the empirical data. This df has a very specific format which is documented here.
  • verbose (bool) – Print progress information.
set_features(feature_list='')

Specify the feature list to use for classification/regression. By default the methods use all features, but if you want to specify exact feature sets to use you may call this method.

Parameters:feature_list (list) – The list of features (summary statistics) to retain for downstream analysis. Items in this list should correspond exactly to summary statistics in the simulations or else it will complain.
set_params(params={})

Allow to directly specify the parameters of the sklearn model, rather than doing a parameter search. Useful if param searching takes a long time and you only want to do it once, and reuse the best parameter set for multiple models.

Parameters:params (dict) – A dictionary of parameter values and settings to pass to the underlying sklearn model. It’s up to you to be sure the passed in parameters make sense for whatever model you’re using.
set_targets(target_list='')

Specify the target (parameter) list to use for classification/regression. By default the classifier will only consider psi and the regressor will use all targets, but if you want to specify exact target sets to use you may call this method.

Parameters:target_list (list) – The list of targets (model parameters) to retain for downstream analysis. Items in this list should correspond exactly to parameters in the simulations or else it will complain.

Model Selection (Classification)

class PTA.inference.Classifier(empirical_df, sims='', algorithm='rf', verbose=False)

This class wraps all the model selection machinery.

Parameters:
  • empirical_df (pandas.DataFrame) – A DataFrame containing the empirical data. This df has a very specific format which is documented here.
  • sims (pd.DataFrame/string) – The path to the file containing all the simulations.
  • algorithm (string) – One of the Supported Ensemble Methods to use for parameter estimation.
  • verbose (bool) – Print detailed progress information.
cross_val_predict(cv=5, quick=False, verbose=False)

A thin wrapper around Ensemble.cross_val_predict() that basically just calculates some Classifier specific statistics after the cross validation prodecure. This function will calculate and populate class variables:

  • Classifier.classification_report: Mean absolute error
Parameters:
  • cv (int) – The number of cross-validation folds to perform.
  • quick (bool) – Whether to downsample to run fast but do a bad job.
  • verbose (bool) – Whether to print progress messages.
Returns:

A numpy.array of model class predictions for each simulation when it was a member of the held-out test set.

plot()

Simple method for visualizing the classification probabilities

plot_confusion_matrix(ax='', figsize=(8, 8), cmap=<matplotlib.colors.LinearSegmentedColormap object>, cbar=False, title='', normalize=False, outfile='')

Plot the confusion matrix for CV predictions. Assumes Classifier.cross_val_predict() has been called. If not it complains and tells you to do that first.

Parameters:
  • ax (matplotlib.pyploat.axis) – The matplotlib axis to draw the plot on.
  • figsize (tuple) – If not passing in an axis, specify the size of the figure to plot.
  • cmap (matplotlib.pyplot.cm) – Specify the colormap to use.
  • cbar (bool) – Whether to add a colorbar to the figure.
  • title (str) – Add a title to the figure.
  • normalize (bool) – Whether to normalize the bin values (scale to 1/# simulations).
  • outfile (str) – Where to save the figure. This parameter should include the desired output file format, e.g. .png, .svg or .svg.
Returns:

The matplotlib.axis on which the confusion matrix was plotted.

predict(select_features=False, param_search=False, by_target=False, quick=False, force=False, verbose=False)

Predict the community assembly model class probabilities.

Parameters:
  • select_features (bool) – Whether to perform relevant feature selection. This will remove features with little information useful for model prediction. Should improve classification performance, but does take time.
  • param_search (bool) – Whether to perform ML classifier hyperparameter tuning. If False then classification will be performed with default classifier options, which will almost certainly result in poor performance, but it will run really fast!.
  • by_target (bool) – Whether to predict multiple target variables simultaneously, or each individually and sequentially.
  • quick (bool) – Reduce the number of retained simulations and the number of feature selection and hyperparameter tuning iterations to make the prediction step run really fast! Useful for testing.
  • force (bool) – Force re-running feature selection and hyper-parameter tuning. This is basically here to prevent you from shooting yourself in the foot inside a for loop with select_features=True when really what you want (most of the time) is to just run this once, and call predict() multiple times without redoing this.
  • verbose (bool) – Print detailed progress information.
Returns:

A tuple including the predicted model and the probabilities per model class.

Parameter Estimation (Regression)

class PTA.inference.Regressor(empirical_df, sims='', algorithm='rf', verbose=False)

This class wraps all the parameter estimation machinery.

Parameters:
  • empirical_df (pandas.DataFrame) – A DataFrame containing the empirical data. This df has a very specific format which is documented here.
  • sims (string) – The path to the file containing all the simulations.
  • algorithm (string) – The ensemble method to use for parameter estimation.
  • verbose (bool) – Print lots of status messages. Good for debugging, or if you’re really curious about the process.
cross_val_predict(cv=5, quick=False, verbose=False)

A thin wrapper around Ensemble.cross_val_predict() that basically just calculates some Regressor specific statistics after the cross validation prodecure. This function will calculate and populate class variables:

  • Regressor.MAE: Mean absolute error
  • Regressor.RMSE: Root mean squared error
  • Regressor.vscore: Explained variance score
  • Regressor.r2: Coefficient of determination regression score

As well as Regressor.cv_stats which is just a pandas.DataFrame of the above stats.

Parameters:
  • cv (int) – The number of cross-validation folds to perform.
  • quick (bool) – Whether to downsample to run fast but do a bad job.
  • verbose (bool) – Whether to print progress messages.
Returns:

A numpy.array of parameter estimates for each simulation when it was a member of the held-out test set.

plot_cv_predictions(ax='', figsize=(10, 5), figdims=(2, 3), n_cvs=1000, title='', targets='', outfile='')

Plot the cross validation predictions for this Regressor. Assumes Regressor.cross_val_predict() has been called. If not it complains and tells you to do that first.

Parameters:
  • ax (matplotlib.pyploat.axis) – The matplotlib axis to draw the plot on.
  • figsize (tuple) – If not passing in an axis, specify the size of the figure to plot.
  • figdims (tuple) – The number of rows and columns (specified in that order) of the output figure. There will be one plot per target parameter, so there should be at least as many available cells in the specified grid.
  • n_cvs (int) – The number of true/estimated points to plot on the figure.
  • title (str) – Add a title to the figure.
  • targets (list) – Specify which of the targets to include in the plot.
  • outfile (str) – Where to save the figure. This parameter should include the desired output file format, e.g. .png, .svg or .svg.
Returns:

The flattened list of matplotlib axes on which the scatter plots were drawn, one per target.

predict(select_features=False, param_search=False, by_target=False, quick=False, force=True, verbose=False)

Predict parameter estimates for selected targets.

Parameters:
  • select_features (bool) – Whether to perform relevant feature selection. This will remove features with little information useful for parameter estimation. Should improve parameter estimation performance, but does take time.
  • param_search (bool) – Whether to perform ML regressor hyperparamter tuning. If False then prediction will be performed with default options, which will almost certainly result in poor performance, but it will run really fast!.
  • by_target (bool) – Whether to estimate all parameters simultaneously, or each individually and sequentially. Some ensemble methods are only capable of performing individual parameter estimation, in which case this parameter is forced to True.
  • quick (bool) – Reduce the number of retained simulations and the number of feature selection and hyperparameter tuning iterations to make the prediction step run really fast! Useful for testing.
  • force (bool) – Force re-running feature selection and hyper-parameter tuning. This is basically here to prevent you from shooting yourself in the foot inside a for loop with select_features=True when really what you want (most of the time) is to just run this once, and call predict() multiple times without redoing this.
  • verbose (bool) – Print detailed progress information.
Returns:

A pandas.DataFrame including the predicted value per target parameter, and 95% prediction intervals if the ensemble method specified for this Regressor supports it.

prediction_interval(interval=0.95, quick=False, verbose=False)

Add upper and lower prediction interval for algorithms that support quantile regression (rfq, gb).

Hint:

You normaly won’t have to call this by hand, as it is incorporated automatically into the predict() methods. We allow access to in for experimental purposes.

Parameters:
  • interval (float) – The prediction interval to generate.
  • quick (bool) – Subsample the data to make it run fast, for testing. The quick parameter doesn’t do anything for rfq because it’s already really fast (the model doesn’t have to be refit).
  • verbose (bool) – Print information about progress.
Returns:

A pandas.DataFrame containing the model predictions and the prediction intervals.

Classification Cross-Validation

PTA.inference.classification_cv(sims, sep=' ', algorithm='rf', quick=True, verbose=False)

A convenience function to make it easier and more straightforward to run classification CV. This basically wraps the work of generating the synthetic community (dummy data), selecting which input data axes to retain (determines which summary statistics are used by the ML), creates the Classifier and calls Classifier.cross_val_predict(), and Classifier.cross_val_score().

Feature selection is independent of the real data, so it doesn’t matter that we passed in synthetic empirical data here. It chooses features that are only relevant for each summary statistic. Searching for the best model hyperparameters is the same, it is done independently of the observed data.

Parameters:
  • sims (str) – A pd.DataFrame or the file containing copious simulations.
  • sep (str) – Separator for loading in a DataFrame, if this was passed.
  • data_axes (list) – A list of the data axis identifiers to prune the simulations with. One or more of ‘abundance’, ‘pi’, ‘dxy’, ‘trait’. If this parameter is left blank it will use all data axes.
  • algorithm (str) – One of the supported Ensemble.Regressor algorithm identifier strings: ‘ab’, ‘gb’, ‘rf’, ‘rfq’.
  • quick (bool) – Whether to run fast but do a bad job.
  • verbose (bool) – Whether to print progress information.
Returns:

Returns the trained PTA.inference.Classifier with the cross- validation predictions for each simulation in the cv_preds member variable and the cross-validation scores per K-fold in the cv_scores member variable.

Parameter Estimation Cross-Validation

PTA.inference.parameter_estimation_cv(sims, sep=' ', data_axes='', algorithm='rf', quick=True, verbose=False)

A convenience function to make it easier and more straightforward to run parameter estimation CV. This basically wraps the work of generating the synthetic community (dummy data), selecting which input data axes to retain (determines which summary statistics are used by the ML), creates the Regressor and calls Regressor.cross_val_predict() and Regressor.cross_val.score().

Feature selection is independent of the real data, so it doesn’t matter that we passed in synthetic empirical data here. It chooses features that are only relevant for each summary statistic. Searching for the best model hyperparameters is the same, it is done independently of the observed data.

Parameters:
  • simfile (str) – The file containing copious simulations.
  • sep (str) – Separator for loading in a DataFrame, if this was passed.
  • data_axes (list) – A list of the data axis identifiers to prune the simulations with. One or more of ‘abundance’, ‘pi’, ‘dxy’, ‘trait’. If this parameter is left blank it will use all data axes.
  • algorithm (str) – One of the supported Ensemble.Regressor algorithm identifier strings: ‘ab’, ‘gb’, ‘rf’, ‘rfq’.
  • quick (bool) – Whether to run fast but do a bad job.
  • verbose (bool) – Whether to print progress information.
Returns:

Returns the trained PTA.inference.Regressor with the cross- validation predictions for each simulation in the cv_preds member variable and the cross-validation scores per K-fold in the cv_scores member variable.

Posterior Predictive Checks

PTA.inference.posterior_predictive_check(empirical_df, parameter_estimates, ax='', ipyclient=None, est_only=False, nsims=100, outfile='', use_lambda=True, force=False, verbose=False)

Currently not working.

Perform posterior predictive simulations. This function will take parameter estimates and perform PTA simulations using these parameter values. It will then plot the resulting summary statistics in PC space, along with the summary statistics of the observed data. The logic of posterior predictive checks is that if the estimated parameters are a good fit to the data, then summary statistics generated using these parameters should resemble those of the real data.

Parameters:
  • empirical_df (pandas.DataFrame) – A DataFrame containing the empirical data. This df has a very specific format which is documented here.
  • parameter_estimates (pandas.DataFrame) – A DataFrame containing the the parameter estimates from a PTA.inference.Regressor.predict() call and optional prediction interval upper and lower bounds.
  • ax (bool) – The matplotlib axis to use for plotting. If not specified then a new axis will be created.
  • ipyclient (ipyparallel.Client) – Allow to pass in an ipyparallel client to allow parallelization of the posterior predictive simulations. If no ipyclient is specified then simulations will be performed serially (i.e. SLOW).
  • est_only (bool) – If True, drop the lower and upper prediction interval (PI) and just use the mean estimated parameters for generating posterior predictive simulations. If False, and PIs exist, then parameter values will be sampled uniformly between the lower and upper PI.
  • nsims (bool) – The number of posterior predictive simulations to perform.
  • outfile (bool) – A file path for saving the figure. If not specified the figure is simply not saved to the filesystem.
  • use_lambda (bool) – Whether to generated simulations using time as measured in _lambda or in generations.
  • force (bool) – Force overwrite previously generated simulations. If not force then re-running will append new simulations to previous ones.
  • verbose (bool) – Print detailed progress information.
Returns:

A matplotlib.pyplot.axis containing the plot.

Stats

Plotting

PTA.plotting.plot_simulations_hist(sims, ax='', figsize=(12, 6), feature_set='', nsims=1000, bins=20, alpha=0.6, select='', tol='', title='', outfile='', verbose=False)

Simple histogram for each summary statistic. Useful for inspecting model performance. Invariant summary statistics will be removed.

Parameters:
  • sims (str) –
  • figsize (tuple) –
  • feature_set (list) –
  • nsims (int) –
  • bins (int) – The number of bins per histogram.
  • alpha (float) – Set alpha value to determine transparency [0-1], larger values increase opacity.
  • select (int/float) –
  • tol (int/float) –
  • title (str) –
  • outfile (str) –
  • verbose (bool) –
Returns:

Return a list of matplotlib.pyplot.axis on which the simulated summary statistics have been plotted. This list can be _long_ depending on how many statistics you plot.

PTA.plotting.plot_simulations_pca(sims, ax='', figsize=(8, 8), target='', feature_set='', loadings=False, nsims=1000, select='', tol='', title='', outfile='', colorbar=True, verbose=False)

Plot summary statistics for simulations projected into PC space.

Parameters:
  • sims (str) –
  • ax (matplotlib.pyplot.axis) –
  • figsize (tuple) –
  • target (str) –
  • feature_set (list) –
  • loadings (bool) – BROKEN! Whether to plot the loadings in the figure.
  • nsims (int) –
  • select (int/float) –
  • tol (int/float) –
  • title (str) –
  • outfile (str) –
  • verbose (bool) –
Returns:

Return the matplotlib.pyplot.axis on which the simulations are plotted.