# PEtab Tutorial¶

## Overview¶

In the following, we demonstrate how to set up a parameter estimation problem in PEtab based on a realistic application example. To this end, we consider the model and experimental data by Boehm et al. (2014). The model describes the dynamics of phosphorylation and dimerization of the transcription factors STAT5A and STAT5B. A visualization and the corresponding reactions of the model are provided below, although the details of the model are not relevant for the purpose of this tutorial. For more details, we refer to the original publication (Boehm et al., 2014).

A PEtab problem consists of 1) an SBML model of a biological system, 2) condition, observable and measurement definitions, and 3) the specification of the parameters. We will show how to generate the respective files in the following.

## 1. The model¶

PEtab assumes that an SBML file of the model exists. Here, we use the SBML model provided in the original publication, which is also available on Biomodels (https://www.ebi.ac.uk/biomodels/BIOMD0000000591). For illustration purposes we slightly modified the SBML model and shortened some parts of the PEtab files. The full PEtab problem introduced in this tutorial is available online.

Reactions included in the example model.
ID Reaction Rate law
R1 2 STAT5A → pApA cyt * BaF3_Epo * STAT5A^2 * k_phos
R2 STAT5A + STAT5B → pApB cyt * BaF3_Epo * STAT5A * STAT5B * k_phos
R3 2 STAT5B → pBpB cyt * BaF3_Epo * STAT5B^2 * k_phos
R4 pApA → nucpApA cyt * k_imp_homo * pApA
R5 pApB → nucpApB cyt * k_imp_hetero * pApB
R6 pBpB → nucpBpB cyt * k_imp_homo * pBpB
R7 nucpApA → 2 STAT5A nuc * k_exp_homo * nucpApA
R8 nucpApB → STAT5A + STAT5B nuc * k_exp_hetero * nucpApB
R9 nucpBpB → 2 STAT5B nuc * k_exp_homo * nucpBpB

## 2. Linking model and measurements¶

The model by Boehm et al. (2014) was calibrated on measurements on phosphorylation levels of STAT5A and STAT5B as well as relative STAT5A abundance for different timepoints between 0 - 240 minutes after stimulation with erythropoietin (Epo):

To define a parameter estimation problem in PEtab, we need to map measurements to the model states. To this end, we need to 1) specify the experimental conditions the measurements were generated from, 2) specify observation functions and error models, and 3) specify the measurements themselves. For this, we need to define observation functions as well as experimental conditions under which a measurement was performed.

### 2.1 Specifying experimental conditions¶

All measurements were collected under the same experimental condition, which is a stimulation with Epo. This is specified in the experimental condition PEtab file, a tab-separated values (TSV) file[1], by providing a condition identifier and listing all condition-specific parameters and their respective values.

In the problem considered here, the relevant parameter is Epo_concentration which we want to set to a value of 1.25E-7, as the only condition-specific parameter. Since in this example we include data from only one single experiment, it would not be necessary to specify the condition parameter here, but instead the value could have been also set in the model or in the parameter table. However, the benefit of specifying it in the condition table is, that it allows us to easily add measurements from other experiments performed with different Epo concentrations later on.

The condition table looks as follows:

Conditions table experimental_conditions.tsv.
conditionId conditionName Epo_concentration
epo_stimulation Stimulation with 1.25E-7 Epo 1.25E-7
• conditionId is a unique identifier to define the different conditions and link them to the measurements (see measurement file below). Additional measurements e.g. for different Epo concentrations can be defined by adding new rows.
• conditionName can be used as a human readable description of the condition e.g. for plotting.

The following column headers (here Epo_concentration) refer to different parameters or species in the model, the values of which are overridden by these condition-specific values. Here, we define the Epo concentration, but additional columns could be used to e.g. set different initial concentrations of STAT5A/B. In addition to numeric values, also parameter identifiers can be used here to introduce condition specific optimization parameters.

### 2.2 Specifying the observation model¶

To link the model states to the measured values, we specify observation functions. Additionally, a noise model can be introduced to account for the measurement errors. In PEtab, this can be encoded in the observable file:

Observables table observables.tsv.
observableId observableName
pSTAT5A_rel Rel. STAT5A phosphorylation [%]
pSTAT5B_rel Rel. STAT5B phosphorylation [%]
rSTAT5A_rel Rel. STAT5A abundance [%]
Observables table observables.tsv (continued).
observableFormula
100*(2*pApA + pApB) / (2*pApA + pApB + STAT5A)
100*(2*pBpB + pApB) / (2*pBpB + pApB + STAT5B)
100*(STAT5A + pApB + 2*pApA) / (2 * pApB + 2* pApA + STAT5A + STAT5B + 2*pBpB)
Observables table observables.tsv (continued).
noiseFormula noiseDistribution
noiseParameter1_pSTAT5A_rel normal
noiseParameter1_pSTAT5B_rel normal
noiseParameter1_rSTAT5A_rel normal
• observableId specifies a unique identifier to the observables that can be used to link them to the measurements (see below).
• observableName can be used as a human readable description of the observable. Here, this corresponds to the y-label in the figure above.
• observableFormula is a mathematical expression defining how the model output is calculated. The formula can consist of species and parameters defined in the SBML file. In our example, we measure e.g. the relative phosphorylation level of STAT5A (pSTAT5A_rel), which is the sum of all species containing phosphorylated STAT5A over the sum of all species containing any form of STAT5A.
• noiseFormula is used to describe the formula for the measurement noise. Together with noiseDistribution, it defines the noise model. In this example, we assume additive, normally distributed measurement noise. In this scenario, noiseParameter1_{observableId} is the standard deviation of the measurement noise. Parameters following this naming scheme are expected to be overridden in a measurement-specific manner in the noiseParameters column of the measurement table (see below).

### 2.3 Specifying measurements¶

The experimental data is linked to the conditions via the conditionId and to the observables via the observableId. This is defined in the PEtab measurement file:

Measurement table measurement_data.tsv.
observableId simulationConditionId measurement time noiseParameters
pSTAT5A_rel epo_stimulation 7.9 0 sd_pSTAT5A_rel
pSTAT5A_rel epo_stimulation 15.4 240 sd_pSTAT5A_rel
pSTAT5B_rel epo_stimulation 4.6 0 sd_pSTAT5B_rel
pSTAT5B_rel epo_stimulation 10.96 240 sd_pSTAT5B_rel
rSTAT5A_rel epo_stimulation 14.7 0 sd_rSTAT5A_rel
rSTAT5A_rel epo_stimulation 32.2 240 sd_rSTAT5A_rel
• observableId references the observableId from the observable file.
• simulationConditionId references the conditionId from the experimental condition file.
• measurement defines the values that are measured for the respective observable and experimental condition.
• time is the time point at which the measurement was performed. For brevity, only the first and last time point of the example are shown here (the omitted measurements are indicated by “…” in the example).
• noiseParameters relates to the noiseParameters in the observables file. In our example, the measurement noise is unknown. Therefore we define parameters here which have to be estimated (see parameters sheet below). If the noise is known, e.g. from multiple replicates, numeric values can be used in this column.

## 3. Defining parameters¶

The model by Boehm et al. (2014) contains nine unknown parameters that need to be estimated from the experimental data. Additionally, it has two known parameters that are fixed to literature values.

The parameters file for this is given by:

Parameter table parameters.tsv.
parameterId parameterScale lowerBound upperBound nominalValue estimate
k_exp_hetero log10 1E-5 1E+5 1
k_exp_homo log10 1E-5 1E+5 1
k_imp_hetero log10 1E-5 1E+5 1
k_imp_homo log10 1E-5 1E+5 1
k_phos log10 1E-5 1E+5 1
ratio lin 0.693 0
sd_pSTAT5A_rel log10 1E-5 1E+5 1
sd_pSTAT5B_rel log10 1E-5 1E+5 1
sd_rSTAT5A_rel log10 1E-5 1E+5 1
• parameterId references parameters defined in the SBML file. Additionally, parameters defined in the measurement table can be used here. In this example, the standard deviations for the different observables (sd_{observableId}) are estimated.
• parameterScale is the scale on which parameters are estimated. Often, a logarithmic scale improves optimization. Alternatively, a linear scale can be used, e.g. when parameters can be negative.
• lowerBound and upperBound define the bounds for the parameters used during optimization. These are usually biologically plausible ranges.
• nominalValue are known values used for simulation. The entry can be left empty, if a value is unknown and subject to optimization.
• estimate defines whether the parameter is subject to optimization (1) or if it is fixed (0) to the value in the nominalValue column.

## 4. Visualization file¶

Optionally, a visualization file can be specified in PEtab which defines how the measurement data and potentially model simulations are plotted. So far, the visualization files are only supported by the PEtab Python library. Here, we describe a file that specifies the visualization of the measurement data similar to the figure above.

Visualization specification table visualization_specification.tsv.
plotId plotTypeData xLabel yValues yLabel
plot1 MeanAndSD Time [min] pSTAT5A_rel Rel. STAT5A phosphorylation [%]
plot2 MeanAndSD Time [min] pSTAT5B_rel Rel. STAT5B phosphorylation [%]
plot3 MeanAndSD Time [min] rSTAT5A_rel Rel. STAT5A abundance [%]
• plotId corresponds to a specific plot. All lines which share the same plotId are combined into one plot.
• plotTypeData defines the plotting style of the measurement data. Here, we use mean and (if available) standard deviations.
• xLabel and yLabel are the labels of the x- and y-axes for the corresponding plot.
• yValues defines what is plotted. In this example the different observables are plotted individually.

There are various ways of further individualizing the plots, e.g. by defining legend entries or data plotted on log-scale (see the documentation for further information https://petab.readthedocs.io/en/latest/documentation_data_format.html#visualization-table).

## 5. YAML file¶

To group the previously mentioned PEtab files, a YAML file can be used, defining which files constitute a PEtab problem. While being optional, this makes it easier to import a PEtab problem into tools, and allows reusing files for different PEtab problems. This file has the following format (Boehm_JProteomeRes2014.yaml):

format_version: 1
parameter_file: parameters.tsv
problems:
- condition_files:
- experimental_conditions.tsv
measurement_files:
- measurement_data.tsv
observable_files:
- observables.tsv
sbml_files:
- model_Boehm_JProteomeRes2014.xml
visualization_files:
- visualization_specification.tsv


The first line specifies the version this file and the files referenced adhere to. The current version number is 1. The second line references the parameter file. This is followed by a list of (sub-)problems, in this case only one, referencing the respective condition, measurement observable, model, and visualization files. There can be multiple of those files, e.g. for large numbers of measurements, one could split those up into separate files, e.g. by experimental condition or observable.

## 6. Model simulation¶

To simulate the model and compare it to the experimental data, the nominal parameters in the parameters file need to be set. As some parameters are a priori unknown, we here consider randomly sampled parameters to get a glance of model behaviour and fit to the data.

Parameter table parameters.tsv with nominalValue set to random values.
parameterId parameterScale lowerBound upperBound nominalValue estimate
Epo_degradation_BaF3 log10 1E-5 1E+5 0.105 1
k_exp_hetero log10 1E-5 1E+5 1.85 1
k_exp_homo log10 1E-5 1E+5 9.83 1
k_imp_hetero log10 1E-5 1E+5 1048.96 1
k_imp_homo log10 1E-5 1E+5 10.136 1
k_phos log10 1E-5 1E+5 10.136 1
ratio lin 0.693 0
sd_pSTAT5A_rel log10 1E-5 1E+5 51.7 1
sd_pSTAT5B_rel log10 1E-5 1E+5 0.257 1
sd_rSTAT5A_rel log10 1E-5 1E+5 0.017 1

With this, the model can be simulated using the different tools that support PEtab. The easiest tool to get started with is probably COPASI which comes with a graphical user interface (see https://github.com/copasi/python-petab-importer for further instructions).

It is apparent from the figure, that the random parameters yield a poor fit of the model with the data. Therefore, it is important to optimize the parameters to improve the model fit. This can be done using various parameter estimation tools. Links to detailed descriptions how to use the individual toolboxes are provided at the PEtab Github page.

## 7. Further information¶

This tutorial only demonstrates a subset of PEtab functionality. For full reference, consult the PEtab reference. After finishing the implementation of the PEtab problem, its correctness can be verified using the PEtab library (see https://github.com/PEtab-dev/PEtab/blob/master/doc/example/example_petablint.ipynb for instructions). The PEtab problem can then be used as input to the supporting toolboxes to estimate the unknown parameters or calculate parameter uncertainties. Links to tutorials for the different tools can be found at the PEtab Github page (https://github.com/PEtab-dev/PEtab#petab-support-in-systems-biology-tools).

 [1] TSV files can be created using any standard spreadsheet application, or for small files, text editor.