Store Model Inputs and Outputs

Simulation inputs and/or outputs can be kept in memory or saved on disk using either xarray’s or zarr’s I/O capabilities.

Using xarray

The xarray.Dataset structure, used for both simulation inputs and outputs, already supports serialization and I/O to several file formats, among which netCDF is the recommended format. For more information, see Section reading and writing files in xarray’s docs.

Before showing some examples, let’s first create the same initial setup than the one used in Section Setup and Run Models:

In [1]: advect_model
Out[1]: 
<xsimlab.Model (4 processes, 5 inputs)>
grid
    spacing     [in] uniform spacing
    length      [in] total length
init
    loc         [in] location of initial pulse
    scale       [in] scale of initial pulse
advect
    v           [in] () or ('x',) velocity
profile
In [2]: import xsimlab as xs

In [3]: ds = xs.create_setup(
   ...:     model=advect_model,
   ...:     clocks={
   ...:         'time': np.linspace(0., 1., 101),
   ...:         'otime': [0, 0.5, 1]
   ...:     },
   ...:     master_clock='time',
   ...:     input_vars={
   ...:         'grid': {'length': 1.5, 'spacing': 0.01},
   ...:         'init': {'loc': 0.3, 'scale': 0.1},
   ...:         'advect__v': 1.
   ...:     },
   ...:     output_vars={
   ...:         'grid__x': None,
   ...:         'profile__u': 'otime'
   ...:     }
   ...: )
   ...: 

In [4]: ds
Out[4]: 
<xarray.Dataset>
Dimensions:        (otime: 3, time: 101)
Coordinates:
  * time           (time) float64 0.0 0.01 0.02 0.03 0.04 ... 0.97 0.98 0.99 1.0
  * otime          (otime) float64 0.0 0.5 1.0
Data variables:
    grid__length   float64 1.5
    grid__spacing  float64 0.01
    init__loc      float64 0.3
    init__scale    float64 0.1
    advect__v      float64 1.0
Attributes:
    __xsimlab_output_vars__:  grid__x

You can save the dataset here above, e.g., using xarray.Dataset.to_netcdf()

In [5]: ds.to_netcdf("advect_model_setup.nc")

You can then reload this setup later or elsewhere before starting a new simulation:

In [6]: import xarray as xr

In [7]: in_ds = xr.load_dataset("advect_model_setup.nc")

In [8]: out_ds = in_ds.xsimlab.run(model=advect_model)

In [9]: out_ds
Out[9]: 
<xarray.Dataset>
Dimensions:        (otime: 3, time: 101, x: 150)
Coordinates:
  * otime          (otime) float64 0.0 0.5 1.0
  * time           (time) float64 0.0 0.01 0.02 0.03 0.04 ... 0.97 0.98 0.99 1.0
  * x              (x) float64 0.0 0.01 0.02 0.03 0.04 ... 1.46 1.47 1.48 1.49
Data variables:
    advect__v      float64 1.0
    grid__length   float64 1.5
    grid__spacing  float64 0.01
    grid__x        (x) float64 0.0 0.01 0.02 0.03 0.04 ... 1.46 1.47 1.48 1.49
    init__loc      float64 0.3
    init__scale    float64 0.1
    profile__u     (otime, x) float64 0.0001234 0.0002226 ... 0.03916 0.02705

The latter dataset out_ds contains both the inputs and the outputs of this model run. Likewise, You can write it to a netCDF file or any other format supported by xarray, e.g.,

In [10]: out_ds.to_netcdf("advect_model_run.nc")

Using zarr

When xarray.Dataset.xsimlab.run() is called, xarray-simlab uses the zarr library to efficiently store (i.e., with compression) both simulation input and output data. Input data is stored before the simulation starts and output data is stored progressively as the simulation proceeds.

By default all this data is saved into memory. For large amounts of model I/O data, however, it is recommended to save the data on disk. For example, you can specify a directory where to store it:

In [11]: out_ds = in_ds.xsimlab.run(model=advect_model, store="advect_model_run.zarr")

You can also store the data in a temporary directory:

In [12]: import zarr

In [13]: out_ds = in_ds.xsimlab.run(model=advect_model, store=zarr.TempStore())

Or you can directly use zarr.group() for more options, e.g., if you want to overwrite a directory that has been used for old model runs:

In [14]: zg = zarr.group("advect_model_run.zarr", overwrite=True)

In [15]: out_ds = in_ds.xsimlab.run(model=advect_model, store=zg)

Note

The zarr library provides many storage alternatives, including support for distributed/cloud and database storage systems (see storage alternatives in zarr’s tutorial). Note, however, that a few alternatives won’t work well with xarray-simlab. For example, zarr.ZipStore doesn’t support feeding a zarr dataset once it has been created.

Regardless of the chosen alternative, xarray.Dataset.xsimlab.run() returns a xarray.Dataset object that contains the data (lazily) loaded from the zarr store:

In [16]: out_ds
Out[16]: 
<xarray.Dataset>
Dimensions:        (otime: 3, time: 101, x: 150)
Coordinates:
  * otime          (otime) float64 0.0 0.5 1.0
  * time           (time) float64 0.0 0.01 0.02 0.03 0.04 ... 0.97 0.98 0.99 1.0
  * x              (x) float64 0.0 0.01 0.02 0.03 0.04 ... 1.46 1.47 1.48 1.49
Data variables:
    advect__v      float64 1.0
    grid__length   float64 1.5
    grid__spacing  float64 0.01
    grid__x        (x) float64 dask.array<chunksize=(150,), meta=np.ndarray>
    init__loc      float64 0.3
    init__scale    float64 0.1
    profile__u     (otime, x) float64 dask.array<chunksize=(3, 150), meta=np.ndarray>

Zarr stores large multi-dimensional arrays as contiguous chunks. When opened as a xarray.Dataset, xarray keeps track of those chunks, which enables efficient and parallel post-processing via the dask library (see Section parallel computing with Dask in xarray’s docs).

Advanced usage

Dynamically sized arrays

Model variables may have one or several of their dimension(s) dynamically resized during a simulation. When saving those variables as outputs, the corresponding zarr datasets may be resized so that, at the end of the simulation, all values are stored in large arrays of fixed shape and possibly containing missing values (note: depending on chunk size, zarr doesn’t need to physically store all regions of contiguous missing values).

The example below illustrates how such variables are returned as outputs:

In [17]: import numpy as np

In [18]: @xs.process
   ....: class Particles:
   ....:     """Generate at each step a random number of particles
   ....:     at random positions along an axis.
   ....:     """
   ....: 
   ....:     position = xs.variable(dims='pt', intent='out')
   ....: 
   ....:     def initialize(self):
   ....:         self._rng = np.random.default_rng(123)
   ....: 
   ....:     def run_step(self):
   ....:         nparticles = self._rng.integers(1, 4)
   ....:         self.position = self._rng.uniform(0, 10, size=nparticles)
   ....: 

In [19]: model = xs.Model({'pt': Particles})

In [20]: with model:
   ....:     in_ds = xs.create_setup(clocks={'steps': range(4)},
   ....:                             output_vars={'pt__position': 'steps'})
   ....:     out_ds = in_ds.xsimlab.run()
   ....: 

In [21]: out_ds.pt__position
Out[21]: 
<xarray.DataArray 'pt__position' (steps: 4, pt: 3)>
array([[0.53821019,        nan,        nan],
       [2.20359873, 1.84371811, 1.75905901],
       [9.23344998, 2.76574398,        nan],
       [9.23344998, 2.76574398,        nan]])
Coordinates:
  * steps    (steps) int64 0 1 2 3
Dimensions without coordinates: pt

N-dimensional arrays with missing values might not be the best format for dealing with this kind of output data. It could still be converted into a denser format, like for example a pandas.DataFrame with a multi-index thanks to the xarray Dataset or DataArray stack(), dropna() and to_dataframe() methods:

In [22]: (out_ds.stack(particles=('steps', 'pt'))
   ....:        .dropna('particles')
   ....:        .to_dataframe())
   ....: 
Out[22]: 
          pt__position
steps pt              
0     0       0.538210
1     0       2.203599
      1       1.843718
      2       1.759059
2     0       9.233450
      1       2.765744
3     0       9.233450
      1       2.765744

Encoding options

It is possible to control via some encoding options how Zarr stores simulation data. Those options can be set for variables declared in process classes. See the encoding parameter of variable() for all available options. Encoding options may also be set or overridden when calling run().

Warning

Zarr uses 0 as the default fill value for numeric value types. This may badly affect the results, as array elements with the fill value are replaced by NA in the output xarray Dataset. For variables which accept 0 as a possible (non-missing) value, it is highly recommended to explicitly provide another fill_value. Alternatively, it is possible to deactivate this value masking behavior by setting the mask_and_scale=False option and pass it via the decoding parameter of run().