Store Model Inputs and Outputs

Simulation inputs and/or outputs can be kept in memory or saved on disk using either xarray’s or zarr’s I/O capabilities.

Using xarray

The xarray.Dataset structure, used for both simulation inputs and outputs, already supports serialization and I/O to several file formats, among which netCDF is the recommended format. For more information, see Section reading and writing files in xarray’s docs.

Before showing some examples, let’s first create the same initial setup than the one used in Section Setup and Run Models:

In [1]: advect_model
Out[1]: 
<xsimlab.Model (4 processes, 5 inputs)>
grid
    spacing     [in] uniform spacing
    length      [in] total length
init
    loc         [in] location of initial pulse
    scale       [in] scale of initial pulse
advect
    v           [in] () or ('x',) velocity
profile
In [2]: import xsimlab as xs

In [3]: ds = xs.create_setup(
   ...:     model=advect_model,
   ...:     clocks={
   ...:         'time': np.linspace(0., 1., 101),
   ...:         'otime': [0, 0.5, 1]
   ...:     },
   ...:     main_clock='time',
   ...:     input_vars={
   ...:         'grid': {'length': 1.5, 'spacing': 0.01},
   ...:         'init': {'loc': 0.3, 'scale': 0.1},
   ...:         'advect__v': 1.
   ...:     },
   ...:     output_vars={
   ...:         'grid__x': None,
   ...:         'profile__u': 'otime'
   ...:     }
   ...: )
   ...: 

In [4]: ds
Out[4]: 
<xarray.Dataset>
Dimensions:        (otime: 3, time: 101)
Coordinates:
  * time           (time) float64 0.0 0.01 0.02 0.03 0.04 ... 0.97 0.98 0.99 1.0
  * otime          (otime) float64 0.0 0.5 1.0
Data variables:
    grid__length   float64 1.5
    grid__spacing  float64 0.01
    init__loc      float64 0.3
    init__scale    float64 0.1
    advect__v      float64 1.0
Attributes:
    __xsimlab_output_vars__:  grid__x

You can save the dataset here above, e.g., using xarray.Dataset.to_netcdf()

In [5]: ds.to_netcdf("advect_model_setup.nc")

You can then reload this setup later or elsewhere before starting a new simulation:

In [6]: import xarray as xr

In [7]: in_ds = xr.load_dataset("advect_model_setup.nc")

In [8]: out_ds = in_ds.xsimlab.run(model=advect_model)

In [9]: out_ds
Out[9]: 
<xarray.Dataset>
Dimensions:        (otime: 3, time: 101, x: 150)
Coordinates:
  * otime          (otime) float64 0.0 0.5 1.0
  * time           (time) float64 0.0 0.01 0.02 0.03 0.04 ... 0.97 0.98 0.99 1.0
  * x              (x) float64 0.0 0.01 0.02 0.03 0.04 ... 1.46 1.47 1.48 1.49
Data variables:
    advect__v      float64 1.0
    grid__length   float64 1.5
    grid__spacing  float64 0.01
    grid__x        (x) float64 0.0 0.01 0.02 0.03 0.04 ... 1.46 1.47 1.48 1.49
    init__loc      float64 0.3
    init__scale    float64 0.1
    profile__u     (otime, x) float64 0.0001234 0.0002226 ... 0.03916 0.02705

The latter dataset out_ds contains both the inputs and the outputs of this model run. Likewise, You can write it to a netCDF file or any other format supported by xarray, e.g.,

In [10]: out_ds.to_netcdf("advect_model_run.nc")

Using zarr

When xarray.Dataset.xsimlab.run() is called, xarray-simlab uses the zarr library to efficiently store (i.e., with compression) both simulation input and output data. Input data is stored before the simulation starts and output data is stored progressively as the simulation proceeds.

By default all this data is saved into memory. For large amounts of model I/O data, however, it is recommended to save the data on disk. For example, you can specify a directory where to store it:

In [11]: out_ds = in_ds.xsimlab.run(model=advect_model, store="advect_model_run.zarr")

You can also store the data in a temporary directory:

In [12]: import zarr

In [13]: out_ds = in_ds.xsimlab.run(model=advect_model, store=zarr.TempStore())

Or you can directly use zarr.group() for more options, e.g., if you want to overwrite a directory that has been used for old model runs:

In [14]: zg = zarr.group("advect_model_run.zarr", overwrite=True)

In [15]: out_ds = in_ds.xsimlab.run(model=advect_model, store=zg)

Note

The zarr library provides many storage alternatives, including support for distributed/cloud and database storage systems (see storage alternatives in zarr’s tutorial). Note, however, that a few alternatives won’t work well with xarray-simlab. For example, zarr.ZipStore doesn’t support feeding a zarr dataset once it has been created.

Regardless of the chosen alternative, xarray.Dataset.xsimlab.run() returns a xarray.Dataset object that contains the data (lazily) loaded from the zarr store:

In [16]: out_ds
Out[16]: 
<xarray.Dataset>
Dimensions:        (otime: 3, time: 101, x: 150)
Coordinates:
  * otime          (otime) float64 0.0 0.5 1.0
  * time           (time) float64 0.0 0.01 0.02 0.03 0.04 ... 0.97 0.98 0.99 1.0
  * x              (x) float64 0.0 0.01 0.02 0.03 0.04 ... 1.46 1.47 1.48 1.49
Data variables:
    advect__v      float64 1.0
    grid__length   float64 1.5
    grid__spacing  float64 0.01
    grid__x        (x) float64 dask.array<chunksize=(150,), meta=np.ndarray>
    init__loc      float64 0.3
    init__scale    float64 0.1
    profile__u     (otime, x) float64 dask.array<chunksize=(3, 150), meta=np.ndarray>

Zarr stores large multi-dimensional arrays as contiguous chunks. When opened as a xarray.Dataset, xarray keeps track of those chunks, which enables efficient and parallel post-processing via the dask library (see Section parallel computing with Dask in xarray’s docs).

Advanced usage

Encoding options

It is possible to control via some encoding options how Zarr stores simulation data. Those options can be set for variables declared in process classes. See the encoding parameter of variable() for all available options. Encoding options may also be set or overridden when calling run(). A often encountered use-case for encoding, is to suppress nan values in the output dataset:

Default fill values

By default, output variables have a fill value that is set to np.nan in output. These values are determined during model runtime from the variable’s datatype. This only affects the output array: during the model run, the actual values are used.

Fill values

datatype

fill values

example

(unsigned) integer

maximum possible value

uint8, 255

float

np.nan

string

empty string

“”

bool

False

complex values

default for each dtype

Especially for boolean datatypes, where the default fill value is False, it is desireable to suppress this behaviour. There are several options, with different benefits and drawbacks, as outlined below.

If you know beforehand what the fill_value should be, this can be set directly in the process class:

@xs.process
class Foo:
    v_bool_nan = xs.variable(dims="x", intent="out")
    # suppress nan values by setting an explicit fill value:
    v_bool = xs.variable(dims="x", intent="out", encoding={"fill_value": None})

    def initialize(self):
        self.v_bool_nan = [True, False]
        self.v_bool = [True, False]

The resulting output is set to nan when no fill_value is specified:

In [17]: model = xs.Model({"foo":Foo})

In [18]: in_ds = xs.create_setup(
   ....:    model=model,
   ....:    clocks={"clock": [0, 1]},
   ....:    output_vars={"foo__v_bool": None, "foo__v_bool_nan": None},
   ....: )
   ....: 

#this will result in nan values in output
In [19]: in_ds.xsimlab.run(model=model)
Out[19]: 
<xarray.Dataset>
Dimensions:          (clock: 2, x: 2)
Coordinates:
  * clock            (clock) int64 0 1
Dimensions without coordinates: x
Data variables:
    foo__v_bool      (x) bool True False
    foo__v_bool_nan  (x) object True nan

Alternatively, encoding (or decoding) options can be set during model run:

# set encoding options during model run
In [20]: in_ds.xsimlab.run(model=model, encoding={"foo__v_bool_nan": {"fill_value": None}})
Out[20]: 
<xarray.Dataset>
Dimensions:          (clock: 2, x: 2)
Coordinates:
  * clock            (clock) int64 0 1
Dimensions without coordinates: x
Data variables:
    foo__v_bool      (x) bool True False
    foo__v_bool_nan  (x) bool True False

# set mask_and_scale to false
In [21]: in_ds.xsimlab.run(model=model, decoding={"mask_and_scale": False})
Out[21]: 
<xarray.Dataset>
Dimensions:          (clock: 2, x: 2)
Coordinates:
  * clock            (clock) int64 0 1
Dimensions without coordinates: x
Data variables:
    foo__v_bool      (x) bool True False
    foo__v_bool_nan  (x) bool True False

However, using mask_and_scale:False results in non-serializeable attributes in the output dataset, so the other alternatives are preferable.

Dynamically sized arrays

Model variables may have one or several of their dimension(s) dynamically resized during a simulation. When saving those variables as outputs, the corresponding zarr datasets may be resized so that, at the end of the simulation, all values are stored in large arrays of fixed shape and possibly containing missing values (note: depending on chunk size, zarr doesn’t need to physically store all regions of contiguous missing values).

The example below illustrates how such variables are returned as outputs:

In [22]: import numpy as np

In [23]: @xs.process
   ....: class Particles:
   ....:     """Generate at each step a random number of particles
   ....:     at random positions along an axis.
   ....:     """
   ....: 
   ....:     position = xs.variable(dims='pt', intent='out')
   ....: 
   ....:     def initialize(self):
   ....:         self._rng = np.random.default_rng(123)
   ....: 
   ....:     def run_step(self):
   ....:         nparticles = self._rng.integers(1, 4)
   ....:         self.position = self._rng.uniform(0, 10, size=nparticles)
   ....: 

In [24]: model = xs.Model({'pt': Particles})

In [25]: with model:
   ....:     in_ds = xs.create_setup(clocks={'steps': range(4)},
   ....:                             output_vars={'pt__position': 'steps'})
   ....:     out_ds = in_ds.xsimlab.run()
   ....: 

In [26]: out_ds.pt__position
Out[26]: 
<xarray.DataArray 'pt__position' (steps: 4, pt: 3)>
array([[0.53821019,        nan,        nan],
       [2.20359873, 1.84371811, 1.75905901],
       [9.23344998, 2.76574398,        nan],
       [9.23344998, 2.76574398,        nan]])
Coordinates:
  * steps    (steps) int64 0 1 2 3
Dimensions without coordinates: pt

N-dimensional arrays with missing values might not be the best format for dealing with this kind of output data. It could still be converted into a denser format, like for example a pandas.DataFrame with a multi-index thanks to the xarray Dataset or DataArray stack(), dropna() and to_dataframe() methods:

In [27]: (out_ds.stack(particles=('steps', 'pt'))
   ....:        .dropna('particles')
   ....:        .to_dataframe())
   ....: 
Out[27]: 
          pt__position
steps pt              
0     0       0.538210
1     0       2.203599
      1       1.843718
      2       1.759059
2     0       9.233450
      1       2.765744
3     0       9.233450
      1       2.765744