User Guide

Users can either interact with the command line interface (CLI) run_gents or develop custom Python workflows by importing the gents module.

Using the Command Line Interface (CLI)

To get help output:

run_gents --help

This should outline CLI format, available arguments, and GenTS version. Depending on the installation method used, run_gents may contain a prefix such as apptainer run docker://agentoxygen/gents:latest for running inside containers.

run_gents <hf_head_dir> [options]

The path to the head directory containing model output history files (hf_head_dir) is the only required argument. GenTS recursively searches this directory for .nc files, groups them by sub-directory and file name pattern, and generates a corresponding set of time series files. By default, no filters are applied which will likely produce errors for case output directories. To apply the default configuration for a model, specify the --model argument:

run_gents /scratch/my_case/raw_output/ --model CESM3 --dryrun

The --dryrun flag restricts GenTS to read-only operations for validating the configuration before generating new files. By default, the output directory is the same as the case directory (for CESM3, a proc/tseries/ directory is created at the same level as hist/). To specify a separate output directory, use the -o output path flag:

run_gents /scratch/my_case/raw_output/ -o /scratch/my_case/timeseries --model CESM3 --dryrun

Parallelization is setup by default with 64 cores for reading metadata and 8 cores for reading/writing time series files. History file metadata reads scale strongly while time series writes scale weakly, so for a 128-core machine, you may want to adjust these sizes to leverage more parallelism (128 for metadata, 16 for writing):

run_gents /scratch/my_case/raw_output/ -o /scratch/my_case/timeseries/ --model CESM3 --dryrun --tscores 16 --hfcores 128

Remove the --dryrun argument to read the CESM3 history file structure in /scratch/my_case/raw_output/ and generate time series in a similar directory structure under /scratch/my_case/timeseries/. In some cases, users may want to process individual model components. To do this, simply change the history file directory (example for processing only the atm directory):

run_gents /scratch/my_case/raw_output/atm/ -o /scratch/my_case/timeseries/atm/ --model CESM3 --dryrun

GenTS will automatically create missing subdirectories in the output path. In some cases, the default configuration for a model may not perfectly handle the case output and read history files that should not be included. To exclude additional NetCDF files from GenTS, use the --exclude argument in combination with the --append flag:

run_gents /scratch/my_case/raw_output/atm/ --model CESM3 --dryrun --exclude "*log_file*" --append

The --append flag adds this filter on top of the CESM3 default configuration (as specified by --model CESM3). To replace all of the exclusive filters in the configuration, simply remove the --append flag. To exclude multiple glob patterns, use the --exclude argument multiple times:

run_gents /scratch/my_case/raw_output/atm/ --model CESM3 --dryrun --exclude "*log_file.nc" --exclude "*static.nc"

Similarly, you can include files that are being excluded by default:

run_gents /scratch/my_case/raw_output/atm/ --model CESM3 --dryrun --include "*h4i*.nc"

To adjust the slice length for time series output (which defaults to 10 year slices), use the --slice argument:

Limit time series slice length to 5 years:

run_gents /scratch/my_case/raw_output/atm/ --model CESM3 --dryrun --slice 5

Using the Python Package

An example code snippet is featured below:

from gents.hfcollection import HFCollection
from gents.timeseries import TSCollection

input_head_dir = "... case directory with model output ..."
output_head_dir = "... scratch directory to output time series to ..."

hf_collection = HFCollection(input_head_dir, num_processes=64)
hf_collection = hf_collection.include(["*/atm/*", "*/ocn/*", "*.h4.*"])

ts_collection = TSCollection(hf_collection.include_years(0, 5), output_head_dir, num_processes=32)
ts_collection = ts_collection.apply_overwrite("*")
ts_collection.execute()

The bulk of functionality in this package is provided by two Python classes: gents.hfcollection.HFCollection and gents.timeseries.TSCollection. These classes centralize the organization of history files and provide an interface for customizing time series output. In general, the user begins by defining a HFCollection which searches recursively through a directory structure for history files. The user can then optionally apply filters to the selection to include only specific history file types. Once the desired history files have been identified, HFCollection automatically groups them by sub-directory and file name patterns. The user then creates a TSCollection from the populated HFCollection which organizes the history file groupings into a list of executable functions that create the time series files. These functions run independently of each other in an embarrassingly parallel scheme using the Python Standard Library ProcessPoolExecutor. They may also be ported to third-party distributed computing libraries such as Dask .

Creating the HFCollection

The HFCollection class provides an intuitive interface for the user to interactively filter for target history files by mapping paths to metadata. To get started, create a HFCollection object by pointing it to the head directory of your history file collection:

from gents.hfcollection import HFCollection
hf_collection = HFCollection(hf_dir="my/file/system/scratch/GCM_run/output/history_files/")

hf_collection now contains an internal dictionary that maps history files to metadata stored in the gents.meta.netCDFMeta class. For example, to print all history files by path and obtain the first entry’s metadata:

print(list(hf_collection))
first_entry_path = list(hf_collection)[0]
hf_collection.pull_metadata()
first_entry_meta = hf_collection[first_entry_path]

The gents.meta.netCDFMeta stores useful metadata information that can be quickly obtained by reading the netCDF headers. When initialized, HFCollection does not pull the metadata and leaves the internal dictionary values empty (the keys effectively act as pointers to files from which metadata will eventually be pulled). This allows the user to apply filters purely based on path characteristics before reading every history file in the collection, thereby reducing the total number of header reads. The above code block assumes the user wants all of the history files under the head directory. If the user was only interested in history files with .h1. in the path, the following code would be optimal:

hf_collection = hf_collection.include(["*.h1.*"])
first_entry_path = list(hf_collection)[0]
hf_collection.pull_metadata()
first_entry_meta = hf_collection[first_entry_path]

Note that HFCollection.include is called before the metadata is pulled. This allows GenTS to filter out history files that do not include the specified patterns and avoid unnecessary header reads. Similarly, we can exclude patterns using HFCollection.exclude too:

hf_collection = hf_collection.exclude(glob=["*.once.*", "*/rof/*"])
first_entry_path = list(hf_collection)[0]
hf_collection.pull_metadata()
first_entry_meta = hf_collection[first_entry_path]

Note that the user can specify multiple entries as glob patterns which can filter directories too (the glob pattern is applied to the absolute path string). Both HFCollection.include and HFCollection.exclude should be executed before pulling metadata for optimal performance. Although header reads are lightweight, thousands of files can start to add up. This can be done in serial (as above), but it is recommended to specify multiple cores when initializing HFCollection to parallelize the process. Since gathering metadata is lightweight and read-only, the throughput generally scales strongly with the number of cores:

from gents.hfcollection import HFCollection
hf_collection = HFCollection(hf_dir="my/file/system/scratch/GCM_run/output/history_files/", num_processes=64)
hf_collection.pull_metadata() # distributed across 64 cores

These functions also return copies of the HFCollection that allow the user to create multiple objects for better organization:

hf_atm_only = hf_collection.include(glob=["*/atm/*"])
hf_ocn_only = hf_collection.include(glob=["*/ocn/*"])
hf_lnd_only = hf_collection.include(glob=["*/lnd/*"])

Note that pulling metadata for hf_atm_only in this case does not pull metadata for the other two collections. However, if metadata was pulled for hf_collection, all three sub-collections would inherit those metadata objects (and thus would not need to pull again).

A common step may be to filter by a date-time string in the file name:

hf_2010_2019 = hf_collection.include(glob=["*20100101-20191231.nc"])

This may work in most cases, but file names are not always reliable and may be difficult to apply across multiple model components. A more robust way of filtering is to operate over the time bounds provided in the metadata. This requires a metadata pull before running, so there is a performance hit for large datasets, but for smaller datasets the decrease is negligible:

hf_2010_2019 = hf_collection.include_years(2010, 2019)

Additionally, the user may combine this filter with an inclusive filter by using the glob argument:

hf_atm_2010_2019 = hf_collection.include_years(2010, 2019, glob=["*/atm/*"])

Note that the glob patterns are applied after pulling metadata, so this argument is designed for convenience rather than performance (HFCollection.include is preferred). HFCollection.include_years will automatically pull metadata, if it has not already been done by the user.

Creating the TSCollection

Once an HFCollection has been created and configured, a TSCollection may be derived from it to map out and execute the post-processing. TSCollection only requires a valid HFCollection object and a head directory to eventually output time series datasets to:

ts_collection = TSCollection(hf_collection, output_head_dir, num_processes=16)

Metadata for hf_collection will automatically be pulled if not done so already. Note that the num_processes argument allows the user to parallelize time series generation across multiple cores. This is an I/O heavy process due to fully reading and writing netCDF files, so there is a limit to how strongly it scales with the number of cores allocated (scaling depends on the file system and networking). In general, scaling is much weaker than the metadata reads with HFCollection. Similar to HFCollection, inclusive and exclusive operations may be applied over the history file paths, but TSCollection adds variable-level filtering to singular path globs (whereas HFCollection didn’t allow for per-variable filtering but could handle multiple path globs):

ts_tmax_only = ts_collection.include(path_glob="*", var_glob="TMAX")
ts_prec_only = ts_collection.include(path_glob="*", var_glob="PREC*")
ts_h1_prec_only = ts_collection.include(path_glob="*.h1.*", var_glob="PREC*")

Note that the last inclusive filter only includes history files with a path that contains “.h1.” and only derives time series for variables that start with “PREC”. You can also exclude time series in the same manner:

ts_without_h4_hurs = ts_collection.exclude(path_glob="*.h4.*", var_glob="HURS")

Just like with HFCollection, both TSCollection.include and TSCollection.exclude operations return copies, allowing for advanced filtering:

ts_h2_temps_only = ts_collection.include(path_glob="*.h2.*", var_glob="T*")
ts_h2_temps_no_pop = ts_h2_temps_only.exclude(path_glob="*.pop.*", var_glob="*")

Once filtered, custom arguments can be applied to all time series or just a subset. Currently supported arguments include whether to overwrite existing time series, compression level, and compression algorithm. These arguments are passed to the netCDF4 Python API. The arguments can be applied using glob patterns for both paths and variable names:

ts_collection = ts_collection.add_args("*", "*", overwrite=True)
ts_collection = ts_collection.apply_compression(alg="zlib", level=5, path_glob="*/atm/*", var_glob="*")
ts_collection = ts_collection.add_args("*", "*HD*", alg="zlib", level=2)

The first line sets all time series output to overwrite existing files. The second line applies level 5 compression using the “zlib” algorithm only to time series output derived from history files that contain “/atm/” in their path. The third line applies level 2 compression to all time series output with primary variables that contain the characters “HD”. Note that line 3 overrides any possible overlap with line 2.

By default, the output path templates (“templates” are incomplete path strings where only the file prefix is provided so that date time and variable name can be assigned during generation) used for writing the time series netCDF files mirror the directory structure of the given HFCollection. To modify the path template, the user may replace substrings. For example, to replace the “/hist/” subdirectory with “/tseries/”:

ts_collection = ts_collection.apply_path_swap(string_match="/hist", string_swap="/tseries/")

Note that swaps are made using the built-in replace string function, so matches can be made to any part of the path string and should not use glob or re patterns.

TSCollection stores all time series as dictionaries in a list. Each dictionary contains arguments that can be passed to gents.timeseries.generate_time_series to generate a time series file.

print(list(ts_collection))

The above code will print the list of time series dictionaries. By default, TSCollection parses this list of arguments into a ProcessPoolExecutor if num_processes > 1. This allows the user to simply execute all time series generation functions:

ts_collection.execute()

Custom Dask Workflows with TSCollection

The list-type interface of TSCollection allows the user to directly modify the inputs to gents.timeseries.generate_time_series and build custom workflows if necessary. For example, if using Dask:

from dask import delayed
from dask.distributed import LocalCluster, Client
from gents.timeseries import generate_time_series

cluster = LocalCluster(n_workers=30, threads_per_worker=1, memory_limit="2GB")
client = cluster.get_client()

delayed_orders = []
for args in ts_collection:
    delayed_orders.append(delayed(generate_time_series)(**args))

client.compute(delayed_orders, sync=True)