API

hfcollection.py

Developer: Cameron Cummins Contact: cameron.cummins@utexas.edu Last Header Update: 04/30/25

class gents.hfcollection.HFCollection(hf_dir, num_processes=1, meta_map=None, hf_groups=None, step_map=None, hf_glob_pattern='*.nc*', dask_client=None, multistep_slice_map={})

Manages a collection of netCDF history files and their metadata.

Holds a {path: netCDFMeta | None} mapping and lazily loads metadata on demand via pull_metadata(). All filter and slice operations return new HFCollection instances, preserving an immutable-style fluent API.

check_pulled()

Ensures metadata is loaded, triggering pull_metadata() if necessary.

check_validity()

Removes history files with missing or invalid metadata from the collection.

Iterates over the metadata map and drops any entry where the metadata is None or is_valid() returns False, logging a warning for each removed file.

Returns:

Dictionary of the removed {path: metadata} entries.

Return type:

dict

copy(num_processes=None, meta_map=None, hf_groups=None, step_map=None, multistep_slice_map=None)

Creates a new HFCollection derived from this one with optional overrides.

Shares the same hf_dir as the original. Used as the return mechanism for all filter and transform operations to preserve immutability.

Parameters:
  • num_processes (int or None) – Worker process count for the copy. Defaults to the current value.

  • meta_map (dict or None) – Metadata map to assign to the copy. Defaults to the current map.

  • hf_groups (dict or None) – Group dictionary to assign to the copy. Defaults to the current groups.

  • step_map (dict or None) – Timestep delta map to assign to the copy. Defaults to the current map.

  • multistep_slice_map – Slice indices map for multisteps to assign to the copy. Defaults to the current map.

Returns:

New HFCollection instance.

Return type:

HFCollection

exclude(glob_patterns)

Returns a new collection with files whose paths match the patterns removed.

A file is excluded if its path matches any of the provided glob patterns via fnmatch.

Parameters:

glob_patterns (list[str] or str) – One or more fnmatch-style glob patterns. A single string is also accepted.

Returns:

New HFCollection with matching files removed.

Return type:

HFCollection

exclude_patterns(glob_patterns)

Deprecated since version Use: exclude() instead.

get_groups(check_fragmented=True)

Returns the dictionary of history file groups.

On the first call, groups are built by sort_hf_groups(). If check_fragmented is True, spatially tiled groups are additionally merged via merge_fragmented_groups(). Subsequent calls return the cached result.

Parameters:

check_fragmented (bool) – If True (default), detect and merge spatially fragmented file groups.

Returns:

Dictionary mapping group ID strings to lists of history file paths.

Return type:

dict[str, list[pathlib.Path]]

get_input_dir()

Returns the head directory this collection was initialised from.

Returns:

Root input directory path.

Return type:

str

get_multistep_slices(hf_path)

Checks if the history file is multistep and returns the slice indices if they are needed.

This only matters for history files with mutliple time steps that overlap a yearly boundary being sliced on.

Parameters:

hf_path (pathlib.Path) – Path to the history file.

Returns:

Dictionary with slice labels as keys and the start and end indicies to slice the history file by as values, or None if the history file is not muiltistep or doesn’t need to be sliced.

Return type:

dict

get_timestep_delta(hf_path)

Returns the pre-computed time-step duration for a given history file.

Triggers pull_metadata() if metadata has not yet been loaded.

Parameters:

hf_path (pathlib.Path) – Path to the history file.

Returns:

Duration of one time step as a cftime timedelta object.

Return type:

datetime.timedelta

include(glob_patterns)

Returns a new collection containing only files whose paths match the patterns.

A file is retained if its path matches at least one of the provided glob patterns via fnmatch.

Parameters:

glob_patterns (list[str] or str) – One or more fnmatch-style glob patterns. A single string is also accepted.

Returns:

New HFCollection restricted to matching files.

Return type:

HFCollection

include_patterns(glob_patterns)

Deprecated since version Use: include() instead.

include_years(start_year, end_year, glob_patterns=['*'])

Returns a new collection filtered to files whose midpoint time falls within a year range.

Only files whose paths also match glob_patterns are considered for filtering. The representative year is the midpoint of the first time bound, or the first time value if no bounds are present. Requires metadata to have been loaded.

Parameters:
  • start_year (int) – First year in the range (inclusive).

  • end_year (int) – Last year in the range (inclusive).

  • glob_patterns (list[str]) – Glob patterns restricting which files are subject to the year filter. Defaults to ['*'] (all files).

Returns:

New HFCollection containing only files within the year range.

Return type:

HFCollection

is_pulled()

Returns whether metadata has been loaded for all files in the collection.

Returns:

True if every path has a non-None metadata value, False otherwise.

Return type:

bool

pull_metadata(check_valid=True, raise_errors=False)

Loads metadata for all history files in the collection in parallel.

Submits get_meta_from_path() calls to a ProcessPoolExecutor worker pool and populates the internal metadata map with the results. After loading, computes the timestep delta for each group by sorting all CFTime values and taking the interval between the last two steps.

Parameters:
  • check_valid (bool) – If True (default), calls check_validity() after loading to remove files with incomplete or invalid metadata.

  • raise_errors (bool) – If True (default False), calls errors are raised rather than just logged.

slice_groups(slice_size_years=10, start_year=0, pattern='*', time_alignment_method='midpoint')

Returns a new collection with history file groups partitioned into time slices.

For each group (optionally filtered by pattern), determines the year range via get_year_bounds(), computes slice boundaries via calculate_year_slices(), and assigns each file to the appropriate sub-group based on its midpoint year. Sub-group keys are suffixed with [sorting_pivot]<start>-<end> to carry the year range through to TSCollection.

Time alignment methods:

  • 'midpoint' (default): midpoint of the first time bound.

  • 'direct_time': raw time coordinate values (ignores bounds).

  • 'start_bound': lower edge of the first time bound.

  • 'end_bound': upper edge of the first time bound.

Parameters:
  • slice_size_years (int) – Maximum width of each time slice in years. Defaults to 10.

  • start_year (int or None) – Override for the starting year; set to None to begin at the dataset’s own minimum year. Defaults to 0.

  • pattern (str or None) – fnmatch glob to restrict slicing to matching group IDs. Defaults to * (all groups).

  • time_alignment_method (str) – Method used to select the representative time value from each file’s time bounds. Must be one of 'midpoint', 'direct_time', 'start_bound', or 'end_bound'. Defaults to 'midpoint'.

Returns:

New HFCollection with sliced groups embedded.

Return type:

HFCollection

sort_along_time()

Returns a new HFCollection with files sorted by their first time value.

Returns:

New HFCollection with the metadata map re-ordered in ascending time order.

Return type:

HFCollection

gents.hfcollection.calculate_year_slices(slice_size_years, min_year, max_year)

Computes non-overlapping year-range tuples covering a given span.

Each slice is at most slice_size_years wide. The upper boundary is aligned by rounding max_year up to the next multiple of slice_size_years. Returns a single tuple if the full span fits within one slice.

Parameters:
  • slice_size_years (int) – Maximum width of each slice in years.

  • min_year (int) – First year in the range (inclusive).

  • max_year (int) – Last year in the range (inclusive).

Returns:

List of (start_year, end_year) tuples, one per slice.

Return type:

list[tuple[int, int]]

Raises:

ValueError – If max_year is less than min_year.

gents.hfcollection.check_config(config)

Validates that a configuration dictionary contains the required keys and types.

Asserts the presence of 'name', 'include', and 'exclude' keys and that their values are of the expected types.

Parameters:

config (dict) – Configuration dictionary to validate.

Raises:

AssertionError – If any required key is missing or has an unexpected type.

gents.hfcollection.check_groups_by_variables(sliced_groups)

Filters history file groups to ensure variable-set consistency within each group.

For each group, calls filter_by_variables() to identify the majority variable set, discards minority files with a logged warning, and re-sorts the retained files by time via sort_metas_by_time(). Groups for which no majority can be determined are dropped entirely with a warning.

Parameters:

sliced_groups (dict) – Dictionary mapping group IDs to lists of netCDFMeta objects.

Returns:

Filtered dictionary containing only the majority-consistent metadata objects per group, sorted by time.

Return type:

dict

gents.hfcollection.filter_by_variables(meta_datasets)

Identifies the majority variable set among a list of history file metadata objects.

Groups metadata objects by their sorted variable-name fingerprint and returns the set belonging to the most common variable list alongside any outliers.

Parameters:

meta_datasets (list[gents.meta.netCDFMeta]) – List of metadata objects to examine.

Returns:

Tuple of (majority, others) where majority is the list of metadata objects sharing the most common variable set and others contains the remainder. others is None if all objects share the same variable set.

Return type:

tuple[list, list or None]

gents.hfcollection.find_all_indices(string, substring)

Returns all start indices where substring occurs within string.

Uses a sliding-window str.find loop so overlapping occurrences are all reported.

Parameters:
  • string (str) – The string to search in.

  • substring (str) – The substring to search for.

Returns:

List of integer indices where substring begins.

Return type:

list[int]

gents.hfcollection.find_files(head_path, pattern)

Recursively searches a directory tree for files matching a glob pattern.

Walks head_path with os.walk and collects every file whose name matches pattern via fnmatch.

Parameters:
  • head_path (str or pathlib.Path) – Root directory to begin the recursive search from.

  • pattern (str) – fnmatch-style wildcard pattern to match file names against (e.g. '*.nc').

Returns:

Sorted list of matching file paths.

Return type:

list[pathlib.Path]

gents.hfcollection.generate_output_template(hf_head_dir, group_path_id, output_head_dir=None, directory_swaps={'hist': 'tseries'}, filename_delimiter='.', cutoff_index=None)

Constructs a time-series output path template from a history file group path.

Builds the output path (excluding the variable-name and timestamp suffix) by extracting the subdirectory structure relative to hf_head_dir, applying any directory_swaps renames, and stripping date tokens from the filename prefix up to cutoff_index.

Parameters:
  • hf_head_dir (str) – Head directory used when reading the history files.

  • group_path_id (str or pathlib.Path) – Group path pattern produced by sort_hf_groups() (e.g. '/data/hist/model.h0*').

  • output_head_dir (str or None) – Alternate head directory for output. Defaults to None (uses hf_head_dir).

  • directory_swaps (dict) – Mapping of directory name substrings to replace (e.g. {'hist': 'tseries'}). Defaults to {'hist': 'tseries'}.

  • filename_delimiter (str) – Delimiter used to split the filename into tokens. Defaults to '.'.

  • cutoff_index (int or None) – Character index at which to truncate the filename prefix. Defaults to None (cuts at the last delimiter occurrence).

Returns:

Path template for time-series output (without variable/timestamp suffix).

Return type:

pathlib.Path

gents.hfcollection.get_default_config()

Returns a configuration dictionary populated with default GenTS settings.

Returns:

Dictionary with name="default", include=None, exclude=None.

Return type:

dict

gents.hfcollection.get_year_bounds(hf_to_meta_map)

Determines the minimum and maximum year covered by a set of history files.

Uses the midpoint of each time bound (or the time value itself if no bounds are present) to determine which year each file belongs to.

Parameters:

hf_to_meta_map (dict) – Dictionary mapping file paths to their netCDFMeta objects.

Returns:

Tuple of (min_year, max_year) as integers.

Return type:

tuple[int, int]

gents.hfcollection.is_ds_within_years(ds_meta, min_year, max_year)

Checks whether a dataset’s representative time falls within a year range.

Uses the midpoint of the first time bound as the representative year, or the first time value directly if no time bounds are present.

Parameters:
  • ds_meta (gents.meta.netCDFMeta) – Metadata object for the dataset to check.

  • min_year (int) – Lower bound of the year range (inclusive).

  • max_year (int) – Upper bound of the year range (inclusive).

Returns:

True if the representative year falls within [min_year, max_year], False otherwise.

Return type:

bool

gents.hfcollection.merge_fragmented_groups(hf_groups, hf_meta_map)

Merges spatially fragmented (tiled) history file groups into unified groups.

Iterates through hf_groups and separates fragmented files (identified by paths that do not end with .nc) from standard files. Fragmented groups are hashed by their non-time dimension bounds; groups sharing the same hash are merged into a single entry under a new wildcard key. Non-fragmented files are passed through unchanged.

Parameters:
  • hf_groups (dict) – Dictionary mapping group pattern strings to lists of history file paths.

  • hf_meta_map (dict) – Dictionary mapping file paths to their netCDFMeta objects (used to retrieve dimension bounds).

Returns:

New group dictionary with fragmented groups merged.

Return type:

dict

Raises:

KeyError – If a merged fragmented group label already exists among the non-fragmented groups.

gents.hfcollection.sort_hf_groups(hf_paths, delimiter='.', substring_index=2)

Groups history file paths by directory and shared filename prefix.

Files are first grouped by their parent directory, then within each directory by a common filename prefix derived by dropping the last substring_index delimiter-delimited tokens from each filename.

For example, model.h0.0001-01.nc and model.h0.0001-02.nc share the prefix model.h0 and end up in the same group.

Parameters:
  • hf_paths (list[pathlib.Path]) – List of history file paths to group.

  • delimiter (str) – Token delimiter used to parse the filename prefix. Defaults to '.'.

  • substring_index (int) – Number of trailing delimiter-separated tokens to strip when deriving the group prefix. Defaults to 2.

Returns:

Dictionary mapping '<parent_dir>/<prefix>*' pattern strings to lists of matching file paths.

Return type:

dict[str, list[pathlib.Path]]

gents.hfcollection.sort_metas_by_time(metas)

Returns a new list of metadata objects sorted by their first CFTime value.

Performs an insertion sort; the original list is not modified.

Parameters:

metas (list[gents.meta.netCDFMeta]) – Unsorted list of metadata objects.

Returns:

New list sorted in ascending time order.

Return type:

list[gents.meta.netCDFMeta]

meta.py

Developer: Cameron Cummins Contact: cameron.cummins@utexas.edu Last Header Update: 07/03/25

gents.meta.get_attributes(dataset)

Extracts all attributes from a netCDF4 dataset or variable into a dictionary.

Parameters:

dataset – A netCDF4.Dataset, netCDF4.MFDataset, or netCDF4.Variable object from which to read attributes.

Returns:

Dictionary mapping attribute names to their values.

Return type:

dict

gents.meta.get_meta_from_path(path: str)

Opens a netCDF file, constructs a netCDFMeta object, and returns it.

Serves as a picklable factory wrapper around netCDFMeta so that instances can be created inside ProcessPoolExecutor worker processes. Any exception raised during construction is re-raised with the file path appended to the message for easier debugging.

Parameters:

path (str) – Path to the netCDF history file.

Returns:

Metadata object populated from the specified file.

Return type:

netCDFMeta

Raises:

Exception – Re-raises any exception from netCDFMeta.__init__ with the file path appended to the message.

gents.meta.get_time_variables_names(ds)

Locates the time and time-bounds variable names in a netCDF dataset.

Performs a case-insensitive scan of all variable names and returns the canonical name of the time variable and, if present, the name of the corresponding time-bounds variable (time_bnds, time_bnd, time_bounds, or time_bound).

Parameters:

ds (netCDF4.Dataset) – Open netCDF4 dataset to inspect.

Returns:

Tuple of (time_name, time_bounds_name). Either element is None if the corresponding variable is not found.

Return type:

tuple[str or None, str or None]

gents.meta.is_var_secondary(variable, secondary_vars: list = ['time_bnds', 'time_bnd', 'time_bounds', 'time_bound'], secondary_dims: list = ['nbnd', 'chars', 'string_length', 'hist_interval'], max_num_dims: int = 1, primary_dims: list = ['time']) bool

Classifies a netCDF variable as secondary or primary.

Secondary variables (e.g. coordinate and auxiliary fields such as time, time_bnds, lat, lon) are written unchanged into every time-series output file. Primary variables (multi-dimensional, time-varying scientific fields) each warrant their own time-series output file.

Rules are evaluated in order:

  1. Variable name is in secondary_vars → secondary.

  2. Any dimension name is in secondary_dims → secondary.

  3. Variable has more than max_num_dims dimensions and none are in primary_dims → secondary.

  4. Otherwise the variable is primary (has a time dimension and more than one dimension total).

Parameters:
  • variable (netCDF4._netCDF4.Variable) – netCDF4 variable object to classify.

  • secondary_vars (list) – Variable names that are unconditionally secondary. Defaults to ['time_bnds', 'time_bnd', 'time_bounds', 'time_bound'].

  • secondary_dims (list) – Dimension names whose presence makes a variable secondary. Defaults to ['nbnd', 'chars', 'string_length', 'hist_interval'].

  • max_num_dims (int) – Maximum number of dimensions a variable may have before the primary_dims check is applied. Defaults to 1.

  • primary_dims (list) – Dimension names whose presence keeps a variable primary. Defaults to ['time'].

Returns:

True if the variable is secondary, False if primary.

Return type:

bool

class gents.meta.netCDFMeta(ds, path: str)

Stores metadata extracted from a single netCDF history file.

Caches time values (as raw floats and as CFTime objects), optional time-bounds values, global file attributes, variable lists partitioned into primary vs. secondary sets, and per-dimension coordinate bounds. Instances are constructed by get_meta_from_path() and consumed throughout gents.hfcollection.

get_attributes()

Returns the global attributes dictionary cached from the history file.

Returns:

Dictionary mapping global attribute names to their values.

Return type:

dict

get_cftime_bounds()

Returns the time-bounds array as CFTime objects.

Returns:

Array of CFTime bound pairs, or None if the history file contains no time-bounds variable.

Return type:

numpy.ndarray or None

get_cftimes()

Returns the time values converted to CFTime objects.

Returns:

Array of CFTime datetime objects corresponding to each time step.

Return type:

numpy.ndarray

get_dim_bounds()

Returns coordinate bounds for each dimension in the history file.

For each dimension that has an associated coordinate variable, maps the dimension name to a list containing its minimum value (single-element list for a scalar coordinate) or [min_value, max_value] for a range. Used by merge_fragmented_groups() to identify spatial extent when merging tiled files.

Returns:

Dictionary mapping dimension names to their coordinate bound lists.

Return type:

dict

get_float_time_bounds()

Returns the time-bounds array as raw float values.

Returns:

Array of float time-bound pairs, or None if the history file contains no time-bounds variable.

Return type:

numpy.ndarray or None

get_float_times()

Returns the raw float time values read from the time variable.

Returns:

1-D array of float time values.

Return type:

numpy.ndarray

get_path()

Returns the file-system path of the history file this object was built from.

Returns:

Path to the source history file.

Return type:

str

get_primary_variables()

Returns the names of primary variables in the history file.

Primary variables are multi-dimensional, time-varying scientific fields that each warrant their own time-series output file.

Returns:

List of primary variable name strings.

Return type:

list

get_secondary_variables()

Returns the names of secondary variables in the history file.

Secondary variables are coordinate and auxiliary fields (e.g. time, time_bnds, lat, lon) that are written unchanged into every time-series output file.

Returns:

List of secondary variable name strings.

Return type:

list

get_variable_dims(variable)

Returns the dimension names for the given variable.

Parameters:

variable (str) – Name of the variable to look up.

Returns:

Tuple or list of dimension name strings for the variable.

Return type:

tuple

get_variable_dtype(variable)

Returns the data type of the given variable.

Parameters:

variable (str) – Name of the variable to look up.

Returns:

NumPy dtype describing the element type of the variable.

Return type:

numpy.dtype

get_variable_shapes(variable)

Returns the shape of the given variable.

Parameters:

variable (str) – Name of the variable to look up.

Returns:

Tuple of integers describing the size of each dimension.

Return type:

tuple

get_variables()

Returns the full list of variable names present in the history file.

Returns:

List of all variable name strings.

Return type:

list

is_valid()

Returns whether this history file is usable for time-series generation.

A file is considered invalid if any of the following are true:

  • Both get_cftime_bounds() and get_cftimes() are None (no usable time coordinate).

  • The file contains zero primary and zero secondary variables.

  • A gents_version global attribute is present (the file is already a GenTS-generated time-series output, not a raw history file).

Returns:

True if the file is valid for processing, False otherwise.

Return type:

bool

timeseries.py

Developer: Cameron Cummins Contact: cameron.cummins@utexas.edu Last Header Update: 01/31/25

class gents.timeseries.TSCollection(hf_collection, output_dir, ts_orders=None, num_processes=None, dask_client=None)

Manages the set of time-series generation orders derived from an HFCollection.

Each order is a dictionary describing one output file: source history file paths, output path template, primary variable name, secondary variable names, and generation arguments (compression, overwrite flag, etc.). All modifier methods return new TSCollection instances, preserving an immutable-style fluent API.

add_args(path_glob='*', var_glob='*', level=None, alg=None, overwrite=None)

Updates generation arguments on orders that match both filters.

Only arguments that are not None are applied; others are left unchanged.

Parameters:
  • path_glob (str) – fnmatch glob applied to source history file paths. Defaults to '*'.

  • var_glob (str) – fnmatch glob applied to primary variable names. Defaults to '*'.

  • level (int or None) – netCDF4 compression level (0–9). Defaults to None (unchanged).

  • alg (str or None) – netCDF4 compression algorithm (e.g. 'zlib'). Defaults to None (unchanged).

  • overwrite (bool or None) – Overwrite flag to apply. Defaults to None (unchanged).

Returns:

New TSCollection with updated order arguments.

Return type:

TSCollection

add_attrs(attrs)

Adds attributes to all output time series files associated with this collection.

Parameters:

attrs (dict) – Dictionary of key/value strings to append to output NetCDF files

Returns:

New TSCollection with updated attributes.

Return type:

TSCollection

append_timestep_dirs(var_glob='*')

Inserts a time-step frequency subdirectory into each matching order’s output path.

Determines the frequency label from the group’s timestep delta: 'hour_N', 'day_N', 'month_N', or 'year_N'. The label is inserted as a new directory level immediately before the filename in the output path template, organising outputs by observation frequency.

Parameters:

var_glob (str) – fnmatch glob applied to primary variable names. Defaults to '*'.

Returns:

New TSCollection with updated output path templates.

Return type:

TSCollection

apply_compression(level, alg, path_glob, var_glob='*')

Applies compression settings to matching time-series orders.

Convenience wrapper around add_args().

Parameters:
  • level (int) – netCDF4 compression level (0–9).

  • alg (str) – netCDF4 compression algorithm (e.g. 'zlib').

  • path_glob (str) – fnmatch glob applied to source history file paths.

  • var_glob (str) – fnmatch glob applied to primary variable names. Defaults to '*'.

Returns:

New TSCollection with compression arguments applied.

Return type:

TSCollection

apply_overwrite(path_glob, var_glob='*')

Sets the overwrite flag on matching time-series orders.

Convenience wrapper around add_args() with overwrite=True.

Parameters:
  • path_glob (str) – fnmatch glob applied to source history file paths.

  • var_glob (str) – fnmatch glob applied to primary variable names. Defaults to '*'.

Returns:

New TSCollection with overwrite enabled on matching orders.

Return type:

TSCollection

apply_path_swap(string_match, string_swap, path_glob='*', var_glob='*')

Replaces a substring in the output path template of matching orders.

Iterates over orders whose source paths match path_glob and replaces string_match with string_swap in each order’s ts_path_template. Used to redirect outputs to a different directory structure (e.g. '/hist/''/proc/tseries/').

Parameters:
  • string_match (str) – Substring to find in the output path template.

  • string_swap (str) – Replacement string.

  • path_glob (str) – fnmatch glob applied to source history file paths. Defaults to '*'.

  • var_glob (str) – fnmatch glob applied to primary variable names. Defaults to '*'.

Returns:

New TSCollection with updated path templates.

Return type:

TSCollection

copy(hf_collection=None, output_dir=None, ts_orders=None, num_processes=None)

Creates a new TSCollection derived from this one with optional overrides.

Used as the return mechanism for all modifier methods to preserve immutability.

Parameters:
  • hf_collection (gents.hfcollection.HFCollection or None) – HFCollection to assign to the copy. Defaults to the current collection.

  • output_dir (str or None) – Output directory to assign to the copy. Defaults to the current directory.

  • ts_orders (list or None) – Order list to assign to the copy. Defaults to the current orders.

  • num_processes (int or None) – Worker process count for the copy. Defaults to the current value.

Returns:

New TSCollection instance.

Return type:

TSCollection

create_directories(exist_ok=True)

Creates the output directory tree for all time-series orders.

Parameters:

exist_ok (bool) – If True (default), no error is raised when a directory already exists.

exclude(path_glob, var_glob='')

Returns a new collection with orders that match both filters removed.

An order is excluded if any of its source paths matches path_glob and its primary variable matches var_glob.

Parameters:
  • path_glob (str) – fnmatch glob applied to source history file paths.

  • var_glob (str) – fnmatch glob applied to primary variable names. Defaults to ''.

Returns:

New TSCollection with matching orders removed.

Return type:

TSCollection

execute(optimize=True, optimize_batch_n=200, raise_errors=False)

Executes all time-series generation orders in parallel.

When optimize=True (default), orders that share the same first source file are batched together (up to optimize_batch_n per batch) so that generate_time_series() opens each group of history files only once and writes multiple primary-variable output files per worker invocation, significantly reducing file I/O overhead.

When optimize=False, each order is submitted as a separate worker task (one file open per variable).

Parameters:
  • optimize (bool) – If True (default), batch orders sharing the same source files into single worker calls.

  • optimize_batch_n (int) – Maximum number of variables per optimised batch. Defaults to 200.

  • raise_errors (bool) – If True (default False), calls errors are raised rather than just logged.

Returns:

List of paths to all generated time-series output files.

Return type:

list[str]

get_hf_collection()

Returns the underlying HFCollection.

Returns:

The history file collection this TSCollection was derived from.

Return type:

gents.hfcollection.HFCollection

get_output_dir()

Returns the output directory path for generated time series files.

Returns:

Absolute path to the output directory.

Return type:

str

include(path_glob, var_glob='*')

Returns a new collection containing only orders that match both filters.

An order is retained if at least one of its source paths matches path_glob and its primary variable matches var_glob.

Parameters:
  • path_glob (str) – fnmatch glob applied to source history file paths.

  • var_glob (str) – fnmatch glob applied to primary variable names. Defaults to '*'.

Returns:

New TSCollection restricted to matching orders.

Return type:

TSCollection

remove_overwrite(path_glob, var_glob='*')

Clears the overwrite flag on matching time-series orders.

Convenience wrapper around add_args() with overwrite=False.

Parameters:
  • path_glob (str) – fnmatch glob applied to source history file paths.

  • var_glob (str) – fnmatch glob applied to primary variable names. Defaults to '*'.

Returns:

New TSCollection with overwrite disabled on matching orders.

Return type:

TSCollection

update_ts_orders(strfrmt_kwargs={}, time_alignment_method='midpoint')

Rebuilds the time-series order list and returns a new TSCollection.

Re-derives one order per primary variable per history file group, applying strfrmt_kwargs to override individual timestamp format strings and time_alignment_method to control which point within each time bound is used when computing start_time / end_time for the output filename.

Time alignment methods:

  • 'midpoint' (default): midpoint of the first time bound.

  • 'direct_time': raw time coordinate values (ignores bounds).

  • 'start_bound': lower edge of the first time bound.

  • 'end_bound': upper edge of the first time bound.

Parameters:
  • strfrmt_kwargs (dict) – Format-string overrides forwarded to get_timestamp_format() (e.g. {'monthly_format': '%Y%m%d'}). Defaults to {}.

  • time_alignment_method (str) – Method used to select the representative time value from each file’s time bounds. Must be one of 'midpoint', 'direct_time', 'start_bound', or 'end_bound'. Defaults to 'midpoint'.

Returns:

A new TSCollection with the rebuilt order list.

Return type:

TSCollection

Raises:

ValueError – If time_alignment_method is not one of the accepted values.

gents.timeseries.check_timeseries_conform(ts_path: str)

Checks whether a time-series file meets the GenTS chunking conventions.

A conforming file satisfies:

  • The time variable is stored contiguously (chunk sizes equal shape).

  • Every multi-dimensional variable is either stored contiguously, or its per-time-step chunk occupies at least 4 MiB.

Parameters:

ts_path (str) – Path to the time-series netCDF file to inspect.

Returns:

True if the file conforms to the chunking conventions, False otherwise.

Return type:

bool

gents.timeseries.check_timeseries_integrity(ts_path: str)

Checks whether a time-series file was written completely by GenTS.

Opens the file and looks for the gents_version global attribute, which is stamped on every successfully completed output file.

Parameters:

ts_path (str) – Path to the time-series netCDF file to inspect.

Returns:

True if gents_version is present (file likely complete), False if absent or the file cannot be opened (possible corruption).

Return type:

bool

gents.timeseries.generate_time_series(hf_paths, ts_path_template, secondary_vars, ts_args)

Generates time-series files for a group of history files.

Opens an MHFDataset over hf_paths, pre-loads all secondary variable data, then calls write_timeseries_file() for each primary variable described in ts_args.

Parameters:
  • hf_paths (list[str or pathlib.Path]) – Paths to the history files forming the group.

  • ts_path_template (str) – Output path prefix (without variable name or timestamp suffix).

  • secondary_vars (list[str]) – Names of secondary variables to read and embed in every output file.

  • ts_args (dict) – Dictionary mapping each primary variable name to a dict of keyword arguments for write_timeseries_file() (must include a 'ts_string' key for the timestamp suffix).

Returns:

List of paths to the generated time-series files.

Return type:

list[str]

gents.timeseries.get_timestamp_format(dt, subhour_format='%Y%m%d%H%M%S', hourly_format='%Y%m%d%H', daily_format='%Y%m%d', monthly_format='%Y%m', yearly_format='%Y')

Returns a strftime format string appropriate for a given time-step duration.

Parameters:
  • dt (datetime.timedelta) – Duration of a single model time step.

  • subhour_format (str) – Format string for sub-minute time steps. Defaults to '%Y%m%d%H%M%S'.

  • hourly_format (str) – Format string for hour-level time steps (< 24 h). Defaults to '%Y%m%d%H'.

  • daily_format (str) – Format string for day-level time steps (< 28 days). Defaults to '%Y%m%d'.

  • monthly_format (str) – Format string for month-level time steps (< 12 months). Defaults to '%Y%m'.

  • yearly_format (str) – Format string for year-level time steps. Defaults to '%Y'.

Returns:

strftime-compatible format string.

Return type:

str

gents.timeseries.write_timeseries_file(agg_hf_ds, ts_out_path, primary_var, secondary_vars_data, overwrite=False, complevel=0, compression=None, ts_start_index=None, ts_end_index=None, append_attrs=None)

Writes a single time-series netCDF file for one primary variable.

Behaviour when the output file already exists:

  • overwrite=True: the existing file is deleted and recreated.

  • overwrite=False: check_timeseries_integrity() is called; if the file passes the integrity check it is returned immediately (skipped); otherwise the corrupt file is deleted and recreated.

The primary variable is written with adaptive chunksizes: files smaller than 4 MiB are stored contiguously; larger files are chunked along the time axis to keep each chunk near 4 MiB. Secondary variables are written with their full shape as chunk sizes. The global attributes are stamped with a gents_version entry on completion.

Parameters:
  • agg_hf_ds (gents.mhfdataset.MHFDataset) – Open MHFDataset providing aggregated data for the history file group.

  • ts_out_path (str) – Full output path for the time-series file.

  • primary_var (str) – Name of the primary variable to extract, or 'auxiliary' to write only secondary variables.

  • secondary_vars_data (dict) – Pre-loaded secondary variable data as a {var_name: numpy.ndarray} dictionary.

  • overwrite (bool) – If True, overwrite any existing file. Defaults to False.

  • complevel (int) – netCDF4 compression level (0–9). Defaults to 0 (no compression).

  • compression (str or None) – netCDF4 compression algorithm (e.g. 'zlib'). Defaults to None.

  • ts_start_index (int or None) – Time index to start reading from aggregated history files. If None, read from the first time step for the full aggregation. Defaults to None.

  • ts_end_index (int or None) – Time index to stop reading from aggregated history files. If None, read to the last time step for the full aggregation. Defaults to None.

  • append_attrs (dict or None) – Attributes to append to output NetCDF files.

Returns:

Path to the written (or skipped) output file.

Return type:

str

class gents.mhfdataset.MHFDataset(hf_paths)

Aggregating dataset interface over a group of related history files.

Presents multiple history files — covering the same time range and/or different spatial tiles — as a single virtual dataset. All file handles are opened together on open() (or __enter__) and closed together on close() (or __exit__).

close()

Closes all open netCDF4 file handles.

get_global_attrs()

Returns a merged dictionary of global attributes from all files in the group.

Attributes from later files overwrite those from earlier files when keys conflict.

Returns:

Dictionary mapping global attribute names to their values.

Return type:

dict

get_time_vals()

Returns the sorted array of unique float time values across the group.

Returns:

1-D array of sorted, unique float time values.

Return type:

numpy.ndarray

get_var_attrs(var_name)

Returns the attribute dictionary for a variable from the first file in the group.

Parameters:

var_name (str) – Name of the variable to inspect.

Returns:

Dictionary mapping attribute names to their values.

Return type:

dict

get_var_data_shape(var_name)

Returns the full expected output shape of a variable across the entire group.

Accounts for the total number of aggregated time steps and, for fragmented groups, the combined spatial extents. Returns a single-element list for coordinate variables.

Parameters:

var_name (str) – Name of the variable to inspect.

Returns:

List of dimension sizes representing the aggregated output shape.

Return type:

list[int]

get_var_dimensions(var_name)

Returns the dimension names for a variable, read from the first file in the group.

Parameters:

var_name (str) – Name of the variable to inspect.

Returns:

List of dimension name strings in the order they appear on the variable.

Return type:

list[str]

get_var_dtype(var_name)

Returns the NumPy dtype of a variable, read from the first file in the group.

Parameters:

var_name (str) – Name of the variable to inspect.

Returns:

NumPy dtype of the variable.

Return type:

numpy.dtype

get_var_vals(var_name, time_index_start=0, time_index_end=None)

Reads and returns a variable’s data across the group for a time slice.

Two execution paths are used depending on fragmentation:

  • Non-fragmented: iterates over the requested time values and reads each time step from the appropriate single file.

  • Fragmented: for each time step, reads from all spatial-tile files and inserts each tile into the correct slice of a pre-allocated output array by matching tile coordinate values against the combined coordinate map.

Parameters:
  • var_name (str) – Name of the variable to read.

  • time_index_start (int) – Index of the first time step to include (inclusive). Defaults to 0.

  • time_index_end (int or None) – Index of the last time step to include (exclusive). Defaults to None (all remaining time steps).

Returns:

Array containing the variable data for the requested time slice.

Return type:

numpy.ndarray

is_fragmented()

Returns whether the group consists of spatially fragmented (tiled) files.

Returns:

True if the first time value is covered by more than one file, False otherwise.

Return type:

bool

is_time_consistent()

Checks that every time step is covered by the same number of files.

Required for spatially fragmented groups to ensure every tile is present for every time step.

Returns:

True if all time values have the same fragment count, False otherwise.

Return type:

bool

open()

Opens all history file handles and builds the internal time mapping.

Constructs __time_mapping: a dictionary from each unique float time value to the list of file indices that contain it. Raises an exception if the number of files per time step is not consistent across all time values (i.e. fragmentation is inconsistent).

Raises:

Exception – If the spatial fragmentation is not consistent over time.

gents.mhfdataset.get_concat_coords(hf_datasets)

Builds a combined coordinate map across all datasets in a spatially fragmented group.

For each dimension across all open datasets:

  • If the dimension has a coordinate variable, its values are merged and de-duplicated with numpy.unique across all files.

  • If there is no coordinate variable, a 0-indexed integer range matching the dimension size is used.

Parameters:

hf_datasets (list[netCDF4.Dataset]) – List of open netCDF4.Dataset objects from the group.

Returns:

Dictionary mapping dimension names to their combined coordinate arrays.

Return type:

dict

utils.py

Developer: Cameron Cummins Contact: cameron.cummins@utexas.edu Last Header Update: 07/03/25

class gents.utils.ProgressBar(total, length=40, label='')

Terminal progress bar for visualising long-running loops.

Displays a continuously-updated bar, percentage, item count, and elapsed time by overwriting a single terminal line in place.

step()

Advances the progress bar by one iteration and redraws the terminal line.

Writes a newline once the counter reaches total.

gents.utils.enable_logging(verbose=False, output_path=None)

Configures the gents package logger and begins emitting log messages.

At verbose=True, the log level is set to LOG_LEVEL_IO_WARNING (5), enabling per-file I/O trace messages. At the default verbose=False, the level is DEBUG (10), suppressing those low-level traces. The installed GenTS version is logged immediately on initialisation.

Parameters:
  • verbose (bool) – If True, enable per-file I/O trace messages at LOG_LEVEL_IO_WARNING level. Defaults to False.

  • output_path (str or None) – Optional file path to additionally write log output to. Defaults to None (stdout only).

gents.utils.get_time_stamp()

Returns the current system date and time as a formatted string.

Returns:

Date-time string formatted as 'YYYY-MM-DD HH:MM'.

Return type:

str

gents.utils.get_version()

Returns the version string of the installed gents package.

Returns:

Package version string (e.g. '1.0.0').

Return type:

str

gents.utils.log_hfcollection_info(hfc)

Logs summary statistics for an HFCollection at INFO level.

Iterates over all groups in the collection to compute aggregate metrics and identify outliers. Requires metadata to have been pulled (calls hfc.check_pulled()). A progress bar is displayed on stdout during the scan.

Statistics logged:

  • Input directory and total number of history files found.

  • Number of output groups formed.

  • Total mapped data volume in TB and GB.

  • Group with the most variables.

  • Group with the most history files.

  • Variable with the largest single-timestep memory footprint (shape, dimensions, and size in MB).

Parameters:

hfc (gents.hfcollection.HFCollection) – A pulled HFCollection instance to inspect.

gents.utils.log_tscollection_info(tsc)

Logs summary statistics for a TSCollection at INFO level.

Iterates over all time series orders in the collection to compute aggregate metrics and identify the largest output file. Auxiliary-only orders are skipped. A progress bar is displayed on stdout during the scan.

Statistics logged:

  • Output directory and total number of time series files to generate.

  • Largest time series file by estimated total size, including the sample history file path, variable name, shape, dimensions, number of source history files, and projected size in GB.

Parameters:

tsc (gents.timeseries.TSCollection) – A TSCollection instance to inspect.

gents.cli.main()

Entry point for the gents command-line interface.

Performs the following steps:

  1. Calls parse_arguments() to obtain the parsed CLI namespace.

  2. Defaults outputdir to hf_head_dir when -o is not supplied.

  3. Selects the appropriate model configuration:

    • --model e3sm flag → imports run_config() (E3SM).

    • --model cesm3 → imports run_config() (CESM3).

  4. If --verbose is set, prints a summary of all active settings to stdout.

  5. Delegates execution to the selected run_config(args) function.

gents.cli.parse_arguments()

Parses command-line arguments for the gents CLI entry point.

Constructs an argparse.ArgumentParser with all supported flags and positional arguments, then parses sys.argv and returns the resulting namespace.

Supported arguments:

  • hf_head_dir (positional): Path to the head directory containing history files.

  • -o / --outputdir: Output directory for time-series files (defaults to hf_head_dir if omitted).

  • -v / --verbose: Enable verbose console output.

  • -V / --version: Print the installed gents version and exit.

  • -d / --dryrun: Parse metadata only; do not write time-series files.

  • -w / --overwrite: Overwrite existing time-series output files.

  • -sl / --slice: Maximum length of individual time-series files in years (default 10).

  • -hc / --hfcores: Maximum number of cores for parallel metadata reads (default 64).

  • -tc / --tscores: Maximum number of cores for parallel time-series writes (default 8).

  • -m / --model: Model default configuration to apply ('CESM3', 'CESM2', or 'E3SM'; default 'none').

  • --exclude: Glob pattern to exclude; may be specified multiple times. Overrides the model default unless --append is also set.

  • --include: Glob pattern to include; may be specified multiple times. Overrides the model default unless --append is also set.

  • --append: Append --exclude/--include filters to the model default configuration instead of replacing them.

Returns:

Namespace object populated with parsed argument values.

Return type:

argparse.Namespace