API¶
hfcollection.py
Developer: Cameron Cummins Contact: cameron.cummins@utexas.edu Last Header Update: 04/30/25
- class gents.hfcollection.HFCollection(hf_dir, num_processes=1, meta_map=None, hf_groups=None, step_map=None, hf_glob_pattern='*.nc*', dask_client=None, multistep_slice_map={})¶
Manages a collection of netCDF history files and their metadata.
Holds a
{path: netCDFMeta | None}mapping and lazily loads metadata on demand viapull_metadata(). All filter and slice operations return newHFCollectioninstances, preserving an immutable-style fluent API.- check_pulled()¶
Ensures metadata is loaded, triggering
pull_metadata()if necessary.
- check_validity()¶
Removes history files with missing or invalid metadata from the collection.
Iterates over the metadata map and drops any entry where the metadata is
Noneoris_valid()returnsFalse, logging a warning for each removed file.- Returns:
Dictionary of the removed
{path: metadata}entries.- Return type:
dict
- copy(num_processes=None, meta_map=None, hf_groups=None, step_map=None, multistep_slice_map=None)¶
Creates a new
HFCollectionderived from this one with optional overrides.Shares the same
hf_diras the original. Used as the return mechanism for all filter and transform operations to preserve immutability.- Parameters:
num_processes (int or None) – Worker process count for the copy. Defaults to the current value.
meta_map (dict or None) – Metadata map to assign to the copy. Defaults to the current map.
hf_groups (dict or None) – Group dictionary to assign to the copy. Defaults to the current groups.
step_map (dict or None) – Timestep delta map to assign to the copy. Defaults to the current map.
multistep_slice_map – Slice indices map for multisteps to assign to the copy. Defaults to the current map.
- Returns:
New
HFCollectioninstance.- Return type:
- exclude(glob_patterns)¶
Returns a new collection with files whose paths match the patterns removed.
A file is excluded if its path matches any of the provided glob patterns via
fnmatch.- Parameters:
glob_patterns (list[str] or str) – One or more
fnmatch-style glob patterns. A single string is also accepted.- Returns:
New
HFCollectionwith matching files removed.- Return type:
- get_groups(check_fragmented=True)¶
Returns the dictionary of history file groups.
On the first call, groups are built by
sort_hf_groups(). Ifcheck_fragmentedisTrue, spatially tiled groups are additionally merged viamerge_fragmented_groups(). Subsequent calls return the cached result.- Parameters:
check_fragmented (bool) – If
True(default), detect and merge spatially fragmented file groups.- Returns:
Dictionary mapping group ID strings to lists of history file paths.
- Return type:
dict[str, list[pathlib.Path]]
- get_input_dir()¶
Returns the head directory this collection was initialised from.
- Returns:
Root input directory path.
- Return type:
str
- get_multistep_slices(hf_path)¶
Checks if the history file is multistep and returns the slice indices if they are needed.
This only matters for history files with mutliple time steps that overlap a yearly boundary being sliced on.
- Parameters:
hf_path (pathlib.Path) – Path to the history file.
- Returns:
Dictionary with slice labels as keys and the start and end indicies to slice the history file by as values, or
Noneif the history file is not muiltistep or doesn’t need to be sliced.- Return type:
dict
- get_timestep_delta(hf_path)¶
Returns the pre-computed time-step duration for a given history file.
Triggers
pull_metadata()if metadata has not yet been loaded.- Parameters:
hf_path (pathlib.Path) – Path to the history file.
- Returns:
Duration of one time step as a
cftimetimedelta object.- Return type:
datetime.timedelta
- include(glob_patterns)¶
Returns a new collection containing only files whose paths match the patterns.
A file is retained if its path matches at least one of the provided glob patterns via
fnmatch.- Parameters:
glob_patterns (list[str] or str) – One or more
fnmatch-style glob patterns. A single string is also accepted.- Returns:
New
HFCollectionrestricted to matching files.- Return type:
- include_years(start_year, end_year, glob_patterns=['*'])¶
Returns a new collection filtered to files whose midpoint time falls within a year range.
Only files whose paths also match
glob_patternsare considered for filtering. The representative year is the midpoint of the first time bound, or the first time value if no bounds are present. Requires metadata to have been loaded.- Parameters:
start_year (int) – First year in the range (inclusive).
end_year (int) – Last year in the range (inclusive).
glob_patterns (list[str]) – Glob patterns restricting which files are subject to the year filter. Defaults to
['*'](all files).
- Returns:
New
HFCollectioncontaining only files within the year range.- Return type:
- is_pulled()¶
Returns whether metadata has been loaded for all files in the collection.
- Returns:
Trueif every path has a non-Nonemetadata value,Falseotherwise.- Return type:
bool
- pull_metadata(check_valid=True, raise_errors=False)¶
Loads metadata for all history files in the collection in parallel.
Submits
get_meta_from_path()calls to aProcessPoolExecutorworker pool and populates the internal metadata map with the results. After loading, computes the timestep delta for each group by sorting all CFTime values and taking the interval between the last two steps.- Parameters:
check_valid (bool) – If
True(default), callscheck_validity()after loading to remove files with incomplete or invalid metadata.raise_errors (bool) – If
True(defaultFalse), calls errors are raised rather than just logged.
- slice_groups(slice_size_years=10, start_year=0, pattern='*', time_alignment_method='midpoint')¶
Returns a new collection with history file groups partitioned into time slices.
For each group (optionally filtered by
pattern), determines the year range viaget_year_bounds(), computes slice boundaries viacalculate_year_slices(), and assigns each file to the appropriate sub-group based on its midpoint year. Sub-group keys are suffixed with[sorting_pivot]<start>-<end>to carry the year range through toTSCollection.Time alignment methods:
'midpoint'(default): midpoint of the first time bound.'direct_time': rawtimecoordinate values (ignores bounds).'start_bound': lower edge of the first time bound.'end_bound': upper edge of the first time bound.
- Parameters:
slice_size_years (int) – Maximum width of each time slice in years. Defaults to
10.start_year (int or None) – Override for the starting year; set to
Noneto begin at the dataset’s own minimum year. Defaults to0.pattern (str or None) –
fnmatchglob to restrict slicing to matching group IDs. Defaults to*(all groups).time_alignment_method (str) – Method used to select the representative time value from each file’s time bounds. Must be one of
'midpoint','direct_time','start_bound', or'end_bound'. Defaults to'midpoint'.
- Returns:
New
HFCollectionwith sliced groups embedded.- Return type:
- sort_along_time()¶
Returns a new
HFCollectionwith files sorted by their first time value.- Returns:
New
HFCollectionwith the metadata map re-ordered in ascending time order.- Return type:
- gents.hfcollection.calculate_year_slices(slice_size_years, min_year, max_year)¶
Computes non-overlapping year-range tuples covering a given span.
Each slice is at most
slice_size_yearswide. The upper boundary is aligned by roundingmax_yearup to the next multiple ofslice_size_years. Returns a single tuple if the full span fits within one slice.- Parameters:
slice_size_years (int) – Maximum width of each slice in years.
min_year (int) – First year in the range (inclusive).
max_year (int) – Last year in the range (inclusive).
- Returns:
List of
(start_year, end_year)tuples, one per slice.- Return type:
list[tuple[int, int]]
- Raises:
ValueError – If
max_yearis less thanmin_year.
- gents.hfcollection.check_config(config)¶
Validates that a configuration dictionary contains the required keys and types.
Asserts the presence of
'name','include', and'exclude'keys and that their values are of the expected types.- Parameters:
config (dict) – Configuration dictionary to validate.
- Raises:
AssertionError – If any required key is missing or has an unexpected type.
- gents.hfcollection.check_groups_by_variables(sliced_groups)¶
Filters history file groups to ensure variable-set consistency within each group.
For each group, calls
filter_by_variables()to identify the majority variable set, discards minority files with a logged warning, and re-sorts the retained files by time viasort_metas_by_time(). Groups for which no majority can be determined are dropped entirely with a warning.- Parameters:
sliced_groups (dict) – Dictionary mapping group IDs to lists of
netCDFMetaobjects.- Returns:
Filtered dictionary containing only the majority-consistent metadata objects per group, sorted by time.
- Return type:
dict
- gents.hfcollection.filter_by_variables(meta_datasets)¶
Identifies the majority variable set among a list of history file metadata objects.
Groups metadata objects by their sorted variable-name fingerprint and returns the set belonging to the most common variable list alongside any outliers.
- Parameters:
meta_datasets (list[gents.meta.netCDFMeta]) – List of metadata objects to examine.
- Returns:
Tuple of
(majority, others)wheremajorityis the list of metadata objects sharing the most common variable set andotherscontains the remainder.othersisNoneif all objects share the same variable set.- Return type:
tuple[list, list or None]
- gents.hfcollection.find_all_indices(string, substring)¶
Returns all start indices where
substringoccurs withinstring.Uses a sliding-window
str.findloop so overlapping occurrences are all reported.- Parameters:
string (str) – The string to search in.
substring (str) – The substring to search for.
- Returns:
List of integer indices where
substringbegins.- Return type:
list[int]
- gents.hfcollection.find_files(head_path, pattern)¶
Recursively searches a directory tree for files matching a glob pattern.
Walks
head_pathwithos.walkand collects every file whose name matchespatternviafnmatch.- Parameters:
head_path (str or pathlib.Path) – Root directory to begin the recursive search from.
pattern (str) –
fnmatch-style wildcard pattern to match file names against (e.g.'*.nc').
- Returns:
Sorted list of matching file paths.
- Return type:
list[pathlib.Path]
- gents.hfcollection.generate_output_template(hf_head_dir, group_path_id, output_head_dir=None, directory_swaps={'hist': 'tseries'}, filename_delimiter='.', cutoff_index=None)¶
Constructs a time-series output path template from a history file group path.
Builds the output path (excluding the variable-name and timestamp suffix) by extracting the subdirectory structure relative to
hf_head_dir, applying anydirectory_swapsrenames, and stripping date tokens from the filename prefix up tocutoff_index.- Parameters:
hf_head_dir (str) – Head directory used when reading the history files.
group_path_id (str or pathlib.Path) – Group path pattern produced by
sort_hf_groups()(e.g.'/data/hist/model.h0*').output_head_dir (str or None) – Alternate head directory for output. Defaults to
None(useshf_head_dir).directory_swaps (dict) – Mapping of directory name substrings to replace (e.g.
{'hist': 'tseries'}). Defaults to{'hist': 'tseries'}.filename_delimiter (str) – Delimiter used to split the filename into tokens. Defaults to
'.'.cutoff_index (int or None) – Character index at which to truncate the filename prefix. Defaults to
None(cuts at the last delimiter occurrence).
- Returns:
Path template for time-series output (without variable/timestamp suffix).
- Return type:
pathlib.Path
- gents.hfcollection.get_default_config()¶
Returns a configuration dictionary populated with default GenTS settings.
- Returns:
Dictionary with
name="default",include=None,exclude=None.- Return type:
dict
- gents.hfcollection.get_year_bounds(hf_to_meta_map)¶
Determines the minimum and maximum year covered by a set of history files.
Uses the midpoint of each time bound (or the time value itself if no bounds are present) to determine which year each file belongs to.
- Parameters:
hf_to_meta_map (dict) – Dictionary mapping file paths to their
netCDFMetaobjects.- Returns:
Tuple of
(min_year, max_year)as integers.- Return type:
tuple[int, int]
- gents.hfcollection.is_ds_within_years(ds_meta, min_year, max_year)¶
Checks whether a dataset’s representative time falls within a year range.
Uses the midpoint of the first time bound as the representative year, or the first time value directly if no time bounds are present.
- Parameters:
ds_meta (gents.meta.netCDFMeta) – Metadata object for the dataset to check.
min_year (int) – Lower bound of the year range (inclusive).
max_year (int) – Upper bound of the year range (inclusive).
- Returns:
Trueif the representative year falls within[min_year, max_year],Falseotherwise.- Return type:
bool
- gents.hfcollection.merge_fragmented_groups(hf_groups, hf_meta_map)¶
Merges spatially fragmented (tiled) history file groups into unified groups.
Iterates through
hf_groupsand separates fragmented files (identified by paths that do not end with.nc) from standard files. Fragmented groups are hashed by their non-time dimension bounds; groups sharing the same hash are merged into a single entry under a new wildcard key. Non-fragmented files are passed through unchanged.- Parameters:
hf_groups (dict) – Dictionary mapping group pattern strings to lists of history file paths.
hf_meta_map (dict) – Dictionary mapping file paths to their
netCDFMetaobjects (used to retrieve dimension bounds).
- Returns:
New group dictionary with fragmented groups merged.
- Return type:
dict
- Raises:
KeyError – If a merged fragmented group label already exists among the non-fragmented groups.
- gents.hfcollection.sort_hf_groups(hf_paths, delimiter='.', substring_index=2)¶
Groups history file paths by directory and shared filename prefix.
Files are first grouped by their parent directory, then within each directory by a common filename prefix derived by dropping the last
substring_indexdelimiter-delimited tokens from each filename.For example,
model.h0.0001-01.ncandmodel.h0.0001-02.ncshare the prefixmodel.h0and end up in the same group.- Parameters:
hf_paths (list[pathlib.Path]) – List of history file paths to group.
delimiter (str) – Token delimiter used to parse the filename prefix. Defaults to
'.'.substring_index (int) – Number of trailing delimiter-separated tokens to strip when deriving the group prefix. Defaults to
2.
- Returns:
Dictionary mapping
'<parent_dir>/<prefix>*'pattern strings to lists of matching file paths.- Return type:
dict[str, list[pathlib.Path]]
- gents.hfcollection.sort_metas_by_time(metas)¶
Returns a new list of metadata objects sorted by their first CFTime value.
Performs an insertion sort; the original list is not modified.
- Parameters:
metas (list[gents.meta.netCDFMeta]) – Unsorted list of metadata objects.
- Returns:
New list sorted in ascending time order.
- Return type:
list[gents.meta.netCDFMeta]
meta.py
Developer: Cameron Cummins Contact: cameron.cummins@utexas.edu Last Header Update: 07/03/25
- gents.meta.get_attributes(dataset)¶
Extracts all attributes from a netCDF4 dataset or variable into a dictionary.
- Parameters:
dataset – A
netCDF4.Dataset,netCDF4.MFDataset, ornetCDF4.Variableobject from which to read attributes.- Returns:
Dictionary mapping attribute names to their values.
- Return type:
dict
- gents.meta.get_meta_from_path(path: str)¶
Opens a netCDF file, constructs a
netCDFMetaobject, and returns it.Serves as a picklable factory wrapper around
netCDFMetaso that instances can be created insideProcessPoolExecutorworker processes. Any exception raised during construction is re-raised with the file path appended to the message for easier debugging.- Parameters:
path (str) – Path to the netCDF history file.
- Returns:
Metadata object populated from the specified file.
- Return type:
- Raises:
Exception – Re-raises any exception from
netCDFMeta.__init__with the file path appended to the message.
- gents.meta.get_time_variables_names(ds)¶
Locates the time and time-bounds variable names in a netCDF dataset.
Performs a case-insensitive scan of all variable names and returns the canonical name of the
timevariable and, if present, the name of the corresponding time-bounds variable (time_bnds,time_bnd,time_bounds, ortime_bound).- Parameters:
ds (netCDF4.Dataset) – Open netCDF4 dataset to inspect.
- Returns:
Tuple of
(time_name, time_bounds_name). Either element isNoneif the corresponding variable is not found.- Return type:
tuple[str or None, str or None]
- gents.meta.is_var_secondary(variable, secondary_vars: list = ['time_bnds', 'time_bnd', 'time_bounds', 'time_bound'], secondary_dims: list = ['nbnd', 'chars', 'string_length', 'hist_interval'], max_num_dims: int = 1, primary_dims: list = ['time']) bool¶
Classifies a netCDF variable as secondary or primary.
Secondary variables (e.g. coordinate and auxiliary fields such as
time,time_bnds,lat,lon) are written unchanged into every time-series output file. Primary variables (multi-dimensional, time-varying scientific fields) each warrant their own time-series output file.Rules are evaluated in order:
Variable name is in
secondary_vars→ secondary.Any dimension name is in
secondary_dims→ secondary.Variable has more than
max_num_dimsdimensions and none are inprimary_dims→ secondary.Otherwise the variable is primary (has a
timedimension and more than one dimension total).
- Parameters:
variable (netCDF4._netCDF4.Variable) – netCDF4 variable object to classify.
secondary_vars (list) – Variable names that are unconditionally secondary. Defaults to
['time_bnds', 'time_bnd', 'time_bounds', 'time_bound'].secondary_dims (list) – Dimension names whose presence makes a variable secondary. Defaults to
['nbnd', 'chars', 'string_length', 'hist_interval'].max_num_dims (int) – Maximum number of dimensions a variable may have before the
primary_dimscheck is applied. Defaults to1.primary_dims (list) – Dimension names whose presence keeps a variable primary. Defaults to
['time'].
- Returns:
Trueif the variable is secondary,Falseif primary.- Return type:
bool
- class gents.meta.netCDFMeta(ds, path: str)¶
Stores metadata extracted from a single netCDF history file.
Caches time values (as raw floats and as CFTime objects), optional time-bounds values, global file attributes, variable lists partitioned into primary vs. secondary sets, and per-dimension coordinate bounds. Instances are constructed by
get_meta_from_path()and consumed throughoutgents.hfcollection.- get_attributes()¶
Returns the global attributes dictionary cached from the history file.
- Returns:
Dictionary mapping global attribute names to their values.
- Return type:
dict
- get_cftime_bounds()¶
Returns the time-bounds array as CFTime objects.
- Returns:
Array of CFTime bound pairs, or
Noneif the history file contains no time-bounds variable.- Return type:
numpy.ndarray or None
- get_cftimes()¶
Returns the time values converted to CFTime objects.
- Returns:
Array of CFTime datetime objects corresponding to each time step.
- Return type:
numpy.ndarray
- get_dim_bounds()¶
Returns coordinate bounds for each dimension in the history file.
For each dimension that has an associated coordinate variable, maps the dimension name to a list containing its minimum value (single-element list for a scalar coordinate) or
[min_value, max_value]for a range. Used bymerge_fragmented_groups()to identify spatial extent when merging tiled files.- Returns:
Dictionary mapping dimension names to their coordinate bound lists.
- Return type:
dict
- get_float_time_bounds()¶
Returns the time-bounds array as raw float values.
- Returns:
Array of float time-bound pairs, or
Noneif the history file contains no time-bounds variable.- Return type:
numpy.ndarray or None
- get_float_times()¶
Returns the raw float time values read from the
timevariable.- Returns:
1-D array of float time values.
- Return type:
numpy.ndarray
- get_path()¶
Returns the file-system path of the history file this object was built from.
- Returns:
Path to the source history file.
- Return type:
str
- get_primary_variables()¶
Returns the names of primary variables in the history file.
Primary variables are multi-dimensional, time-varying scientific fields that each warrant their own time-series output file.
- Returns:
List of primary variable name strings.
- Return type:
list
- get_secondary_variables()¶
Returns the names of secondary variables in the history file.
Secondary variables are coordinate and auxiliary fields (e.g.
time,time_bnds,lat,lon) that are written unchanged into every time-series output file.- Returns:
List of secondary variable name strings.
- Return type:
list
- get_variable_dims(variable)¶
Returns the dimension names for the given variable.
- Parameters:
variable (str) – Name of the variable to look up.
- Returns:
Tuple or list of dimension name strings for the variable.
- Return type:
tuple
- get_variable_dtype(variable)¶
Returns the data type of the given variable.
- Parameters:
variable (str) – Name of the variable to look up.
- Returns:
NumPy dtype describing the element type of the variable.
- Return type:
numpy.dtype
- get_variable_shapes(variable)¶
Returns the shape of the given variable.
- Parameters:
variable (str) – Name of the variable to look up.
- Returns:
Tuple of integers describing the size of each dimension.
- Return type:
tuple
- get_variables()¶
Returns the full list of variable names present in the history file.
- Returns:
List of all variable name strings.
- Return type:
list
- is_valid()¶
Returns whether this history file is usable for time-series generation.
A file is considered invalid if any of the following are true:
Both
get_cftime_bounds()andget_cftimes()areNone(no usable time coordinate).The file contains zero primary and zero secondary variables.
A
gents_versionglobal attribute is present (the file is already a GenTS-generated time-series output, not a raw history file).
- Returns:
Trueif the file is valid for processing,Falseotherwise.- Return type:
bool
timeseries.py
Developer: Cameron Cummins Contact: cameron.cummins@utexas.edu Last Header Update: 01/31/25
- class gents.timeseries.TSCollection(hf_collection, output_dir, ts_orders=None, num_processes=None, dask_client=None)¶
Manages the set of time-series generation orders derived from an
HFCollection.Each order is a dictionary describing one output file: source history file paths, output path template, primary variable name, secondary variable names, and generation arguments (compression, overwrite flag, etc.). All modifier methods return new
TSCollectioninstances, preserving an immutable-style fluent API.- add_args(path_glob='*', var_glob='*', level=None, alg=None, overwrite=None)¶
Updates generation arguments on orders that match both filters.
Only arguments that are not
Noneare applied; others are left unchanged.- Parameters:
path_glob (str) –
fnmatchglob applied to source history file paths. Defaults to'*'.var_glob (str) –
fnmatchglob applied to primary variable names. Defaults to'*'.level (int or None) – netCDF4 compression level (0–9). Defaults to
None(unchanged).alg (str or None) – netCDF4 compression algorithm (e.g.
'zlib'). Defaults toNone(unchanged).overwrite (bool or None) – Overwrite flag to apply. Defaults to
None(unchanged).
- Returns:
New
TSCollectionwith updated order arguments.- Return type:
- add_attrs(attrs)¶
Adds attributes to all output time series files associated with this collection.
- Parameters:
attrs (dict) – Dictionary of key/value strings to append to output NetCDF files
- Returns:
New
TSCollectionwith updated attributes.- Return type:
- append_timestep_dirs(var_glob='*')¶
Inserts a time-step frequency subdirectory into each matching order’s output path.
Determines the frequency label from the group’s timestep delta:
'hour_N','day_N','month_N', or'year_N'. The label is inserted as a new directory level immediately before the filename in the output path template, organising outputs by observation frequency.- Parameters:
var_glob (str) –
fnmatchglob applied to primary variable names. Defaults to'*'.- Returns:
New
TSCollectionwith updated output path templates.- Return type:
- apply_compression(level, alg, path_glob, var_glob='*')¶
Applies compression settings to matching time-series orders.
Convenience wrapper around
add_args().- Parameters:
level (int) – netCDF4 compression level (0–9).
alg (str) – netCDF4 compression algorithm (e.g.
'zlib').path_glob (str) –
fnmatchglob applied to source history file paths.var_glob (str) –
fnmatchglob applied to primary variable names. Defaults to'*'.
- Returns:
New
TSCollectionwith compression arguments applied.- Return type:
- apply_overwrite(path_glob, var_glob='*')¶
Sets the overwrite flag on matching time-series orders.
Convenience wrapper around
add_args()withoverwrite=True.- Parameters:
path_glob (str) –
fnmatchglob applied to source history file paths.var_glob (str) –
fnmatchglob applied to primary variable names. Defaults to'*'.
- Returns:
New
TSCollectionwith overwrite enabled on matching orders.- Return type:
- apply_path_swap(string_match, string_swap, path_glob='*', var_glob='*')¶
Replaces a substring in the output path template of matching orders.
Iterates over orders whose source paths match
path_globand replacesstring_matchwithstring_swapin each order’sts_path_template. Used to redirect outputs to a different directory structure (e.g.'/hist/'→'/proc/tseries/').- Parameters:
string_match (str) – Substring to find in the output path template.
string_swap (str) – Replacement string.
path_glob (str) –
fnmatchglob applied to source history file paths. Defaults to'*'.var_glob (str) –
fnmatchglob applied to primary variable names. Defaults to'*'.
- Returns:
New
TSCollectionwith updated path templates.- Return type:
- copy(hf_collection=None, output_dir=None, ts_orders=None, num_processes=None)¶
Creates a new
TSCollectionderived from this one with optional overrides.Used as the return mechanism for all modifier methods to preserve immutability.
- Parameters:
hf_collection (gents.hfcollection.HFCollection or None) –
HFCollectionto assign to the copy. Defaults to the current collection.output_dir (str or None) – Output directory to assign to the copy. Defaults to the current directory.
ts_orders (list or None) – Order list to assign to the copy. Defaults to the current orders.
num_processes (int or None) – Worker process count for the copy. Defaults to the current value.
- Returns:
New
TSCollectioninstance.- Return type:
- create_directories(exist_ok=True)¶
Creates the output directory tree for all time-series orders.
- Parameters:
exist_ok (bool) – If
True(default), no error is raised when a directory already exists.
- exclude(path_glob, var_glob='')¶
Returns a new collection with orders that match both filters removed.
An order is excluded if any of its source paths matches
path_globand its primary variable matchesvar_glob.- Parameters:
path_glob (str) –
fnmatchglob applied to source history file paths.var_glob (str) –
fnmatchglob applied to primary variable names. Defaults to''.
- Returns:
New
TSCollectionwith matching orders removed.- Return type:
- execute(optimize=True, optimize_batch_n=200, raise_errors=False)¶
Executes all time-series generation orders in parallel.
When
optimize=True(default), orders that share the same first source file are batched together (up tooptimize_batch_nper batch) so thatgenerate_time_series()opens each group of history files only once and writes multiple primary-variable output files per worker invocation, significantly reducing file I/O overhead.When
optimize=False, each order is submitted as a separate worker task (one file open per variable).- Parameters:
optimize (bool) – If
True(default), batch orders sharing the same source files into single worker calls.optimize_batch_n (int) – Maximum number of variables per optimised batch. Defaults to
200.raise_errors (bool) – If
True(defaultFalse), calls errors are raised rather than just logged.
- Returns:
List of paths to all generated time-series output files.
- Return type:
list[str]
- get_hf_collection()¶
Returns the underlying
HFCollection.- Returns:
The history file collection this
TSCollectionwas derived from.- Return type:
- get_output_dir()¶
Returns the output directory path for generated time series files.
- Returns:
Absolute path to the output directory.
- Return type:
str
- include(path_glob, var_glob='*')¶
Returns a new collection containing only orders that match both filters.
An order is retained if at least one of its source paths matches
path_globand its primary variable matchesvar_glob.- Parameters:
path_glob (str) –
fnmatchglob applied to source history file paths.var_glob (str) –
fnmatchglob applied to primary variable names. Defaults to'*'.
- Returns:
New
TSCollectionrestricted to matching orders.- Return type:
- remove_overwrite(path_glob, var_glob='*')¶
Clears the overwrite flag on matching time-series orders.
Convenience wrapper around
add_args()withoverwrite=False.- Parameters:
path_glob (str) –
fnmatchglob applied to source history file paths.var_glob (str) –
fnmatchglob applied to primary variable names. Defaults to'*'.
- Returns:
New
TSCollectionwith overwrite disabled on matching orders.- Return type:
- update_ts_orders(strfrmt_kwargs={}, time_alignment_method='midpoint')¶
Rebuilds the time-series order list and returns a new
TSCollection.Re-derives one order per primary variable per history file group, applying
strfrmt_kwargsto override individual timestamp format strings andtime_alignment_methodto control which point within each time bound is used when computingstart_time/end_timefor the output filename.Time alignment methods:
'midpoint'(default): midpoint of the first time bound.'direct_time': rawtimecoordinate values (ignores bounds).'start_bound': lower edge of the first time bound.'end_bound': upper edge of the first time bound.
- Parameters:
strfrmt_kwargs (dict) – Format-string overrides forwarded to
get_timestamp_format()(e.g.{'monthly_format': '%Y%m%d'}). Defaults to{}.time_alignment_method (str) – Method used to select the representative time value from each file’s time bounds. Must be one of
'midpoint','direct_time','start_bound', or'end_bound'. Defaults to'midpoint'.
- Returns:
A new
TSCollectionwith the rebuilt order list.- Return type:
- Raises:
ValueError – If
time_alignment_methodis not one of the accepted values.
- gents.timeseries.check_timeseries_conform(ts_path: str)¶
Checks whether a time-series file meets the GenTS chunking conventions.
A conforming file satisfies:
The
timevariable is stored contiguously (chunk sizes equal shape).Every multi-dimensional variable is either stored contiguously, or its per-time-step chunk occupies at least 4 MiB.
- Parameters:
ts_path (str) – Path to the time-series netCDF file to inspect.
- Returns:
Trueif the file conforms to the chunking conventions,Falseotherwise.- Return type:
bool
- gents.timeseries.check_timeseries_integrity(ts_path: str)¶
Checks whether a time-series file was written completely by GenTS.
Opens the file and looks for the
gents_versionglobal attribute, which is stamped on every successfully completed output file.- Parameters:
ts_path (str) – Path to the time-series netCDF file to inspect.
- Returns:
Trueifgents_versionis present (file likely complete),Falseif absent or the file cannot be opened (possible corruption).- Return type:
bool
- gents.timeseries.generate_time_series(hf_paths, ts_path_template, secondary_vars, ts_args)¶
Generates time-series files for a group of history files.
Opens an
MHFDatasetoverhf_paths, pre-loads all secondary variable data, then callswrite_timeseries_file()for each primary variable described ints_args.- Parameters:
hf_paths (list[str or pathlib.Path]) – Paths to the history files forming the group.
ts_path_template (str) – Output path prefix (without variable name or timestamp suffix).
secondary_vars (list[str]) – Names of secondary variables to read and embed in every output file.
ts_args (dict) – Dictionary mapping each primary variable name to a dict of keyword arguments for
write_timeseries_file()(must include a'ts_string'key for the timestamp suffix).
- Returns:
List of paths to the generated time-series files.
- Return type:
list[str]
- gents.timeseries.get_timestamp_format(dt, subhour_format='%Y%m%d%H%M%S', hourly_format='%Y%m%d%H', daily_format='%Y%m%d', monthly_format='%Y%m', yearly_format='%Y')¶
Returns a
strftimeformat string appropriate for a given time-step duration.- Parameters:
dt (datetime.timedelta) – Duration of a single model time step.
subhour_format (str) – Format string for sub-minute time steps. Defaults to
'%Y%m%d%H%M%S'.hourly_format (str) – Format string for hour-level time steps (< 24 h). Defaults to
'%Y%m%d%H'.daily_format (str) – Format string for day-level time steps (< 28 days). Defaults to
'%Y%m%d'.monthly_format (str) – Format string for month-level time steps (< 12 months). Defaults to
'%Y%m'.yearly_format (str) – Format string for year-level time steps. Defaults to
'%Y'.
- Returns:
strftime-compatible format string.- Return type:
str
- gents.timeseries.write_timeseries_file(agg_hf_ds, ts_out_path, primary_var, secondary_vars_data, overwrite=False, complevel=0, compression=None, ts_start_index=None, ts_end_index=None, append_attrs=None)¶
Writes a single time-series netCDF file for one primary variable.
Behaviour when the output file already exists:
overwrite=True: the existing file is deleted and recreated.overwrite=False:check_timeseries_integrity()is called; if the file passes the integrity check it is returned immediately (skipped); otherwise the corrupt file is deleted and recreated.
The primary variable is written with adaptive chunksizes: files smaller than 4 MiB are stored contiguously; larger files are chunked along the time axis to keep each chunk near 4 MiB. Secondary variables are written with their full shape as chunk sizes. The global attributes are stamped with a
gents_versionentry on completion.- Parameters:
agg_hf_ds (gents.mhfdataset.MHFDataset) – Open
MHFDatasetproviding aggregated data for the history file group.ts_out_path (str) – Full output path for the time-series file.
primary_var (str) – Name of the primary variable to extract, or
'auxiliary'to write only secondary variables.secondary_vars_data (dict) – Pre-loaded secondary variable data as a
{var_name: numpy.ndarray}dictionary.overwrite (bool) – If
True, overwrite any existing file. Defaults toFalse.complevel (int) – netCDF4 compression level (0–9). Defaults to
0(no compression).compression (str or None) – netCDF4 compression algorithm (e.g.
'zlib'). Defaults toNone.ts_start_index (int or None) – Time index to start reading from aggregated history files. If
None, read from the first time step for the full aggregation. Defaults toNone.ts_end_index (int or None) – Time index to stop reading from aggregated history files. If
None, read to the last time step for the full aggregation. Defaults toNone.append_attrs (dict or None) – Attributes to append to output NetCDF files.
- Returns:
Path to the written (or skipped) output file.
- Return type:
str
- class gents.mhfdataset.MHFDataset(hf_paths)¶
Aggregating dataset interface over a group of related history files.
Presents multiple history files — covering the same time range and/or different spatial tiles — as a single virtual dataset. All file handles are opened together on
open()(or__enter__) and closed together onclose()(or__exit__).- close()¶
Closes all open netCDF4 file handles.
- get_global_attrs()¶
Returns a merged dictionary of global attributes from all files in the group.
Attributes from later files overwrite those from earlier files when keys conflict.
- Returns:
Dictionary mapping global attribute names to their values.
- Return type:
dict
- get_time_vals()¶
Returns the sorted array of unique float time values across the group.
- Returns:
1-D array of sorted, unique float time values.
- Return type:
numpy.ndarray
- get_var_attrs(var_name)¶
Returns the attribute dictionary for a variable from the first file in the group.
- Parameters:
var_name (str) – Name of the variable to inspect.
- Returns:
Dictionary mapping attribute names to their values.
- Return type:
dict
- get_var_data_shape(var_name)¶
Returns the full expected output shape of a variable across the entire group.
Accounts for the total number of aggregated time steps and, for fragmented groups, the combined spatial extents. Returns a single-element list for coordinate variables.
- Parameters:
var_name (str) – Name of the variable to inspect.
- Returns:
List of dimension sizes representing the aggregated output shape.
- Return type:
list[int]
- get_var_dimensions(var_name)¶
Returns the dimension names for a variable, read from the first file in the group.
- Parameters:
var_name (str) – Name of the variable to inspect.
- Returns:
List of dimension name strings in the order they appear on the variable.
- Return type:
list[str]
- get_var_dtype(var_name)¶
Returns the NumPy dtype of a variable, read from the first file in the group.
- Parameters:
var_name (str) – Name of the variable to inspect.
- Returns:
NumPy dtype of the variable.
- Return type:
numpy.dtype
- get_var_vals(var_name, time_index_start=0, time_index_end=None)¶
Reads and returns a variable’s data across the group for a time slice.
Two execution paths are used depending on fragmentation:
Non-fragmented: iterates over the requested time values and reads each time step from the appropriate single file.
Fragmented: for each time step, reads from all spatial-tile files and inserts each tile into the correct slice of a pre-allocated output array by matching tile coordinate values against the combined coordinate map.
- Parameters:
var_name (str) – Name of the variable to read.
time_index_start (int) – Index of the first time step to include (inclusive). Defaults to
0.time_index_end (int or None) – Index of the last time step to include (exclusive). Defaults to
None(all remaining time steps).
- Returns:
Array containing the variable data for the requested time slice.
- Return type:
numpy.ndarray
- is_fragmented()¶
Returns whether the group consists of spatially fragmented (tiled) files.
- Returns:
Trueif the first time value is covered by more than one file,Falseotherwise.- Return type:
bool
- is_time_consistent()¶
Checks that every time step is covered by the same number of files.
Required for spatially fragmented groups to ensure every tile is present for every time step.
- Returns:
Trueif all time values have the same fragment count,Falseotherwise.- Return type:
bool
- open()¶
Opens all history file handles and builds the internal time mapping.
Constructs
__time_mapping: a dictionary from each unique float time value to the list of file indices that contain it. Raises an exception if the number of files per time step is not consistent across all time values (i.e. fragmentation is inconsistent).- Raises:
Exception – If the spatial fragmentation is not consistent over time.
- gents.mhfdataset.get_concat_coords(hf_datasets)¶
Builds a combined coordinate map across all datasets in a spatially fragmented group.
For each dimension across all open datasets:
If the dimension has a coordinate variable, its values are merged and de-duplicated with
numpy.uniqueacross all files.If there is no coordinate variable, a 0-indexed integer range matching the dimension size is used.
- Parameters:
hf_datasets (list[netCDF4.Dataset]) – List of open
netCDF4.Datasetobjects from the group.- Returns:
Dictionary mapping dimension names to their combined coordinate arrays.
- Return type:
dict
utils.py
Developer: Cameron Cummins Contact: cameron.cummins@utexas.edu Last Header Update: 07/03/25
- class gents.utils.ProgressBar(total, length=40, label='')¶
Terminal progress bar for visualising long-running loops.
Displays a continuously-updated bar, percentage, item count, and elapsed time by overwriting a single terminal line in place.
- step()¶
Advances the progress bar by one iteration and redraws the terminal line.
Writes a newline once the counter reaches
total.
- gents.utils.enable_logging(verbose=False, output_path=None)¶
Configures the
gentspackage logger and begins emitting log messages.At
verbose=True, the log level is set toLOG_LEVEL_IO_WARNING(5), enabling per-file I/O trace messages. At the defaultverbose=False, the level isDEBUG(10), suppressing those low-level traces. The installed GenTS version is logged immediately on initialisation.- Parameters:
verbose (bool) – If
True, enable per-file I/O trace messages atLOG_LEVEL_IO_WARNINGlevel. Defaults toFalse.output_path (str or None) – Optional file path to additionally write log output to. Defaults to
None(stdout only).
- gents.utils.get_time_stamp()¶
Returns the current system date and time as a formatted string.
- Returns:
Date-time string formatted as
'YYYY-MM-DD HH:MM'.- Return type:
str
- gents.utils.get_version()¶
Returns the version string of the installed
gentspackage.- Returns:
Package version string (e.g.
'1.0.0').- Return type:
str
- gents.utils.log_hfcollection_info(hfc)¶
Logs summary statistics for an
HFCollectionat INFO level.Iterates over all groups in the collection to compute aggregate metrics and identify outliers. Requires metadata to have been pulled (calls
hfc.check_pulled()). A progress bar is displayed on stdout during the scan.Statistics logged:
Input directory and total number of history files found.
Number of output groups formed.
Total mapped data volume in TB and GB.
Group with the most variables.
Group with the most history files.
Variable with the largest single-timestep memory footprint (shape, dimensions, and size in MB).
- Parameters:
hfc (gents.hfcollection.HFCollection) – A pulled
HFCollectioninstance to inspect.
- gents.utils.log_tscollection_info(tsc)¶
Logs summary statistics for a
TSCollectionat INFO level.Iterates over all time series orders in the collection to compute aggregate metrics and identify the largest output file. Auxiliary-only orders are skipped. A progress bar is displayed on stdout during the scan.
Statistics logged:
Output directory and total number of time series files to generate.
Largest time series file by estimated total size, including the sample history file path, variable name, shape, dimensions, number of source history files, and projected size in GB.
- Parameters:
tsc (gents.timeseries.TSCollection) – A
TSCollectioninstance to inspect.
- gents.cli.main()¶
Entry point for the
gentscommand-line interface.Performs the following steps:
Calls
parse_arguments()to obtain the parsed CLI namespace.Defaults
outputdirtohf_head_dirwhen-ois not supplied.Selects the appropriate model configuration:
--model e3smflag → importsrun_config()(E3SM).--model cesm3→ importsrun_config()(CESM3).
If
--verboseis set, prints a summary of all active settings to stdout.Delegates execution to the selected
run_config(args)function.
- gents.cli.parse_arguments()¶
Parses command-line arguments for the
gentsCLI entry point.Constructs an
argparse.ArgumentParserwith all supported flags and positional arguments, then parsessys.argvand returns the resulting namespace.Supported arguments:
hf_head_dir(positional): Path to the head directory containing history files.-o/--outputdir: Output directory for time-series files (defaults tohf_head_dirif omitted).-v/--verbose: Enable verbose console output.-V/--version: Print the installedgentsversion and exit.-d/--dryrun: Parse metadata only; do not write time-series files.-w/--overwrite: Overwrite existing time-series output files.-sl/--slice: Maximum length of individual time-series files in years (default10).-hc/--hfcores: Maximum number of cores for parallel metadata reads (default64).-tc/--tscores: Maximum number of cores for parallel time-series writes (default8).-m/--model: Model default configuration to apply ('CESM3','CESM2', or'E3SM'; default'none').--exclude: Glob pattern to exclude; may be specified multiple times. Overrides the model default unless--appendis also set.--include: Glob pattern to include; may be specified multiple times. Overrides the model default unless--appendis also set.--append: Append--exclude/--includefilters to the model default configuration instead of replacing them.
- Returns:
Namespace object populated with parsed argument values.
- Return type:
argparse.Namespace