feat: lazy dataset ingestion by lewisjared · Pull Request #515 · Climate-REF/climate-ref

lewisjared · 2026-02-05T11:47:38Z

Description

Implements lazy dataset ingestion to support millions of CMIP6 files on HPC parallel file systems without opening every netCDF at ingest time.

Two-phase approach:

Phase 1 (Bootstrap): The DRS parser extracts metadata from directory/filename structure only (finalised=False). A TABLE_ID_TO_FREQUENCY mapping infers frequency from table_id without file I/O.
Phase 2 (Lazy Finalization): At solve time, after filtering and grouping narrows candidates, only matched files are opened via the complete parser. Metadata is extracted, persisted to the DB, and cached so subsequent solves skip re-parsing.

Key components:

FinaliseableDatasetAdapterMixin — base class for adapters supporting two-phase ingestion
CMIP6DatasetAdapter.finalise_datasets() — opens unfinalised netCDFs, extracts metadata, applies fixes, persists to DB
DataCatalog wrapper — replaces raw pd.DataFrame in the solver with lazy loading (to_frame()) and per-group finalization (finalise(subset))
Solver integration — finalization happens after filter+group_by but before constraint checking

Also includes a refactor removing the unused config parameter from register_dataset.

Checklist

Please confirm that this pull request has done the following:

Tests added
Documentation added (where applicable)
Changelog item added to changelog/

The Config object was passed to register_dataset but never used inside the method. Removing it simplifies the API and makes the dependency on Database explicit.

Add a TABLE_ID_TO_FREQUENCY mapping so the DRS parser can determine frequency without opening netCDF files. This is a prerequisite for lazy dataset ingestion where full file I/O is deferred.

Introduce a mixin for adapters that support two-phase ingestion. CMIP6DatasetAdapter.finalise_datasets() opens unfinalised netCDF files, extracts full metadata, applies fixes, and persists back to the database.

…lver DataCatalog wraps the per-adapter catalog DataFrame with lazy loading from the database and per-group finalization of unfinalised datasets. The solver now finalizes datasets after filter+group_by but before constraint checking, so only matched candidates trigger file I/O.

Add unit tests for DataCatalog (lazy loading, cache, finalization paths), CMIP6DatasetAdapter.finalise_datasets (DRS-to-complete round trip), and solver integration (extract_covered_datasets with DataCatalog). Fix a bug where finalise_datasets did not convert start_time/end_time strings from the complete parser to datetime objects before persisting.

lewisjared · 2026-02-05T12:29:15Z

Replaces #369

Update documentation across tutorials, background, and how-to guides to reflect the two-phase ingestion approach where CMIP6 DRS parsing extracts metadata from paths only, with full metadata completed lazily at solve time.

…zy-dataset-ingestion * origin/cmip7-data-requirements: (30 commits) test: improve coverage for CMIP7, solver, and test_cases CLI fix: resolve merge conflict in solver OR-logic after InvalidDiagnosticException removal fix(tests): set up context manager mocks for xr.open_dataset in CMIP7 tests Add changelog Add codecov.yml with relaxed coverage thresholds feat(cmip7): add license_id and external_variables to CMIP7 model Also update output collection of regional historical trend diagnostic Add changelog Faster listing of regression data Add another variable to test case Do not try to push to container registry from forks Update regression test output Add changelog Split trends recipe and fix the other two regional recipes Update recipes Update regional historical annual cycle and timeseries Bump version: 0.9.0 → 0.9.1 docs: fix admonition syntax and typo in getting-started guides fix(tests): update patch paths for lazy imports chore: add changelog and fix lint after merge ... # Conflicts: # docs/getting-started/03-ingest.md # packages/climate-ref/src/climate_ref/cli/datasets.py

…lated tests

lewisjared · 2026-02-06T05:09:00Z

Closes #175

codecov · 2026-02-06T05:09:31Z

Codecov Report

❌ Patch coverage is 74.31193% with 28 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ages/climate-ref/src/climate_ref/datasets/cmip6.py	54.38%	22 Missing and 4 partials ⚠️
...mate-ref/src/climate_ref/datasets/cmip6_parsers.py	75.00%	0 Missing and 1 partial ⚠️
...ges/climate-ref/src/climate_ref/datasets/mixins.py	80.00%	1 Missing ⚠️

Files with missing lines	Coverage Δ
...es/climate-ref-pmp/src/climate_ref_pmp/__init__.py	`100.00% <100.00%> (ø)`
...ckages/climate-ref/src/climate_ref/cli/datasets.py	`87.36% <100.00%> (ø)`
...ckages/climate-ref/src/climate_ref/data_catalog.py	`100.00% <100.00%> (ø)`
...s/climate-ref/src/climate_ref/datasets/__init__.py	`96.82% <100.00%> (-0.05%)`	⬇️
...kages/climate-ref/src/climate_ref/datasets/base.py	`98.41% <ø> (-0.02%)`	⬇️
packages/climate-ref/src/climate_ref/solver.py	`98.96% <100.00%> (+0.03%)`	⬆️
...mate-ref/src/climate_ref/datasets/cmip6_parsers.py	`90.14% <75.00%> (-0.91%)`	⬇️
...ges/climate-ref/src/climate_ref/datasets/mixins.py	`80.00% <80.00%> (ø)`
...ages/climate-ref/src/climate_ref/datasets/cmip6.py	`74.33% <54.38%> (-20.41%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Extract DB persistence into _persist_finalised_metadata method - Skip parse_datetime and _apply_fixes when no rows were updated - Remove redundant unfinalised guard (caller already checks) - Use pd.isna(path) instead of row.get("path") for clarity

lewisjared added 6 commits February 5, 2026 22:23

refactor: remove unused config param from register_dataset

e9c1c40

The Config object was passed to register_dataset but never used inside the method. Removing it simplifies the API and makes the dependency on Database explicit.

feat: infer frequency from table_id in DRS parser

51b952d

Add a TABLE_ID_TO_FREQUENCY mapping so the DRS parser can determine frequency without opening netCDF files. This is a prerequisite for lazy dataset ingestion where full file I/O is deferred.

feat: add FinaliseableDatasetAdapterMixin and implement on CMIP6 adapter

45c0f88

Introduce a mixin for adapters that support two-phase ingestion. CMIP6DatasetAdapter.finalise_datasets() opens unfinalised netCDF files, extracts full metadata, applies fixes, and persists back to the database.

docs: add changelog for lazy dataset ingestion

142c629

lewisjared changed the title ~~feat: lazy dataset ingestion with two-phase finalization~~ feat: lazy dataset ingestion Feb 5, 2026

lewisjared added 3 commits February 5, 2026 23:52

docs: update ingestion documentation for lazy finalization

f9fba90

Update documentation across tutorials, background, and how-to guides to reflect the two-phase ingestion approach where CMIP6 DRS parsing extracts metadata from paths only, with full metadata completed lazily at solve time.

docs: use British English spelling conventions

df08c0f

Base automatically changed from cmip7-data-requirements to main February 6, 2026 04:55

fix(tests): remove unused config variable from test_db fixture and re…

4b076b1

…lated tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: lazy dataset ingestion#515

feat: lazy dataset ingestion#515
lewisjared wants to merge 11 commits intomainfrom
lazy-dataset-ingestion

lewisjared commented Feb 5, 2026

Uh oh!

lewisjared commented Feb 5, 2026

Uh oh!

lewisjared commented Feb 6, 2026

Uh oh!

codecov bot commented Feb 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lewisjared commented Feb 5, 2026

Description

Checklist

Uh oh!

lewisjared commented Feb 5, 2026

Uh oh!

lewisjared commented Feb 6, 2026

Uh oh!

codecov bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Feb 6, 2026 •

edited

Loading