Skip to content

feat: lazy dataset ingestion#515

Open
lewisjared wants to merge 11 commits intomainfrom
lazy-dataset-ingestion
Open

feat: lazy dataset ingestion#515
lewisjared wants to merge 11 commits intomainfrom
lazy-dataset-ingestion

Conversation

@lewisjared
Copy link
Contributor

Description

Implements lazy dataset ingestion to support millions of CMIP6 files on HPC parallel file systems without opening every netCDF at ingest time.

Two-phase approach:

  • Phase 1 (Bootstrap): The DRS parser extracts metadata from directory/filename structure only (finalised=False). A TABLE_ID_TO_FREQUENCY mapping infers frequency from table_id without file I/O.
  • Phase 2 (Lazy Finalization): At solve time, after filtering and grouping narrows candidates, only matched files are opened via the complete parser. Metadata is extracted, persisted to the DB, and cached so subsequent solves skip re-parsing.

Key components:

  • FinaliseableDatasetAdapterMixin — base class for adapters supporting two-phase ingestion
  • CMIP6DatasetAdapter.finalise_datasets() — opens unfinalised netCDFs, extracts metadata, applies fixes, persists to DB
  • DataCatalog wrapper — replaces raw pd.DataFrame in the solver with lazy loading (to_frame()) and per-group finalization (finalise(subset))
  • Solver integration — finalization happens after filter+group_by but before constraint checking

Also includes a refactor removing the unused config parameter from register_dataset.

Checklist

Please confirm that this pull request has done the following:

  • Tests added
  • Documentation added (where applicable)
  • Changelog item added to changelog/

The Config object was passed to register_dataset but never used
inside the method. Removing it simplifies the API and makes the
dependency on Database explicit.
Add a TABLE_ID_TO_FREQUENCY mapping so the DRS parser can determine
frequency without opening netCDF files. This is a prerequisite for
lazy dataset ingestion where full file I/O is deferred.
Introduce a mixin for adapters that support two-phase ingestion.
CMIP6DatasetAdapter.finalise_datasets() opens unfinalised netCDF files,
extracts full metadata, applies fixes, and persists back to the database.
…lver

DataCatalog wraps the per-adapter catalog DataFrame with lazy loading
from the database and per-group finalization of unfinalised datasets.
The solver now finalizes datasets after filter+group_by but before
constraint checking, so only matched candidates trigger file I/O.
Add unit tests for DataCatalog (lazy loading, cache, finalization paths),
CMIP6DatasetAdapter.finalise_datasets (DRS-to-complete round trip), and
solver integration (extract_covered_datasets with DataCatalog).

Fix a bug where finalise_datasets did not convert start_time/end_time
strings from the complete parser to datetime objects before persisting.
@lewisjared lewisjared changed the title feat: lazy dataset ingestion with two-phase finalization feat: lazy dataset ingestion Feb 5, 2026
@lewisjared
Copy link
Contributor Author

Replaces #369

Update documentation across tutorials, background, and how-to guides
to reflect the two-phase ingestion approach where CMIP6 DRS parsing
extracts metadata from paths only, with full metadata completed lazily
at solve time.
…zy-dataset-ingestion

* origin/cmip7-data-requirements: (30 commits)
  test: improve coverage for CMIP7, solver, and test_cases CLI
  fix: resolve merge conflict in solver OR-logic after InvalidDiagnosticException removal
  fix(tests): set up context manager mocks for xr.open_dataset in CMIP7 tests
  Add changelog
  Add codecov.yml with relaxed coverage thresholds
  feat(cmip7): add license_id and external_variables to CMIP7 model
  Also update output collection of regional historical trend diagnostic
  Add changelog
  Faster listing of regression data
  Add another variable to test case
  Do not try to push to container registry from forks
  Update regression test output
  Add changelog
  Split trends recipe and fix the other two regional recipes
  Update recipes
  Update regional historical annual cycle and timeseries
  Bump version: 0.9.0 → 0.9.1
  docs: fix admonition syntax and typo in getting-started guides
  fix(tests): update patch paths for lazy imports
  chore: add changelog and fix lint after merge
  ...

# Conflicts:
#	docs/getting-started/03-ingest.md
#	packages/climate-ref/src/climate_ref/cli/datasets.py
Base automatically changed from cmip7-data-requirements to main February 6, 2026 04:55
@lewisjared
Copy link
Contributor Author

Closes #175

@codecov
Copy link

codecov bot commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 74.31193% with 28 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ages/climate-ref/src/climate_ref/datasets/cmip6.py 54.38% 22 Missing and 4 partials ⚠️
...mate-ref/src/climate_ref/datasets/cmip6_parsers.py 75.00% 0 Missing and 1 partial ⚠️
...ges/climate-ref/src/climate_ref/datasets/mixins.py 80.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
...es/climate-ref-pmp/src/climate_ref_pmp/__init__.py 100.00% <100.00%> (ø)
...ckages/climate-ref/src/climate_ref/cli/datasets.py 87.36% <100.00%> (ø)
...ckages/climate-ref/src/climate_ref/data_catalog.py 100.00% <100.00%> (ø)
...s/climate-ref/src/climate_ref/datasets/__init__.py 96.82% <100.00%> (-0.05%) ⬇️
...kages/climate-ref/src/climate_ref/datasets/base.py 98.41% <ø> (-0.02%) ⬇️
packages/climate-ref/src/climate_ref/solver.py 98.96% <100.00%> (+0.03%) ⬆️
...mate-ref/src/climate_ref/datasets/cmip6_parsers.py 90.14% <75.00%> (-0.91%) ⬇️
...ges/climate-ref/src/climate_ref/datasets/mixins.py 80.00% <80.00%> (ø)
...ages/climate-ref/src/climate_ref/datasets/cmip6.py 74.33% <54.38%> (-20.41%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Extract DB persistence into _persist_finalised_metadata method
- Skip parse_datetime and _apply_fixes when no rows were updated
- Remove redundant unfinalised guard (caller already checks)
- Use pd.isna(path) instead of row.get("path") for clarity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant