Skip to content

Conversation

@griffinsharps
Copy link
Contributor

This is a project to convert the ComEd CSV files on S3 into Parquet for efficiency.

Griffin Sharps and others added 10 commits January 26, 2026 23:22
Co-authored-by: Cursor <cursoragent@cursor.com>
…onth validator

Enhance validate_month_output.py with three preflight checks needed before
scaling to full-month execution:
- Duplicate (zip_code, account_identifier, datetime) detection per batch file
- Row count reporting (total + per-file) in validation report JSON
- Run artifact integrity via --run-dir flag (plan.json, run_summary.json,
  manifests, batch summaries)

Add PREFLIGHT_200.md checklist for 200-file EC2 validation run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s_sorted()

Polars 1.38 removed is_sorted() from Expr. Collect the composite key first,
then check sortedness on the resulting Series which retains the method.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restores Justfile from main and adds migrate-month recipe.
Usage: just migrate-month 202307
- batch-size 100, workers 6, lazy_sink, --resume
- Reads ~/s3_paths_<YYYYMM>_full.txt, writes to /ebs/.../out_<YYYYMM>_production
- Uses bare python (no uv) for EC2 compatibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
.gitignore: block *.txt, .tmp/, archive_quarantine/, tmp_polars_run_*/,
subagent_packages/ from being tracked.

pre-commit: add detect-private-key hook and a local forbid-secrets hook
that blocks .env, .secrets, credentials.json, .pem, .key, .p12, .pfx,
.jks files from being committed (even via git add -f).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hash-based duplicate detection (n_unique ~400MB/file) with
adjacent-key streaming that leverages the required global sort order.
Sortedness and uniqueness now share a single PyArrow iter_batches
pass in full mode.

Key changes:
- _streaming_sort_and_dup_check: combined sort+dup via PyArrow
  batch iteration, O(batch_size) memory, cross-file boundary state
- Per-file datetime stats with merge (_DtStats dataclass)
- Per-file DST stats with merge (_DstFileStats dataclass)
- Enhanced sample mode: strict-increasing check (catches dups in windows)
- Row counts from parquet metadata (O(1), no data scan)
- Phase-based main() architecture (discovery -> metadata -> streaming
  -> datetime -> DST -> artifacts -> report)
- _fail() typed as NoReturn for mypy narrowing
- Add pyarrow mypy override in pyproject.toml

Removed dead functions: _check_sorted_full, _validate_no_duplicates_file,
_validate_datetime_invariants_partition, _validate_dst_option_b_partition,
_keys_is_sorted_df

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes in validate_month_output.py:

1. _slice_keys: use lf.collect() instead of streaming engine for slice
   reads — streaming may reorder rows, defeating sortedness validation.
   Slices are small (5K rows x 3 cols) so default engine is correct and fast.

2. _check_sorted_sample: track prev_end and only perform cross-slice
   boundary comparison when off >= prev_end (non-overlapping). Random
   windows can overlap head/tail/each other, making boundary checks
   invalid under overlap. Within-slice strict-monotonic checks still
   run unconditionally.

Also updates remaining collect(streaming=True) calls to
collect(engine="streaming") to fix Polars deprecation warnings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restores Justfile from main and adds migrate-month YEAR_MONTH recipe:
- Guards against non-EC2 environments (checks /ebs mount)
- Auto-generates S3 input list via aws s3 ls + awk + sort
- Validates non-empty input list before running
- Runs migrate_month_runner.py with standard production params
  (batch-size 100, workers 6, --resume, lazy_sink)

Usage: just migrate-month 202307

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@griffinsharps griffinsharps linked an issue Feb 6, 2026 that may be closed by this pull request
40 tasks
@griffinsharps griffinsharps self-assigned this Feb 6, 2026
@griffinsharps griffinsharps added the enhancement New feature or request label Feb 6, 2026
@griffinsharps griffinsharps changed the title 43 convert comed meter data from csv to parquet [smart-meter-analysis] 43 convert comed meter data from csv to parquet Feb 6, 2026
@griffinsharps griffinsharps changed the title [smart-meter-analysis] 43 convert comed meter data from csv to parquet [smart-meter-analysis] Convert comed meter data from csv to parquet Feb 6, 2026
Griffin Sharps and others added 4 commits February 6, 2026 21:41
Annotate migrate_month_runner.py, validate_month_output.py, and Justfile
with industry-standard "why" comments for senior code review. Additions
include module-level architecture docstrings, function-level design
rationale, and parameter tuning explanations. No logic changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactor migrate-month to use configurable variables (S3_PREFIX,
MIGRATE_OUT_BASE, etc.) instead of hardcoded bucket names and
usernames, preparing the repo for open-source. Add six recipes:
months-from-s3, migrate-months, validate-month, validate-months,
and migration-status. Multi-month recipes support fail-fast (default)
or continue-on-error mode with per-invocation UTC-timestamped logs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Convert ComEd meter data from csv to parquet

1 participant