-
Notifications
You must be signed in to change notification settings - Fork 0
[smart-meter-analysis] Convert comed meter data from csv to parquet #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
griffinsharps
wants to merge
14
commits into
main
Choose a base branch
from
43-convert-coned-meter-data-from-csv-to-parquet
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[smart-meter-analysis] Convert comed meter data from csv to parquet #59
griffinsharps
wants to merge
14
commits into
main
from
43-convert-coned-meter-data-from-csv-to-parquet
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: Cursor <cursoragent@cursor.com>
…onth validator Enhance validate_month_output.py with three preflight checks needed before scaling to full-month execution: - Duplicate (zip_code, account_identifier, datetime) detection per batch file - Row count reporting (total + per-file) in validation report JSON - Run artifact integrity via --run-dir flag (plan.json, run_summary.json, manifests, batch summaries) Add PREFLIGHT_200.md checklist for 200-file EC2 validation run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s_sorted() Polars 1.38 removed is_sorted() from Expr. Collect the composite key first, then check sortedness on the resulting Series which retains the method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restores Justfile from main and adds migrate-month recipe. Usage: just migrate-month 202307 - batch-size 100, workers 6, lazy_sink, --resume - Reads ~/s3_paths_<YYYYMM>_full.txt, writes to /ebs/.../out_<YYYYMM>_production - Uses bare python (no uv) for EC2 compatibility Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
.gitignore: block *.txt, .tmp/, archive_quarantine/, tmp_polars_run_*/, subagent_packages/ from being tracked. pre-commit: add detect-private-key hook and a local forbid-secrets hook that blocks .env, .secrets, credentials.json, .pem, .key, .p12, .pfx, .jks files from being committed (even via git add -f). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hash-based duplicate detection (n_unique ~400MB/file) with adjacent-key streaming that leverages the required global sort order. Sortedness and uniqueness now share a single PyArrow iter_batches pass in full mode. Key changes: - _streaming_sort_and_dup_check: combined sort+dup via PyArrow batch iteration, O(batch_size) memory, cross-file boundary state - Per-file datetime stats with merge (_DtStats dataclass) - Per-file DST stats with merge (_DstFileStats dataclass) - Enhanced sample mode: strict-increasing check (catches dups in windows) - Row counts from parquet metadata (O(1), no data scan) - Phase-based main() architecture (discovery -> metadata -> streaming -> datetime -> DST -> artifacts -> report) - _fail() typed as NoReturn for mypy narrowing - Add pyarrow mypy override in pyproject.toml Removed dead functions: _check_sorted_full, _validate_no_duplicates_file, _validate_datetime_invariants_partition, _validate_dst_option_b_partition, _keys_is_sorted_df Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes in validate_month_output.py: 1. _slice_keys: use lf.collect() instead of streaming engine for slice reads — streaming may reorder rows, defeating sortedness validation. Slices are small (5K rows x 3 cols) so default engine is correct and fast. 2. _check_sorted_sample: track prev_end and only perform cross-slice boundary comparison when off >= prev_end (non-overlapping). Random windows can overlap head/tail/each other, making boundary checks invalid under overlap. Within-slice strict-monotonic checks still run unconditionally. Also updates remaining collect(streaming=True) calls to collect(engine="streaming") to fix Polars deprecation warnings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restores Justfile from main and adds migrate-month YEAR_MONTH recipe: - Guards against non-EC2 environments (checks /ebs mount) - Auto-generates S3 input list via aws s3 ls + awk + sort - Validates non-empty input list before running - Runs migrate_month_runner.py with standard production params (batch-size 100, workers 6, --resume, lazy_sink) Usage: just migrate-month 202307 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
40 tasks
Annotate migrate_month_runner.py, validate_month_output.py, and Justfile with industry-standard "why" comments for senior code review. Additions include module-level architecture docstrings, function-level design rationale, and parameter tuning explanations. No logic changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r-data-from-csv-to-parquet
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refactor migrate-month to use configurable variables (S3_PREFIX, MIGRATE_OUT_BASE, etc.) instead of hardcoded bucket names and usernames, preparing the repo for open-source. Add six recipes: months-from-s3, migrate-months, validate-month, validate-months, and migration-status. Multi-month recipes support fail-fast (default) or continue-on-error mode with per-invocation UTC-timestamped logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a project to convert the ComEd CSV files on S3 into Parquet for efficiency.