-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
Part of #271 — Gastown Cloud Proposal A (Sandbox-per-Town)
Goal
Handle edge cases identified in the implementation plan and apply security hardening to the sandbox infrastructure.
Requirements
Edge Case Handling
| Edge Case | Implementation |
|---|---|
| Volume loss during active work | R2 restore from latest snapshot. Gastown's checkPolecatSessionHealth() + checkpoint recovery restarts polecats with hooked work. Verify this flow end-to-end. |
| Dolt corruption after restore | Add dolt verify-constraints to restore script (PR 4). If verification fails, attempt dolt backup restore from previous snapshot. If all else fails, fresh Dolt init. Log corruption events. |
| Git repo too large for R2 | Support shallow clone option in rig config: { "shallow_depth": 100 }. Use incremental git bundles: git bundle create --since=<last_backup_timestamp>. For repos >2GB, log warning and recommend external git remote. |
| Rapid machine restarts | Track consecutive restart failures in gastown_towns.config: { "restart_failures": N, "last_failure_at": "..." }. After 3 failures in 10 minutes → set status to error, stop auto-restart, notify user via dashboard alert. |
| JWT token expiry during long work | Agents get 401 from gateway → LLM calls fail but processes don't crash. Health monitor detects stale last_heartbeat_at and alerts. Dashboard shows "Token expired" warning with manual refresh button. |
| Concurrent sandbox API calls | Add request queuing in the sandbox internal API for gt CLI operations (which are not thread-safe). Use a mutex/queue for commands that modify state. Read-only queries (beads, convoys) can run concurrently via Dolt SQL. |
| Disk space exhaustion | Health check reports disk usage. Alert at 80% capacity in dashboard. Auto-clean: gt gc to remove old worktrees, prune Dolt garbage, clean /tmp. If >95%, stop spawning new polecats. |
Security Hardening
| Concern | Implementation |
|---|---|
| Sandbox network isolation | Verify Fly machine is on private network. Only the internal API port (8080) and terminal proxy port (8081) exposed via Fly proxy. No direct SSH access. Document network policy. |
| User code execution | Agents run arbitrary code in the sandbox. Sandbox is per-user — no multi-tenancy within a machine. Document that user code runs with full sandbox permissions. |
| Gateway token scope | Verify JWT is scoped to user's billing context only. No cross-user access possible. Add test for token validation with wrong user. |
| Internal API auth | Verify x-internal-api-key is required on all endpoints. Key is encrypted via NaCl box in transit. Add rate limiting on internal API (prevent abuse if key is leaked). |
| R2 backup access | Verify R2 credentials are scoped to the gastown-backups bucket. Per-town key prefix isolation — sandbox can only read/write its own town's backups. Add IAM policy documentation. |
| Secrets in env vars | Audit all env vars passed to sandbox. Ensure secrets (KILO_JWT, INTERNAL_API_KEY, R2_SECRET_ACCESS_KEY) are encrypted via NaCl box and not logged. |
Resource Monitoring
- Add resource usage to health check response: CPU %, memory %, disk %
- Dashboard shows resource usage on town card and detail page
- Alerts at thresholds: 80% disk, 90% memory
- Log resource metrics for capacity planning
Files
- Updates to
cloud/infra/gastown-sandbox/r2-restore.sh— Dolt corruption handling - Updates to
cloud/infra/gastown-sandbox/internal-api/— request queuing, disk monitoring - Updates to
cloud/src/lib/gastown/health-monitor.ts— restart circuit breaker, resource alerts - Updates to
cloud/src/server/api/routers/gastown.ts— token refresh UI endpoint - Updates to dashboard components — resource usage display, alerts
Acceptance Criteria
- Dolt corruption is detected and handled during R2 restore (falls back to previous snapshot)
- Large repo handling: shallow clone config option works, incremental bundles reduce sync time
- Rapid restart circuit breaker: 3 failures in 10 min → status
error, user notified - JWT expiry: dashboard shows warning, manual refresh button works
- Concurrent API calls: state-modifying operations are serialized, reads are concurrent
- Disk space: alert at 80%, auto-clean triggers, new polecats blocked at 95%
- Network: only ports 8080 and 8081 are accessible via Fly proxy
- Internal API: all endpoints require auth, rate limiting applied
- R2: credentials scoped to bucket, per-town prefix isolation verified
- Secrets: not logged in startup, health check, or error responses
- Resource metrics included in health check and displayed in dashboard
Dependencies
- All prior PRs (1–10) — this is the final hardening pass
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels