-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
Part of #271 — Gastown Cloud Proposal A (Sandbox-per-Town)
Goal
Implement cloud-side health monitoring, stop/start lifecycle, idle auto-stop, and disaster recovery for gastown sandboxes.
Requirements
Health Monitoring
Cloud-side cron job (every 3 minutes, matching gastown daemon heartbeat).
For each active town:
- Call sandbox internal API
GET /health - Health response includes:
gtdaemon running (PID alive)- Dolt server running
- Number of active tmux sessions
- Last heartbeat timestamp from daemon
- Disk usage (volume capacity)
- Memory/CPU usage
- Update
gastown_towns.last_heartbeat_at - Track consecutive health check failures per town
- If 3 consecutive failures → mark town as
unhealthy→ attempt machine restart - If restart fails → mark as
error→ notify user
Stop Flow (user-initiated or idle timeout)
- Call sandbox internal API:
POST /sync(trigger immediate R2 backup) - Send SIGTERM to sandbox → handler runs
gt down+ final R2 sync - Call Fly API: stop machine (volume persists)
- Update
gastown_towns.status→stopped
Start Flow
- Call Fly API: start machine
- Machine boots →
startup.shruns → volume data is already there →gt up - Daemon heartbeat restarts all agent roles within 3 minutes
- Update
gastown_towns.status→running - Total cold start: ~10–15s (no R2 restore needed since volume persists)
Idle Auto-Stop
- If no active polecats AND no user interaction for 30 minutes → auto-stop the town
- "User interaction" = any tRPC call or WebSocket connection to this town
- Track last interaction timestamp on
gastown_towns(or in-memory) - Resume automatically on next dashboard visit or API call (
startTownif stopped) - Idle timeout is configurable per town (default 30 min)
Destroy Flow
- Call sandbox internal API:
POST /sync(final backup) - Soft-delete DB row:
gastown_towns.destroyed_at = NOW() - Call Fly API: destroy machine + volume
- Retain R2 backup for 30 days (user can request restore)
- After 30 days: R2 lifecycle policy deletes backup
Disaster Recovery (Volume Loss)
If the Fly machine or volume is catastrophically lost:
- Health check detects unreachable machine (3 consecutive failures)
- Attempt machine restart via Fly API
- If machine is gone: create new Fly machine + volume in same region
startup.shdetects empty volume → triggers R2 restore →gt up- Gastown's built-in recovery handles the rest:
- Daemon cleans stale PIDs, acquires flock
- Dolt health ticker restarts Dolt server
- Heartbeat recreates deacon, mayor, witnesses, refineries
checkPolecatSessionHealth()restarts polecats with hooked workgt prime --hookre-injects role context on every new session
Data loss window: up to 5 minutes (R2 sync interval).
Files
cloud/src/lib/gastown/health-monitor.ts(new)- Updates to
cloud/src/server/api/routers/gastown.ts— lifecycle endpoints, idle tracking cloud/src/lib/gastown/fly-client.ts— stop/start/restart/destroy machine methods
Acceptance Criteria
- Health monitor cron runs every 3 minutes for all active towns
- Health check failures are tracked per-town with consecutive failure counting
- 3 consecutive failures trigger automatic restart attempt
-
stopTowntriggers R2 sync, sends SIGTERM, stops Fly machine, updates DB status -
startTownstarts Fly machine, waits for health check to pass, updates DB status - Idle auto-stop triggers after 30 minutes of no polecats and no user interaction
- Auto-resume works when user visits dashboard for a stopped town
-
destroyTownsoft-deletes, destroys Fly resources, retains R2 backup - Disaster recovery creates new machine and restores from R2 on volume loss
- Idle timeout is configurable per town
Dependencies
- PR 2 (Provisioning API) — DB schema and Fly client
- PR 3 (Sandbox Internal API) —
/healthand/syncendpoints - PR 4 (R2 Backup System) — R2 sync and restore scripts
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels