Health Monitoring & Lifecycle Management

Part of #271 — Gastown Cloud Proposal A (Sandbox-per-Town)

## Goal

Implement cloud-side health monitoring, stop/start lifecycle, idle auto-stop, and disaster recovery for gastown sandboxes.

## Requirements

### Health Monitoring

Cloud-side cron job (every 3 minutes, matching gastown daemon heartbeat).

For each active town:
1. Call sandbox internal API `GET /health`
2. Health response includes:
   - `gt` daemon running (PID alive)
   - Dolt server running
   - Number of active tmux sessions
   - Last heartbeat timestamp from daemon
   - Disk usage (volume capacity)
   - Memory/CPU usage
3. Update `gastown_towns.last_heartbeat_at`
4. Track consecutive health check failures per town
5. If 3 consecutive failures → mark town as `unhealthy` → attempt machine restart
6. If restart fails → mark as `error` → notify user

### Stop Flow (user-initiated or idle timeout)

1. Call sandbox internal API: `POST /sync` (trigger immediate R2 backup)
2. Send SIGTERM to sandbox → handler runs `gt down` + final R2 sync
3. Call Fly API: stop machine (volume persists)
4. Update `gastown_towns.status` → `stopped`

### Start Flow

1. Call Fly API: start machine
2. Machine boots → `startup.sh` runs → volume data is already there → `gt up`
3. Daemon heartbeat restarts all agent roles within 3 minutes
4. Update `gastown_towns.status` → `running`
5. Total cold start: ~10–15s (no R2 restore needed since volume persists)

### Idle Auto-Stop

- If no active polecats AND no user interaction for 30 minutes → auto-stop the town
- "User interaction" = any tRPC call or WebSocket connection to this town
- Track last interaction timestamp on `gastown_towns` (or in-memory)
- Resume automatically on next dashboard visit or API call (`startTown` if stopped)
- Idle timeout is configurable per town (default 30 min)

### Destroy Flow

1. Call sandbox internal API: `POST /sync` (final backup)
2. Soft-delete DB row: `gastown_towns.destroyed_at = NOW()`
3. Call Fly API: destroy machine + volume
4. Retain R2 backup for 30 days (user can request restore)
5. After 30 days: R2 lifecycle policy deletes backup

### Disaster Recovery (Volume Loss)

If the Fly machine or volume is catastrophically lost:

1. Health check detects unreachable machine (3 consecutive failures)
2. Attempt machine restart via Fly API
3. If machine is gone: create new Fly machine + volume in same region
4. `startup.sh` detects empty volume → triggers R2 restore → `gt up`
5. Gastown's built-in recovery handles the rest:
   - Daemon cleans stale PIDs, acquires flock
   - Dolt health ticker restarts Dolt server
   - Heartbeat recreates deacon, mayor, witnesses, refineries
   - `checkPolecatSessionHealth()` restarts polecats with hooked work
   - `gt prime --hook` re-injects role context on every new session

Data loss window: up to 5 minutes (R2 sync interval).

### Files

- `cloud/src/lib/gastown/health-monitor.ts` (new)
- Updates to `cloud/src/server/api/routers/gastown.ts` — lifecycle endpoints, idle tracking
- `cloud/src/lib/gastown/fly-client.ts` — stop/start/restart/destroy machine methods

## Acceptance Criteria

- [ ] Health monitor cron runs every 3 minutes for all active towns
- [ ] Health check failures are tracked per-town with consecutive failure counting
- [ ] 3 consecutive failures trigger automatic restart attempt
- [ ] `stopTown` triggers R2 sync, sends SIGTERM, stops Fly machine, updates DB status
- [ ] `startTown` starts Fly machine, waits for health check to pass, updates DB status
- [ ] Idle auto-stop triggers after 30 minutes of no polecats and no user interaction
- [ ] Auto-resume works when user visits dashboard for a stopped town
- [ ] `destroyTown` soft-deletes, destroys Fly resources, retains R2 backup
- [ ] Disaster recovery creates new machine and restores from R2 on volume loss
- [ ] Idle timeout is configurable per town

## Dependencies

- PR 2 (Provisioning API) — DB schema and Fly client
- PR 3 (Sandbox Internal API) — `/health` and `/sync` endpoints
- PR 4 (R2 Backup System) — R2 sync and restore scripts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health Monitoring & Lifecycle Management #286

Goal

Requirements