Skip to content

Health Monitoring & Lifecycle Management #286

@jrf0110

Description

@jrf0110

Part of #271 — Gastown Cloud Proposal A (Sandbox-per-Town)

Goal

Implement cloud-side health monitoring, stop/start lifecycle, idle auto-stop, and disaster recovery for gastown sandboxes.

Requirements

Health Monitoring

Cloud-side cron job (every 3 minutes, matching gastown daemon heartbeat).

For each active town:

  1. Call sandbox internal API GET /health
  2. Health response includes:
    • gt daemon running (PID alive)
    • Dolt server running
    • Number of active tmux sessions
    • Last heartbeat timestamp from daemon
    • Disk usage (volume capacity)
    • Memory/CPU usage
  3. Update gastown_towns.last_heartbeat_at
  4. Track consecutive health check failures per town
  5. If 3 consecutive failures → mark town as unhealthy → attempt machine restart
  6. If restart fails → mark as error → notify user

Stop Flow (user-initiated or idle timeout)

  1. Call sandbox internal API: POST /sync (trigger immediate R2 backup)
  2. Send SIGTERM to sandbox → handler runs gt down + final R2 sync
  3. Call Fly API: stop machine (volume persists)
  4. Update gastown_towns.statusstopped

Start Flow

  1. Call Fly API: start machine
  2. Machine boots → startup.sh runs → volume data is already there → gt up
  3. Daemon heartbeat restarts all agent roles within 3 minutes
  4. Update gastown_towns.statusrunning
  5. Total cold start: ~10–15s (no R2 restore needed since volume persists)

Idle Auto-Stop

  • If no active polecats AND no user interaction for 30 minutes → auto-stop the town
  • "User interaction" = any tRPC call or WebSocket connection to this town
  • Track last interaction timestamp on gastown_towns (or in-memory)
  • Resume automatically on next dashboard visit or API call (startTown if stopped)
  • Idle timeout is configurable per town (default 30 min)

Destroy Flow

  1. Call sandbox internal API: POST /sync (final backup)
  2. Soft-delete DB row: gastown_towns.destroyed_at = NOW()
  3. Call Fly API: destroy machine + volume
  4. Retain R2 backup for 30 days (user can request restore)
  5. After 30 days: R2 lifecycle policy deletes backup

Disaster Recovery (Volume Loss)

If the Fly machine or volume is catastrophically lost:

  1. Health check detects unreachable machine (3 consecutive failures)
  2. Attempt machine restart via Fly API
  3. If machine is gone: create new Fly machine + volume in same region
  4. startup.sh detects empty volume → triggers R2 restore → gt up
  5. Gastown's built-in recovery handles the rest:
    • Daemon cleans stale PIDs, acquires flock
    • Dolt health ticker restarts Dolt server
    • Heartbeat recreates deacon, mayor, witnesses, refineries
    • checkPolecatSessionHealth() restarts polecats with hooked work
    • gt prime --hook re-injects role context on every new session

Data loss window: up to 5 minutes (R2 sync interval).

Files

  • cloud/src/lib/gastown/health-monitor.ts (new)
  • Updates to cloud/src/server/api/routers/gastown.ts — lifecycle endpoints, idle tracking
  • cloud/src/lib/gastown/fly-client.ts — stop/start/restart/destroy machine methods

Acceptance Criteria

  • Health monitor cron runs every 3 minutes for all active towns
  • Health check failures are tracked per-town with consecutive failure counting
  • 3 consecutive failures trigger automatic restart attempt
  • stopTown triggers R2 sync, sends SIGTERM, stops Fly machine, updates DB status
  • startTown starts Fly machine, waits for health check to pass, updates DB status
  • Idle auto-stop triggers after 30 minutes of no polecats and no user interaction
  • Auto-resume works when user visits dashboard for a stopped town
  • destroyTown soft-deletes, destroys Fly resources, retains R2 backup
  • Disaster recovery creates new machine and restores from R2 on volume loss
  • Idle timeout is configurable per town

Dependencies

  • PR 2 (Provisioning API) — DB schema and Fly client
  • PR 3 (Sandbox Internal API) — /health and /sync endpoints
  • PR 4 (R2 Backup System) — R2 sync and restore scripts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions