Skip to content

Gastown Cloud: Sandbox-per-Town hosted Gastown (Proposal A) #271

@jrf0110

Description

@jrf0110

Disclaimer: This is experimental. Proposal A is the fast-path MVP approach — validate the product quickly with zero gastown code changes.

Overview

Hosted Gastown via the "lift & shift" approach: run the unmodified gt binary inside a Fly.io sandbox (one per user/town), configured to use Kilo CLI as the agent runtime with KILO_API_URL pointed at the Kilo gateway. All LLM calls route through the gateway for billing, model routing, and observability.

Key design decisions:

  • Zero gastown code changes — the gt binary runs unmodified with default_agent: "opencode" and KILO_API_URL set
  • One Fly.io machine per town with a persistent volume (survives restarts/stops). R2 is disaster-recovery backup only, not critical-path persistence
  • All gastown features work: tmux sessions, Dolt databases, git worktrees, hooks, mail, handoffs, convoys, watchdog chain
  • Sandbox contains: gt, kilo (CLI), dolt, tmux, git, node — pre-configured in a Docker image
  • A lightweight internal API inside the sandbox handles control-plane communication (add rig, query Dolt, stream tmux, health checks)
  • Web terminal UI streams tmux pane output via WebSocket (xterm.js)
  • Dashboard queries the sandbox directly (Dolt is the source of truth)
  • Gateway integration is trivial: KILO_API_URL + per-user JWT token. Existing billing pipeline handles everything
  • Gastown's battle-tested crash recovery handles machine restarts identically to a local reboot: daemon heartbeat recreates all roles, restarts polecats with hooked work, re-injects context via gt prime --hook
  • Follows KiloClaw's provisioning patterns: DB row per sandbox, deterministic sandbox ID, soft-delete lifecycle, Fly.io machine management, NaCl-encrypted secrets

Architecture:

Browser ── Dashboard UI ──> Cloud App ──> Fly.io Sandbox
                                           ├── gt binary (unmodified)
                                           ├── kilo-cli (KILO_API_URL → gateway)
                                           ├── tmux (session management)
                                           ├── dolt (per-rig SQL database)
                                           ├── git worktrees (per-polecat)
                                           ├── Internal API (control plane)
                                           ├── Terminal Proxy (WebSocket)
                                           └── persistent volume (50GB)

What already works (verified against codebase):

  • OpenCode agent preset: builtinPresets["opencode"] at gastown/internal/config/agents.go:266-292
  • gastown.js plugin hooks: session.created, session.compacted, session.deleted at gastown/internal/opencode/plugin/gastown.js
  • Daemon heartbeat (3-min cycle) with crash recovery at gastown/internal/daemon/daemon.go
  • Dolt server management with 30s health ticker at gastown/internal/doltserver/
  • Session lifecycle at gastown/internal/session/lifecycle.go
  • KiloClaw provisioning patterns at cloud/src/lib/kiloclaw/
  • Gateway API at cloud/src/app/api/gateway/ (100% reuse)
  • R2 client at cloud/src/lib/r2/client.ts

Workflow

All work for this project lives on the 271-gt-prop-a feature branch. Nothing merges to main.

For each sub-issue:

  1. Branch: Create a branch off 271-gt-prop-a using the naming scheme {issue_number}-{short-description} (e.g., 300-sandbox-image).
  2. Implement: Work through the issue's acceptance criteria. Write tests alongside the implementation.
  3. Validate: Run the test suite and type checks (pnpm typecheck) to confirm everything passes.
  4. Self-review: Spawn a sub-agent to review the diff against the sub-issue's requirements and acceptance criteria. Fix any issues raised.
  5. Commit: Stage all changes and create a commit with a descriptive message referencing the sub-issue number.
  6. Push & PR: Push the branch to the remote and create a pull request targeting 271-gt-prop-a (not main). Link the sub-issue in the PR body with Closes #NNN. Do not skip this step — every sub-issue must result in a pushed branch and an open PR.

Branch structure:

main
 └── 271-gt-prop-a                        (feature branch for entire project)
      ├── NNN-sandbox-image
      ├── NNN-db-schema-provisioning
      ├── NNN-sandbox-internal-api
      └── ...

Phase 1: Sandbox Image & Provisioning

Build the sandbox image and cloud-side provisioning infrastructure.

  • Sandbox Docker Image #281PR 1: Sandbox Docker Image — Fly.io machine image with gt, kilo-cli, dolt, tmux, git, node, jq pre-installed. Ubuntu 22.04 base. Pre-configured default_agent: "opencode". Startup script: R2 restore (if needed) → gt up → R2 sync daemon → terminal proxy. SIGTERM handler for graceful shutdown.

    • Files: cloud/infra/gastown-sandbox/Dockerfile, startup.sh, town-config.json
  • Database Schema & Provisioning API #282PR 2: Database Schema & Provisioning APIgastown_towns table (mirrors KiloClaw's kiloclaw_instances pattern: user_id, sandbox_id, fly_machine_id, fly_volume_id, status, last_heartbeat_at, destroyed_at). gastown_rigs table. Partial unique index for one active town per user+name. tRPC router with createTown, destroyTown, getTownStatus, listTowns, addRig, removeRig, stopTown, startTown. Provisioning flow: generate deterministic sandbox ID → insert DB row → mint gateway JWT → create Fly machine with encrypted env vars → update DB with Fly IDs. Rollback on failure via soft-delete.

    • Files: cloud/src/db/migrations/NNNN_gastown.sql, cloud/src/server/api/routers/gastown.ts
  • Sandbox Internal API #283PR 3: Sandbox Internal API — Lightweight HTTP server inside the sandbox for control-plane communication. Endpoints: GET /health (daemon + Dolt status), GET /rigs, POST /rigs (gt rig add), GET /rigs/:name/beads (Dolt query), GET /rigs/:name/convoys, GET /rigs/:name/polecats, POST /message (tmux send-keys), GET /sessions, WebSocket /sessions/:name/stream, POST /sync (trigger R2 backup). Auth via x-internal-api-key.

    • Files: cloud/infra/gastown-sandbox/internal-api/

Phase 2: R2 Persistence & Gateway Auth

R2 disaster recovery and gateway token integration.

  • R2 Backup System #284PR 4: R2 Backup System — R2 sync daemon (5-min timer): dolt backup per rig → R2, git bundle per bare repo → R2, config + runtime tar → R2. Key structure: gastown/{town_id}/snapshots/{timestamp}/. Atomic swap via staging prefix + latest.json pointer. Restore script: check volume → if empty, fetch latest snapshot → restore Dolt + git + config → verify integrity. SIGTERM handler flushes to R2 before shutdown. Heartbeat reporting to cloud API.

    • Files: cloud/infra/gastown-sandbox/r2-sync-daemon.sh, r2-restore.sh
  • Gateway Token Minting & Refresh #285PR 5: Gateway Token Minting & RefreshmintGastownToken(userId, townId): 24-hour JWT with { sub, type: "gastown_sandbox", town_id, organization_id }. No gateway changes needed — existing getUserFromAuth() extracts user context. Token passed as KILO_JWT env var. 12-hour refresh cron in sandbox calls POST /api/gastown/refresh-token. Model configuration flow: dashboard → tRPC → sandbox internal API PATCH /config → writes gt config files → new sessions pick up changes.

    • Files: cloud/src/lib/gastown/auth.ts, updates to cloud/src/server/api/routers/gastown.ts

Phase 3: Lifecycle Management

Health monitoring, stop/start, idle timeout, disaster recovery.

  • Health Monitoring & Lifecycle Management #286PR 6: Health Monitoring & Lifecycle — Cloud-side health monitor: 3-min cron checks sandbox /health endpoint. Track consecutive failures → 3 failures = unhealthy → attempt restart. Stop flow: trigger R2 sync → SIGTERM → stop Fly machine (volume persists). Start flow: start Fly machine → volume data intact → gt up → daemon restarts all roles in ~10–15s. Idle timeout: auto-stop after 30 min with no active polecats or user interaction. Resume on next dashboard visit. Destroy flow: final R2 sync → soft-delete DB → destroy Fly machine + volume → retain R2 backup 30 days. Disaster recovery: health check detects unreachable → create new machine + volume → R2 restore → gt up.
    • Files: cloud/src/lib/gastown/health-monitor.ts, updates to gastown tRPC router

Phase 4: Web Terminal UI

Terminal streaming for observing and interacting with agents.

  • Terminal Proxy #287PR 7: Terminal Proxy — Custom Go binary inside sandbox: WebSocket server that streams tmux capture-pane -p output (200ms interval) and accepts tmux send-keys input. Supports multiple concurrent viewers (broadcast). Auth via internal API key + session name. Handles terminal resize.

    • Files: cloud/infra/gastown-sandbox/terminal-proxy/
  • WebSocket Proxy & Terminal Component #288PR 8: WebSocket Proxy & Terminal Component — Cloud app authenticates and proxies WebSocket connections to sandbox terminal proxy. Stream ticket minting (60s JWT, same pattern as cloud-agent signStreamTicket). GastownTerminal React component using xterm.js with fit addon. Read-only mode for polecats/witnesses, read-write for Mayor. Session picker sidebar: lists tmux sessions, color-coded by role (Mayor gold, Witness blue, Refinery green, Polecat gray), click to switch view.

    • Files: cloud/src/app/api/gastown/terminal/route.ts, cloud/src/components/gastown/GastownTerminal.tsx, updates to gastown tRPC router

Phase 5: Dashboard & Status Views

Web dashboard for managing towns, rigs, convoys, and agents.

  • Town Overview & Rig Detail Dashboard Pages #289PR 9: Town Overview & Rig Detail Pages/gastown page: town cards (name, status, agent count, heartbeat, resource usage), create town wizard, quick actions (start/stop/destroy). /gastown/[townId] page: agent grid (card per tmux session with role, name, status, current bead, click to terminal), rig list (repo URL, branch, polecats, pending beads, expand for bead list), convoy tracker (progress bars, color-coded). All data from sandbox internal API. Polling: 5s for agents, 30s for beads/convoys.

    • Files: cloud/src/app/gastown/ (new pages), cloud/src/components/gastown/
  • Configuration Page #290PR 10: Configuration Page/gastown/[townId]/settings: default model selector (gateway-available models), per-role model overrides, max concurrent polecats slider, rig management (add/remove repos), auto-stop idle timeout. tRPC mutations → sandbox internal API PATCH /config → writes gt config files.

    • Files: updates to gastown pages and tRPC router

Phase 6: Hardening

Edge cases, security, resource tuning.

  • Edge Case Handling & Security Hardening #291PR 11: Edge Case Handling & Security — Volume loss recovery (R2 restore + gastown crash recovery). Dolt corruption detection (dolt verify-constraints). Large repo handling (shallow clones, incremental bundles). Rapid restart circuit breaker (3 failures in 10 min → error state). JWT expiry handling (401 → stop LLM calls, manual refresh). Concurrent API call queuing. Disk space monitoring (alert at 80%). Security: Fly private network, per-user sandbox isolation, scoped JWT, encrypted internal API key, R2 credential scoping.

Resource Sizing & Cost

Resource MVP Default
CPU 4 vCPUs
Memory 8 GB
Disk 50 GB persistent volume
Max polecats 8 (configurable)
Cost Per Town/Month
Always-on ~$68
With idle auto-stop (4 hrs/day avg) ~$18
R2 storage + operations ~$0.06
LLM costs Pass-through via gateway

Risks

Risk Likelihood Impact Mitigation
tmux-over-WebSocket UX High Medium xterm.js polish. Long-term: Proposal D web-native UI
Fly provisioning latency Medium Low Pre-warm pool. "Provisioning" state in UI
Large repos slow R2 sync Medium Medium Shallow clones, incremental bundles, external remote fallback
Dolt backup slow (>500MB) Low Medium Incremental mode, increased interval, Dolt remotes
Fly pricing changes Low High Provider abstraction. CF Containers as future alternative

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions