-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Disclaimer: This is experimental. Proposal A is the fast-path MVP approach — validate the product quickly with zero gastown code changes.
Overview
Hosted Gastown via the "lift & shift" approach: run the unmodified gt binary inside a Fly.io sandbox (one per user/town), configured to use Kilo CLI as the agent runtime with KILO_API_URL pointed at the Kilo gateway. All LLM calls route through the gateway for billing, model routing, and observability.
Key design decisions:
- Zero gastown code changes — the
gtbinary runs unmodified withdefault_agent: "opencode"andKILO_API_URLset - One Fly.io machine per town with a persistent volume (survives restarts/stops). R2 is disaster-recovery backup only, not critical-path persistence
- All gastown features work: tmux sessions, Dolt databases, git worktrees, hooks, mail, handoffs, convoys, watchdog chain
- Sandbox contains:
gt,kilo(CLI),dolt,tmux,git,node— pre-configured in a Docker image - A lightweight internal API inside the sandbox handles control-plane communication (add rig, query Dolt, stream tmux, health checks)
- Web terminal UI streams tmux pane output via WebSocket (xterm.js)
- Dashboard queries the sandbox directly (Dolt is the source of truth)
- Gateway integration is trivial:
KILO_API_URL+ per-user JWT token. Existing billing pipeline handles everything - Gastown's battle-tested crash recovery handles machine restarts identically to a local reboot: daemon heartbeat recreates all roles, restarts polecats with hooked work, re-injects context via
gt prime --hook - Follows KiloClaw's provisioning patterns: DB row per sandbox, deterministic sandbox ID, soft-delete lifecycle, Fly.io machine management, NaCl-encrypted secrets
Architecture:
Browser ── Dashboard UI ──> Cloud App ──> Fly.io Sandbox
├── gt binary (unmodified)
├── kilo-cli (KILO_API_URL → gateway)
├── tmux (session management)
├── dolt (per-rig SQL database)
├── git worktrees (per-polecat)
├── Internal API (control plane)
├── Terminal Proxy (WebSocket)
└── persistent volume (50GB)
What already works (verified against codebase):
- OpenCode agent preset:
builtinPresets["opencode"]atgastown/internal/config/agents.go:266-292 - gastown.js plugin hooks:
session.created,session.compacted,session.deletedatgastown/internal/opencode/plugin/gastown.js - Daemon heartbeat (3-min cycle) with crash recovery at
gastown/internal/daemon/daemon.go - Dolt server management with 30s health ticker at
gastown/internal/doltserver/ - Session lifecycle at
gastown/internal/session/lifecycle.go - KiloClaw provisioning patterns at
cloud/src/lib/kiloclaw/ - Gateway API at
cloud/src/app/api/gateway/(100% reuse) - R2 client at
cloud/src/lib/r2/client.ts
Workflow
All work for this project lives on the 271-gt-prop-a feature branch. Nothing merges to main.
For each sub-issue:
- Branch: Create a branch off
271-gt-prop-ausing the naming scheme{issue_number}-{short-description}(e.g.,300-sandbox-image). - Implement: Work through the issue's acceptance criteria. Write tests alongside the implementation.
- Validate: Run the test suite and type checks (
pnpm typecheck) to confirm everything passes. - Self-review: Spawn a sub-agent to review the diff against the sub-issue's requirements and acceptance criteria. Fix any issues raised.
- Commit: Stage all changes and create a commit with a descriptive message referencing the sub-issue number.
- Push & PR: Push the branch to the remote and create a pull request targeting
271-gt-prop-a(notmain). Link the sub-issue in the PR body withCloses #NNN. Do not skip this step — every sub-issue must result in a pushed branch and an open PR.
Branch structure:
main
└── 271-gt-prop-a (feature branch for entire project)
├── NNN-sandbox-image
├── NNN-db-schema-provisioning
├── NNN-sandbox-internal-api
└── ...
Phase 1: Sandbox Image & Provisioning
Build the sandbox image and cloud-side provisioning infrastructure.
-
Sandbox Docker Image #281 — PR 1: Sandbox Docker Image — Fly.io machine image with
gt,kilo-cli,dolt,tmux,git,node,jqpre-installed. Ubuntu 22.04 base. Pre-configureddefault_agent: "opencode". Startup script: R2 restore (if needed) →gt up→ R2 sync daemon → terminal proxy. SIGTERM handler for graceful shutdown.- Files:
cloud/infra/gastown-sandbox/Dockerfile,startup.sh,town-config.json
- Files:
-
Database Schema & Provisioning API #282 — PR 2: Database Schema & Provisioning API —
gastown_townstable (mirrors KiloClaw'skiloclaw_instancespattern: user_id, sandbox_id, fly_machine_id, fly_volume_id, status, last_heartbeat_at, destroyed_at).gastown_rigstable. Partial unique index for one active town per user+name. tRPC router withcreateTown,destroyTown,getTownStatus,listTowns,addRig,removeRig,stopTown,startTown. Provisioning flow: generate deterministic sandbox ID → insert DB row → mint gateway JWT → create Fly machine with encrypted env vars → update DB with Fly IDs. Rollback on failure via soft-delete.- Files:
cloud/src/db/migrations/NNNN_gastown.sql,cloud/src/server/api/routers/gastown.ts
- Files:
-
Sandbox Internal API #283 — PR 3: Sandbox Internal API — Lightweight HTTP server inside the sandbox for control-plane communication. Endpoints:
GET /health(daemon + Dolt status),GET /rigs,POST /rigs(gt rig add),GET /rigs/:name/beads(Dolt query),GET /rigs/:name/convoys,GET /rigs/:name/polecats,POST /message(tmux send-keys),GET /sessions,WebSocket /sessions/:name/stream,POST /sync(trigger R2 backup). Auth viax-internal-api-key.- Files:
cloud/infra/gastown-sandbox/internal-api/
- Files:
Phase 2: R2 Persistence & Gateway Auth
R2 disaster recovery and gateway token integration.
-
R2 Backup System #284 — PR 4: R2 Backup System — R2 sync daemon (5-min timer):
dolt backupper rig → R2,git bundleper bare repo → R2, config + runtime tar → R2. Key structure:gastown/{town_id}/snapshots/{timestamp}/. Atomic swap via staging prefix +latest.jsonpointer. Restore script: check volume → if empty, fetch latest snapshot → restore Dolt + git + config → verify integrity. SIGTERM handler flushes to R2 before shutdown. Heartbeat reporting to cloud API.- Files:
cloud/infra/gastown-sandbox/r2-sync-daemon.sh,r2-restore.sh
- Files:
-
Gateway Token Minting & Refresh #285 — PR 5: Gateway Token Minting & Refresh —
mintGastownToken(userId, townId): 24-hour JWT with{ sub, type: "gastown_sandbox", town_id, organization_id }. No gateway changes needed — existinggetUserFromAuth()extracts user context. Token passed asKILO_JWTenv var. 12-hour refresh cron in sandbox callsPOST /api/gastown/refresh-token. Model configuration flow: dashboard → tRPC → sandbox internal APIPATCH /config→ writesgtconfig files → new sessions pick up changes.- Files:
cloud/src/lib/gastown/auth.ts, updates tocloud/src/server/api/routers/gastown.ts
- Files:
Phase 3: Lifecycle Management
Health monitoring, stop/start, idle timeout, disaster recovery.
- Health Monitoring & Lifecycle Management #286 — PR 6: Health Monitoring & Lifecycle — Cloud-side health monitor: 3-min cron checks sandbox
/healthendpoint. Track consecutive failures → 3 failures = unhealthy → attempt restart. Stop flow: trigger R2 sync → SIGTERM → stop Fly machine (volume persists). Start flow: start Fly machine → volume data intact →gt up→ daemon restarts all roles in ~10–15s. Idle timeout: auto-stop after 30 min with no active polecats or user interaction. Resume on next dashboard visit. Destroy flow: final R2 sync → soft-delete DB → destroy Fly machine + volume → retain R2 backup 30 days. Disaster recovery: health check detects unreachable → create new machine + volume → R2 restore →gt up.- Files:
cloud/src/lib/gastown/health-monitor.ts, updates to gastown tRPC router
- Files:
Phase 4: Web Terminal UI
Terminal streaming for observing and interacting with agents.
-
Terminal Proxy #287 — PR 7: Terminal Proxy — Custom Go binary inside sandbox: WebSocket server that streams
tmux capture-pane -poutput (200ms interval) and acceptstmux send-keysinput. Supports multiple concurrent viewers (broadcast). Auth via internal API key + session name. Handles terminal resize.- Files:
cloud/infra/gastown-sandbox/terminal-proxy/
- Files:
-
WebSocket Proxy & Terminal Component #288 — PR 8: WebSocket Proxy & Terminal Component — Cloud app authenticates and proxies WebSocket connections to sandbox terminal proxy. Stream ticket minting (60s JWT, same pattern as cloud-agent
signStreamTicket).GastownTerminalReact component using xterm.js with fit addon. Read-only mode for polecats/witnesses, read-write for Mayor. Session picker sidebar: lists tmux sessions, color-coded by role (Mayor gold, Witness blue, Refinery green, Polecat gray), click to switch view.- Files:
cloud/src/app/api/gastown/terminal/route.ts,cloud/src/components/gastown/GastownTerminal.tsx, updates to gastown tRPC router
- Files:
Phase 5: Dashboard & Status Views
Web dashboard for managing towns, rigs, convoys, and agents.
-
Town Overview & Rig Detail Dashboard Pages #289 — PR 9: Town Overview & Rig Detail Pages —
/gastownpage: town cards (name, status, agent count, heartbeat, resource usage), create town wizard, quick actions (start/stop/destroy)./gastown/[townId]page: agent grid (card per tmux session with role, name, status, current bead, click to terminal), rig list (repo URL, branch, polecats, pending beads, expand for bead list), convoy tracker (progress bars, color-coded). All data from sandbox internal API. Polling: 5s for agents, 30s for beads/convoys.- Files:
cloud/src/app/gastown/(new pages),cloud/src/components/gastown/
- Files:
-
Configuration Page #290 — PR 10: Configuration Page —
/gastown/[townId]/settings: default model selector (gateway-available models), per-role model overrides, max concurrent polecats slider, rig management (add/remove repos), auto-stop idle timeout. tRPC mutations → sandbox internal APIPATCH /config→ writes gt config files.- Files: updates to gastown pages and tRPC router
Phase 6: Hardening
Edge cases, security, resource tuning.
- Edge Case Handling & Security Hardening #291 — PR 11: Edge Case Handling & Security — Volume loss recovery (R2 restore + gastown crash recovery). Dolt corruption detection (
dolt verify-constraints). Large repo handling (shallow clones, incremental bundles). Rapid restart circuit breaker (3 failures in 10 min → error state). JWT expiry handling (401 → stop LLM calls, manual refresh). Concurrent API call queuing. Disk space monitoring (alert at 80%). Security: Fly private network, per-user sandbox isolation, scoped JWT, encrypted internal API key, R2 credential scoping.
Resource Sizing & Cost
| Resource | MVP Default |
|---|---|
| CPU | 4 vCPUs |
| Memory | 8 GB |
| Disk | 50 GB persistent volume |
| Max polecats | 8 (configurable) |
| Cost | Per Town/Month |
|---|---|
| Always-on | ~$68 |
| With idle auto-stop (4 hrs/day avg) | ~$18 |
| R2 storage + operations | ~$0.06 |
| LLM costs | Pass-through via gateway |
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| tmux-over-WebSocket UX | High | Medium | xterm.js polish. Long-term: Proposal D web-native UI |
| Fly provisioning latency | Medium | Low | Pre-warm pool. "Provisioning" state in UI |
| Large repos slow R2 sync | Medium | Medium | Shallow clones, incremental bundles, external remote fallback |
| Dolt backup slow (>500MB) | Low | Medium | Incremental mode, increased interval, Dolt remotes |
| Fly pricing changes | Low | High | Provider abstraction. CF Containers as future alternative |