Skip to content

[Gastown] PR 5.5: Container — Adopt kilo serve for Agent Management #305

@jrf0110

Description

@jrf0110

Overview

Replace the current stdin/stdout-based agent process management in the Town Container with Kilo's built-in HTTP server (kilo serve). The container currently spawns kilo code --non-interactive as fire-and-forget child processes and communicates via raw stdin pipes. This is fragile and provides no structured observability.

Decision: We are going forward with the kilo serve route. See analysis: docs/gt/opencode-server-analysis.md

Context

kilo serve starts a headless HTTP server (OpenAPI 3.1) with session management, structured message sending, SSE event streaming, abort/fork/revert, diff inspection, and more. The SDK (@kilocode/sdk/v2/server) provides createOpencodeServer() to manage server lifecycle.

Current flow:

Container Control Server (port 8080)
  └── Bun.spawn('kilo code --non-interactive') × N agents
      └── stdin/stdout pipes (fragile, unstructured)

Target flow:

Container Control Server (port 8080)
  └── kilo serve (port 4096+N) × M server instances (one per worktree)
      └── HTTP API: POST /session/:id/message, GET /event (SSE), etc.

Scope

1. Replace process-manager.ts internals

  • Instead of Bun.spawn(['kilo', 'code', '--non-interactive', ...]), use createOpencodeServer() from @kilocode/sdk/v2/server (or equivalent) to start kilo serve instances
  • One kilo serve instance per worktree/project directory (since a server is scoped to one project)
  • Manage port allocation for multiple server instances within the container
  • Track server instances and their sessions instead of raw child processes

2. Replace stdin-based messaging with HTTP API

  • sendMessage(agentId, prompt)POST /session/:id/message or POST /session/:id/prompt_async
  • getProcessStatus(agentId)GET /session/status (structured session-level status)
  • Agent abort → POST /session/:id/abort (clean abort instead of SIGTERM)

3. Replace agent-runner.ts startup flow

  • After git clone/worktree setup, start a kilo serve instance for the worktree (if not already running)
  • Create a new session on the server: POST /session
  • Send the initial prompt via POST /session/:id/message with model/agent/system-prompt configuration
  • Return session ID as the agent's handle (instead of process PID)

4. Wire up SSE event streaming

  • Subscribe to GET /event on each kilo serve instance
  • Forward relevant events (tool calls, completions, errors) to the heartbeat reporter
  • This replaces the raw stdout pipe reading with typed, structured events
  • Enables the future WebSocket streaming endpoint (/agents/:agentId/stream) referenced in the TODO

5. Update control server endpoints

Endpoint Current After
POST /agents/start Spawns kilo process Creates session on kilo server
POST /agents/:id/message Writes to stdin pipe POST /session/:id/message
GET /agents/:id/status Process lifecycle (pid, exit code) Session status (active tools, message count, etc.)
POST /agents/:id/stop SIGTERM/SIGKILL on process POST /session/:id/abort + optionally stop server if no more sessions
GET /health Process count Server instance count + session count

6. Update heartbeat reporter

  • Report session-level status instead of process-level status
  • Include active tool calls and last message info from SSE events

What stays the same

  • Git clone/worktree management (git-manager.ts) — unchanged
  • Container control server (port 8080) — same interface for TownContainer DO
  • Agent environment variable setup — still needed for gastown plugin config
  • Dockerfile — still needs kilo installed globally

Acceptance Criteria

  • Container starts kilo serve instances instead of kilo code --non-interactive processes
  • Agents are managed as sessions within kilo server instances
  • Follow-up messages use HTTP API instead of stdin pipes
  • Agent status reflects session-level detail (not just process alive/dead)
  • SSE event subscription is wired up for observability
  • Clean abort via server API works
  • Existing control server endpoints maintain the same external contract (no breaking changes for TownContainer DO)
  • All existing container tests pass (or are updated to reflect new internals)

Risks & Notes

  • Port management: Each kilo serve needs its own port. Need port allocation strategy (e.g., 4096 + incrementing counter)
  • One server per worktree: A kilo server is scoped to one project dir. Multiple agents sharing a worktree can share a server with separate sessions; agents in different worktrees need separate servers
  • Resource overhead: Marginal — kilo serve is a single Bun process either way, just with HTTP server overhead instead of raw stdin/stdout
  • Migration path: Can be done incrementally — start with HTTP messaging, then add SSE, then refine status reporting

Parent issue: #204

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions