Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 16, 2026

Adds a pre-built 6-step investigation workflow for debugging production issues using Grafana MCP (Prometheus, Alertmanager, Loki). Delegates all observability queries to isolated subagents to prevent context bloat from log dumps and metric time series.

New Job: env_investigate

Six-step workflow: triagealert_checkmetrics_analysislog_investigationroot_causeremediation

Each step produces structured markdown artifacts. Observability steps delegate to specialized subagents that enforce strict output contracts (max 10 alerts/metrics, 5 log lines, 200 char truncation).

Subagent Templates (Claude Code)

Three analyst subagents in src/deepwork/templates/claude/agents/:

  • alertmanager-analyst - Queries mcp__grafana__get_alerts, returns YAML summary grouped by severity
  • prometheus-analyst - Queries mcp__grafana__query_prometheus, samples 10 data points per metric with trends
  • loki-analyst - Queries mcp__grafana__query_loki, returns pattern counts + max 5 truncated log samples

Generated to .claude/agents/*.md during sync. Subagents return structured YAML, never raw observability data.

Platform Support

Claude Code: Full subagent isolation via Task tool delegation
Gemini CLI: Inline summarization rules in step instructions (no subagent support)

Agent Generation System

New AgentGenerator class renders Jinja2 templates from src/deepwork/templates/{platform}/agents/:

# In sync.py
agent_generator = AgentGenerator()
agent_paths = agent_generator.generate_agents(adapter, project_path)
# Creates .claude/agents/*.md for Claude, nothing for Gemini

Integrated into deepwork sync and deepwork install flows. Install now includes env_investigate alongside deepwork_jobs and deepwork_policy.

Standard Job Pattern

The job follows the standard DeepWork job pattern with:

  • AGENTS.md - Comprehensive documentation (8.7KB) explaining the observability subagent meta-framework:

    • How isolated subagents prevent context bloat from high-volume data sources
    • Subagent output contracts and platform support
    • How to adapt the pattern for other tools (traces, databases, events)
    • Complete workflow walkthrough with examples
  • templates/ - 7 artifact templates for investigation outputs:

    • triage.md.template - Investigation scope structure
    • alerts.md.template - Alertmanager analysis format
    • metrics.md.template - Prometheus metrics format
    • logs.md.template - Loki logs format
    • root_cause.md.template - Root cause analysis structure
    • timeline.md.template - Incident timeline format
    • remediation.md.template - Remediation plan structure
  • make_new_job.sh - Script to create investigation directories with README tracking

Example Usage

deepwork install --platform claude
# Generates:
#   .claude/agents/alertmanager-analyst.md
#   .claude/agents/prometheus-analyst.md  
#   .claude/agents/loki-analyst.md
#   .claude/commands/env_investigate.*.md
#   .deepwork/jobs/env_investigate/AGENTS.md
#   .deepwork/jobs/env_investigate/templates/*.md.template
#   .deepwork/jobs/env_investigate/make_new_job.sh

Step instructions delegate queries:

Use the Task tool to spawn the alertmanager-analyst subagent with this prompt:
"Query Alertmanager for alerts related to the investigation in triage.md..."

Subagent returns:

summary:
  total_alerts: 23
  critical: 5
alerts:
  - name: HighErrorRate
    severity: critical
    # ... max 10 total

Files Added

src/deepwork/standard_jobs/env_investigate/
├── AGENTS.md              # Meta-framework documentation
├── job.yml                # Job definition
├── make_new_job.sh        # Investigation directory creator
├── steps/                 # 6 step instruction files
└── templates/             # 7 artifact templates

src/deepwork/templates/claude/agents/
├── alertmanager-analyst.md.j2
├── prometheus-analyst.md.j2
└── loki-analyst.md.j2

src/deepwork/core/
└── agent_generator.py     # Agent file generator
Original prompt

This section details on the original issue you should resolve

<issue_title>Implement Environment Investigation Workflow with Observability Subagents (Grafana MCP)</issue_title>
<issue_description>## Summary

Add a pre-built DeepWork job for investigating issues in deployed environments using Grafana MCP (Prometheus, Alertmanager, Loki). The workflow delegates all observability queries to isolated subagents to prevent context bloat.

Problem

When debugging production issues with AI agents, observability queries (especially Loki logs) return large amounts of data that blow out the context window. This makes iterative investigation difficult and forces users to manually manage context.

Solution

A multi-step DeepWork job that:

  1. Structures investigation into discrete steps with clear artifacts
  2. Delegates all Grafana MCP queries to isolated subagents
  3. Returns only structured summaries to the main context

Scope

In Scope

  • New job: env_investigate with 6 steps
  • Subagent definitions for Claude Code (Alertmanager, Prometheus, Loki analysts)
  • Step instructions that enforce subagent delegation
  • Platform-specific command generation (Claude Code, Gemini CLI)

Out of Scope

  • Grafana MCP server implementation (assumes user has it configured)
  • Custom MCP wrappers or summarization middleware
  • Dashboard creation or modification

Job Definition

name: env_investigate
version: "1.0.0"
summary: "Investigate deployed environment issues via isolated observability subagents"

steps:
  - id: triage
    name: "Triage & Scope"
    outputs: [triage.md]
    dependencies: []

  - id: alert_check
    name: "Check Alerts"
    outputs: [alerts.md]
    dependencies: [triage]

  - id: metrics_analysis
    name: "Metrics Analysis"
    outputs: [metrics.md]
    dependencies: [alert_check]

  - id: log_investigation
    name: "Log Investigation"
    outputs: [logs.md]
    dependencies: [metrics_analysis]

  - id: root_cause
    name: "Root Cause Analysis"
    outputs: [root_cause.md, timeline.md]
    dependencies: [log_investigation]

  - id: remediation
    name: "Remediation Plan"
    outputs: [remediation.md]
    dependencies: [root_cause]

Subagent Definitions (Claude Code)

Three subagents in .claude/agents/:

Agent Tools Output Contract
alertmanager-analyst mcp__grafana__get_alerts Max 10 alerts, grouped by severity
prometheus-analyst mcp__grafana__query_prometheus Max 10 data points, trend summary
loki-analyst mcp__grafana__query_loki Max 5 log lines, truncated traces

Each returns structured YAML, never raw query results.

Platform Support

Platform Subagent Support Implementation
Claude Code Yes .claude/agents/*.md + step instructions with context: fork
Gemini CLI No (global only) Step instructions include inline summarization rules

For Gemini: steps should instruct the agent to self-limit output since isolated subagents aren't available.

Files to Add

job_library/
└── env_investigate/
    ├── job.yml
    └── steps/
        ├── triage.md
        ├── alert_check.md
        ├── metrics_analysis.md
        ├── log_investigation.md
        ├── root_cause.md
        └── remediation.md

src/deepwork/templates/claude/
└── agents/
    ├── alertmanager-analyst.md.j2
    ├── prometheus-analyst.md.j2
    └── loki-analyst.md.j2

Acceptance Criteria

  • deepwork install with this job generates working commands for Claude Code
  • deepwork install with this job generates working commands for Gemini CLI
  • Claude Code: subagents are generated in .claude/agents/
  • Step instructions reference correct subagent names
  • Each step produces its documented output artifact
  • No step returns raw observability data to main context (verified via output contract)

Testing

  1. Install job in test repo with Grafana MCP configured
  2. Run /env_investigate.triage → verify triage.md created
  3. Run /env_investigate.alert_check → verify subagent spawned, alerts.md contains summary not raw data
  4. Complete full workflow → verify all artifacts on work branch

Dependencies

  • User must have Grafana MCP server configured with Prometheus, Loki, Alertmanager datasources
  • Claude Code for full subagent support</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits January 16, 2026 22:06
Co-authored-by: ncrmro <8276365+ncrmro@users.noreply.github.com>
Co-authored-by: ncrmro <8276365+ncrmro@users.noreply.github.com>
Copilot AI changed the title [WIP] Add Environment Investigation Workflow with Grafana MCP Add env_investigate job for production debugging with Grafana MCP subagents Jan 16, 2026
Copilot AI requested a review from ncrmro January 16, 2026 22:12
Copy link

@ncrmro ncrmro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot can you rewrite this using the existing job pattern and update or create an AGENTS.md explaining how to use this meta framework

Co-authored-by: ncrmro <8276365+ncrmro@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 17, 2026

@copilot can you rewrite this using the existing job pattern and update or create an AGENTS.md explaining how to use this meta framework

Done! I've restructured env_investigate to follow the standard job pattern:

Added files (commit 79b8ba0):

  • AGENTS.md - Comprehensive documentation of the observability subagent meta-framework, including:
    • How the subagent pattern prevents context bloat
    • Subagent definitions and output contracts
    • Platform support (Claude Code vs Gemini CLI)
    • How to adapt the pattern for other tools
    • Workflow walkthrough
  • templates/ - 7 artifact templates (triage, alerts, metrics, logs, root_cause, timeline, remediation)
  • make_new_job.sh - Script to create investigation directories with README tracking

Meta-framework explanation in AGENTS.md:
The document explains how the job uses isolated subagents to query observability tools (Grafana MCP) and return structured YAML summaries instead of raw data. This prevents context bloat from large log dumps and metric time series. It also shows how to adapt this pattern for other high-volume data sources.

The structure now matches deepwork_jobs with AGENTS.md, templates/, and make_new_job.sh.

Copilot AI requested a review from ncrmro January 17, 2026 01:06
@ncrmro
Copy link

ncrmro commented Jan 17, 2026

@copilot How would you have us use the claude serve mcp or gemini --acp flag to do suagebta ourselves in a unifef way

Copy link
Contributor Author

Copilot AI commented Jan 17, 2026

@copilot How would you have us use the claude serve mcp or gemini --acp flag to do suagebta ourselves in a unifef way

The current implementation uses Claude Code's Task tool for subagent isolation (spawning isolated contexts) rather than running separate MCP servers. This works for Claude Code but has no equivalent in Gemini CLI.

If you want to use native MCP server features (claude serve mcp or gemini --acp) to achieve unified subagent behavior:

Approach:

  1. Run Grafana MCP as a separate server process
  2. Have both platforms connect to it via their native MCP configuration
  3. Remove the custom .claude/agents/*.md subagent files
  4. Update step instructions to call MCP tools directly with strict output rules

Trade-offs:

  • ✅ Unified: Both platforms use same MCP connection
  • ✅ Simpler: No custom subagent definitions needed
  • ❌ No isolation: Main context receives all tool output (context bloat issue returns)
  • ❌ Manual limits: Agent must self-enforce output contracts

The current implementation with isolated subagents (via Task tool) was chosen because it guarantees context isolation for Claude Code users. The MCP server approach would require agents to manually truncate/summarize, which is less reliable.

Would you like me to add documentation for the MCP server approach as an alternative pattern, or modify the implementation to use native MCP features?

@ncrmro
Copy link

ncrmro commented Jan 17, 2026

@copilot I was thinking more along the lines that either agent could delegate work to another agent

@ncrmro ncrmro closed this Jan 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Environment Investigation Workflow with Observability Subagents (Grafana MCP)

2 participants