Add env_investigate job for production debugging with Grafana MCP subagents #60

Copilot · 2026-01-16T21:54:58Z

Adds a pre-built 6-step investigation workflow for debugging production issues using Grafana MCP (Prometheus, Alertmanager, Loki). Delegates all observability queries to isolated subagents to prevent context bloat from log dumps and metric time series.

New Job: env_investigate

Six-step workflow: triage → alert_check → metrics_analysis → log_investigation → root_cause → remediation

Each step produces structured markdown artifacts. Observability steps delegate to specialized subagents that enforce strict output contracts (max 10 alerts/metrics, 5 log lines, 200 char truncation).

Subagent Templates (Claude Code)

Three analyst subagents in src/deepwork/templates/claude/agents/:

alertmanager-analyst - Queries mcp__grafana__get_alerts, returns YAML summary grouped by severity
prometheus-analyst - Queries mcp__grafana__query_prometheus, samples 10 data points per metric with trends
loki-analyst - Queries mcp__grafana__query_loki, returns pattern counts + max 5 truncated log samples

Generated to .claude/agents/*.md during sync. Subagents return structured YAML, never raw observability data.

Platform Support

Claude Code: Full subagent isolation via Task tool delegation
Gemini CLI: Inline summarization rules in step instructions (no subagent support)

Agent Generation System

New AgentGenerator class renders Jinja2 templates from src/deepwork/templates/{platform}/agents/:

# In sync.py
agent_generator = AgentGenerator()
agent_paths = agent_generator.generate_agents(adapter, project_path)
# Creates .claude/agents/*.md for Claude, nothing for Gemini

Integrated into deepwork sync and deepwork install flows. Install now includes env_investigate alongside deepwork_jobs and deepwork_policy.

Standard Job Pattern

The job follows the standard DeepWork job pattern with:

AGENTS.md - Comprehensive documentation (8.7KB) explaining the observability subagent meta-framework:
- How isolated subagents prevent context bloat from high-volume data sources
- Subagent output contracts and platform support
- How to adapt the pattern for other tools (traces, databases, events)
- Complete workflow walkthrough with examples
templates/ - 7 artifact templates for investigation outputs:
- triage.md.template - Investigation scope structure
- alerts.md.template - Alertmanager analysis format
- metrics.md.template - Prometheus metrics format
- logs.md.template - Loki logs format
- root_cause.md.template - Root cause analysis structure
- timeline.md.template - Incident timeline format
- remediation.md.template - Remediation plan structure
make_new_job.sh - Script to create investigation directories with README tracking

Example Usage

deepwork install --platform claude
# Generates:
#   .claude/agents/alertmanager-analyst.md
#   .claude/agents/prometheus-analyst.md  
#   .claude/agents/loki-analyst.md
#   .claude/commands/env_investigate.*.md
#   .deepwork/jobs/env_investigate/AGENTS.md
#   .deepwork/jobs/env_investigate/templates/*.md.template
#   .deepwork/jobs/env_investigate/make_new_job.sh

Step instructions delegate queries:

Use the Task tool to spawn the alertmanager-analyst subagent with this prompt:
"Query Alertmanager for alerts related to the investigation in triage.md..."

Subagent returns:

summary:
  total_alerts: 23
  critical: 5
alerts:
  - name: HighErrorRate
    severity: critical
    # ... max 10 total

Files Added

src/deepwork/standard_jobs/env_investigate/
├── AGENTS.md              # Meta-framework documentation
├── job.yml                # Job definition
├── make_new_job.sh        # Investigation directory creator
├── steps/                 # 6 step instruction files
└── templates/             # 7 artifact templates

src/deepwork/templates/claude/agents/
├── alertmanager-analyst.md.j2
├── prometheus-analyst.md.j2
└── loki-analyst.md.j2

src/deepwork/core/
└── agent_generator.py     # Agent file generator

Original prompt

This section details on the original issue you should resolve

<issue_title>Implement Environment Investigation Workflow with Observability Subagents (Grafana MCP)</issue_title>
<issue_description>## Summary

Add a pre-built DeepWork job for investigating issues in deployed environments using Grafana MCP (Prometheus, Alertmanager, Loki). The workflow delegates all observability queries to isolated subagents to prevent context bloat.

Problem

When debugging production issues with AI agents, observability queries (especially Loki logs) return large amounts of data that blow out the context window. This makes iterative investigation difficult and forces users to manually manage context.

Solution

A multi-step DeepWork job that:

Structures investigation into discrete steps with clear artifacts

Delegates all Grafana MCP queries to isolated subagents

Returns only structured summaries to the main context

Scope

In Scope

New job: env_investigate with 6 steps

Subagent definitions for Claude Code (Alertmanager, Prometheus, Loki analysts)

Step instructions that enforce subagent delegation

Platform-specific command generation (Claude Code, Gemini CLI)

Out of Scope

Grafana MCP server implementation (assumes user has it configured)

Custom MCP wrappers or summarization middleware

Dashboard creation or modification

Job Definition
name: env_investigate
version: "1.0.0"
summary: "Investigate deployed environment issues via isolated observability subagents"

steps:
  - id: triage
    name: "Triage & Scope"
    outputs: [triage.md]
    dependencies: []

  - id: alert_check
    name: "Check Alerts"
    outputs: [alerts.md]
    dependencies: [triage]

  - id: metrics_analysis
    name: "Metrics Analysis"
    outputs: [metrics.md]
    dependencies: [alert_check]

  - id: log_investigation
    name: "Log Investigation"
    outputs: [logs.md]
    dependencies: [metrics_analysis]

  - id: root_cause
    name: "Root Cause Analysis"
    outputs: [root_cause.md, timeline.md]
    dependencies: [log_investigation]

  - id: remediation
    name: "Remediation Plan"
    outputs: [remediation.md]
    dependencies: [root_cause]
Subagent Definitions (Claude Code)

Three subagents in .claude/agents/:

Agent Tools Output Contract

alertmanager-analyst mcp__grafana__get_alerts Max 10 alerts, grouped by severity

prometheus-analyst mcp__grafana__query_prometheus Max 10 data points, trend summary

loki-analyst mcp__grafana__query_loki Max 5 log lines, truncated traces

Each returns structured YAML, never raw query results.

Platform Support

Platform Subagent Support Implementation

Claude Code Yes .claude/agents/*.md + step instructions with context: fork

Gemini CLI No (global only) Step instructions include inline summarization rules

For Gemini: steps should instruct the agent to self-limit output since isolated subagents aren't available.

Files to Add
job_library/
└── env_investigate/
    ├── job.yml
    └── steps/
        ├── triage.md
        ├── alert_check.md
        ├── metrics_analysis.md
        ├── log_investigation.md
        ├── root_cause.md
        └── remediation.md

src/deepwork/templates/claude/
└── agents/
    ├── alertmanager-analyst.md.j2
    ├── prometheus-analyst.md.j2
    └── loki-analyst.md.j2
Acceptance Criteria

deepwork install with this job generates working commands for Claude Code

deepwork install with this job generates working commands for Gemini CLI

Claude Code: subagents are generated in .claude/agents/

Step instructions reference correct subagent names

Each step produces its documented output artifact

No step returns raw observability data to main context (verified via output contract)

Testing

Install job in test repo with Grafana MCP configured

Run /env_investigate.triage → verify triage.md created

Run /env_investigate.alert_check → verify subagent spawned, alerts.md contains summary not raw data

Complete full workflow → verify all artifacts on work branch

Dependencies

User must have Grafana MCP server configured with Prometheus, Loki, Alertmanager datasources

Claude Code for full subagent support</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Implement Environment Investigation Workflow with Observability Subagents (Grafana MCP) #59

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: ncrmro <8276365+ncrmro@users.noreply.github.com>

ncrmro

@copilot can you rewrite this using the existing job pattern and update or create an AGENTS.md explaining how to use this meta framework

Co-authored-by: ncrmro <8276365+ncrmro@users.noreply.github.com>

Copilot · 2026-01-17T01:04:06Z

@copilot can you rewrite this using the existing job pattern and update or create an AGENTS.md explaining how to use this meta framework

Done! I've restructured env_investigate to follow the standard job pattern:

Added files (commit 79b8ba0):

AGENTS.md - Comprehensive documentation of the observability subagent meta-framework, including:
- How the subagent pattern prevents context bloat
- Subagent definitions and output contracts
- Platform support (Claude Code vs Gemini CLI)
- How to adapt the pattern for other tools
- Workflow walkthrough
templates/ - 7 artifact templates (triage, alerts, metrics, logs, root_cause, timeline, remediation)
make_new_job.sh - Script to create investigation directories with README tracking

Meta-framework explanation in AGENTS.md:
The document explains how the job uses isolated subagents to query observability tools (Grafana MCP) and return structured YAML summaries instead of raw data. This prevents context bloat from large log dumps and metric time series. It also shows how to adapt this pattern for other high-volume data sources.

The structure now matches deepwork_jobs with AGENTS.md, templates/, and make_new_job.sh.

ncrmro · 2026-01-17T01:46:03Z

@copilot How would you have us use the claude serve mcp or gemini --acp flag to do suagebta ourselves in a unifef way

Copilot · 2026-01-17T01:47:57Z

@copilot How would you have us use the claude serve mcp or gemini --acp flag to do suagebta ourselves in a unifef way

The current implementation uses Claude Code's Task tool for subagent isolation (spawning isolated contexts) rather than running separate MCP servers. This works for Claude Code but has no equivalent in Gemini CLI.

If you want to use native MCP server features (claude serve mcp or gemini --acp) to achieve unified subagent behavior:

Approach:

Run Grafana MCP as a separate server process
Have both platforms connect to it via their native MCP configuration
Remove the custom .claude/agents/*.md subagent files
Update step instructions to call MCP tools directly with strict output rules

Trade-offs:

✅ Unified: Both platforms use same MCP connection
✅ Simpler: No custom subagent definitions needed
❌ No isolation: Main context receives all tool output (context bloat issue returns)
❌ Manual limits: Agent must self-enforce output contracts

The current implementation with isolated subagents (via Task tool) was chosen because it guarantees context isolation for Claude Code users. The MCP server approach would require agents to manually truncate/summarize, which is less reliable.

Would you like me to add documentation for the MCP server approach as an alternative pattern, or modify the implementation to use native MCP features?

ncrmro · 2026-01-17T05:44:42Z

@copilot I was thinking more along the lines that either agent could delegate work to another agent

Initial plan

73fc591

Copilot AI assigned Copilot and ncrmro Jan 16, 2026

Copilot started work on behalf of ncrmro January 16, 2026 21:55 View session

Copilot AI and others added 2 commits January 16, 2026 22:06

Add env_investigate job with Grafana MCP subagent support

e539e33

Co-authored-by: ncrmro <8276365+ncrmro@users.noreply.github.com>

Fix dependency validation in env_investigate job.yml

00e9f25

Co-authored-by: ncrmro <8276365+ncrmro@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add Environment Investigation Workflow with Grafana MCP~~ Add env_investigate job for production debugging with Grafana MCP subagents Jan 16, 2026

Copilot AI requested a review from ncrmro January 16, 2026 22:12

Copilot finished work on behalf of ncrmro January 16, 2026 22:12

ncrmro reviewed Jan 17, 2026

View reviewed changes

Copilot started work on behalf of ncrmro January 17, 2026 00:58 View session

Add AGENTS.md, templates, and make_new_job.sh to env_investigate

79b8ba0

Co-authored-by: ncrmro <8276365+ncrmro@users.noreply.github.com>

Copilot AI requested a review from ncrmro January 17, 2026 01:06

Copilot finished work on behalf of ncrmro January 17, 2026 01:06

Copilot started work on behalf of ncrmro January 17, 2026 01:46 View session

Copilot finished work on behalf of ncrmro January 17, 2026 01:48

ncrmro closed this Jan 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add env_investigate job for production debugging with Grafana MCP subagents #60

Add env_investigate job for production debugging with Grafana MCP subagents #60

Uh oh!

Copilot AI commented Jan 16, 2026 •

edited

Loading

Uh oh!

ncrmro left a comment

Uh oh!

Copilot AI commented Jan 17, 2026

Uh oh!

ncrmro commented Jan 17, 2026

Uh oh!

Copilot AI commented Jan 17, 2026

Uh oh!

ncrmro commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Agent	Tools	Output Contract
`alertmanager-analyst`	`mcp__grafana__get_alerts`	Max 10 alerts, grouped by severity
`prometheus-analyst`	`mcp__grafana__query_prometheus`	Max 10 data points, trend summary
`loki-analyst`	`mcp__grafana__query_loki`	Max 5 log lines, truncated traces

Platform	Subagent Support	Implementation
Claude Code	Yes	`.claude/agents/*.md` + step instructions with `context: fork`
Gemini CLI	No (global only)	Step instructions include inline summarization rules

Add env_investigate job for production debugging with Grafana MCP subagents #60

Add env_investigate job for production debugging with Grafana MCP subagents #60

Uh oh!

Conversation

Copilot AI commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Job: env_investigate

Subagent Templates (Claude Code)

Platform Support

Agent Generation System

Standard Job Pattern

Example Usage

Files Added

Problem

Solution

Scope

In Scope

Out of Scope

Job Definition

Subagent Definitions (Claude Code)

Platform Support

Files to Add

Acceptance Criteria

Testing

Dependencies

Comments on the Issue (you are @copilot in this section)

Uh oh!

ncrmro left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Jan 17, 2026

Uh oh!

ncrmro commented Jan 17, 2026

Uh oh!

Copilot AI commented Jan 17, 2026

Uh oh!

ncrmro commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 16, 2026 •

edited

Loading