Skip to content

[FEATURE] Add GenAI Evaluation Support Following OpenTelemetry Semantic Conventions #1633

@anirudha

Description

@anirudha

Problem Statement

The OpenTelemetry community recently standardized how GenAI evaluation results should be captured (PR #2563, merged Aug 2025, now in v1.39.0). This creates a gen_ai.evaluation.result event specification that any OTEL-compliant observability platform can consume.

Currently, strands_evals evaluators (OutputEvaluator, TrajectoryEvaluator, etc.) produce evaluation results, but these are not emitted as standardized OpenTelemetry events. This means:

  1. No interoperability — Evaluation data stays siloed in custom formats rather than flowing to any OTel-compatible backend
  2. No trace correlation — Evaluations can't be linked to the agent spans they're evaluating
  3. No experiment tracking — Observability platforms can't organize evaluations into experiments/test cases without proprietary SDKs
  4. Divergence from standards — As OTel GenAI conventions gain adoption, Strands risks being incompatible with the emerging ecosystem

Proposed Solution

Implement the OTel GenAI Evaluation Semantic Conventions in Strands' telemetry system.

Core Components

1. EvaluationResult data class

# strands/telemetry/evaluation.py
from dataclasses import dataclass
from typing import Any, Optional
from opentelemetry.trace import Span

@dataclass
class EvaluationResult:
    """Represents a single evaluation result conforming to OTel GenAI conventions."""
    
    name: str                           # gen_ai.evaluation.name (required)
    score_value: Optional[float] = None # gen_ai.evaluation.score.value
    score_label: Optional[str] = None   # gen_ai.evaluation.score.label
    explanation: Optional[str] = None   # gen_ai.evaluation.explanation
    response_id: Optional[str] = None   # gen_ai.response.id
    error_type: Optional[str] = None    # error.type (if evaluation failed)
    
    def to_otel_attributes(self) -> dict[str, Any]:
        """Convert to OpenTelemetry event attributes."""
        attrs = {"gen_ai.evaluation.name": self.name}
        if self.score_value is not None:
            attrs["gen_ai.evaluation.score.value"] = self.score_value
        if self.score_label is not None:
            attrs["gen_ai.evaluation.score.label"] = self.score_label
        if self.explanation is not None:
            attrs["gen_ai.evaluation.explanation"] = self.explanation
        if self.response_id is not None:
            attrs["gen_ai.response.id"] = self.response_id
        if self.error_type is not None:
            attrs["error.type"] = self.error_type
        return attrs

2. Event emitter

class EvaluationEventEmitter:
    """Emits evaluation results as OpenTelemetry events."""
    
    EVENT_NAME = "gen_ai.evaluation.result"
    
    def emit(self, span: Span, result: EvaluationResult) -> None:
        """Add an evaluation result event to the given span."""
        span.add_event(
            name=self.EVENT_NAME,
            attributes=result.to_otel_attributes()
        )

3. Integration with existing evaluators

# strands_evals/evaluators/output_evaluator.py
class OutputEvaluator(Evaluator):
    def __init__(
        self, 
        name: str = "output_quality",  # NEW: explicit evaluator name
        rubric: str = ...,
        emit_otel_events: bool = True,  # NEW: opt-in/out flag
    ):
        self.name = name
        self.emit_otel_events = emit_otel_events
        ...
    
    def evaluate(self, case: Case, output: str, span: Optional[Span] = None) -> EvaluationReport:
        score, reasoning = self._run_evaluation(case, output)
        
        # Emit OTel event if enabled
        if self.emit_otel_events and span is not None:
            result = EvaluationResult(
                name=self.name,
                score_value=score,
                score_label=self._score_to_label(score),
                explanation=reasoning,
            )
            EvaluationEventEmitter().emit(span, result)
        
        return EvaluationReport(...)

4. Experiment context attributes (for observability UI organization)

# Resource-level attributes
resource_attributes = {
    "experiment.id": "exp_20250205_greeting",
    "experiment.name": "Greeting Quality Test",
}

# Span-level attributes  
span_attributes = {
    "test_case.id": case.name,
    "test_case.input": case.input,
    "test_case.expected": case.expected_output,
}

Expected OTel Output

{
  "resourceSpans": [{
    "resource": {
      "attributes": [
        {"key": "agent.name", "value": {"stringValue": "strands-agent"}},
        {"key": "experiment.id", "value": {"stringValue": "exp_20250205"}}
      ]
    },
    "scopeSpans": [{
      "spans": [{
        "name": "invoke_agent helpful_assistant",
        "attributes": [
          {"key": "test_case.id", "value": {"stringValue": "greeting"}}
        ],
        "events": [{
          "name": "gen_ai.evaluation.result",
          "attributes": [
            {"key": "gen_ai.evaluation.name", "value": {"stringValue": "response_quality"}},
            {"key": "gen_ai.evaluation.score.value", "value": {"doubleValue": 0.92}},
            {"key": "gen_ai.evaluation.score.label", "value": {"stringValue": "pass"}},
            {"key": "gen_ai.evaluation.explanation", "value": {"stringValue": "Response is friendly..."}}
          ]
        }]
      }]
    }]
  }]
}

Use Case

1. Automatic evaluation export to any OTel backend

from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator

# Existing strands_evals code works unchanged
evaluator = OutputEvaluator(name="accuracy", rubric="...")
experiment = Experiment(cases=test_cases, evaluators=[evaluator])
reports = experiment.run_evaluations(get_response)

# 🎉 Evaluation events automatically sent to OTLP endpoint
# View in Datadog, Jaeger, Honeycomb, or any OTel-compatible platform

2. Experiment tracking in observability platforms

With the proposed attributes, platforms can build UIs like:

Experiments
├── Greeting Quality Test (Feb 5) — 50 cases, 92% pass
│   ├── accuracy: avg 0.95
│   └── relevance: avg 0.91
└── RAG Test (Feb 4) — 120 cases, 87% pass

[Drill into case]
Test Case: greeting
├── Input: "Hello!"  
├── Evaluations:
│   └── accuracy: 0.92 (pass)
└── [View full trace →]

3. Manual event emission for custom evaluations

from strands.telemetry.evaluation import add_evaluation_event
from opentelemetry import trace

with tracer.start_as_current_span("my_evaluation") as span:
    score = my_custom_evaluator.evaluate(input, output)
    
    add_evaluation_event(
        span=span,
        name="custom_metric",
        score_value=score,
        score_label="pass" if score > 0.8 else "fail",
        explanation="Custom reasoning...",
    )

4. CI/CD integration

Evaluation events flow through standard OTel pipelines, enabling:

  • Automated regression detection when scores drop
  • Blocking deployments if evaluation thresholds aren't met
  • Historical analysis across experiments

Alternatives Considered

1. Proprietary evaluation API

Build a Strands-specific evaluation reporting API separate from OTel.

Rejected because:

  • Creates vendor lock-in
  • Requires users to adopt additional SDKs
  • Doesn't integrate with existing observability infrastructure
  • Diverges from industry direction

We could support both patterns, with events as default and optional span emission for advanced use cases.

Additional Context

References

Open Questions

  1. Opt-in vs opt-out: Should OTel event emission be enabled by default? [ Opt-in ]

  2. Async evaluations: When evaluations run after the agent span closes, should we use gen_ai.response.id correlation, span links, or both? [ reponse.id , not send dupe traces ]

  3. Experiment attributes: Should experiment.id/experiment.name be proposed upstream to OTel, or remain Strands-specific? [ yes ]

Implementation Scope

Component Effort
Core EvaluationResult + EvaluationEventEmitter S
OutputEvaluator integration M
TrajectoryEvaluator integration M
Experiment/test case attributes S
Documentation + examples M

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions