-
Notifications
You must be signed in to change notification settings - Fork 646
Description
Problem Statement
The OpenTelemetry community recently standardized how GenAI evaluation results should be captured (PR #2563, merged Aug 2025, now in v1.39.0). This creates a gen_ai.evaluation.result event specification that any OTEL-compliant observability platform can consume.
Currently, strands_evals evaluators (OutputEvaluator, TrajectoryEvaluator, etc.) produce evaluation results, but these are not emitted as standardized OpenTelemetry events. This means:
- No interoperability — Evaluation data stays siloed in custom formats rather than flowing to any OTel-compatible backend
- No trace correlation — Evaluations can't be linked to the agent spans they're evaluating
- No experiment tracking — Observability platforms can't organize evaluations into experiments/test cases without proprietary SDKs
- Divergence from standards — As OTel GenAI conventions gain adoption, Strands risks being incompatible with the emerging ecosystem
Proposed Solution
Implement the OTel GenAI Evaluation Semantic Conventions in Strands' telemetry system.
Core Components
1. EvaluationResult data class
# strands/telemetry/evaluation.py
from dataclasses import dataclass
from typing import Any, Optional
from opentelemetry.trace import Span
@dataclass
class EvaluationResult:
"""Represents a single evaluation result conforming to OTel GenAI conventions."""
name: str # gen_ai.evaluation.name (required)
score_value: Optional[float] = None # gen_ai.evaluation.score.value
score_label: Optional[str] = None # gen_ai.evaluation.score.label
explanation: Optional[str] = None # gen_ai.evaluation.explanation
response_id: Optional[str] = None # gen_ai.response.id
error_type: Optional[str] = None # error.type (if evaluation failed)
def to_otel_attributes(self) -> dict[str, Any]:
"""Convert to OpenTelemetry event attributes."""
attrs = {"gen_ai.evaluation.name": self.name}
if self.score_value is not None:
attrs["gen_ai.evaluation.score.value"] = self.score_value
if self.score_label is not None:
attrs["gen_ai.evaluation.score.label"] = self.score_label
if self.explanation is not None:
attrs["gen_ai.evaluation.explanation"] = self.explanation
if self.response_id is not None:
attrs["gen_ai.response.id"] = self.response_id
if self.error_type is not None:
attrs["error.type"] = self.error_type
return attrs2. Event emitter
class EvaluationEventEmitter:
"""Emits evaluation results as OpenTelemetry events."""
EVENT_NAME = "gen_ai.evaluation.result"
def emit(self, span: Span, result: EvaluationResult) -> None:
"""Add an evaluation result event to the given span."""
span.add_event(
name=self.EVENT_NAME,
attributes=result.to_otel_attributes()
)3. Integration with existing evaluators
# strands_evals/evaluators/output_evaluator.py
class OutputEvaluator(Evaluator):
def __init__(
self,
name: str = "output_quality", # NEW: explicit evaluator name
rubric: str = ...,
emit_otel_events: bool = True, # NEW: opt-in/out flag
):
self.name = name
self.emit_otel_events = emit_otel_events
...
def evaluate(self, case: Case, output: str, span: Optional[Span] = None) -> EvaluationReport:
score, reasoning = self._run_evaluation(case, output)
# Emit OTel event if enabled
if self.emit_otel_events and span is not None:
result = EvaluationResult(
name=self.name,
score_value=score,
score_label=self._score_to_label(score),
explanation=reasoning,
)
EvaluationEventEmitter().emit(span, result)
return EvaluationReport(...)4. Experiment context attributes (for observability UI organization)
# Resource-level attributes
resource_attributes = {
"experiment.id": "exp_20250205_greeting",
"experiment.name": "Greeting Quality Test",
}
# Span-level attributes
span_attributes = {
"test_case.id": case.name,
"test_case.input": case.input,
"test_case.expected": case.expected_output,
}Expected OTel Output
{
"resourceSpans": [{
"resource": {
"attributes": [
{"key": "agent.name", "value": {"stringValue": "strands-agent"}},
{"key": "experiment.id", "value": {"stringValue": "exp_20250205"}}
]
},
"scopeSpans": [{
"spans": [{
"name": "invoke_agent helpful_assistant",
"attributes": [
{"key": "test_case.id", "value": {"stringValue": "greeting"}}
],
"events": [{
"name": "gen_ai.evaluation.result",
"attributes": [
{"key": "gen_ai.evaluation.name", "value": {"stringValue": "response_quality"}},
{"key": "gen_ai.evaluation.score.value", "value": {"doubleValue": 0.92}},
{"key": "gen_ai.evaluation.score.label", "value": {"stringValue": "pass"}},
{"key": "gen_ai.evaluation.explanation", "value": {"stringValue": "Response is friendly..."}}
]
}]
}]
}]
}]
}Use Case
1. Automatic evaluation export to any OTel backend
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator
# Existing strands_evals code works unchanged
evaluator = OutputEvaluator(name="accuracy", rubric="...")
experiment = Experiment(cases=test_cases, evaluators=[evaluator])
reports = experiment.run_evaluations(get_response)
# 🎉 Evaluation events automatically sent to OTLP endpoint
# View in Datadog, Jaeger, Honeycomb, or any OTel-compatible platform2. Experiment tracking in observability platforms
With the proposed attributes, platforms can build UIs like:
Experiments
├── Greeting Quality Test (Feb 5) — 50 cases, 92% pass
│ ├── accuracy: avg 0.95
│ └── relevance: avg 0.91
└── RAG Test (Feb 4) — 120 cases, 87% pass
[Drill into case]
Test Case: greeting
├── Input: "Hello!"
├── Evaluations:
│ └── accuracy: 0.92 (pass)
└── [View full trace →]
3. Manual event emission for custom evaluations
from strands.telemetry.evaluation import add_evaluation_event
from opentelemetry import trace
with tracer.start_as_current_span("my_evaluation") as span:
score = my_custom_evaluator.evaluate(input, output)
add_evaluation_event(
span=span,
name="custom_metric",
score_value=score,
score_label="pass" if score > 0.8 else "fail",
explanation="Custom reasoning...",
)4. CI/CD integration
Evaluation events flow through standard OTel pipelines, enabling:
- Automated regression detection when scores drop
- Blocking deployments if evaluation thresholds aren't met
- Historical analysis across experiments
Alternatives Considered
1. Proprietary evaluation API
Build a Strands-specific evaluation reporting API separate from OTel.
Rejected because:
- Creates vendor lock-in
- Requires users to adopt additional SDKs
- Doesn't integrate with existing observability infrastructure
- Diverges from industry direction
We could support both patterns, with events as default and optional span emission for advanced use cases.
Additional Context
References
Open Questions
-
Opt-in vs opt-out: Should OTel event emission be enabled by default? [ Opt-in ]
-
Async evaluations: When evaluations run after the agent span closes, should we use
gen_ai.response.idcorrelation, span links, or both? [ reponse.id , not send dupe traces ] -
Experiment attributes: Should
experiment.id/experiment.namebe proposed upstream to OTel, or remain Strands-specific? [ yes ]
Implementation Scope
| Component | Effort |
|---|---|
Core EvaluationResult + EvaluationEventEmitter |
S |
OutputEvaluator integration |
M |
TrajectoryEvaluator integration |
M |
| Experiment/test case attributes | S |
| Documentation + examples | M |