[FEATURE] Add GenAI Evaluation Support Following OpenTelemetry Semantic Conventions


## Problem Statement

The OpenTelemetry community recently standardized how GenAI evaluation results should be captured ([PR #2563](https://github.com/open-telemetry/semantic-conventions/pull/2563), merged Aug 2025, now in v1.39.0). This creates a `gen_ai.evaluation.result` event specification that any OTEL-compliant observability platform can consume.

Currently, `strands_evals` evaluators (`OutputEvaluator`, `TrajectoryEvaluator`, etc.) produce evaluation results, but these are not emitted as standardized OpenTelemetry events. This means:

1. **No interoperability** — Evaluation data stays siloed in custom formats rather than flowing to any OTel-compatible backend
2. **No trace correlation** — Evaluations can't be linked to the agent spans they're evaluating
3. **No experiment tracking** — Observability platforms can't organize evaluations into experiments/test cases without proprietary SDKs
4. **Divergence from standards** — As OTel GenAI conventions gain adoption, Strands risks being incompatible with the emerging ecosystem

## Proposed Solution

Implement the [OTel GenAI Evaluation Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/#event-gen_aievaluationresult) in Strands' telemetry system.

### Core Components

**1. EvaluationResult data class**

```python
# strands/telemetry/evaluation.py
from dataclasses import dataclass
from typing import Any, Optional
from opentelemetry.trace import Span

@dataclass
class EvaluationResult:
    """Represents a single evaluation result conforming to OTel GenAI conventions."""
    
    name: str                           # gen_ai.evaluation.name (required)
    score_value: Optional[float] = None # gen_ai.evaluation.score.value
    score_label: Optional[str] = None   # gen_ai.evaluation.score.label
    explanation: Optional[str] = None   # gen_ai.evaluation.explanation
    response_id: Optional[str] = None   # gen_ai.response.id
    error_type: Optional[str] = None    # error.type (if evaluation failed)
    
    def to_otel_attributes(self) -> dict[str, Any]:
        """Convert to OpenTelemetry event attributes."""
        attrs = {"gen_ai.evaluation.name": self.name}
        if self.score_value is not None:
            attrs["gen_ai.evaluation.score.value"] = self.score_value
        if self.score_label is not None:
            attrs["gen_ai.evaluation.score.label"] = self.score_label
        if self.explanation is not None:
            attrs["gen_ai.evaluation.explanation"] = self.explanation
        if self.response_id is not None:
            attrs["gen_ai.response.id"] = self.response_id
        if self.error_type is not None:
            attrs["error.type"] = self.error_type
        return attrs
```

**2. Event emitter**

```python
class EvaluationEventEmitter:
    """Emits evaluation results as OpenTelemetry events."""
    
    EVENT_NAME = "gen_ai.evaluation.result"
    
    def emit(self, span: Span, result: EvaluationResult) -> None:
        """Add an evaluation result event to the given span."""
        span.add_event(
            name=self.EVENT_NAME,
            attributes=result.to_otel_attributes()
        )
```

**3. Integration with existing evaluators**

```python
# strands_evals/evaluators/output_evaluator.py
class OutputEvaluator(Evaluator):
    def __init__(
        self, 
        name: str = "output_quality",  # NEW: explicit evaluator name
        rubric: str = ...,
        emit_otel_events: bool = True,  # NEW: opt-in/out flag
    ):
        self.name = name
        self.emit_otel_events = emit_otel_events
        ...
    
    def evaluate(self, case: Case, output: str, span: Optional[Span] = None) -> EvaluationReport:
        score, reasoning = self._run_evaluation(case, output)
        
        # Emit OTel event if enabled
        if self.emit_otel_events and span is not None:
            result = EvaluationResult(
                name=self.name,
                score_value=score,
                score_label=self._score_to_label(score),
                explanation=reasoning,
            )
            EvaluationEventEmitter().emit(span, result)
        
        return EvaluationReport(...)
```

**4. Experiment context attributes (for observability UI organization)**

```python
# Resource-level attributes
resource_attributes = {
    "experiment.id": "exp_20250205_greeting",
    "experiment.name": "Greeting Quality Test",
}

# Span-level attributes  
span_attributes = {
    "test_case.id": case.name,
    "test_case.input": case.input,
    "test_case.expected": case.expected_output,
}
```

### Expected OTel Output

```json
{
  "resourceSpans": [{
    "resource": {
      "attributes": [
        {"key": "agent.name", "value": {"stringValue": "strands-agent"}},
        {"key": "experiment.id", "value": {"stringValue": "exp_20250205"}}
      ]
    },
    "scopeSpans": [{
      "spans": [{
        "name": "invoke_agent helpful_assistant",
        "attributes": [
          {"key": "test_case.id", "value": {"stringValue": "greeting"}}
        ],
        "events": [{
          "name": "gen_ai.evaluation.result",
          "attributes": [
            {"key": "gen_ai.evaluation.name", "value": {"stringValue": "response_quality"}},
            {"key": "gen_ai.evaluation.score.value", "value": {"doubleValue": 0.92}},
            {"key": "gen_ai.evaluation.score.label", "value": {"stringValue": "pass"}},
            {"key": "gen_ai.evaluation.explanation", "value": {"stringValue": "Response is friendly..."}}
          ]
        }]
      }]
    }]
  }]
}
```

## Use Case

### 1. Automatic evaluation export to any OTel backend

```python
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator

# Existing strands_evals code works unchanged
evaluator = OutputEvaluator(name="accuracy", rubric="...")
experiment = Experiment(cases=test_cases, evaluators=[evaluator])
reports = experiment.run_evaluations(get_response)

# 🎉 Evaluation events automatically sent to OTLP endpoint
# View in Datadog, Jaeger, Honeycomb, or any OTel-compatible platform
```

### 2. Experiment tracking in observability platforms

With the proposed attributes, platforms can build UIs like:

```
Experiments
├── Greeting Quality Test (Feb 5) — 50 cases, 92% pass
│   ├── accuracy: avg 0.95
│   └── relevance: avg 0.91
└── RAG Test (Feb 4) — 120 cases, 87% pass

[Drill into case]
Test Case: greeting
├── Input: "Hello!"  
├── Evaluations:
│   └── accuracy: 0.92 (pass)
└── [View full trace →]
```

### 3. Manual event emission for custom evaluations

```python
from strands.telemetry.evaluation import add_evaluation_event
from opentelemetry import trace

with tracer.start_as_current_span("my_evaluation") as span:
    score = my_custom_evaluator.evaluate(input, output)
    
    add_evaluation_event(
        span=span,
        name="custom_metric",
        score_value=score,
        score_label="pass" if score > 0.8 else "fail",
        explanation="Custom reasoning...",
    )
```

### 4. CI/CD integration

Evaluation events flow through standard OTel pipelines, enabling:
- Automated regression detection when scores drop
- Blocking deployments if evaluation thresholds aren't met
- Historical analysis across experiments

## Alternatives Considered

### 1. Proprietary evaluation API

Build a Strands-specific evaluation reporting API separate from OTel.

**Rejected because:**
- Creates vendor lock-in
- Requires users to adopt additional SDKs
- Doesn't integrate with existing observability infrastructure
- Diverges from industry direction


We could support both patterns, with events as default and optional span emission for advanced use cases.


## Additional Context

### References
- [OTel GenAI Evaluation Spec](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/#event-gen_aievaluationresult)
- [OTel PR #2563 - Gen AI Evaluation Result](https://github.com/open-telemetry/semantic-conventions/pull/2563)


### Open Questions

1. **Opt-in vs opt-out:** Should OTel event emission be enabled by default? [ Opt-in ]

2. **Async evaluations:** When evaluations run after the agent span closes, should we use `gen_ai.response.id` correlation, span links, or both? [ reponse.id , not send dupe traces ]

3. **Experiment attributes:** Should `experiment.id`/`experiment.name` be proposed upstream to OTel, or remain Strands-specific? [ yes ] 

### Implementation Scope

| Component | Effort |
|-----------|--------|
| Core `EvaluationResult` + `EvaluationEventEmitter` | S |
| `OutputEvaluator` integration | M |
| `TrajectoryEvaluator` integration | M |
| Experiment/test case attributes | S |
| Documentation + examples | M |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add GenAI Evaluation Support Following OpenTelemetry Semantic Conventions #1633

Problem Statement

Proposed Solution

Core Components

Expected OTel Output

Use Case

1. Automatic evaluation export to any OTel backend

2. Experiment tracking in observability platforms

3. Manual event emission for custom evaluations

4. CI/CD integration

Alternatives Considered

1. Proprietary evaluation API

Additional Context

References

Open Questions

Implementation Scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Effort
Core `EvaluationResult` + `EvaluationEventEmitter`	S
`OutputEvaluator` integration	M
`TrajectoryEvaluator` integration	M
Experiment/test case attributes	S
Documentation + examples	M

[FEATURE] Add GenAI Evaluation Support Following OpenTelemetry Semantic Conventions #1633

Description

Problem Statement

Proposed Solution

Core Components

Expected OTel Output

Use Case

1. Automatic evaluation export to any OTel backend

2. Experiment tracking in observability platforms

3. Manual event emission for custom evaluations

4. CI/CD integration

Alternatives Considered

1. Proprietary evaluation API

Additional Context

References

Open Questions

Implementation Scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions