Skip to content

Support for hosted evals#880

Open
willccbb wants to merge 3 commits intomainfrom
hosted-eval-plugin
Open

Support for hosted evals#880
willccbb wants to merge 3 commits intomainfrom
hosted-eval-plugin

Conversation

@willccbb
Copy link
Member

@willccbb willccbb commented Feb 9, 2026

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Adds new networked evaluation execution paths (API requests, polling, log handling) and changes CLI argument normalization, which could affect evaluation runs and environment resolution if edge cases are missed.

Overview
Adds --hosted support to the verifiers.cli.commands.eval adapter, keeping local vf-eval behavior but enabling creation (and optional --follow polling/log streaming) of Prime-hosted evaluations via the Prime API, including API key/config resolution, slug/version resolution (arg/header/local metadata), TOML multi-eval configs, and payload options like timeouts, env args, access flags, secrets, and naming.

Updates the Prime CLI plugin to better locate a workspace root/venv and to normalize/auto-fill environment directory arguments (e.g., --path, --env-dir-path) to absolute workspace environments/ paths. Adds unit tests covering hosted payload construction, header-based slug parsing, TOML behavior, and plugin command/path resolution. Also pins dev ruff to >=0.15.0.

Written by Cursor Bugbot for commit 9e363a9. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

)
raise SystemExit(2)

_run_vf_eval(args)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation for hosted evaluation feature

Medium Severity

This PR adds significant new user-facing functionality with the --hosted mode for evaluations, including new flags: --hosted, --follow, --poll-interval, --timeout-minutes, --allow-sandbox-access, --allow-instances-access, --custom-secrets, and --eval-name. The existing docs/evaluation.md describes the evaluation command in detail but isn't updated to document these new hosted evaluation capabilities. Per the review rules, PRs that add or modify core user-facing functionality as described in docs must update the relevant documentation.

Fix in Cursor Fix in Web

)
raise SystemExit(2)

_run_vf_eval(args)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing skills update for hosted evaluation workflow

Low Severity

This PR changes user-facing Prime evaluation workflows by adding a new hosted execution mode with --hosted. The existing skills/evaluate-environments/SKILL.md describes evaluation workflows but doesn't include the new hosted evaluation patterns such as running evaluations on the Prime platform, following logs with --follow, or configuring hosted-specific options. Per the review rules, changes to user-facing evaluation workflows must update the corresponding skills.

Fix in Cursor Fix in Web

slug, version = value.rsplit("@", 1)
if not version:
raise HostedEvalError(f"Invalid environment version in '{value}'")
return slug, version
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Malformed slug with @ before / causes crash

Low Severity

_split_slug_and_version doesn't validate that the resulting slug contains a / separator. An unusual input like @version/name passes _is_slug_reference (because it contains / and doesn't start with ./, ../, or /), then rsplit("@", 1) produces an empty slug "". When _run_hosted_eval later calls env_slug.split("/", 1) on an empty string, it raises a ValueError with a confusing message rather than a descriptive hosted eval error.

Additional Locations (1)

Fix in Cursor Fix in Web

i += 1
continue
if token in HOSTED_VALUE_FLAGS:
i += 2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help flag skipped when following hosted value flag

Low Severity

In _strip_hosted_flags_for_help, when a hosted value flag like --poll-interval is encountered, the code does i += 2 to skip the flag and its value without checking if the "value" position contains --help or -h. For input like ["my-env", "--poll-interval", "--help"], the help flag is treated as --poll-interval's value and skipped. After stripping, _run_vf_eval(["my-env"]) is called, running an evaluation instead of showing help as the user intended.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments