As AI coding agents, like Claude Code and OpenAI Codex, rapidly improve, we need more than cherry-picked demos. Especially in specialized domains like GPU programming.
AgentKernelArena is a standardized evaluation arena built by AMD to measure how well AI coding agents perform on real GPU kernel optimization tasks.
AgentKernelArena provides an end-to-end, siloed benchmarking environment where LLM-powered agents (Cursor Agent, Claude Code, Codex, SWE-agent, GEAK, and custom agents) are evaluated side-by-side on the same kernel tasks using objective and reproducible metrics.
AgentKernelArena enables systematic evaluation of AI agents on GPU kernel optimization tasks by combining:
- Multi-Agent Arena: Cursor, Claude Code, SWE-agent, OpenEvolve (GEAK), single LLM calls (Codex/others), and custom agents
- Multi-Model Support: OpenAI (GPT-5), Anthropic Claude (Opus 4.5, Sonnet 4.5), and other models via OpenRouter or vLLM
- Task Categories: HIP (ROCm examples, rocPRIM, customer HIP), Triton (TritonBench, ROCmBench), and Torch2HIP conversions
- Real Metrics: Automated evaluation of compilation success, correctness, and real GPU performance speedups
- Designed for Fair Comparison: Standardized tasks, environments, prompts, and scoring for leaderboard-style evaluation
- Workspace Isolation: Each task runs in a timestamped duplicate workspace for reproducibility
- Comprehensive Logging: Detailed logs with timestamps, prompts, outputs, and results for every task execution
- Flexible Configuration: YAML-based configuration for tasks, agents, and LLM parameters
AgentKernelArena is actively under development. Upcoming releases will publish detailed evaluation results comparing agent performance across multiple task categories, using standardized correctness and performance scores.
| Model | Compiled | Correctness | Performance | Score |
|---|---|---|---|---|
| Cursor Agent | xx | xx | xx | xx |
| Claude Code | xx | xx | xx | xx |
| OpenAI Codex | xx | xx | xx | xx |
| SWE-agent | xx | xx | xx | xx |
| GEAK | xx | xx | xx | xx |
AgentKernelArena/
├── main.py # Main orchestration entry point
├── config.yaml # Global configuration
├── src/
│ ├── module_registration.py # Dynamic agent/prompt/post-processing loading
│ ├── preprocessing.py # Workspace setup and environment checks
│ ├── prompt_builder.py # Task prompt construction
│ ├── postprocessing.py # Result analysis and report generation
│ ├── score.py # Scoring logic for evaluation metrics
│ ├── tasks.py # Task discovery and registration
│ └── utils/
│ └── report_generation.py # Aggregate report analysis utilities
├── agents/
│ ├── cursor/ # Cursor agent integration
│ ├── claude_code/ # Claude Code agent integration
│ ├── SWE_agent/ # SWE-agent integration
│ ├── openevolve/ # OpenEvolve (GEAK) integration
│ ├── geak_optimagentv2/ # GEAK OptimAgent v2 integration
│ ├── geak_hip/ # GEAK HIP integration
│ ├── geak_ourllm_kernel2kernel/ # GEAK OurLLM kernel-to-kernel integration
│ ├── single_llm_call/ # Single LLM call implementation
│ └── __init__.py # Agent registry
└── tasks/ # Task definitions
├── rocm-examples/ # ROCm example kernels
├── rocprim/ # rocPRIM kernels
├── customer_hip/ # Custom HIP kernels
├── triton/ # Triton benchmark kernels
└── torch2hip/ # Torch2HIP conversion tasks
- Configuration Loading: Load
config.yamlwith agent, task, and LLM settings - Agent Registration: Dynamically load agent launcher, prompt builder, and post-processing handler based on AgentType enum
- Task Discovery: Scan
tasks/directory for task configurations matching specified categories - Workspace Setup: Create isolated workspace with timestamp for each task
- Prompt Building: Construct task-specific prompts from config, source code, and instructions/cheatsheets
- Agent Execution: Launch agent in workspace with constructed prompt
- Result Collection: Save agent output, logs, and modified code
- Post-Processing: Run compilation, correctness tests, performance profiling, and scoring
- Report Generation: Generate comprehensive evaluation report with metrics
- Python 3.12+
- ROCm toolkit (for HIP kernels):
hipcc,rocprof-compute - Triton (for Triton kernels)
- Git
# Clone the repository
git clone <repository-url>
cd AgentKernelArena
# Install dependencies
pip install -r requirements.txt
# Set up API keys (choose one or more)
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"
export OPENROUTER_API_KEY="your_openrouter_key"
# Install agent CLIs (using claude_code as an example)
# For Claude Code:
npm install -g @anthropic-ai/claude-code
- Configure
config.yaml:
# Select agent type
agent:
template: claude_code # Options: cursor, claude_code, swe_agent, single_llm_call, openevolve, geak_optimagentv2, geak_hip, geak_ourllm_kernel2kernel
max_iterations: 5
# Specify tasks to run
tasks:
- rocm-examples/bitonic_sort
- customer_hip/silu
# - all # Run ALL tasks
target_gpu_model: MI300
log_directory: logs
workspace_directory_prefix: workspace
- Run evaluation:
python main.pytasks:
- rocm-examples/* # All ROCm examples
- rocprim/* # All rocPRIM tasks
- customer_hip/mmcv/* # All MMCV HIP kernels
- triton/tritonbench/* # All Triton benchmarks
- torch2hip/* # All Torch2HIP conversion tasksEach task is defined by a config.yaml in its directory:
# tasks/rocm-examples/bitonic_sort/config.yaml
source_file_path:
- main.hip
target_kernel_functions:
- bitonic_sort_kernel
compile_command:
- make
correctness_command:
- ./applications_bitonic_sort -l 15
performance_command:
- rocprof-compute profile -n kernelgen --path rocprof_compute_profile --no-roof --join-type kernel -b SQ -b TCP -b TCC -- ./applications_bitonic_sort -l 15
- rocprof-compute analyze --path rocprof_compute_profile -b 2
task_type: hip2hip
prompt:
source_code: null # Optional: override default source code inclusion
instructions: null # Optional: custom instructions
cheatsheet: null # Optional: provide cheatsheet/referenceAgentKernelArena uses a cumulative scoring system:
| Metric | Points | Description |
|---|---|---|
| Compilation | 20 | Code compiles successfully without errors |
| Correctness | 100 | Code produces correct output (passes tests) |
| Speedup | ratio × 100 | Performance improvement over baseline |
Example: A submission that compiles (20), passes correctness (100), and achieves 1.5× speedup (150) would score 270 points.
Note: This is not the only way to score. Users could always define their own ways.
-
Create agent directory:
agents/your_agent/ -
Implement launch function:
# agents/your_agent/launch_agent.py
from agents import register_agent
@register_agent("your_agent")
def launch_agent(prompt: str, log_directory: str, workspace: str) -> str:
"""
Launch your agent.
Returns:
str: Agent output
"""
# Your agent implementation
return result- Register in module_registration.py:
# Add to AgentType enum
class AgentType(Enum):
YOUR_AGENT = "your_agent"
# Add import in load_agent_launcher
if agent_type == AgentType.YOUR_AGENT:
from agents.your_agent import launch_agent- Add prompt builder support (if needed):
# In load_prompt_builder
if agent_type in [..., AgentType.YOUR_AGENT]:
return prompt_builder- Add post-processing support (if needed):
# In load_post_processing_handler
if agent_type in [..., AgentType.YOUR_AGENT]:
return general_post_processing-
Create task directory:
tasks/<task_type>/<task_name>/ -
Add source files and scripts following this structure:
tasks/<task_type>/<task_name>/
├── config.yaml # Task configuration (required)
├── scripts/
│ └── task_runner.py # Compile/correctness/performance runner (recommended)
└── src/
└── <kernel files> # .cu, .hip, .py, etc.
- Create
config.yamlwith all required fields as lists (not scalar strings):
source_file_path:
- src/my_kernel.hip
target_kernel_functions:
- my_kernel_function
compile_command:
- python3 scripts/task_runner.py --mode compile
correctness_command:
- python3 scripts/task_runner.py --mode correctness
performance_command:
- python3 scripts/task_runner.py --mode performance
task_type: hip2hip # one of: hip2hip, cuda2hip, triton2triton, torch2hip
prompt:
source_code: null
instructions: null
cheatsheet: null-
Add baseline performance (optional): Create
baseline.txtwith expected performance metrics -
Run the Task Validator Agent (required):
All new tasks must pass the task validator agent before being merged. The validator runs 10 automated checks covering config schema, source file existence, kernel symbol resolution, compilation, correctness, performance, self-containedness, GPU hang detection, correctness implementation review, and result template compatibility.
# Configure the validator to target your new task
# In config.yaml at repo root:
agent:
template: task_validator
tasks:
- <task_type>/<task_name>
# Run validation
python3 main.pyReview the generated validation_report.yaml in the workspace directory. The task must achieve PASS overall status (all checks pass). A WARN status (no failures but warnings) is acceptable with justification. A FAIL status means the task must be fixed before merging.
See agents/task_validator/README.md for the full list of validation checks and requirements.
- Enhance A/B Testing with Better Interactivity and User Experience
- Benchmarking State-of-the-Art Agents for Technical Reporting
- Standardize Holdout Tests with Comprehensive Shape Coverage
- Add Holdout Test Evaluation via Independent Agent
- New Feature: Support Multi Agents in Multi GPUs Server
- New Feature: Resume the Evaluation From Previous Experiment
- Agents Can Hang During Task Execution, Blocking Test Completion
- Expand Pytorch2HIP Task Set to 100+ Tasks
- Expand CUDA2HIP Task Set to 100+ Tasks
- Expand Triton2Triton Task Set to 100+ Tasks
- Expand HIP2HIP Task Set to 100+ Tasks
- Restructure Task Directory by Take Type and Difficulty Level