Real-world benchmarks for AI coding agents
PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files.
| Repo | Description |
|---|---|
| skill | Benchmark runner and task definitions — run it yourself |
| leaderboard | The pinchbench.com leaderboard frontend |
git clone https://github.com/pinchbench/skill.git
cd skill
./scripts/run.sh --model anthropic/claude-sonnet-4Results upload to the public leaderboard. Get started →
Claw-some AI agent testing 🦞