Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion _includes/blogs.html
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ <h2 class="text-3xl md:text-4xl font-bold mb-4 md:mb-0">Latest Research</h2>

<!-- Blog Posts Grid -->
<div class="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-8 mb-16">
{% for post in site.posts limit:3 %}
{% assign visible_posts = site.posts | where_exp: "post", "post.exclude_from_collection != true" %}
{% for post in visible_posts limit:3 %}
<a href="{{ post.url | relative_url }}" class="post-card block hover:shadow-lg transition-shadow duration-200 rounded overflow-hidden bg-white">
{% if post.image %}
<div class="overflow-hidden mb-4">
Expand Down
198 changes: 198 additions & 0 deletions _posts/2026-02-15-everyevalever-launch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
---
layout: post
title: "Every Eval Ever: Toward a Common Language for AI Eval Reporting"
date: 2026-02-15
published: true
exclude_from_collection: true
category: Infrastructure
image: "/assets/img/long-site-banner.webp"
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image path uses a leading slash "/assets/img/long-site-banner.webp" while other blog posts in the codebase use "assets/img/..." without the leading slash. This inconsistency may cause issues depending on how Jekyll processes the paths. Consider using "assets/img/long-site-banner.webp" to match the convention seen in other posts like "2025-07-13-eval-science-kickoff.md" and "2025-08-10-blog-chart-crisis.md".

Suggested change
image: "/assets/img/long-site-banner.webp"
image: "assets/img/long-site-banner.webp"

Copilot uses AI. Check for mistakes.
authors:
- name: "Jan Batzner*"
- name: "Leshem Coshen*"
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The author name "Leshem Coshen" appears to be spelled inconsistently with other posts in the repository. In the post "_posts/2025-08-10-blog-chart-crisis.md", the same author is listed as "Leshem Choshen" (with an 'h'). Please verify the correct spelling and ensure consistency across all posts.

Suggested change
- name: "Leshem Coshen*"
- name: "Leshem Choshen*"

Copilot uses AI. Check for mistakes.
- name: "Avijit Ghosh*"
- name: "Sree Harsha Nelaturu*"
- name: "Anastassia Kornilova*"
- name: "Damian Stachura*"
- name: "Anka Reuel"
- name: "Yifan Mai"
- name: "Asaf Yehudai"
- name: "Irene Solaiman"
- name: "Stella Biderman"
tags:
- "infrastructure"
- "eval metadata"
- "reproducibility"
description: "The multistakeholder coalition EvalEval launches Every Eval Ever, a shared format and central eval repository. We're working to resolve AI evaluation fragmentation, improving formatting, settings, and ways to compare and build on each other's work."
---

As AI models advance, we encounter more and more evaluation results and benchmarks, yet evaluation itself rarely takes center stage. Evaluation plays an important role across the entire development cycle, from design decisions and [research to marketing and maintenance](https://dl.acm.org/doi/full/10.1145/3708359.3712152). This rapid progression forces evaluation solutions and infrastructure to keep up, often pushing them to adapt to settings [they were not originally designed for](https://arxiv.org/abs/2405.14782). For example, the popular evaluation tool [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) was initially designed for base model evaluation, but it now supports instruction-tuned and reasoning models that didn't even exist when it was created.

This wide range of evaluations has led to fragmentation. Lacking standards, each framework reports different attributes in different formats, preventing the community from reliably comparing results, replicating experiments, determining which components fail, separating signal from noise, reusing others' (often costly) evaluations, and performing large-scale analysis. This lack of reusability makes it difficult to build on the momentum of previous efforts. As model training has moved past the point where we retrain models from scratch or rewrite their training code, we must ask: why are we still rerunning every evaluation from scratch?

As part of the [EvalEval Coalition](https://evalevalai.com), **a cross-institutional, cross-sector initiative,** we are announcing [Every Eval Ever](https://evalevalai.com/projects/every-eval-ever/) to improve the state of evaluation building and comparison:

(1) **Defining a shared schema** so results from different frameworks can be compared, and

(2) Providing a **crowdsourced eval database** so researchers don't have to start from scratch every time.

Today, we're launching [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space), built upon valuable feedback from AI evaluation ecosystem stakeholders including researchers and practitioners at the U.S. Center for AI Standards and Innovation (CAISI), EleutherAI, Hugging Face, Noma Security, Trustible, Inspect AI, Meridian, AVERI, Collective Intelligence Project, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, and IBM Research.

## The Hidden Problem

The infrastructure is fragmented, and the results are scattered.

How does Model A compare to Model B on a given benchmark? One lab reports a five-shot score from its own harness. Another uses zero-shot through a different framework. A third pulls numbers from a leaderboard that does not disclose generation parameters.

This fragmentation has concrete costs. Large-scale analysis of evaluation trends or improving evaluation methodologies requires weeks of data wrangling before any research can begin, if such analysis is even possible without rerunning full leaderboards at extreme cost. Comparison is unreliable when evaluations use different settings but carry the same benchmark name. Reproducibility is elusive when the details are lacking or inconsistent.

We have seen this before in other parts of the ML pipeline. The community stopped retraining models from scratch or rewriting training code for each project long ago. Evaluations are next.

## Why Us, Why Now

We understand the pain firsthand. The EvalEval Coalition is a community of researchers working to fix how AI evaluations are built, run, documented, shared, and compared. We have worked on numerous projects where collecting evaluations either restricts what can be done or consumes most of the project's effort. Need examples? See [1](https://arxiv.org/abs/2602.03344), [2](https://arxiv.org/abs/2503.01622), [3](https://proceedings.neurips.cc/paper_files/paper/2024/hash/28236482f64a72eec43706b6f3a6c511-Abstract-Conference.html), [4](https://arxiv.org/abs/2412.06540), [5](https://arxiv.org/abs/2410.11840), [6](https://aclanthology.org/2024.acl-long.456/), [7](https://arxiv.org/abs/2407.13696), [8](https://par.nsf.gov/servlets/purl/10547932), [9](https://aclanthology.org/2024.naacl-long.139/), [10](https://aclanthology.org/2025.acl-long.34.com), among others.
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ACL Anthology URL for reference [10] has an incorrect format. ACL Anthology URLs should end with a forward slash, not ".com". The URL "https://aclanthology.org/2025.acl-long.34.com" should be "https://aclanthology.org/2025.acl-long.34/" to be a valid link.

Suggested change
We understand the pain firsthand. The EvalEval Coalition is a community of researchers working to fix how AI evaluations are built, run, documented, shared, and compared. We have worked on numerous projects where collecting evaluations either restricts what can be done or consumes most of the project's effort. Need examples? See [1](https://arxiv.org/abs/2602.03344), [2](https://arxiv.org/abs/2503.01622), [3](https://proceedings.neurips.cc/paper_files/paper/2024/hash/28236482f64a72eec43706b6f3a6c511-Abstract-Conference.html), [4](https://arxiv.org/abs/2412.06540), [5](https://arxiv.org/abs/2410.11840), [6](https://aclanthology.org/2024.acl-long.456/), [7](https://arxiv.org/abs/2407.13696), [8](https://par.nsf.gov/servlets/purl/10547932), [9](https://aclanthology.org/2024.naacl-long.139/), [10](https://aclanthology.org/2025.acl-long.34.com), among others.
We understand the pain firsthand. The EvalEval Coalition is a community of researchers working to fix how AI evaluations are built, run, documented, shared, and compared. We have worked on numerous projects where collecting evaluations either restricts what can be done or consumes most of the project's effort. Need examples? See [1](https://arxiv.org/abs/2602.03344), [2](https://arxiv.org/abs/2503.01622), [3](https://proceedings.neurips.cc/paper_files/paper/2024/hash/28236482f64a72eec43706b6f3a6c511-Abstract-Conference.html), [4](https://arxiv.org/abs/2412.06540), [5](https://arxiv.org/abs/2410.11840), [6](https://aclanthology.org/2024.acl-long.456/), [7](https://arxiv.org/abs/2407.13696), [8](https://par.nsf.gov/servlets/purl/10547932), [9](https://aclanthology.org/2024.naacl-long.139/), [10](https://aclanthology.org/2025.acl-long.34/), among others.

Copilot uses AI. Check for mistakes.

The urgency of standardized AI evaluation has reached a tipping point, driven by the shift toward evaluations as a primary mechanism of governance. With the EU AI Act and the U.S. Executive Order now mandating rigorous risk assessments, standardized data is no longer a luxury but a prerequisite for meaningful sociotechnical safety standards. This need is further intensified by the growing complexity of modern AI, where frameworks like Inspect AI and HELM must navigate multi-turn agentic behaviors and human preferences that defy simple scoring.

Failing to adopt reusable formats imposes technical debt on the community, forcing researchers to waste resources rerunning redundant evaluations rather than advancing the scientific frontier. Even what is shared is often just a single score per dataset, obscuring many questions.

## What We're Building

Every Eval Ever is a schema to describe evaluation results and a community collection of those results. It is by the community and for the community, designed to make contributing evaluations or code simple. Now for the details.

The repository is organized by benchmark, model, and evaluation run. Each result file captures not just scores but the context needed to interpret and reuse them:
- who ran the evaluation,
- what model,
- with what settings,
- what these scores actually mean,
- and instance-level scores, if available.

```
data/
└── {benchmark}/
└── {developer_name}/
└── {model_name}/
├── {uuid}.json
└── {uuid}.jsonl
```

The three components that make Every Eval Ever work:

📋 A [metadata schema](https://github.com/evaleval/every_eval_ever/blob/main/eval.schema.json) that defines the information needed for meaningful comparison of evaluation results

🔧 [Validation](https://github.com/evaleval/every_eval_ever/blob/main/utils/validate_data.py) that checks data against the schema before it enters the repository

🔌 [Converters](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters) for popular evaluation tools like [Inspect AI](https://inspect.aisi.org.uk/), [HELM](https://github.com/stanford-crfm/helm), and [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness), so you can transform your existing evaluation logs into the standard format

## 📋 The Schema

One thing we realized early on: evaluation harnesses are valuable tools, but each was built for its own purpose, and we cannot simply aggregate their scores. Take MMLU: the lm-eval-harness, HELM, and the original Berkeley implementation all evaluate the same dataset, but with different prompt formatting, different answer extraction methods, and different ordering of few-shot examples. The result? [LLaMA 65B scored 0.637 on HELM but 0.488 on the EleutherAI harness](https://huggingface.co/blog/open-llm-leaderboard-mmlu), both reporting an "MMLU score" on the same dataset, yet with a significant gap.

Think about all the evaluation data that doesn't come from a harness at all: hand-built benchmarks, leaderboard scrapes, results pulled from research papers and blog posts. We wanted one format that could capture all evaluations, whether harness-run or hand-built, truly every evaluation ever.

Here's how the schema looks in practice, using [LiveCodeBench Pro](https://github.com/GavinZhengOI/LiveCodeBench-Pro) as an example from our repository.

First, we capture where the evaluation came from and who ran it:

```json
{
"source_metadata": {
"source_name": "LiveCodeBench Pro",
"source_type": "evaluation_platform",
"source_organization_name": "LiveCodeBench",
"evaluator_relationship": "third_party"
}
}
```

The schema also records the exact model details: name, developer, and how it was accessed. Was it called through the developer's own API (like OpenAI or Anthropic), through a third-party provider (like OpenRouter or Together AI), or run locally with an inference engine like vLLM? This matters because the same model, accessed through different providers or run with different engine configurations, [can produce different outputs](https://arxiv.org/pdf/2312.03886) and therefore different scores.

Next, generation settings. We all know how much they matter: changing temperature or the number of samples alone can shift scores by several points. Yet they're routinely absent from leaderboards and incomplete even in papers. When a model's score is reported without this context, we're left guessing whether differences reflect actual model capability or just different settings. We capture this information and, where it's missing, record that gap explicitly so anyone interpreting the results knows what context they do and don't have:

```json
{
"generation_config": {
"generation_args": {
"temperature": 0.2,
"top_p": 0.95,
"max_tokens": 2048
},
"additional_details": {
"n_samples": 10,
"stop_sequences": ["\n```"]
}
}
}
```

Then there's the score itself. Consider a model on the coding benchmark [HumanEval](https://arxiv.org/abs/2107.03374): scoring 0.31 on the first try (called pass@1) means it solved 31% of coding problems, where higher is better. In contrast, if the same model scores 0.31 on [RealToxicityPrompts](https://github.com/allenai/real-toxicity-prompts), lower scores would be better. Every Eval Ever standardizes this to enable better interpretation of evaluation results:

```json
{
"evaluation_results": [
{
"evaluation_name": "code_generation",
"metric_config": {
"evaluation_description": "pass@1 on code generation tasks",
"lower_is_better": false,
"score_type": "continuous",
"min_score": 0,
"max_score": 1
},
"score_details": {
"score": 0.31
}
}
]
}
```

## 🔧 Validation & Converters

Better eval infrastructure should be easy and frictionless for practitioners. That's why [Every Eval Ever](https://evalevalai.com/projects/every-eval-ever/) provides:

**Converters:** If you're already running evaluations with existing eval tools, you shouldn't have to manually parse your results. Our converters transform evaluation logs into the Every Eval Ever format automatically. Converters for [HELM](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/helm), [lm-eval](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/lm_eval), and [Inspect AI](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/inspect) are already available, with more on the way. In one step, your eval data fits the format.

**Validation:** Validation runs automatically on every submission via Hugging Face Jobs. Before any result file is merged, it's checked against the schema to catch missing fields and structural issues early, not months later when someone tries to use the data.

## 🧩 Design Decisions

What counts as a unique evaluation? This can be a thorny question. Two [GSM8K](https://huggingface.co/datasets/openai/gsm8k) runs might differ in prompt template, chat template, vLLM version, GPU kernels, or dataset version, each affecting scores. Within our own coalition, members used GSM8K for purposes as different as measuring mathematical reasoning and benchmarking speculative decoding.

We considered defining a canonical parameter set to fingerprint unique runs. In practice, the space of score-affecting variables is too large for any fixed set. Our solution: each run gets its own file, identified by a UUID, with as much metadata as possible captured alongside it. Deduplication and grouping happen at the analysis layer, not the schema layer. This keeps data lossless while letting consumers apply their own equivalence criteria.

We intentionally allow reporting whatever information is available. For many, this is an aggregation score with some metadata; for others, it is per-example data and extensive hyperparameters. Why does this matter? Most evaluation research requires more than aggregation scores. Researchers need to check for biases, errors, redundant questions, questions that measure the wrong thing, alternative dataset aggregations, and error analysis. All of these questions require at least the model outputs or scores per example, not per dataset.

## What's Next

Every Eval Ever grew out of a need we kept running into in our own research. When [EvalEval researchers mapped how social impact evaluations are reported across the field](https://arxiv.org/abs/2511.05613), examining 186 first-party reports and 183 third-party sources, the lack of a common format turned what should have been a straightforward analysis into weeks of manual data wrangling. That work made the case for why something like Every Eval Ever needed to exist: even though evaluation data is openly available, it exists in incompatible formats, with no shared infrastructure to aggregate or compare it.

This schema enables research. Beyond good documentation hygiene, [researchers have already used the repository in a multi-author EvalEval effort to analyze benchmark saturation across 60 benchmarks](https://evalevalai.com/projects/bench-sat/), finding that nearly half had lost their ability to differentiate top-performing models. Centralized, standardized evaluation data enables additional research questions: identifying where the ecosystem is thin, which capabilities are over-measured, and which risks are neglected. With instance-level data, researchers can move beyond leaderboard averages to study item difficulty, robustness, and temporal drift. Every Eval Ever enables meta-evaluation: testing evaluation methods themselves to distinguish real progress from artifacts of setup and reporting.

We need your help. We're launching a [Shared Task](https://evalevalai.com/events/) for practitioners alongside this post: two tracks for contributing public and proprietary eval data to the repository, with co-authorship for qualifying contributors and a [workshop at ACL 2026 in San Diego](https://evalevalai.com/events/2026-acl-workshop/).

*Submissions open now, deadline May 1, 2026.*

## Get Involved

- Try the schema 📋: [Hugging Face Space](https://huggingface.co/spaces/evaleval/every_eval_ever_space) and [GitHub](https://github.com/evaleval/every_eval_ever)

- Join the Shared Task 🏁: [Call for Participation](https://evalevalai.com/events/)

- Join the community 💬: [Reach out to be added](mailto:jan.batzner@tum.de)

```bibtex
@misc{evaleval2026everyevalever,
title = {Every Eval Ever: Toward a Common Language for AI Eval Reporting},
author = {Jan Batzner and Leshem Coshen and Avijit Ghosh and Sree Harsha Nelaturu and Anastassia Kornilova and Damian Stachura and Anka Reuel and Yifan Mai and Asaf Yehudai and Irene Solaiman and Stella Biderman},
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The author name "Leshem Coshen" in the BibTeX citation appears to be spelled inconsistently with other posts in the repository. In the post "_posts/2025-08-10-blog-chart-crisis.md", the same author is listed as "Leshem Choshen" (with an 'h'). Please verify the correct spelling and ensure consistency across all posts and citations.

Copilot uses AI. Check for mistakes.
year = {2026},
month = {February},
url = {https://evaleval.github.io/2026/02/16/everyevalever-launch/},
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BibTeX citation URL shows the date as "2026/02/16" but the frontmatter date is "2026-02-15". This creates an inconsistency between the post date and its canonical URL in the citation.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL in the BibTeX citation has a date mismatch. The file is dated 2026-02-15 (February 15) in both the filename and the frontmatter date field, but the URL uses "2026/02/16" (February 16). This should be corrected to "2026/02/15" to match the post date.

Suggested change
url = {https://evaleval.github.io/2026/02/16/everyevalever-launch/},
url = {https://evaleval.github.io/2026/02/15/everyevalever-launch/},

Copilot uses AI. Check for mistakes.
note = {Blog Post, EvalEval Coalition}
}
```

### Feedback and Advice

We acknowledge feedback from JJ Allaire (Inspect, Meridian Labs), Ryan Steed (US CAISI), Zee Talat (University of Edinburgh), Gal Moyal (Noma Security), Sean McGregor (AVERI), Joal Stein (WeVal/CiP), Srishti Yadav (ELLIS Copenhagen), Andrew Tran (AWS), Sanchit Ahuja (Northeastern), Volker Stocker (Weizenbaum, TUB), Marek Šuppa (Slido), Stefan Schmid (Weizenbaum, TUB), and Gjergji Kasneci (TUM, MCML).

---