From 40a8ccb0b9e4796ec8296f368d9e00d977c1e85a Mon Sep 17 00:00:00 2001 From: Jan Batzner <91485870+janbatzner@users.noreply.github.com> Date: Sun, 15 Feb 2026 18:24:48 +0100 Subject: [PATCH 1/3] everyevalever launch --- _includes/blogs.html | 3 +- _posts/2026-02-15-everyevalever-launch.md | 191 ++++++++++++++++++++++ 2 files changed, 193 insertions(+), 1 deletion(-) create mode 100644 _posts/2026-02-15-everyevalever-launch.md diff --git a/_includes/blogs.html b/_includes/blogs.html index 61a27d6..d3bcb29 100644 --- a/_includes/blogs.html +++ b/_includes/blogs.html @@ -8,7 +8,8 @@

Latest Research

- {% for post in site.posts limit:3 %} + {% assign visible_posts = site.posts | where_exp: "post", "post.exclude_from_collection != true" %} + {% for post in visible_posts limit:3 %} {% if post.image %}
diff --git a/_posts/2026-02-15-everyevalever-launch.md b/_posts/2026-02-15-everyevalever-launch.md new file mode 100644 index 0000000..2f0571b --- /dev/null +++ b/_posts/2026-02-15-everyevalever-launch.md @@ -0,0 +1,191 @@ +--- +layout: post +title: "Every Eval Ever: Toward a Common Language for AI Eval Reporting" +date: 2026-02-15 +published: true +exclude_from_collection: true +category: Infrastructure +image: "/assets/img/long-site-banner.webp" +authors: + - name: "Jan Batzner*" + - name: "Leshem Coshen*" + - name: "Avijit Ghosh*" + - name: "Sree Harsha Nelaturu*" + - name: "Anastassia Kornilova*" + - name: "Damian Stachura*" + - name: "Anka Reuel" + - name: "Yifan Mai" + - name: "Asaf Yehudai" + - name: "Irene Solaiman" + - name: "Stella Biderman" +tags: + - "infrastructure" + - "eval metadata" + - "reproducibility" +description: "The multistakeholder coalition EvalEval launches Every Eval Ever, a shared format and central eval repository. We're working to resolve AI evaluation fragmentation, improving formatting, settings, and ways to compare and build on each other's work." +--- +As AI models advance, we encounter more and more evaluation results and benchmarks—yet evaluation itself rarely takes center stage. It plays an important role across the entire development cycle, from design decisions and [research to marketing and maintenance](https://dl.acm.org/doi/full/10.1145/3708359.3712152). This rapid progression forces evaluation solutions and eval infrastructure to keep up, often pushing them to adapt to settings [they were not originally designed for](https://arxiv.org/abs/2405.14782). For example, the popular evaluation tool [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) was initially designed for base model evaluation, but it now supports instruction-tuned and reasoning models that didn’t even exist when it was created. + +This rapid progression and wide range of evals led to fragmentation. Lacking standards, each framework reports different attributes in different formats, preventing the community from reliably comparing, replicating, determining which component fails, separating signal from noise, reusing others’ (often costly) evaluations, and performing large-scale analysis. This lack of reusability makes it difficult to build on the momentum of previous efforts. As model training has moved past the point where we retrain models from scratch or rewrite their training code, we must ask: why are we still rerunning every evaluation from scratch? + +As part of a **cross-institutional, cross-sector initiative, the [EvalEval Coalition](https://evalevalai.com),** we are announcing [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) to improve the state of evaluation building and comparison: + +(1) **Defining a shared schema** so results from different frameworks can be compared, and + +(2) Providing a **crowdsourced eval database** so researchers don’t have to start from scratch every time. + +Today, we're launching [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space), built upon valuable feedback from AI Eval Ecosystem actors including researchers and practitioners at the U.S. Center for AI Standards and Innovation (CAISI), EleutherAI, Hugging Face, Noma Security, Trustible, Inspect AI, Meridian, AVERI, Collective Intelligence Project, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, and IBM Research. + +## The Hidden Problem +The infrastructure is fragmented, and the results scattered. +How does Model A compare to Model B on a given benchmark? One lab reports a five-shot score from its own harness. Another uses zero-shot through a different framework. A third pulls numbers from a leaderboard that does not disclose generation parameters. +This fragmentation has concrete costs. Large-scale analysis of evaluation trends or improving evaluation methodologies requires weeks of data wrangling before any research can begin. If they are possible at all, without (re)running full leaderboards at extreme costs. Comparison is unreliable when evaluations use different settings but carry the same benchmark name. Reproducibility is elusive when the details are lacking and inconsistent. +It is time for a change. We have seen this before in other parts of the ML pipeline. The community stopped retraining models from scratch or rewriting training code for each project long ago. Evaluations are next. + +## Why Us, Why Now +We just know the pain. The EvalEval Coalition is a community of researchers working to fix how AI evaluations are built, run, documented, shared, and compared. We worked on a myriad of projects where collecting evaluations restricts what can be done or takes most of the project’s efforts. Need examples? See [1](https://arxiv.org/abs/2602.03344), [2](https://arxiv.org/abs/2503.01622), [3](https://proceedings.neurips.cc/paper_files/paper/2024/hash/28236482f64a72eec43706b6f3a6c511-Abstract-Conference.html), [4](https://arxiv.org/abs/2412.06540), [5](https://arxiv.org/abs/2410.11840), [6](https://aclanthology.org/2024.acl-long.456/), [7](https://arxiv.org/abs/2407.13696), [8](https://par.nsf.gov/servlets/purl/10547932), [9](https://aclanthology.org/2024.naacl-long.139/), [10](https://aclanthology.org/2025.acl-long.34.com) among others. + +The urgency of standardized AI evaluation has reached a critical tipping point, driven by the shift toward evaluations as a primary mechanism of governance. With the EU AI Act and the U.S. Executive Order now mandating rigorous risk assessments, standardized data is no longer a luxury but a prerequisite for meaningful sociotechnical safety standards. +This need is further intensified by the exploding complexity of modern AI, where frameworks like Inspect AI and HELM must navigate multi-turn agentic behaviors and human preferences that defy simple scoring. + +Ultimately, failing to adopt reusable formats imposes a technical debt on the community, forcing researchers to waste resources rerunning redundant evaluations rather than advancing the scientific frontier. Oh, and of course, even what is shared is often just a single score per dataset, obscuring many questions. + +## What We're Building +The [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) is a schema to describe evaluation results and a community collection of those results. It is by the community and for the community, made simple to contribute to, evaluation or code. Enough high-level, let’s get into the details. + +The repository is organized by benchmark, model, and evaluation run. Each result file captures not just scores but the context you need to interpret and reuse them: +- who ran the evaluation, +- what model, +- with what settings, +- what these scores actually mean, +- and instance-level scores, if you have them. + +``` +data/ +└── {benchmark}/ + └── {developer_name}/ + └── {model_name}/ + ├── {uuid}.json + └── {uuid}.jsonl +``` + +The three components that make [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) work 💙 + +📋 A [metadata schema](https://github.com/evaleval/every_eval_ever/blob/main/eval.schema.json) that defines the information needed for meaningful comparison of evaluation results + +🔧 [Validation](https://github.com/evaleval/every_eval_ever/blob/main/utils/validate_data.py) that checks data against the schema before it enters the repository + +🔌 [Converters](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters) for popular evaluation tools like [Inspect AI](https://inspect.aisi.org.uk/), [HELM](https://github.com/stanford-crfm/helm), and [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness), so you can transform your existing evaluation logs into the standard format. + +## 📋 The Schema +One thing we realized early on: evaluation harnesses are amazing — we love them all — but they were each built for their own purposes, and we cannot simply aggregate their scores. Take MMLU: the lm-eval-harness, HELM, and the original Berkeley implementation all evaluate the same dataset, but with different prompt formatting, different answer extraction methods, and different ordering of few-shot examples. The result? [LLaMA 65B scored 0.637 on HELM but 0.488 on the EleutherAI harness](https://huggingface.co/blog/open-llm-leaderboard-mmlu) — both called "MMLU score," same dataset, big gap. + +And then, think about all the evaluation data that doesn't come from a harness at all: hand-built benchmarks, leaderboard scrapes, results pulled from research papers and blog posts. We wanted one format that could capture whether harness-run or hand-built alike, truly every evaluation ever. +Here's how the schema looks in practice, using [LiveCodeBench Pro](https://github.com/GavinZhengOI/LiveCodeBench-Pro) as an example from our repo. + +First, we capture where the evaluation came from and who ran it: + +```json +{ + "source_metadata": { + "source_name": "LiveCodeBench Pro", + "source_type": "evaluation_platform", + "source_organization_name": "LiveCodeBench", + "evaluator_relationship": "third_party" + } +} +``` + +The schema also records the exact model details — name, developer, and how it was accessed. Was it called through the developer's own API (like OpenAI or Anthropic), through a third-party provider (like OpenRouter or Together AI), or run locally with an inference engine like vLLM? This isn't just because we love eval metadata. The same model, accessed through different providers or run with different engine configurations, [can produce different outputs](https://arxiv.org/pdf/2312.03886) — and therefore different scores. + +Next, generation settings. We all know how much they matter — changing temperature or the number of samples alone can shift scores by several points. Yet they're routinely absent from leaderboards and incomplete even in papers. When a model's score is reported without this context, we're left guessing whether differences reflect actual model capability or just different settings. So we want to capture this information — and where it's missing, record that gap explicitly, so anyone interpreting the results knows what context they do and don't have: + +```json +{ + "generation_config": { + "generation_args": { + "temperature": 0.2, + "top_p": 0.95, + "max_tokens": 2048 + }, + "additional_details": { + "n_samples": 10, + "stop_sequences": ["\n```"] + } + } +} +``` + +And then there's the score itself. Let’s take a model on the coding benchmark [HumanEval](https://arxiv.org/abs/2107.03374): scoring 0.31 on the first try (called pass@1) represents a fraction of coding problems it solved — higher would be better. On the contrary, if the same model scores again 0.31 but on [RealToxicityPrompts](https://github.com/allenai/real-toxicity-prompts), lower scores would be better. [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) standardizes to enable better eval result interpretation: + +```json +{ + "evaluation_results": [ + { + "evaluation_name": "code_generation", + "metric_config": { + "evaluation_description": "pass@1 on code generation tasks", + "lower_is_better": false, + "score_type": "continuous", + "min_score": 0, + "max_score": 1 + }, + "score_details": { + "score": 0.31 + } + } + ] +} +``` + +## 🔧 Validation & Converters +Better eval infrastructure should be easy and frictionless for practitioners. That's why we're proud that [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) provides: + +**Converters:** If you're already running evaluations with existing eval tools, you shouldn't have to manually parse your results. Our converters transform evaluation logs into the Every Eval Ever format automatically. Converters for [HELM](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/helm), [lm-eval](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/lm_eval), and [Inspect AI](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/inspect) are already available, with more on the way. One step, and your eval data fits in. + +**Validation:** Validation runs automatically on every submission via Hugging Face Jobs — before any result file is merged, it's checked against the schema to catch missing fields and structural issues early, not months later when someone tries to use the data. + +## 🧩 Design Decisions +What counts as a unique evaluation for you? This can get quite a thorny question! Imagine two [GSM8K](https://huggingface.co/datasets/openai/gsm8k) runs might differ in prompt template, chat template, vLLM version, GPU kernels, or dataset version — each affecting scores. Within our own coalition, members used GSM8K for purposes as different as measuring mathematical reasoning and benchmarking speculative decoding. + +We considered defining a canonical parameter set to fingerprint unique runs. In practice, the space of score-affecting variables is too large for any fixed set. Our solution: each run gets its own file, identified by a UUID, with as much metadata as possible captured alongside it. Deduplication and grouping happen at the analysis layer, not the schema layer. This keeps data lossless while letting consumers apply their own equivalence criteria. + +On purpose we allowed reporting whatever one has. For many, this is an aggregation score with some metadata; for others, it is per-example data and a lot of hyperparameters. Why should you care? +Hardly any evaluation research is done on the aggregation scores. You want to check whether there are biases, errors, redundant questions, questions that measure the wrong thing, aggregate datasets differently, and analyse errors. All of those questions require at least the model outputs or scores per example, not per dataset. + +## What’s Next +[Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) grew out of a need we kept running into in our own research. When [EvalEval researchers mapped how social impact evaluations are reported across the field](https://arxiv.org/abs/2511.05613), examining 186 first-party reports and 183 third-party sources, the lack of a common format turned what should have been a straightforward analysis into weeks of manual data wrangling. That work made the case for why something like Every Eval Ever needed to exist: even though a lot of evaluation data is available open source, it is in incompatible formats, with no shared infrastructure to aggregate or compare it. + +This schema enables research. Beyond just good documentation hygiene, [researchers already used the repository in a multi-author EvalEval effort to analyze benchmark saturation across 60 benchmarks](https://evalevalai.com/projects/bench-sat/), finding that nearly half had lost their ability to differentiate top-performing models. Centralized, standardized evaluation data opens up more: seeing where the ecosystem is thin, which capabilities are over-measured, and which risks are neglected. With instance-level data, researchers can move beyond leaderboard averages to study item difficulty, robustness, and temporal drift. Every Eval Ever enables meta-evaluation: testing evaluation methods themselves to distinguish real progress from artifacts of setup and reporting. +We need your help. We're launching a [Shared Task](evalevalai.com/events/) for practitioners alongside this post — two tracks for contributing public and proprietary eval data to the repository, with co-authorship for qualifying contributors and a [workshop at ACL 2026 in San Diego](https://evalevalai.com/events/2026-acl-workshop/). + +*Submissions open now, deadline May 1, 2026.* + +## Get involved +- Try the schema 📋 : [Hugging Face Space](https://huggingface.co/spaces/evaleval/every_eval_ever_space) and [GitHub](github.com/evaleval/every_eval_ever) + +- Join the Shared Task 🏁 : [Call for Participation](evalevalai.com/events/) + +- Join the community 💬 : [Reach out to be added](mailto:jan.batzner@tum.de) + + +```bibtex +@misc{evaleval2026everyevalever, + title = {Every Eval Ever: Toward a Common Language for AI Eval Reporting}, + author = {Jan Batzner and Leshem Coshen and Avijit Ghosh and Sree Harsha Nelaturu and Anastassia Kornilova and Damian Stachura and Anka Reuel and Yifan Mai and Asaf Yehudai and Irene Solaiman and Stella Biderman}, + year = {2026}, + month = {February}, + url = {https://evaleval.github.io/2026/02/16/everyevalever-launch/}, + note = {Blog Post, EvalEval Coalition} +} +``` + +### Feedback and Advise +We acknowledge feedback by, but not limited to, JJ Allaire (Inspect, Meridian Labs), Ryan Steed (US CAISI), Zee Talat +(University of Edinburgh), Gal Moyal (Noma Security), Sean McGregor (AVERI), Joal Stein +(WeVal/CiP), Srishti Yadav (ELLIS Copenhagen), Andrew Tran (AWS), Sanchit Ahuja (Northeastern), Volker Stocker (Weizenbaum, TUB), Marek Šuppa (Slido), Stefan Schmid (Weizenbaum, TUB), Gjergji Kasneci (TUM, MCML). + + + + + From 0fbfac8bc8bd4cd5122aad91bf41f2ad1f130a0c Mon Sep 17 00:00:00 2001 From: Avijit Ghosh Date: Sun, 15 Feb 2026 17:33:59 -0500 Subject: [PATCH 2/3] grammar and formatting fix --- _posts/2026-02-15-everyevalever-launch.md | 95 ++++++++++++----------- 1 file changed, 51 insertions(+), 44 deletions(-) diff --git a/_posts/2026-02-15-everyevalever-launch.md b/_posts/2026-02-15-everyevalever-launch.md index 2f0571b..9ef5c20 100644 --- a/_posts/2026-02-15-everyevalever-launch.md +++ b/_posts/2026-02-15-everyevalever-launch.md @@ -24,41 +24,47 @@ tags: - "reproducibility" description: "The multistakeholder coalition EvalEval launches Every Eval Ever, a shared format and central eval repository. We're working to resolve AI evaluation fragmentation, improving formatting, settings, and ways to compare and build on each other's work." --- -As AI models advance, we encounter more and more evaluation results and benchmarks—yet evaluation itself rarely takes center stage. It plays an important role across the entire development cycle, from design decisions and [research to marketing and maintenance](https://dl.acm.org/doi/full/10.1145/3708359.3712152). This rapid progression forces evaluation solutions and eval infrastructure to keep up, often pushing them to adapt to settings [they were not originally designed for](https://arxiv.org/abs/2405.14782). For example, the popular evaluation tool [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) was initially designed for base model evaluation, but it now supports instruction-tuned and reasoning models that didn’t even exist when it was created. -This rapid progression and wide range of evals led to fragmentation. Lacking standards, each framework reports different attributes in different formats, preventing the community from reliably comparing, replicating, determining which component fails, separating signal from noise, reusing others’ (often costly) evaluations, and performing large-scale analysis. This lack of reusability makes it difficult to build on the momentum of previous efforts. As model training has moved past the point where we retrain models from scratch or rewrite their training code, we must ask: why are we still rerunning every evaluation from scratch? +As AI models advance, we encounter more and more evaluation results and benchmarks, yet evaluation itself rarely takes center stage. Evaluation plays an important role across the entire development cycle, from design decisions and [research to marketing and maintenance](https://dl.acm.org/doi/full/10.1145/3708359.3712152). This rapid progression forces evaluation solutions and infrastructure to keep up, often pushing them to adapt to settings [they were not originally designed for](https://arxiv.org/abs/2405.14782). For example, the popular evaluation tool [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) was initially designed for base model evaluation, but it now supports instruction-tuned and reasoning models that didn't even exist when it was created. -As part of a **cross-institutional, cross-sector initiative, the [EvalEval Coalition](https://evalevalai.com),** we are announcing [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) to improve the state of evaluation building and comparison: +This wide range of evaluations has led to fragmentation. Lacking standards, each framework reports different attributes in different formats, preventing the community from reliably comparing results, replicating experiments, determining which components fail, separating signal from noise, reusing others' (often costly) evaluations, and performing large-scale analysis. This lack of reusability makes it difficult to build on the momentum of previous efforts. As model training has moved past the point where we retrain models from scratch or rewrite their training code, we must ask: why are we still rerunning every evaluation from scratch? + +As part of the [EvalEval Coalition](https://evalevalai.com), **a cross-institutional, cross-sector initiative,** we are announcing [Every Eval Ever](https://evalevalai.com/projects/every-eval-ever/) to improve the state of evaluation building and comparison: (1) **Defining a shared schema** so results from different frameworks can be compared, and -(2) Providing a **crowdsourced eval database** so researchers don’t have to start from scratch every time. +(2) Providing a **crowdsourced eval database** so researchers don't have to start from scratch every time. -Today, we're launching [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space), built upon valuable feedback from AI Eval Ecosystem actors including researchers and practitioners at the U.S. Center for AI Standards and Innovation (CAISI), EleutherAI, Hugging Face, Noma Security, Trustible, Inspect AI, Meridian, AVERI, Collective Intelligence Project, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, and IBM Research. +Today, we're launching [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space), built upon valuable feedback from AI evaluation ecosystem stakeholders including researchers and practitioners at the U.S. Center for AI Standards and Innovation (CAISI), EleutherAI, Hugging Face, Noma Security, Trustible, Inspect AI, Meridian, AVERI, Collective Intelligence Project, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, and IBM Research. ## The Hidden Problem -The infrastructure is fragmented, and the results scattered. + +The infrastructure is fragmented, and the results are scattered. + How does Model A compare to Model B on a given benchmark? One lab reports a five-shot score from its own harness. Another uses zero-shot through a different framework. A third pulls numbers from a leaderboard that does not disclose generation parameters. -This fragmentation has concrete costs. Large-scale analysis of evaluation trends or improving evaluation methodologies requires weeks of data wrangling before any research can begin. If they are possible at all, without (re)running full leaderboards at extreme costs. Comparison is unreliable when evaluations use different settings but carry the same benchmark name. Reproducibility is elusive when the details are lacking and inconsistent. -It is time for a change. We have seen this before in other parts of the ML pipeline. The community stopped retraining models from scratch or rewriting training code for each project long ago. Evaluations are next. + +This fragmentation has concrete costs. Large-scale analysis of evaluation trends or improving evaluation methodologies requires weeks of data wrangling before any research can begin, if such analysis is even possible without rerunning full leaderboards at extreme cost. Comparison is unreliable when evaluations use different settings but carry the same benchmark name. Reproducibility is elusive when the details are lacking or inconsistent. + +We have seen this before in other parts of the ML pipeline. The community stopped retraining models from scratch or rewriting training code for each project long ago. Evaluations are next. ## Why Us, Why Now -We just know the pain. The EvalEval Coalition is a community of researchers working to fix how AI evaluations are built, run, documented, shared, and compared. We worked on a myriad of projects where collecting evaluations restricts what can be done or takes most of the project’s efforts. Need examples? See [1](https://arxiv.org/abs/2602.03344), [2](https://arxiv.org/abs/2503.01622), [3](https://proceedings.neurips.cc/paper_files/paper/2024/hash/28236482f64a72eec43706b6f3a6c511-Abstract-Conference.html), [4](https://arxiv.org/abs/2412.06540), [5](https://arxiv.org/abs/2410.11840), [6](https://aclanthology.org/2024.acl-long.456/), [7](https://arxiv.org/abs/2407.13696), [8](https://par.nsf.gov/servlets/purl/10547932), [9](https://aclanthology.org/2024.naacl-long.139/), [10](https://aclanthology.org/2025.acl-long.34.com) among others. -The urgency of standardized AI evaluation has reached a critical tipping point, driven by the shift toward evaluations as a primary mechanism of governance. With the EU AI Act and the U.S. Executive Order now mandating rigorous risk assessments, standardized data is no longer a luxury but a prerequisite for meaningful sociotechnical safety standards. -This need is further intensified by the exploding complexity of modern AI, where frameworks like Inspect AI and HELM must navigate multi-turn agentic behaviors and human preferences that defy simple scoring. +We understand the pain firsthand. The EvalEval Coalition is a community of researchers working to fix how AI evaluations are built, run, documented, shared, and compared. We have worked on numerous projects where collecting evaluations either restricts what can be done or consumes most of the project's effort. Need examples? See [1](https://arxiv.org/abs/2602.03344), [2](https://arxiv.org/abs/2503.01622), [3](https://proceedings.neurips.cc/paper_files/paper/2024/hash/28236482f64a72eec43706b6f3a6c511-Abstract-Conference.html), [4](https://arxiv.org/abs/2412.06540), [5](https://arxiv.org/abs/2410.11840), [6](https://aclanthology.org/2024.acl-long.456/), [7](https://arxiv.org/abs/2407.13696), [8](https://par.nsf.gov/servlets/purl/10547932), [9](https://aclanthology.org/2024.naacl-long.139/), [10](https://aclanthology.org/2025.acl-long.34.com), among others. + +The urgency of standardized AI evaluation has reached a tipping point, driven by the shift toward evaluations as a primary mechanism of governance. With the EU AI Act and the U.S. Executive Order now mandating rigorous risk assessments, standardized data is no longer a luxury but a prerequisite for meaningful sociotechnical safety standards. This need is further intensified by the growing complexity of modern AI, where frameworks like Inspect AI and HELM must navigate multi-turn agentic behaviors and human preferences that defy simple scoring. -Ultimately, failing to adopt reusable formats imposes a technical debt on the community, forcing researchers to waste resources rerunning redundant evaluations rather than advancing the scientific frontier. Oh, and of course, even what is shared is often just a single score per dataset, obscuring many questions. +Failing to adopt reusable formats imposes technical debt on the community, forcing researchers to waste resources rerunning redundant evaluations rather than advancing the scientific frontier. Even what is shared is often just a single score per dataset, obscuring many questions. ## What We're Building -The [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) is a schema to describe evaluation results and a community collection of those results. It is by the community and for the community, made simple to contribute to, evaluation or code. Enough high-level, let’s get into the details. -The repository is organized by benchmark, model, and evaluation run. Each result file captures not just scores but the context you need to interpret and reuse them: +Every Eval Ever is a schema to describe evaluation results and a community collection of those results. It is by the community and for the community, designed to make contributing evaluations or code simple. Now for the details. + +The repository is organized by benchmark, model, and evaluation run. Each result file captures not just scores but the context needed to interpret and reuse them: - who ran the evaluation, - what model, - with what settings, - what these scores actually mean, -- and instance-level scores, if you have them. +- and instance-level scores, if available. ``` data/ @@ -69,19 +75,21 @@ data/ └── {uuid}.jsonl ``` -The three components that make [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) work 💙 +The three components that make Every Eval Ever work: 📋 A [metadata schema](https://github.com/evaleval/every_eval_ever/blob/main/eval.schema.json) that defines the information needed for meaningful comparison of evaluation results 🔧 [Validation](https://github.com/evaleval/every_eval_ever/blob/main/utils/validate_data.py) that checks data against the schema before it enters the repository -🔌 [Converters](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters) for popular evaluation tools like [Inspect AI](https://inspect.aisi.org.uk/), [HELM](https://github.com/stanford-crfm/helm), and [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness), so you can transform your existing evaluation logs into the standard format. +🔌 [Converters](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters) for popular evaluation tools like [Inspect AI](https://inspect.aisi.org.uk/), [HELM](https://github.com/stanford-crfm/helm), and [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness), so you can transform your existing evaluation logs into the standard format ## 📋 The Schema -One thing we realized early on: evaluation harnesses are amazing — we love them all — but they were each built for their own purposes, and we cannot simply aggregate their scores. Take MMLU: the lm-eval-harness, HELM, and the original Berkeley implementation all evaluate the same dataset, but with different prompt formatting, different answer extraction methods, and different ordering of few-shot examples. The result? [LLaMA 65B scored 0.637 on HELM but 0.488 on the EleutherAI harness](https://huggingface.co/blog/open-llm-leaderboard-mmlu) — both called "MMLU score," same dataset, big gap. -And then, think about all the evaluation data that doesn't come from a harness at all: hand-built benchmarks, leaderboard scrapes, results pulled from research papers and blog posts. We wanted one format that could capture whether harness-run or hand-built alike, truly every evaluation ever. -Here's how the schema looks in practice, using [LiveCodeBench Pro](https://github.com/GavinZhengOI/LiveCodeBench-Pro) as an example from our repo. +One thing we realized early on: evaluation harnesses are valuable tools, but each was built for its own purpose, and we cannot simply aggregate their scores. Take MMLU: the lm-eval-harness, HELM, and the original Berkeley implementation all evaluate the same dataset, but with different prompt formatting, different answer extraction methods, and different ordering of few-shot examples. The result? [LLaMA 65B scored 0.637 on HELM but 0.488 on the EleutherAI harness](https://huggingface.co/blog/open-llm-leaderboard-mmlu), both reporting an "MMLU score" on the same dataset, yet with a significant gap. + +Think about all the evaluation data that doesn't come from a harness at all: hand-built benchmarks, leaderboard scrapes, results pulled from research papers and blog posts. We wanted one format that could capture all evaluations, whether harness-run or hand-built, truly every evaluation ever. + +Here's how the schema looks in practice, using [LiveCodeBench Pro](https://github.com/GavinZhengOI/LiveCodeBench-Pro) as an example from our repository. First, we capture where the evaluation came from and who ran it: @@ -96,9 +104,9 @@ First, we capture where the evaluation came from and who ran it: } ``` -The schema also records the exact model details — name, developer, and how it was accessed. Was it called through the developer's own API (like OpenAI or Anthropic), through a third-party provider (like OpenRouter or Together AI), or run locally with an inference engine like vLLM? This isn't just because we love eval metadata. The same model, accessed through different providers or run with different engine configurations, [can produce different outputs](https://arxiv.org/pdf/2312.03886) — and therefore different scores. +The schema also records the exact model details: name, developer, and how it was accessed. Was it called through the developer's own API (like OpenAI or Anthropic), through a third-party provider (like OpenRouter or Together AI), or run locally with an inference engine like vLLM? This matters because the same model, accessed through different providers or run with different engine configurations, [can produce different outputs](https://arxiv.org/pdf/2312.03886) and therefore different scores. -Next, generation settings. We all know how much they matter — changing temperature or the number of samples alone can shift scores by several points. Yet they're routinely absent from leaderboards and incomplete even in papers. When a model's score is reported without this context, we're left guessing whether differences reflect actual model capability or just different settings. So we want to capture this information — and where it's missing, record that gap explicitly, so anyone interpreting the results knows what context they do and don't have: +Next, generation settings. We all know how much they matter: changing temperature or the number of samples alone can shift scores by several points. Yet they're routinely absent from leaderboards and incomplete even in papers. When a model's score is reported without this context, we're left guessing whether differences reflect actual model capability or just different settings. We capture this information and, where it's missing, record that gap explicitly so anyone interpreting the results knows what context they do and don't have: ```json { @@ -116,7 +124,7 @@ Next, generation settings. We all know how much they matter — changing tempera } ``` -And then there's the score itself. Let’s take a model on the coding benchmark [HumanEval](https://arxiv.org/abs/2107.03374): scoring 0.31 on the first try (called pass@1) represents a fraction of coding problems it solved — higher would be better. On the contrary, if the same model scores again 0.31 but on [RealToxicityPrompts](https://github.com/allenai/real-toxicity-prompts), lower scores would be better. [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) standardizes to enable better eval result interpretation: +Then there's the score itself. Consider a model on the coding benchmark [HumanEval](https://arxiv.org/abs/2107.03374): scoring 0.31 on the first try (called pass@1) means it solved 31% of coding problems, where higher is better. In contrast, if the same model scores 0.31 on [RealToxicityPrompts](https://github.com/allenai/real-toxicity-prompts), lower scores would be better. Every Eval Ever standardizes this to enable better interpretation of evaluation results: ```json { @@ -139,35 +147,38 @@ And then there's the score itself. Let’s take a model on the coding benchmark ``` ## 🔧 Validation & Converters -Better eval infrastructure should be easy and frictionless for practitioners. That's why we're proud that [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) provides: -**Converters:** If you're already running evaluations with existing eval tools, you shouldn't have to manually parse your results. Our converters transform evaluation logs into the Every Eval Ever format automatically. Converters for [HELM](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/helm), [lm-eval](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/lm_eval), and [Inspect AI](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/inspect) are already available, with more on the way. One step, and your eval data fits in. +Better eval infrastructure should be easy and frictionless for practitioners. That's why [Every Eval Ever](https://evalevalai.com/projects/every-eval-ever/) provides: + +**Converters:** If you're already running evaluations with existing eval tools, you shouldn't have to manually parse your results. Our converters transform evaluation logs into the Every Eval Ever format automatically. Converters for [HELM](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/helm), [lm-eval](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/lm_eval), and [Inspect AI](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/inspect) are already available, with more on the way. In one step, your eval data fits the format. -**Validation:** Validation runs automatically on every submission via Hugging Face Jobs — before any result file is merged, it's checked against the schema to catch missing fields and structural issues early, not months later when someone tries to use the data. +**Validation:** Validation runs automatically on every submission via Hugging Face Jobs. Before any result file is merged, it's checked against the schema to catch missing fields and structural issues early, not months later when someone tries to use the data. ## 🧩 Design Decisions -What counts as a unique evaluation for you? This can get quite a thorny question! Imagine two [GSM8K](https://huggingface.co/datasets/openai/gsm8k) runs might differ in prompt template, chat template, vLLM version, GPU kernels, or dataset version — each affecting scores. Within our own coalition, members used GSM8K for purposes as different as measuring mathematical reasoning and benchmarking speculative decoding. + +What counts as a unique evaluation? This can be a thorny question. Two [GSM8K](https://huggingface.co/datasets/openai/gsm8k) runs might differ in prompt template, chat template, vLLM version, GPU kernels, or dataset version, each affecting scores. Within our own coalition, members used GSM8K for purposes as different as measuring mathematical reasoning and benchmarking speculative decoding. We considered defining a canonical parameter set to fingerprint unique runs. In practice, the space of score-affecting variables is too large for any fixed set. Our solution: each run gets its own file, identified by a UUID, with as much metadata as possible captured alongside it. Deduplication and grouping happen at the analysis layer, not the schema layer. This keeps data lossless while letting consumers apply their own equivalence criteria. -On purpose we allowed reporting whatever one has. For many, this is an aggregation score with some metadata; for others, it is per-example data and a lot of hyperparameters. Why should you care? -Hardly any evaluation research is done on the aggregation scores. You want to check whether there are biases, errors, redundant questions, questions that measure the wrong thing, aggregate datasets differently, and analyse errors. All of those questions require at least the model outputs or scores per example, not per dataset. +We intentionally allow reporting whatever information is available. For many, this is an aggregation score with some metadata; for others, it is per-example data and extensive hyperparameters. Why does this matter? Most evaluation research requires more than aggregation scores. Researchers need to check for biases, errors, redundant questions, questions that measure the wrong thing, alternative dataset aggregations, and error analysis. All of these questions require at least the model outputs or scores per example, not per dataset. + +## What's Next -## What’s Next -[Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) grew out of a need we kept running into in our own research. When [EvalEval researchers mapped how social impact evaluations are reported across the field](https://arxiv.org/abs/2511.05613), examining 186 first-party reports and 183 third-party sources, the lack of a common format turned what should have been a straightforward analysis into weeks of manual data wrangling. That work made the case for why something like Every Eval Ever needed to exist: even though a lot of evaluation data is available open source, it is in incompatible formats, with no shared infrastructure to aggregate or compare it. +Every Eval Ever grew out of a need we kept running into in our own research. When [EvalEval researchers mapped how social impact evaluations are reported across the field](https://arxiv.org/abs/2511.05613), examining 186 first-party reports and 183 third-party sources, the lack of a common format turned what should have been a straightforward analysis into weeks of manual data wrangling. That work made the case for why something like Every Eval Ever needed to exist: even though evaluation data is openly available, it exists in incompatible formats, with no shared infrastructure to aggregate or compare it. -This schema enables research. Beyond just good documentation hygiene, [researchers already used the repository in a multi-author EvalEval effort to analyze benchmark saturation across 60 benchmarks](https://evalevalai.com/projects/bench-sat/), finding that nearly half had lost their ability to differentiate top-performing models. Centralized, standardized evaluation data opens up more: seeing where the ecosystem is thin, which capabilities are over-measured, and which risks are neglected. With instance-level data, researchers can move beyond leaderboard averages to study item difficulty, robustness, and temporal drift. Every Eval Ever enables meta-evaluation: testing evaluation methods themselves to distinguish real progress from artifacts of setup and reporting. -We need your help. We're launching a [Shared Task](evalevalai.com/events/) for practitioners alongside this post — two tracks for contributing public and proprietary eval data to the repository, with co-authorship for qualifying contributors and a [workshop at ACL 2026 in San Diego](https://evalevalai.com/events/2026-acl-workshop/). +This schema enables research. Beyond good documentation hygiene, [researchers have already used the repository in a multi-author EvalEval effort to analyze benchmark saturation across 60 benchmarks](https://evalevalai.com/projects/bench-sat/), finding that nearly half had lost their ability to differentiate top-performing models. Centralized, standardized evaluation data enables additional research questions: identifying where the ecosystem is thin, which capabilities are over-measured, and which risks are neglected. With instance-level data, researchers can move beyond leaderboard averages to study item difficulty, robustness, and temporal drift. Every Eval Ever enables meta-evaluation: testing evaluation methods themselves to distinguish real progress from artifacts of setup and reporting. + +We need your help. We're launching a [Shared Task](https://evalevalai.com/events/) for practitioners alongside this post: two tracks for contributing public and proprietary eval data to the repository, with co-authorship for qualifying contributors and a [workshop at ACL 2026 in San Diego](https://evalevalai.com/events/2026-acl-workshop/). *Submissions open now, deadline May 1, 2026.* -## Get involved -- Try the schema 📋 : [Hugging Face Space](https://huggingface.co/spaces/evaleval/every_eval_ever_space) and [GitHub](github.com/evaleval/every_eval_ever) +## Get Involved -- Join the Shared Task 🏁 : [Call for Participation](evalevalai.com/events/) +- Try the schema 📋: [Hugging Face Space](https://huggingface.co/spaces/evaleval/every_eval_ever_space) and [GitHub](https://github.com/evaleval/every_eval_ever) -- Join the community 💬 : [Reach out to be added](mailto:jan.batzner@tum.de) +- Join the Shared Task 🏁: [Call for Participation](https://evalevalai.com/events/) +- Join the community 💬: [Reach out to be added](mailto:jan.batzner@tum.de) ```bibtex @misc{evaleval2026everyevalever, @@ -180,12 +191,8 @@ We need your help. We're launching a [Shared Task](evalevalai.com/events/) for p } ``` -### Feedback and Advise -We acknowledge feedback by, but not limited to, JJ Allaire (Inspect, Meridian Labs), Ryan Steed (US CAISI), Zee Talat -(University of Edinburgh), Gal Moyal (Noma Security), Sean McGregor (AVERI), Joal Stein -(WeVal/CiP), Srishti Yadav (ELLIS Copenhagen), Andrew Tran (AWS), Sanchit Ahuja (Northeastern), Volker Stocker (Weizenbaum, TUB), Marek Šuppa (Slido), Stefan Schmid (Weizenbaum, TUB), Gjergji Kasneci (TUM, MCML). - - - +### Feedback and Advice +We acknowledge feedback from JJ Allaire (Inspect, Meridian Labs), Ryan Steed (US CAISI), Zee Talat (University of Edinburgh), Gal Moyal (Noma Security), Sean McGregor (AVERI), Joal Stein (WeVal/CiP), Srishti Yadav (ELLIS Copenhagen), Andrew Tran (AWS), Sanchit Ahuja (Northeastern), Volker Stocker (Weizenbaum, TUB), Marek Šuppa (Slido), Stefan Schmid (Weizenbaum, TUB), and Gjergji Kasneci (TUM, MCML). +--- \ No newline at end of file From 36653671ebb62d0e0405c7dd4ee2bcba69516303 Mon Sep 17 00:00:00 2001 From: Jan Batzner <91485870+janbatzner@users.noreply.github.com> Date: Mon, 16 Feb 2026 20:08:01 +0100 Subject: [PATCH 3/3] corrected typos, fixed links, incorporated feedback --- _events/shared-task-every-eval-ever.md | 187 ++++++++++++++++++++++ _posts/2026-02-15-everyevalever-launch.md | 32 ++-- 2 files changed, 203 insertions(+), 16 deletions(-) create mode 100644 _events/shared-task-every-eval-ever.md diff --git a/_events/shared-task-every-eval-ever.md b/_events/shared-task-every-eval-ever.md new file mode 100644 index 0000000..8c0fcf6 --- /dev/null +++ b/_events/shared-task-every-eval-ever.md @@ -0,0 +1,187 @@ +--- +layout: event +title: "Shared Task: Every Eval Ever" +subtitle: Building a Unifying, Standardized Database of LLM Evaluations +status: active +order: 2 +category: Infrastructure +event_date: 2026-05-01 +location: 🌐 Online +host: EvalEval +description: | + Help us build the first unifying, open database of LLM evaluation results! Convert evaluation data from leaderboards, papers, or your own runs into a shared format — and join as co-author on the resulting paper. +--- +As the cost of genAI model evaluation is rapidly increasing, researchers, non-profits, small companies, and civil society orgs need to rely on existing evaluation data on the web. Evaluation data refers to Large Language Model evaluations on popular benchmarks or domain-specific tasks, which are commonly saved under HuggingFace leaderboards or reported in research papers. However, with numerous evaluation frameworks emerging across research and industry, evaluation data is scattered across different platforms, stored in inconsistent formats, and lacks standardization that would enable meaningful comparison and meta-analysis. + +The [Every Eval Ever](https://github.com/evaleval/every_eval_ever) Shared Task aims to address this fragmentation by establishing a unified metadata schema for LLM evaluations to populate a comprehensive, standardized database of evaluation results. Qualifying contributors will be invited to join the paper write-up as co-authors. + +## 🎯 Task + +Participants will contribute to building a comprehensive database of LLM evaluations by converting existing evaluation data into our standardized schema. The task is divided into two tracks: + +### 🏁 Track 1: Public Eval Data Parsing + +**Objective 1:** Parse and convert evaluation data from existing public leaderboards (e.g., Chatbot Arena, Open LLM Leaderboard, AlpacaEval, MT-Bench, etc.) into our standardized metadata schema. + +**Objective 2:** Extract evaluation results from academic papers and technical reports, converting them into our standardized schema. This includes results from tables, figures, and text descriptions in published research. + +**Deliverables:** +1. Python scripts that programmatically extract data from leaderboard APIs or web interfaces, or Python scripts for automated or (semi-)automated extraction from papers +2. Converted datasets in our schema format (JSON) +3. Documentation of data extraction methodology and any issues encountered (txt) + +### 🔒 Track 2: Proprietary Evaluation Data + +**Objective:** Convert proprietary evaluation datasets (from companies, research labs, or private benchmarks) into our schema and contribute them to the shared database. This track welcomes both public release of new data and private contributions under appropriate data use agreements. + +**Deliverables:** +1. Converted datasets in standardized schema format (JSON) +2. Python conversion scripts (if data structure can be shared according to your org policies) +3. Documentation of data extraction methodology and any issues encountered (txt) + +## 👥 Participation Guidelines + +**Who can participate:** Anyone! Academic researchers, industry practitioners, independent developers. + +**Submission Requirements:** +- ✓ All submissions should include conversion scripts (Python preferred) +- ✓ Converted data must validate against our schema +- ✓ Documentation explaining the conversion process +- ✓ Participants may use any publicly available leaderboard or paper (link attribution), if not already in database +- ✓ Proprietary data submissions must include appropriate permissions +- ✓ Optional: use the data you parsed for a research question and submit a workshop paper for [EvalEval at ACL](/events/2026-acl-workshop/) + +## 📅 Important Dates + +- **Schema and example data release:** February 17, 2026 +- **Shared Task period begins:** February 17, 2026 +- **Submission deadline:** May 1, 2026 +- **Results announcement:** June 1, 2026 +- **Workshop/presentation:** [ACL San Diego, July 7, 2026](/events/2026-acl-workshop/) + +## 🔍 Schema at a Glance + +For the full story, see our blog post: [Every Eval Ever: Toward a Common Language for AI Eval Reporting](/infrastructure/2026/02/15/everyevalever-launch/). + +The repository is organized by benchmark, model, and evaluation run. Each result file captures not just scores but the context you need to interpret and reuse them: +- who ran the evaluation, +- what model, +- with what settings, +- what these scores actually mean, +- and instance-level scores, if you have them. + +``` +data/ +└── {benchmark}/ + └── {developer_name}/ + └── {model_name}/ + ├── {uuid}.json + └── {uuid}.jsonl +``` + +The three components that make [Every Eval Ever](https://github.com/evaleval/every_eval_ever) work: + +📋 A [metadata schema](https://github.com/evaleval/every_eval_ever/blob/main/eval.schema.json) that defines the information needed for meaningful comparison of evaluation results + +🔧 [Validation](https://github.com/evaleval/every_eval_ever/blob/main/utils/validate_data.py) that checks data against the schema before it enters the repository + +🔌 [Converters](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters) for popular evaluation tools like [Inspect AI](https://inspect.aisi.org.uk/), [HELM](https://github.com/stanford-crfm/helm), and [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness), so you can transform your existing evaluation logs into the standard format. + +### 📋 The Schema + +One thing we realized early on: evaluation harnesses are amazing — we love them all — but they were each built for their own purposes, and we cannot simply aggregate their scores. Take MMLU: the lm-eval-harness, HELM, and the original Berkeley implementation all evaluate the same dataset, but with different prompt formatting, different answer extraction methods, and different ordering of few-shot examples. The result? [LLaMA 65B scored 0.637 on HELM but 0.488 on the EleutherAI harness](https://huggingface.co/blog/open-llm-leaderboard-mmlu) — both called "MMLU score," same dataset, big gap. + +Here's how the schema looks in practice, using [LiveCodeBench Pro](https://github.com/GavinZhengOI/LiveCodeBench-Pro) as an example from our repo. + +First, we capture where the evaluation came from and who ran it: + +```json +{ + "source_metadata": { + "source_name": "LiveCodeBench Pro", + "source_type": "evaluation_platform", + "source_organization_name": "LiveCodeBench", + "evaluator_relationship": "third_party" + } +} +``` + +The schema also records the exact model details — name, developer, and how it was accessed. Was it called through the developer's own API (like OpenAI or Anthropic), through a third-party provider (like OpenRouter or Together AI), or run locally with an inference engine like vLLM? The same model, accessed through different providers or run with different engine configurations, [can produce different outputs](https://arxiv.org/pdf/2312.03886) — and therefore different scores. + +Next, generation settings. We all know how much they matter — changing temperature or the number of samples alone can shift scores by several points. Yet they're routinely absent from leaderboards and incomplete even in papers. So we want to capture this information — and where it's missing, record that gap explicitly, so anyone interpreting the results knows what context they do and don't have: + +```json +{ + "generation_config": { + "generation_args": { + "temperature": 0.2, + "top_p": 0.95, + "max_tokens": 2048 + }, + "additional_details": { + "n_samples": 10, + "stop_sequences": ["\n```"] + } + } +} +``` + +And then there's the score itself. A model scoring 0.31 on [HumanEval](https://arxiv.org/abs/2107.03374) (pass@1) means higher is better. But 0.31 on [RealToxicityPrompts](https://github.com/allenai/real-toxicity-prompts) means lower is better. [Every Eval Ever](https://github.com/evaleval/every_eval_ever) standardizes to enable better eval result interpretation: + +```json +{ + "evaluation_results": [ + { + "evaluation_name": "code_generation", + "metric_config": { + "evaluation_description": "pass@1 on code generation tasks", + "lower_is_better": false, + "score_type": "continuous", + "min_score": 0, + "max_score": 1 + }, + "score_details": { + "score": 0.31 + } + } + ] +} +``` + +### 🔧 Validation & Converters + +Better eval infrastructure should be easy and frictionless for practitioners. That's why [Every Eval Ever](https://github.com/evaleval/every_eval_ever) provides: + +**Converters:** If you're already running evaluations with existing eval tools, you shouldn't have to manually parse your results. Our converters transform evaluation logs into the Every Eval Ever format automatically. Converters for [HELM](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/helm), [lm-eval](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/lm_eval), and [Inspect AI](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/inspect) are already available, with more on the way. + +**Validation:** Validation runs automatically on every submission via Hugging Face Jobs — before any result file is merged, it's checked against the schema to catch missing fields and structural issues early, not months later when someone tries to use the data. + +## 🧑‍🔬 Core Organizers + +- [Jan Batzner](mailto:jan.batzner@tum.de), Weizenbaum Institute, MCML, TU Munich +- Leshem Choshen, MIT, IBM Research, MIT-IBM Watson AI Lab +- Sree Harsha Nelaturu, Zuse Institute Berlin +- Usman Gohar, Iowa State University +- Damian Stachura, Evidence Prime +- Andrew Tran, Independent +- Avijit Ghosh, Hugging Face + +## 🔗 Resources +- 📂 **Schema:** [Documentation and examples on GitHub](https://github.com/evaleval/every_eval_ever) +- ✓ **Validation:** [Script to validate your data against the schema](https://github.com/evaleval/every_eval_ever/blob/main/scripts/validate_data.py) +- 🚀 **Submit:** [Drag & drop or PR on our HuggingFace Datastore](https://huggingface.co/datasets/evaleval/EEE_datastore) +- 💬 **Slack:** [Reach out to join our discussion forum](mailto:jan.batzner@tum.de) + +## ❓ FAQ + +**Q: Do I need to submit data for both tracks?** +A: No, you can participate in any single track or combination of tracks that interests you. + +**Q: Can I submit data from multiple leaderboards/papers?** +A: Yes! We encourage comprehensive contributions covering multiple sources. + +**Q: What if I find errors or inconsistencies in source data?** +A: Document these in your submission. Our goal is transparency about data quality. + +**Q: Will my conversion scripts be made public?** +A: Yes, to enable reproducibility and allow others to update the data as leaderboards evolve. diff --git a/_posts/2026-02-15-everyevalever-launch.md b/_posts/2026-02-15-everyevalever-launch.md index 2f0571b..0db3985 100644 --- a/_posts/2026-02-15-everyevalever-launch.md +++ b/_posts/2026-02-15-everyevalever-launch.md @@ -8,7 +8,7 @@ category: Infrastructure image: "/assets/img/long-site-banner.webp" authors: - name: "Jan Batzner*" - - name: "Leshem Coshen*" + - name: "Leshem Choshen*" - name: "Avijit Ghosh*" - name: "Sree Harsha Nelaturu*" - name: "Anastassia Kornilova*" @@ -28,13 +28,13 @@ As AI models advance, we encounter more and more evaluation results and benchmar This rapid progression and wide range of evals led to fragmentation. Lacking standards, each framework reports different attributes in different formats, preventing the community from reliably comparing, replicating, determining which component fails, separating signal from noise, reusing others’ (often costly) evaluations, and performing large-scale analysis. This lack of reusability makes it difficult to build on the momentum of previous efforts. As model training has moved past the point where we retrain models from scratch or rewrite their training code, we must ask: why are we still rerunning every evaluation from scratch? -As part of a **cross-institutional, cross-sector initiative, the [EvalEval Coalition](https://evalevalai.com),** we are announcing [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) to improve the state of evaluation building and comparison: +As part of a **cross-institutional, cross-sector initiative, the [EvalEval Coalition](https://evalevalai.com),** we are announcing [Every Eval Ever](https://github.com/evaleval/every_eval_ever) to improve the state of evaluation building and comparison: (1) **Defining a shared schema** so results from different frameworks can be compared, and (2) Providing a **crowdsourced eval database** so researchers don’t have to start from scratch every time. -Today, we're launching [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space), built upon valuable feedback from AI Eval Ecosystem actors including researchers and practitioners at the U.S. Center for AI Standards and Innovation (CAISI), EleutherAI, Hugging Face, Noma Security, Trustible, Inspect AI, Meridian, AVERI, Collective Intelligence Project, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, and IBM Research. +Today, we're launching [Every Eval Ever](https://github.com/evaleval/every_eval_ever), built upon valuable feedback from AI Eval Ecosystem actors including researchers and practitioners at the U.S. Center for AI Standards and Innovation (CAISI), EleutherAI, Hugging Face, Noma Security, Trustible, Inspect AI, Meridian, AVERI, Collective Intelligence Project, Stanford HELM, Weizenbaum, Evidence Prime, MIT, TUM, and IBM Research. ## The Hidden Problem The infrastructure is fragmented, and the results scattered. @@ -43,7 +43,7 @@ This fragmentation has concrete costs. Large-scale analysis of evaluation trends It is time for a change. We have seen this before in other parts of the ML pipeline. The community stopped retraining models from scratch or rewriting training code for each project long ago. Evaluations are next. ## Why Us, Why Now -We just know the pain. The EvalEval Coalition is a community of researchers working to fix how AI evaluations are built, run, documented, shared, and compared. We worked on a myriad of projects where collecting evaluations restricts what can be done or takes most of the project’s efforts. Need examples? See [1](https://arxiv.org/abs/2602.03344), [2](https://arxiv.org/abs/2503.01622), [3](https://proceedings.neurips.cc/paper_files/paper/2024/hash/28236482f64a72eec43706b6f3a6c511-Abstract-Conference.html), [4](https://arxiv.org/abs/2412.06540), [5](https://arxiv.org/abs/2410.11840), [6](https://aclanthology.org/2024.acl-long.456/), [7](https://arxiv.org/abs/2407.13696), [8](https://par.nsf.gov/servlets/purl/10547932), [9](https://aclanthology.org/2024.naacl-long.139/), [10](https://aclanthology.org/2025.acl-long.34.com) among others. +We just know the pain. The EvalEval Coalition is a community of researchers working to fix how AI evaluations are built, run, documented, shared, and compared. We worked on a myriad of projects where collecting evaluations restricts what can be done or takes most of the project’s efforts. Need examples? See [1](https://arxiv.org/abs/2602.03344), [2](https://arxiv.org/abs/2503.01622), [3](https://proceedings.neurips.cc/paper_files/paper/2024/hash/28236482f64a72eec43706b6f3a6c511-Abstract-Conference.html), [4](https://arxiv.org/abs/2412.06540), [5](https://arxiv.org/abs/2410.11840), [6](https://aclanthology.org/2024.acl-long.456/), [7](https://arxiv.org/abs/2407.13696), [8](https://par.nsf.gov/servlets/purl/10547932), [9](https://aclanthology.org/2024.naacl-long.139/), [10](https://aclanthology.org/2025.acl-long.34/) among others. The urgency of standardized AI evaluation has reached a critical tipping point, driven by the shift toward evaluations as a primary mechanism of governance. With the EU AI Act and the U.S. Executive Order now mandating rigorous risk assessments, standardized data is no longer a luxury but a prerequisite for meaningful sociotechnical safety standards. This need is further intensified by the exploding complexity of modern AI, where frameworks like Inspect AI and HELM must navigate multi-turn agentic behaviors and human preferences that defy simple scoring. @@ -51,7 +51,7 @@ This need is further intensified by the exploding complexity of modern AI, where Ultimately, failing to adopt reusable formats imposes a technical debt on the community, forcing researchers to waste resources rerunning redundant evaluations rather than advancing the scientific frontier. Oh, and of course, even what is shared is often just a single score per dataset, obscuring many questions. ## What We're Building -The [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) is a schema to describe evaluation results and a community collection of those results. It is by the community and for the community, made simple to contribute to, evaluation or code. Enough high-level, let’s get into the details. +The [Every Eval Ever](https://github.com/evaleval/every_eval_ever) is a schema to describe evaluation results and a community collection of those results. It is by the community and for the community, made simple to contribute to, evaluation or code. Enough high-level, let’s get into the details. The repository is organized by benchmark, model, and evaluation run. Each result file captures not just scores but the context you need to interpret and reuse them: - who ran the evaluation, @@ -69,7 +69,7 @@ data/ └── {uuid}.jsonl ``` -The three components that make [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) work 💙 +The three components that make [Every Eval Ever](https://github.com/evaleval/every_eval_ever) work 💙 📋 A [metadata schema](https://github.com/evaleval/every_eval_ever/blob/main/eval.schema.json) that defines the information needed for meaningful comparison of evaluation results @@ -116,7 +116,7 @@ Next, generation settings. We all know how much they matter — changing tempera } ``` -And then there's the score itself. Let’s take a model on the coding benchmark [HumanEval](https://arxiv.org/abs/2107.03374): scoring 0.31 on the first try (called pass@1) represents a fraction of coding problems it solved — higher would be better. On the contrary, if the same model scores again 0.31 but on [RealToxicityPrompts](https://github.com/allenai/real-toxicity-prompts), lower scores would be better. [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) standardizes to enable better eval result interpretation: +And then there's the score itself. Let’s take a model on the coding benchmark [HumanEval](https://arxiv.org/abs/2107.03374): scoring 0.31 on the first try (called pass@1) represents a fraction of coding problems it solved — higher would be better. On the contrary, if the same model scores again 0.31 but on [RealToxicityPrompts](https://github.com/allenai/real-toxicity-prompts), lower scores would be better. [Every Eval Ever](https://github.com/evaleval/every_eval_ever) standardizes to enable better eval result interpretation: ```json { @@ -139,7 +139,7 @@ And then there's the score itself. Let’s take a model on the coding benchmark ``` ## 🔧 Validation & Converters -Better eval infrastructure should be easy and frictionless for practitioners. That's why we're proud that [Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) provides: +Better eval infrastructure should be easy and frictionless for practitioners. That's why we're proud that [Every Eval Ever](https://github.com/evaleval/every_eval_ever) provides: **Converters:** If you're already running evaluations with existing eval tools, you shouldn't have to manually parse your results. Our converters transform evaluation logs into the Every Eval Ever format automatically. Converters for [HELM](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/helm), [lm-eval](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/lm_eval), and [Inspect AI](https://github.com/evaleval/every_eval_ever/tree/main/eval_converters/inspect) are already available, with more on the way. One step, and your eval data fits in. @@ -154,17 +154,17 @@ On purpose we allowed reporting whatever one has. For many, this is an aggregati Hardly any evaluation research is done on the aggregation scores. You want to check whether there are biases, errors, redundant questions, questions that measure the wrong thing, aggregate datasets differently, and analyse errors. All of those questions require at least the model outputs or scores per example, not per dataset. ## What’s Next -[Every Eval Ever](https://huggingface.co/spaces/evaleval/every_eval_ever_space) grew out of a need we kept running into in our own research. When [EvalEval researchers mapped how social impact evaluations are reported across the field](https://arxiv.org/abs/2511.05613), examining 186 first-party reports and 183 third-party sources, the lack of a common format turned what should have been a straightforward analysis into weeks of manual data wrangling. That work made the case for why something like Every Eval Ever needed to exist: even though a lot of evaluation data is available open source, it is in incompatible formats, with no shared infrastructure to aggregate or compare it. +[Every Eval Ever](https://github.com/evaleval/every_eval_ever) grew out of a need we kept running into in our own research. When [EvalEval researchers mapped how social impact evaluations are reported across the field](https://arxiv.org/abs/2511.05613), examining 186 first-party reports and 183 third-party sources, the lack of a common format turned what should have been a straightforward analysis into weeks of manual data wrangling. That work made the case for why something like Every Eval Ever needed to exist: even though a lot of evaluation data is available open source, it is in incompatible formats, with no shared infrastructure to aggregate or compare it. This schema enables research. Beyond just good documentation hygiene, [researchers already used the repository in a multi-author EvalEval effort to analyze benchmark saturation across 60 benchmarks](https://evalevalai.com/projects/bench-sat/), finding that nearly half had lost their ability to differentiate top-performing models. Centralized, standardized evaluation data opens up more: seeing where the ecosystem is thin, which capabilities are over-measured, and which risks are neglected. With instance-level data, researchers can move beyond leaderboard averages to study item difficulty, robustness, and temporal drift. Every Eval Ever enables meta-evaluation: testing evaluation methods themselves to distinguish real progress from artifacts of setup and reporting. -We need your help. We're launching a [Shared Task](evalevalai.com/events/) for practitioners alongside this post — two tracks for contributing public and proprietary eval data to the repository, with co-authorship for qualifying contributors and a [workshop at ACL 2026 in San Diego](https://evalevalai.com/events/2026-acl-workshop/). +We need your help. We're launching a [Shared Task](https://evalevalai.com/events/) for practitioners alongside this post — two tracks for contributing public and proprietary eval data to the repository, with co-authorship for qualifying contributors and a [workshop at ACL 2026 in San Diego](https://evalevalai.com/events/2026-acl-workshop/). *Submissions open now, deadline May 1, 2026.* ## Get involved -- Try the schema 📋 : [Hugging Face Space](https://huggingface.co/spaces/evaleval/every_eval_ever_space) and [GitHub](github.com/evaleval/every_eval_ever) +- Try the schema 📋 : [HuggingFace Datastore](https://huggingface.co/datasets/evaleval/EEE_datastore) and [GitHub](https://github.com/evaleval/every_eval_ever) -- Join the Shared Task 🏁 : [Call for Participation](evalevalai.com/events/) +- Join the Shared Task 🏁 : [Call for Participation](https://evalevalai.com/events/shared-task-every-eval-ever/) - Join the community 💬 : [Reach out to be added](mailto:jan.batzner@tum.de) @@ -172,18 +172,18 @@ We need your help. We're launching a [Shared Task](evalevalai.com/events/) for p ```bibtex @misc{evaleval2026everyevalever, title = {Every Eval Ever: Toward a Common Language for AI Eval Reporting}, - author = {Jan Batzner and Leshem Coshen and Avijit Ghosh and Sree Harsha Nelaturu and Anastassia Kornilova and Damian Stachura and Anka Reuel and Yifan Mai and Asaf Yehudai and Irene Solaiman and Stella Biderman}, + author = {Jan Batzner and Leshem Choshen and Avijit Ghosh and Sree Harsha Nelaturu and Anastassia Kornilova and Damian Stachura and Anka Reuel and Yifan Mai and Asaf Yehudai and Irene Solaiman and Stella Biderman}, year = {2026}, month = {February}, - url = {https://evaleval.github.io/2026/02/16/everyevalever-launch/}, + url = {https://evaleval.github.io/2026/02/15/everyevalever-launch/}, note = {Blog Post, EvalEval Coalition} } ``` -### Feedback and Advise +### Acknowledgement and Feedback We acknowledge feedback by, but not limited to, JJ Allaire (Inspect, Meridian Labs), Ryan Steed (US CAISI), Zee Talat (University of Edinburgh), Gal Moyal (Noma Security), Sean McGregor (AVERI), Joal Stein -(WeVal/CiP), Srishti Yadav (ELLIS Copenhagen), Andrew Tran (AWS), Sanchit Ahuja (Northeastern), Volker Stocker (Weizenbaum, TUB), Marek Šuppa (Slido), Stefan Schmid (Weizenbaum, TUB), Gjergji Kasneci (TUM, MCML). +(WeVal/CiP), Srishti Yadav (ELLIS Copenhagen), Andrew Tran (AWS), Sanchit Ahuja (Northeastern), Marek Šuppa (Slido), Stefan Schmid (Weizenbaum, TUB), Gjergji Kasneci (TUM, MCML).