EvalEval Coalition — "We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations."
Every Eval Ever is a shared schema and crowdsourced eval database. It defines a standardized metadata format for storing AI evaluation results — from leaderboard scrapes and research papers to local evaluation runs — so that results from different frameworks can be compared, reproduced, and reused. The three components that make it work:
- 📋 A metadata schema (
eval.schema.json) that defines the information needed for meaningful comparison of evaluation results, including instance-level data - 🔧 Validation that checks data against the schema before it enters the repository
- 🔌 Converters for Inspect AI, HELM, and lm-eval-harness, so you can transform your existing evaluation logs into the standard format
| Term | Our Definition | Example |
|---|---|---|
| Single Benchmark | Standardized eval using one dataset to test a single capability, producing one score | MMLU — ~15k multiple-choice QA across 57 subjects |
| Composite Benchmark | A collection of simple benchmarks aggregated into one overall score, testing multiple capabilities at once | BIG-Bench bundles >200 tasks with a single aggregate score |
| Metric | Any numerical or categorical value used to score performance on a benchmark (accuracy, F1, precision, recall, …) | A model scores 92% accuracy on MMLU |
New data can be contributed to the Hugging Face Dataset using the following process:
Leaderboard/evaluation data is split-up into files by individual model, and data for each model is stored using eval.schema.json. The repository is structured into folders as data/{benchmark_name}/{developer_name}/{model_name}/.
- Data must conform to
eval.schema.json(current version:0.2.0) - Validation runs automatically on every PR via
validate_data.py - An EvalEval member will review and merge your submission
Each JSON file is named with a UUID (Universally Unique Identifier) in the format {uuid}.json. The UUID is automatically generated (using standard UUID v4) when creating a new evaluation result file. This ensures that:
- Multiple evaluations of the same model can exist without conflicts (each gets a unique UUID)
- Different timestamps are stored as separate files with different UUIDs (not as separate folders)
- A model may have multiple result files, with each file representing different iterations or runs of the leaderboard/evaluation
- UUID's can be generated using Python's
uuid.uuid4()function.
Example: The model openai/gpt-4o-2024-11-20 might have multiple files like:
e70acf51-30ef-4c20-b7cc-51704d114d70.json(evaluation run #1)a1b2c3d4-5678-90ab-cdef-1234567890ab.json(evaluation run #2)
Note: Each file can contain multiple individual results related to one model. See examples in /data.
- Add a new folder under
data/with a codename for your eval. - For each model, use the HuggingFace (
developer_name/model_name) naming convention to create a 2-tier folder structure. - Add a JSON file with results for each model and name it
{uuid}.json. - [Optional] Include a
utils/folder in your benchmark name folder with any scripts used to generate the data (see e.g.utils/global-mmlu-lite/adapter.py). - [Validate] Validation runs automatically via
validate-data.ymlusingvalidate_data.pyto check JSON files against the schema before merging. - [Submit] Two ways to submit your evaluation data:
- Option A: Drag & drop via Hugging Face — Go to evaleval/EEE_datastore → click "Files and versions" → "Contribute" → "Upload files" → drag and drop your data → select "Open as a pull request to the main branch". See step-by-step screenshots.
- Option B: Clone & PR — Clone the HuggingFace repository, add your data under
data/, and open a pull request
model_info: Use HuggingFace formatting (developer_name/model_name). If a model does not come from HuggingFace, use the exact API reference. Check examples in /data/livecodebenchpro. Notably, some do have a date included in the model name, but others do not. For example:
- OpenAI:
gpt-4o-2024-11-20,gpt-5-2025-08-07,o3-2025-04-16 - Anthropic:
claude-3-7-sonnet-20250219,claude-3-sonnet-20240229 - Google:
gemini-2.5-pro,gemini-2.5-flash - xAI (Grok):
grok-2-2024-08-13,grok-3-2025-01-15
-
evaluation_id: Use{benchmark_name/model_id/retrieved_timestamp}format (e.g.livecodebenchpro/qwen3-235b-a22b-thinking-2507/1760492095.8105888). -
inference_platformvsinference_engine: Where possible specify where the evaluation was run using one of these two fields.
inference_platform: Use this field when the evaluation was run through a remote API (e.g.,openai,huggingface,openrouter,anthropic,xai).inference_engine: Use this field when the evaluation was run locally. This is now an object withnameandversion(e.g.{"name": "vllm", "version": "0.6.0"}).
-
The
source_typeonsource_metadatahas two options:documentationandevaluation_run. Usedocumentationwhen results are scraped from a leaderboard or paper. Useevaluation_runwhen the evaluation was run locally (e.g. via an eval converter). -
source_datais specified per evaluation result (insideevaluation_results), with three variants:
source_type: "url"— link to a web source (e.g. leaderboard API)source_type: "hf_dataset"— reference to a HuggingFace dataset (e.g.{"hf_repo": "google/IFEval"})source_type: "other"— for private or proprietary datasets
-
The schema is designed to accommodate both numeric and level-based (e.g. Low, Medium, High) metrics. For level-based metrics, the actual 'value' should be converted to an integer (e.g. Low = 1, Medium = 2, High = 3), and the
level_namesproperty should be used to specify the mapping of levels to integers. -
Timestamps: The schema has three timestamp fields — use them as follows:
retrieved_timestamp(required) — when this record was created, in Unix epoch format (e.g.1760492095.8105888)evaluation_timestamp(top-level, optional) — when the evaluation was runevaluation_results[].evaluation_timestamp(per-result, optional) — when a specific evaluation result was produced, if different results were run at different times
- Additional details can be provided in several places in the schema. They are not required, but can be useful for detailed analysis.
model_info.additional_details: Use this field to provide any additional information about the model itself (e.g. number of parameters)evaluation_results.generation_config.generation_args: Specify additional arguments used to generate outputs from the modelevaluation_results.generation_config.additional_details: Use this field to provide any additional information about the evaluation process that is not captured elsewhere
For evaluations that include per-sample results, the individual results should be stored in a companion {uuid}.jsonl file in the same folder (one JSONL per JSON, sharing the same UUID). The aggregate JSON file refers to its JSONL via the detailed_evaluation_results field. The instance-level schema (instance_level_eval.schema.json) supports three interaction types:
single_turn: Standard QA, MCQ, classification — usesoutputobjectmulti_turn: Conversational evaluations with multiple exchanges — usesinteractionsarrayagentic: Tool-using evaluations with function calls and sandbox execution — usesinteractionsarray withtool_calls
Each instance captures: input (raw question + reference answer), answer_attribution (how the answer was extracted), evaluation (score, is_correct), and optional token_usage and performance metrics. Instance-level JSONL files are produced automatically by the eval converters.
Example single_turn instance:
{
"schema_version": "instance_level_eval_0.2.0",
"evaluation_id": "math_eval/meta-llama/Llama-2-7b-chat/1706000000",
"model_id": "meta-llama/Llama-2-7b-chat",
"evaluation_name": "math_eval",
"sample_id": 4,
"interaction_type": "single_turn",
"input": { "raw": "If 2^10 = 4^x, what is the value of x?", "reference": "5" },
"output": { "raw": "Rewrite 4 as 2^2, so 4^x = 2^(2x). Since 2^10 = 2^(2x), x = 5." },
"answer_attribution": [{ "source": "output.raw", "extracted_value": "5" }],
"evaluation": { "score": 1.0, "is_correct": true }
}For agentic evaluations (e.g., SWE-Bench, GAIA), the aggregate schema captures configuration under generation_config.generation_args:
{
"agentic_eval_config": {
"available_tools": [
{"name": "bash", "description": "Execute shell commands"},
{"name": "edit_file", "description": "Edit files in the repository"}
]
},
"eval_limits": {"message_limit": 30, "token_limit": 100000},
"sandbox": {"type": "docker", "config": "compose.yaml"}
}At the instance level, agentic evaluations use interaction_type: "agentic" with full tool call traces recorded in the interactions array. See the Inspect AI test fixture for a GAIA example with docker sandbox and tool usage.
This repository has a pre-commit that will validate that JSON files conform to the JSON schema. The pre-commit requires using uv for dependency management.
To run the pre-commit on git staged files only:
uv run pre-commit runTo run the pre-commit on all files:
uv run pre-commit run --all-filesTo run the pre-commit on specific files:
uv run pre-commit run --files a.json b.json c.jsonTo install the pre-commit so that it will run before git commit (optional):
uv run pre-commit installdata/
└── {benchmark_name}/
└── {developer_name}/
└── {model_name}/
├── {uuid}.json # aggregate results
└── {uuid}.jsonl # instance-level results (optional)
Example evaluations included in the schema v0.2 release:
| Evaluation | Data |
|---|---|
| Global MMLU Lite | data/global-mmlu-lite/ |
| HELM Capabilities v1.15 | data/helm_capabilities/ |
| HELM Classic | data/helm_classic/ |
| HELM Instruct | data/helm_instruct/ |
| HELM Lite | data/helm_lite/ |
| HELM MMLU | data/helm_mmlu/ |
| HF Open LLM Leaderboard v2 | data/hfopenllm_v2/ |
| LiveCodeBench Pro | data/livecodebenchpro/ |
| RewardBench | data/reward-bench/ |
Schemas: eval.schema.json (aggregate) · instance_level_eval.schema.json (per-sample JSONL)
Each evaluation has its own directory under data/. Within each evaluation, models are organized by developer and model name. Instance-level data is stored in optional {uuid}.jsonl files alongside aggregate {uuid}.json results.
For a detailed walk-through, see the blogpost.
Each result file captures not just scores but the context needed to interpret and reuse them. Here's how it works, piece by piece:
Where did the evaluation come from? Source metadata tracks who ran it, where the data was published, and the relationship to the model developer:
"source_metadata": {
"source_name": "Live Code Bench Pro",
"source_type": "documentation",
"source_organization_name": "LiveCodeBench",
"evaluator_relationship": "third_party"
}Generation settings matter. Changing temperature or the number of samples alone can shift scores by several points — yet they're routinely absent from leaderboards. We capture them explicitly:
"generation_config": {
"generation_args": {
"temperature": 0.2,
"top_p": 0.95,
"max_tokens": 2048
}
}The score itself. A score of 0.31 on a coding benchmark (pass@1) means higher is better. The same 0.31 on RealToxicityPrompts means lower is better. The schema standardizes this interpretation:
"evaluation_results": [{
"evaluation_name": "code_generation",
"metric_config": {
"evaluation_description": "pass@1 on code generation tasks",
"lower_is_better": false,
"score_type": "continuous",
"min_score": 0,
"max_score": 1
},
"score_details": {
"score": 0.31
}
}]The schema also supports level-based metrics (e.g. Low/Medium/High) and uncertainty reporting (confidence intervals, standard errors). See eval.schema.json for the full specification.
Run following bash commands to generate pydantic classes for eval.schema.json and instance_level_eval.schema.json (to easier use in data converter scripts):
uv run datamodel-codegen --input eval.schema.json --output eval_types.py --class-name EvaluationLog --output-model-type pydantic_v2.BaseModel --input-file-type jsonschema --formatters ruff-format ruff-check
uv run datamodel-codegen --input instance_level_eval.schema.json --output instance_level_types.py --class-name InstanceLevelEvaluationLog --output-model-type pydantic_v2.BaseModel --input-file-type jsonschema --formatters ruff-format ruff-checkWe have prepared converters to make adapting to our schema as easy as possible. At the moment, we support converting local evaluation harness logs from Inspect AI, HELM and lm-evaluation-harness into our unified schema. Each converter produces aggregate JSON and optionally instance-level JSONL output.
| Framework | Command | Instance-Level JSONL |
|---|---|---|
| Inspect AI | uv run python3 -m eval_converters.inspect --log_path <path> |
Yes, if samples in log |
| HELM | uv run python3 -m eval_converters.helm --log_path <path> |
Always |
| lm-evaluation-harness | uv run python -m eval_converters.lm_eval --log_path <path> |
With --include_samples |
For full CLI usage and required input files, see the Eval Converters README.
We are running a Shared Task at ACL 2026 in San Diego (July 7, 2026). The task invites participants to contribute to a unifying database of eval results:
- Track 1: Public Eval Data Parsing — Parse leaderboards (Chatbot Arena, Open LLM Leaderboard, AlpacaEval, etc.) and academic papers into our schema and contribute to a unifying database of eval results!
- Track 2: Proprietary Evaluation Data — Convert proprietary evaluation datasets into our schema and contribute to a unifying database of eval results!
| Milestone | Date |
|---|---|
| Submission deadline | May 1, 2026 |
| Results announced | June 1, 2026 |
| Workshop at ACL 2026 | July 7, 2026 |
Qualifying contributors will be invited as co-authors on the shared task paper.
@misc{everyevalever2026schema,
title = {Every Eval Ever Metadata Schema v0.2},
author = {EvalEval Coalition},
year = {2026},
month = {February},
url = {https://github.com/evaleval/every_eval_ever},
note = {Schema Release}
}