TRACEBACK is a multi-agent attribution framework that traces table-based answers back to their supporting cells, providing fine-grained attribution at the cell level rather than coarse row or column granularity.
Table QA systems can produce correct answers yet offer no way to verify which cells actually support them. Existing approaches either skip attribution entirely or operate at coarse row/column granularity, leaving fine-grained evidence trails unaddressed. TRACEBACK closes this gap with a modular, multi-agent pipeline for cell-level attribution in single-table QA.
TRACEBACK works in five key steps:
- Column Pruning — Identify columns relevant to the question via an LLM agent, reducing noise from large tables.
- Row Filtering — Generate and execute SQL to retain only the rows needed for answering (optional, via MySQL/SQLite).
- Sub-query Decomposition — Break the question into atomic sub-queries, each targeting a single fact, with NLI-based filtering for faithfulness.
- Sub-query Attribution — Align each sub-query to specific table cells, capturing both direct evidence and intermediate reasoning steps.
- Final Attribution — Consolidate cell-level evidence across all sub-queries into a unified attribution map.
We also introduce CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA, and FairScore, a reference-less metric that estimates attribution precision and recall without human cell labels.
This repository contains the code, prompts, datasets, and evaluation pipelines for reproducing and extending TRACEBACK, CITEBench, and FairScore.
Traceback/
├── src/
│ ├── main_aitqa.py # TraceBack runner for AITQA
│ ├── main_fetaqa.py # TraceBack runner for FetaQA
│ ├── totto_traceback.py # TraceBack runner for ToTTo
│ ├── LLM.py # Unified LLM backends (OpenAI, Gemini, DeepSeek, Novita, HF)
│ ├── database.py # MySQL/SQLAlchemy interface for row filtering
│ ├── relevant_col_gen.py # Step 1: Column relevance via LLM
│ ├── relevant_row_gen.py # Step 2: Row filtering via SQL generation
│ ├── query_attribution.py # Step 4: Subquery attribution
│ ├── subqueries.py # Step 3: Subquery generation
│ ├── eval_aitqa.py # AITQA precision/recall evaluation
│ ├── eval_fetaqa.py # FetaQA precision/recall evaluation
│ ├── eval_totto.py # ToTTo evaluation
│ ├── eval_traceback_md.py # Combined markdown table (AITQA/FetaQA/ToTTo)
│ ├── eval_traceback_hf_all.py # HF model results aggregation
│ ├── traceback_citebench_full.py # TraceBack over CITEBENCH
│ ├── eval_citebench_all.py # CITEBENCH evaluator
│ ├── eval_fairscore.py # FAIRScore (reference-less) evaluation
│ ├── icl_runner.py # ICL baseline for CITEBENCH
│ ├── metrics_to_md*.py # CITEBENCH metrics → markdown converters
│ └── utils.py # Shared utilities
├── traceback_workflow.py # Core 5-step TraceBack workflow
├── Prompts/ # TraceBack prompt templates
├── Prompts_2/ # ICL baseline prompt templates
├── Datasets/ # AITQA, FetaQA, ToTTo, CITEBENCH
└── README.md # This file
Key scripts:
traceback_workflow.py: shared 5-step TraceBack workflow used by dataset runnerssrc/main_aitqa.py: TraceBack runner for AITQAsrc/main_fetaqa.py: TraceBack runner for FetaQAsrc/totto_traceback.py: TraceBack runner for ToTTosrc/eval_aitqa.py,src/eval_fetaqa.py,src/eval_totto.py: per-dataset P/R evaluationsrc/eval_traceback_md.py: one-shot markdown table for AITQA/FetaQA/ToTTosrc/traceback_citebench_full.py: TraceBack over unified CITEBENCHsrc/eval_citebench_all.py: CITEBENCH evaluator for style-based outputssrc/eval_fairscore.py: FAIRScore evaluationsrc/icl_runner.py: CITEBENCH ICL baseline
Prompt files:
- TraceBack prompts:
Prompts/ - ICL prompt templates:
Prompts_2/
Datasets:
Datasets/AITQA/aitqa_processed.jsonlDatasets/FetaQA/fetaQA_dev_processed.jsonlDatasets/Totto/totto_processed.jsonlDatasets/CITEBENCH.json
Required (core):
pip install pandas sqlalchemy pymysql tqdm openaiOptional by backend:
- Gemini backend:
pip install google-genai - Local HF backend:
pip install torch transformers accelerate
Environment variables (as needed by backend):
OPENAI_API_KEYGEMINI_API_KEYorGEMINI_API_KEYSDEEPINFRA_API_KEYorDEEPSEEK_API_KEYNOVITA_API_KEYHF_TOKENorHUGGINGFACE_HUB_TOKEN(for gated HF models)
Run:
python src/main_aitqa.pyEvaluate:
python src/eval_aitqa.pyRun:
python src/main_fetaqa.pyEvaluate:
python src/eval_fetaqa.pyRun:
python src/totto_traceback.pyEvaluate:
python src/eval_totto.pypython src/main_aitqa.py --backend hf --model Qwen/Qwen2.5-7B-Instruct --resume
python src/main_fetaqa.py --backend hf --model google/gemma-3-4b-it --resume
python src/totto_traceback.py --backend hf --model Qwen/Qwen2.5-3B-Instruct --resumeNotes:
- For paper-faithful Step 2 row filtering, use MySQL setup in
MYSQL_USAGE.txt. - For quick runs without MySQL, add
--no-mysqlor--no-row-filtering. - Use
--outputto avoid overwriting existing prediction files. - Use
--resumeto continue interrupted runs.
python src/eval_traceback_md.py --percent --output results/eval/metrics_traceback.mdRun:
python src/traceback_citebench_full.py \
--citebench Datasets/CITEBENCH.json \
--outdir results/TraceBack_full \
--model gpt-4oEvaluate:
python src/eval_citebench_all.py \
--results-root results/TraceBack_full \
--gt Datasets/CITEBENCH.json \
--styles traceback-full \
--output results/eval/metrics_citebench_traceback_full.jsonICL prompt templates:
Prompts_2/answer_attr_zero.txtPrompts_2/answer_attr_zero_cot.txtPrompts_2/answer_attr_few.txtPrompts_2/answer_attr_few_cot.txt
Run (OpenAI example):
python src/icl_runner.py \
--backend openai \
--models gpt-4o \
--styles zero zero-cot few few-cot \
--limit 0 \
--resumeEvaluate:
python src/eval_citebench_all.py \
--results-root results/ICL \
--gt Datasets/CITEBENCH.json \
--styles zero zero-cot few few-cot \
--output results/eval/metrics_citebench_icl.jsonDefault run over TraceBack outputs:
python src/eval_fairscore.py --cells pred --backend openai --model gpt-4oParallel requests (be mindful of rate limits):
python src/eval_fairscore.py --cells pred --backend openai --model gpt-4o --workers 4 --max-inflight 4Score both predicted and gold cells:
python src/eval_fairscore.py --cells both --backend openai --model gpt-4oOutputs:
- Summary markdown:
results/eval/metrics_fairscore.md - Summary JSON:
results/eval/metrics_fairscore.json - Cache files:
results/fairscore/
Use a unified CITEBENCH-style prediction file via --preds:
python src/eval_fairscore.py \
--preds results/ICL/gpt-4o/few-cot.json \
--pred-tag few-cot \
--cells pred \
--backend openai --model gpt-4o \
--workers 4 --max-inflight 4 \
--summary-md results/eval/metrics_fairscore_few-cot.md \
--summary-json results/eval/metrics_fairscore_few-cot.jsonSee MYSQL_USAGE.txt for full setup/start/stop instructions.
If you use this repository in your research, please cite the accompanying paper (TRACEBACK).
@misc{anvekar2026tracebackmultiagentdecompositionfinegrained,
title={TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution},
author={Tejas Anvekar and Junha Park and Rajat Jha and Devanshu Gupta and Poojah Ganesan and Puneeth Mathur and Vivek Gupta},
year={2026},
eprint={2602.13059},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.13059},
}Please see the LICENSE file if provided. If absent, contact the authors for licensing information.
Contributions are welcome. Please open an issue or a pull request for fixes and improvements
