The repository is designed to make it easy to run multiple models on the MapVerse dataset, apply controlled perturbations, and analyze results in a reproducible way.
MapVerse/
├── data/
│ ├── imgs/ # Image assets
│ └── typed_questions.csv # Input questions CSV
├── eval/
│ ├── Eval_Analysis.ipynb # Eval: Boolean, Single Entity, Counting, Listing
│ └── Eval_Analysis2.ipynb # Eval: Ranking, Reasoning
├── legacy/ # Legacy per-model QA scripts
│ ├── ayavision_qa.py
│ ├── cogvlm_qa.py
│ ├── deepseek_qa.py
│ ├── gemini_qa.py
│ ├── idefics_qa.py
│ ├── intern_qa.py
│ ├── llama_qa.py
│ ├── molmo_qa.py
│ └── qwen_qa.py
├── models/ # Current unified model wrappers
│ ├── __init__.py
│ ├── ayavision.py
│ ├── cogvlm.py
│ ├── deepseek.py
│ ├── gemini.py
│ ├── idefics.py
│ ├── internvl.py
│ ├── llama.py
│ ├── molmo.py
│ └── qwen.py
├── utils/
│ ├── __init__.py
│ ├── common.py # CSV & image loading helpers
│ └── perturb.py # Image perturbation functions
├── viz/
│ ├── heat.py # Heatmap visualizations
│ └── plot.py # UMAP / t-SNE embedding plots
├── openai_qa_with_image.ipynb # OpenAI batch QA (with images)
├── openai_qa_without_image.ipynb # OpenAI batch QA (text-only)
├── openai_result_fetch.ipynb # Fetch & parse OpenAI batch results
├── predict.py # Central runner for model wrappers
├── requirements.txt # Dependencies for all models except DeepSeek
├── requirements_deepseek.txt # Dependencies specific to DeepSeek models
└── README.md
The repository uses separate dependency files to simplify environment setup:
requirements.txt: Required for all models except DeepSeekrequirements_deepseek.txt: Additional / separate dependencies required to run DeepSeek models
Typical setup:
pip install -r requirements.txtFor DeepSeek:
pip install -r requirements_deepseek.txttyped_questions.csv: Contains the question–answer pairs along with associated metadata (e.g., question type, identifiers, and other annotations used during evaluation).data/imgs/: Contains the corresponding map images referenced by the CSV. --
predict.py is the main execution script. It dynamically imports a model wrapper from models/ and runs it over an input CSV containing questions and (optionally) images.
Each model wrapper must define:
ask_image_question(...): the main inference functionCUSTOM_PROMPT: the prompt template used by the model
Contains per-model wrappers. Each wrapper standardizes how a given model is queried, making it easy to swap models without changing the evaluation pipeline.
Shared utilities for:
- Loading CSV files
- Resolving image paths
- Handling missing or optional image inputs
Defines image perturbation functions used for robustness and stress testing (e.g., compression artifacts, occlusions).
python predict.py <model> \
[--input INPUT] \
[--output OUTPUT] \
[--image-base IMAGE_BASE] \
[--no-image] \
[--perturb PERTURB]--input:typed_questions.csv--image-base:data/imgs/--output: auto-generated underresults/if not specified
Run Qwen with image input:
python predict.py qwen \
--input typed_questions.csv \
--image-base data/imgs/Run a text-only model (no image input):
python predict.py llama --no-imageRun with an image perturbation:
python predict.py qwen --perturb jpeg_compress
python predict.py qwen --perturb add_random_black_box--perturbaccepts the name of a function defined inperturb.py.- Some perturbation functions can also support adding arguments in the cli using
name:specsyntax (e.g.,jpeg_compress:40) if external argument is accepted. - If not implemented, use simple function names only.
The notebooks directory contains workflows for running large-scale OpenAI batch jobs.
-
openai_qa_with_image.ipynbBuild and submit batch jobs that include base64-encoded images. Poll for completion and save results as JSONL and CSV. -
openai_qa_without_image.ipynbBuild and submit text-only batch jobs (no images). Poll and save results. -
openai_result_fetch.ipynbInspect, retrieve, and download results from previously submitted batch jobs, and parse JSONL outputs.
- Edit the CSV path at the top of the script if needed.
- Generates heatmap visualizations.
- Output images are saved to:
heat_maps/
-
Edit constants at the top of the script before running:
CSV_PATHIMAGE_ROOTUSE_UMAP
-
Supports UMAP and t-SNE visualizations.
-
Outputs are saved to:
plots_umap/
plots_tsne/
Evaluation for the following question types:
- Boolean
- Single Entity
- Counting
- Listing
Evaluation for more complex question types:
- Ranking
- Reasoning
These notebooks aggregate model outputs and compute task-specific metrics to compare performance across models and settings.