SharedLLM

Introduction

This repository contains the core code of the paper "Stacked from One: Multi-Scale Self-Injection for Context Window Extension", accepted by ICLR 2026.

Usage

Data Preparation

To train the model on downsampled redpajama and activation beacon, please refer to the following repositories to prepare the data

Pretraining Downsampled RP: https://github.com/princeton-nlp/CEPE
SFT LongAlpaca-12K: https://huggingface.co/datasets/Yukang/LongAlpaca-12k BookSum: https://huggingface.co/datasets/kmfoda/booksum

Synthetic Data (highly encouraged): See details in Llama-3-8B-Instruct-262k and Synthetic Data for Multi-Doc QA

Training

To train sharedllm on red-pajama, use the following command to start training (by default we use NVIDIA 8xA800 GPUs with deepspeed)

CUDA_VISIBLE_DEVICES=$CVD deepspeed train.py --model_name_or_path <llama_path> \
                                             --encoder_name_or_path <llama_path> \
                                             --config <path_to_config> \
                                             --model_class sharedllm \
                                             --output_dir output/sharedllm_7b \
                                             --deepspeed <path_to_deepspeed_config>

For mixed dataset training, just change train.py to train_beacon.py and the corresponding configuration files.

Testing

For evaluation on language modeling, here's an example of testing model on 8K text length and arxiv domain. Here we use one A800 (80G) GPU to run this experiment

python eval_lm.py --config configs/test/test_ab_4x1024_4096 \
                  --model_name_or_path <path_to_model_ckpt> \
                  --model_class sharedllm  \
                  --validation_domains arxiv \
                  --output_dir output/<experiment_name>

For evaluation on longbench and Infbench, please refer to their respective repository and insert the model loading code to the original evaluation scripts. Note that the input loader should be modified as original implementation only supports decoder-only architectures (GPT) which differs from ours.

Citation

@misc{
  anonymous2025stacked,
  title={Stacked from One: Multi-Scale Self-Injection for Context Window Extension},
  author={Han, Wei and Zhou, Pan and Yan, Shuicheng},
  year={2025},
  url={https://openreview.net/forum?id=w1Qpbkb7C6}
}

If you have any further questions about this work, feel free to contact me via henryhan88888@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LongBenchTest		LongBenchTest
configs		configs
modeling		modeling
scripts		scripts
.gitignore		.gitignore
README.md		README.md
data.py		data.py
data_beacon.py		data_beacon.py
dataset_utils.py		dataset_utils.py
eval_lm.py		eval_lm.py
train.py		train.py
train_beacon.py		train_beacon.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SharedLLM

Introduction

Usage

Data Preparation

Training

Testing

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Clement25/SharedLLM

Folders and files

Latest commit

History

Repository files navigation

SharedLLM

Introduction

Usage

Data Preparation

Training

Testing

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages