Official code release for the paper:
MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory
The hallmark of human intelligence is the self-evolving ability to master new skills by learning from past experiences. However, current AI agents struggle to emulate this self-evolution: fine-tuning is computationally expensive and prone to catastrophic forgetting, while existing memory-based methods rely on passive semantic matching that often retrieves noise. To address these challenges, we propose MemRL, a non-parametric approach that evolves via reinforcement learning on episodic memory. By decoupling stable reasoning from plastic memory, MemRL employs a Two-Phase Retrieval mechanism to filter noise and identify high-utility strategies through environmental feedback. Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MemRL significantly outperforms state-of-the-art baselines, confirming that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates.
Click to open the PDF:
Files:
framework_overview.png(preview)framework_overview.pdf(vector)
This repo is a Python package under the memrl namespace.
Install MemRL plus the dependencies needed to run all 4 benchmark entrypoints under run/.
conda create -n memoryrl python=3.10 -y
conda activate memoryrl
pip install -U pip
pip install -r requirements.txtAll benchmarks read LLM + embedding settings from YAML configs under configs/.
Before running, set at least:
llm.api_keyembedding.api_key- (optional)
llm.base_url/embedding.base_urlfor OpenAI-compatible endpoints (vLLM, etc.)
Example configs:
configs/rl_bcb_config.yaml(BigCodeBench)configs/rl_llb_config.yaml(Lifelong Agent Bench)configs/rl_alf_config.yaml(ALFWorld)configs/rl_hle_config.yaml(HLE)
All runners write logs under logs/ and results under results/ (configurable via experiment.output_dir).
Run:
python run/run_hle.py \
--config configs/rl_hle_config.yaml \
--train /path/to/hle_train.parquet \Notes:
- The runner accepts
--categoriesand--category_ratiofor category filtering/sampling. - Data can be found at HLE.
--judge_modelcontrols an optional separate judge LLM. We choose GPT-4o to align with artificialanalysis.
Run:
python run/run_alfworld.py --config configs/rl_alf_config.yamlImportant notes:
- You must install ALFWorld and prepare its data according to the ALFWorld setup.
- This repo expects an ALFWorld environment config at:
configs/envs/alfworld.yaml(Provided). - Few-shot examples are expected at
data/alfworld/alfworld_examples.json(Provided, same as ReAct) (configurable viaexperiment.few_shot_path).
This repo vendors LifelongAgentBench under 3rdparty/LifelongAgentBench and runs it through memrl/run/llb_rl_runner.py.
Docker setup:
- LLB tasks (
db/os) require Docker environments. Please follow the Docker deployment instructions at LifelongAgentBench to build and start the required containers before running.
Quick start:
- Edit
configs/rl_llb_config.local.yamlif it exists (preferred byrun/run_llb.py); otherwise editconfigs/rl_llb_config.yaml:- set
llm.api_key/embedding.api_key - set
experiment.task(db|os) (also acceptsdb_bench/os_interaction) - set
experiment.split_file(and optionalexperiment.valid_file)
- set
- Run:
python run/run_llb.pyDataset:
experiment.split_file/experiment.valid_fileshould point to a JSON dictionary keyed bysample_index(i.e., top-level is an object/dict; keys are strings like"0", values are per-sample dicts).- This repo provides LLB datasets under
data/llb/:- OSInteraction (task =
os/os_interaction):data/llb/os_interaction_data.json(500 samples)data/llb/os_interaction_train.json(350 samples)data/llb/os_interaction_val.json(150 samples)
- DBBench (task =
db/db_bench):data/llb/db_bench_data.json(500 samples)data/llb/db_train.json(361 samples)data/llb/db_val.json(139 samples)
- OSInteraction (task =
Note:
- This open-source release currently supports LLB tasks:
dbandos(nokg).
Optional tracing (LLB):
configs/rl_llb_config.yamlincludesexperiment.trace_jsonl_path.- You can also control tracing with environment variables (see
memrl/trace/llb_jsonl.py).
Run multi-epoch BCB memory benchmark:
python run/run_bcb.py \
--config configs/rl_bcb_config.yaml \
--split instruct \
--epochs 10Dataset:
- Default path:
data/bigcodebench/bigcodebench_{hard|full}.jsonl - Override with:
--data_path /path/to/bigcodebench_hard.jsonl
If the JSONL is missing, the runner prints an actionable download command (via datasets).
Splits:
- Default:
configs/bigcodebench/splits/{hard_seed42|full_seed123}.json - Override with:
--split_file /path/to/split.json
Notes:
- BigCodeBench evaluation uses the vendored repo under
3rdparty/bigcodebench-main. - Default subset is
full. Use--subset hardfor the smaller hard subset. - Retrieval threshold: use
--retrieve_thresholdto override; otherwise it falls back torl_config.sim_threshold(thenrl_config.tau). - TensorBoard (optional): BCB writes scalars under
logs/tensorboard/when TensorBoard support is available. View with:tensorboard --logdir logs/tensorboard
On some hosts, the dynamic loader may forcibly preload an old system libstdc++.so.6 (e.g. via /etc/ld.so.preload),
which can break import sqlite3 in a conda environment (and therefore MemOS / SQLAlchemy initialization).
Workaround (run after activating your conda environment, before running any run/run_*.py):
export LD_PRELOAD="$CONDA_PREFIX/lib/libstdc++.so.6${LD_PRELOAD:+:$LD_PRELOAD}"
python -c "import sqlite3; print('sqlite ok')"If you have root access, you can also inspect the host preload configuration:
cat /etc/ld.so.preloadmemrl/: main library code (MemoryService, runners, providers, tracing)run/: benchmark entrypoints (run_bcb.py,run_llb.py,run_alfworld.py,run_hle.py)configs/: benchmark configs3rdparty/: vendored benchmark repos (BigCodeBench, LifelongAgentBench)
If you use MemRL in your research, please cite our paper:
@misc{zhang2026memrlselfevolvingagentsruntime,
title = {MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory},
author = {Shengtao Zhang and Jiaqian Wang and Ruiwen Zhou and Junwei Liao and Yuchen Feng and Weinan Zhang and Ying Wen and Zhiyu Li and Feiyu Xiong and Yutao Qi and Bo Tang and Muning Wen},
year = {2026},
eprint = {2601.03192},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2601.03192},
}