This is the Pytorch implementation of E2LLM in the EMNLP'25 paper: E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning.
Abstract
-
We propose E2LLM, a novel long-context modeling framework built on pre-trained text encoders and decoder-only LLMs to effectively address the ``impossible triangle'' challenge.
-
We introduce two training objectives: soft prompt reconstruction and long-context instruction fine-tuning, enabling the LLM to understand the soft prompt while reasoning about accurate outputs.
-
Comprehensive experiments conducted on diverse benchmarks demonstrate the efficiency and practicality of E2LLM and reveal its superiority over 8 SOTA baselines and competency on LongBench v2.
- Ubuntu OS
- python==3.10
- torch==2.0.1
- cuda==11.7
- accelerate==0.23.0
- transformers==4.36.0
- deepspeed==0.9.3
- flash-attn==2.3.6
- peft==0.7.0
- scikit-learn==1.3.0
Dependencies can be installed by:
pip install -r requirements.txt The overall directory structure is as follows:
${CODE_ROOT} |-- configs |-- eval_config.json |-- lora_modules.json |-- model2maxlen.json |-- train_config.json |-- dataset |-- __init__.py |-- dataset.py |-- evaluate |-- __init__.py |-- em_quality.py |-- f1_qa.py |-- niah_metric.py |-- rouge_sum.py |-- local |-- ds_config_zero2.yaml |-- model |-- __init__.py |-- encoder_model_bert.py |-- pma.py |-- pro_model.py |-- pefts |-- __init__.py |-- e2llm_args.py |-- e2llm_trainer.py |-- preprocess |-- preshuffle_data_and_chunk.py |-- prompts |-- utils |-- __init__.py |-- common_utils.py |-- eval.py |-- eval.sh |-- train_accelerate.py |-- train_local_machine.sh |-- train_multi_node.sh The five datasets (QMSum, GovReport, Quality, NarrativeQA and TriviaQA) used in this paper can be downloaded from the following links:
Before training, first convert the data into a JSONL file in the format {'context': 'xxx', 'prompt': 'xxx', 'answer': 'xxx'}. Then run
python preprocess/preshuffle_data_and_chunk.pyand set the chunk_size parameter during execution.
During training, first set the desired parameters in configs/train_config.json, then run the appropriate script according to your environment:
-
If you are training on a local machine:
sh train_local_machine.sh -
If you are training on a cluster / multi-node setup:
sh train_multi_node.sh
For inference, run
sh eval.sh and adjust its parameters so that they match the ones used during training.
If you find our repository helpful, please cite us as follows:
@misc{liao2025e2llmencoderelongatedlarge, title={E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning}, author={Zihan Liao and Jun Wang and Hang Yu and Lingxiao Wei and Jianguo Li and Jun Wang and Wei Zhang}, year={2025}, eprint={2409.06679}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.06679}, } 