Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Environment Setup

conda create -n HaystackCraft python=3.10 -y conda activate HaystackCraft pip install -r requirements.txt

If you have trouble running Qwen2.5-1M models, you may create a separate environment with requirements_0-7-2.txt.

If you need to evaluate models from OpenAI, specify your OpenAI API key with

export OPENAI_API_KEY=...

If you need to evaluate Gemini models, specify your Gemini API key with

export GEMINI_API_KEY=...

Static NIAH with Heterogeneous Retrieval Strategies

For access to certain open source LLMs, you may need to first specify your huggingface token with export HUGGING_FACE_HUB_TOKEN=....

We use vLLM for serving open source LLMs, e.g.,

vllm serve meta-llama/Llama-3.1-8B-Instruct --api-key token-abc123 --gpu-memory-utilization 0.95 --trust-remote-code --port 8000

For LLM inference,

python infer_static.py --llm MODEL_TO_EVALUATE --retriever RETRIEVER_FOR_HAYSTACK_CONSTRUCTION --context_size TARGET_CONTEXT_SIZE --order HAYSTACK_ORDERING

Additionally specify --ppr for graph-based reranking with Personalized PageRank (PPR) in haystack construction.

For inference with locally deployed open source LLMs, specify the port you use in vLLM deployment, e.g., --port 8000.

For evaluation, do for example

python eval.py --result_dir results/bm25/Llama-3.1-8B-Instruct/8000/descending_order/

Dynamic NIAH

Retrieval Environment Setup

BM25

Install Java 21 with for example

curl -s "https://get.sdkman.io" | bash source "$HOME/.sdkman/bin/sdkman-init.sh" sdk install java 21.0.3-tem

qwen3_0.6

Deploy a local embedding server with vLLM.

vllm serve Qwen/Qwen3-Embedding-0.6B --port QWEN_RETRIEVER_EMB_PORT --api-key token-abc123 --gpu-memory-utilization 0.95 --trust-remote-code --enforce-eager

LLM Inference (Enforced Multi-Round)

python infer_multi.py --llm MODEL_TO_EVALUATE --retriever RETRIEVER_FOR_HAYSTACK_CONSTRUCTION --context_size TARGET_CONTEXT_SIZE --num_rounds NUM_REASONING_ROUNDS

Additional args:

--port: For inference with locally deployed open source LLMs, specify the port you use in vLLM deployment, e.g., --port 8000.
--emb_port: If you use Qwen3-Embedding-0.6B for haystack construction, specify QWEN_RETRIEVER_EMB_PORT used above.
--ppr: Specify --ppr for graph-based reranking with Personalized PageRank (PPR) in haystack construction.

LLM Inference (Variable-Round)

python infer_variable.py --llm MODEL_TO_EVALUATE --retriever RETRIEVER_FOR_HAYSTACK_CONSTRUCTION --context_size TARGET_CONTEXT_SIZE --max_rounds MAX_REASONING_ROUNDS

Additional args:

--port: For inference with locally deployed open source LLMs, specify the port you use in vLLM deployment, e.g., --port 8000.
--emb_port: If you use Qwen3-Embedding-0.6B for haystack construction, specify QWEN_RETRIEVER_EMB_PORT used above.
--ppr: Specify --ppr for graph-based reranking with Personalized PageRank (PPR) in haystack construction.

Evaluation

For example

python eval_100.py --result_dir 2_round_results/qwen3_0.6/gemini-2.5-flash-lite/8000/descending_order

Citation

@article{li2025haystack, title={Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation}, author={Mufei Li and Dongqi Fu and Limei Wang and Si Zhang and Hanqing Zeng and Kaan Sancak and Ruizhong Qiu and Haoyu Wang and Xiaoxin He and Xavier Bresson and Yinglong Xia and Chonglin Sun and Pan Li}, journal={arXiv preprint arXiv:2510.07414}, year={2025} }

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
prompts		prompts
retrievers		retrievers
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
eval_100.py		eval_100.py
infer_multi.py		infer_multi.py
infer_static.py		infer_static.py
infer_variable.py		infer_variable.py
requirements.txt		requirements.txt
requirements_0-7-2.txt		requirements_0-7-2.txt
theme_figure.png		theme_figure.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Table of Contents

Environment Setup

Static NIAH with Heterogeneous Retrieval Strategies

Dynamic NIAH

Retrieval Environment Setup

BM25

qwen3_0.6

LLM Inference (Enforced Multi-Round)

LLM Inference (Variable-Round)

Evaluation

Citation

About

Uh oh!

Languages

License

Graph-COM/HaystackCraft

Folders and files

Latest commit

History

Repository files navigation

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Table of Contents

Environment Setup

Static NIAH with Heterogeneous Retrieval Strategies

Dynamic NIAH

Retrieval Environment Setup

BM25

qwen3_0.6

LLM Inference (Enforced Multi-Round)

LLM Inference (Variable-Round)

Evaluation

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages