Skip to content

Graph-COM/HaystackCraft

Repository files navigation

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

arXiv

fig

Table of Contents

Environment Setup

conda create -n HaystackCraft python=3.10 -y conda activate HaystackCraft pip install -r requirements.txt

If you have trouble running Qwen2.5-1M models, you may create a separate environment with requirements_0-7-2.txt.

If you need to evaluate models from OpenAI, specify your OpenAI API key with

export OPENAI_API_KEY=...

If you need to evaluate Gemini models, specify your Gemini API key with

export GEMINI_API_KEY=...

Static NIAH with Heterogeneous Retrieval Strategies

For access to certain open source LLMs, you may need to first specify your huggingface token with export HUGGING_FACE_HUB_TOKEN=....

We use vLLM for serving open source LLMs, e.g.,

vllm serve meta-llama/Llama-3.1-8B-Instruct --api-key token-abc123 --gpu-memory-utilization 0.95 --trust-remote-code --port 8000

For LLM inference,

python infer_static.py --llm MODEL_TO_EVALUATE --retriever RETRIEVER_FOR_HAYSTACK_CONSTRUCTION --context_size TARGET_CONTEXT_SIZE --order HAYSTACK_ORDERING

Additionally specify --ppr for graph-based reranking with Personalized PageRank (PPR) in haystack construction.

For inference with locally deployed open source LLMs, specify the port you use in vLLM deployment, e.g., --port 8000.

For evaluation, do for example

python eval.py --result_dir results/bm25/Llama-3.1-8B-Instruct/8000/descending_order/

Dynamic NIAH

Retrieval Environment Setup

BM25

Install Java 21 with for example

curl -s "https://get.sdkman.io" | bash source "$HOME/.sdkman/bin/sdkman-init.sh" sdk install java 21.0.3-tem

qwen3_0.6

Deploy a local embedding server with vLLM.

vllm serve Qwen/Qwen3-Embedding-0.6B --port QWEN_RETRIEVER_EMB_PORT --api-key token-abc123 --gpu-memory-utilization 0.95 --trust-remote-code --enforce-eager

LLM Inference (Enforced Multi-Round)

python infer_multi.py --llm MODEL_TO_EVALUATE --retriever RETRIEVER_FOR_HAYSTACK_CONSTRUCTION --context_size TARGET_CONTEXT_SIZE --num_rounds NUM_REASONING_ROUNDS

Additional args:

  • --port: For inference with locally deployed open source LLMs, specify the port you use in vLLM deployment, e.g., --port 8000.
  • --emb_port: If you use Qwen3-Embedding-0.6B for haystack construction, specify QWEN_RETRIEVER_EMB_PORT used above.
  • --ppr: Specify --ppr for graph-based reranking with Personalized PageRank (PPR) in haystack construction.

LLM Inference (Variable-Round)

python infer_variable.py --llm MODEL_TO_EVALUATE --retriever RETRIEVER_FOR_HAYSTACK_CONSTRUCTION --context_size TARGET_CONTEXT_SIZE --max_rounds MAX_REASONING_ROUNDS

Additional args:

  • --port: For inference with locally deployed open source LLMs, specify the port you use in vLLM deployment, e.g., --port 8000.
  • --emb_port: If you use Qwen3-Embedding-0.6B for haystack construction, specify QWEN_RETRIEVER_EMB_PORT used above.
  • --ppr: Specify --ppr for graph-based reranking with Personalized PageRank (PPR) in haystack construction.

Evaluation

For example

python eval_100.py --result_dir 2_round_results/qwen3_0.6/gemini-2.5-flash-lite/8000/descending_order

Citation

@article{li2025haystack, title={Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation}, author={Mufei Li and Dongqi Fu and Limei Wang and Si Zhang and Hanqing Zeng and Kaan Sancak and Ruizhong Qiu and Haoyu Wang and Xiaoxin He and Xavier Bresson and Yinglong Xia and Chonglin Sun and Pan Li}, journal={arXiv preprint arXiv:2510.07414}, year={2025} }

About

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Topics

Resources

License

Stars

Watchers

Forks

Languages