conda create -n HaystackCraft python=3.10 -y conda activate HaystackCraft pip install -r requirements.txt
If you have trouble running Qwen2.5-1M models, you may create a separate environment with requirements_0-7-2.txt
.
If you need to evaluate models from OpenAI, specify your OpenAI API key with
export OPENAI_API_KEY=...
If you need to evaluate Gemini models, specify your Gemini API key with
export GEMINI_API_KEY=...
For access to certain open source LLMs, you may need to first specify your huggingface token with export HUGGING_FACE_HUB_TOKEN=...
.
We use vLLM for serving open source LLMs, e.g.,
vllm serve meta-llama/Llama-3.1-8B-Instruct --api-key token-abc123 --gpu-memory-utilization 0.95 --trust-remote-code --port 8000
For LLM inference,
python infer_static.py --llm MODEL_TO_EVALUATE --retriever RETRIEVER_FOR_HAYSTACK_CONSTRUCTION --context_size TARGET_CONTEXT_SIZE --order HAYSTACK_ORDERING
Additionally specify --ppr
for graph-based reranking with Personalized PageRank (PPR) in haystack construction.
For inference with locally deployed open source LLMs, specify the port you use in vLLM deployment, e.g., --port 8000
.
For evaluation, do for example
python eval.py --result_dir results/bm25/Llama-3.1-8B-Instruct/8000/descending_order/
Install Java 21 with for example
curl -s "https://get.sdkman.io" | bash source "$HOME/.sdkman/bin/sdkman-init.sh" sdk install java 21.0.3-tem
Deploy a local embedding server with vLLM.
vllm serve Qwen/Qwen3-Embedding-0.6B --port QWEN_RETRIEVER_EMB_PORT --api-key token-abc123 --gpu-memory-utilization 0.95 --trust-remote-code --enforce-eager
python infer_multi.py --llm MODEL_TO_EVALUATE --retriever RETRIEVER_FOR_HAYSTACK_CONSTRUCTION --context_size TARGET_CONTEXT_SIZE --num_rounds NUM_REASONING_ROUNDS
Additional args:
--port
: For inference with locally deployed open source LLMs, specify the port you use in vLLM deployment, e.g.,--port 8000
.--emb_port
: If you useQwen3-Embedding-0.6B
for haystack construction, specifyQWEN_RETRIEVER_EMB_PORT
used above.--ppr
: Specify--ppr
for graph-based reranking with Personalized PageRank (PPR) in haystack construction.
python infer_variable.py --llm MODEL_TO_EVALUATE --retriever RETRIEVER_FOR_HAYSTACK_CONSTRUCTION --context_size TARGET_CONTEXT_SIZE --max_rounds MAX_REASONING_ROUNDS
Additional args:
--port
: For inference with locally deployed open source LLMs, specify the port you use in vLLM deployment, e.g.,--port 8000
.--emb_port
: If you useQwen3-Embedding-0.6B
for haystack construction, specifyQWEN_RETRIEVER_EMB_PORT
used above.--ppr
: Specify--ppr
for graph-based reranking with Personalized PageRank (PPR) in haystack construction.
For example
python eval_100.py --result_dir 2_round_results/qwen3_0.6/gemini-2.5-flash-lite/8000/descending_order
@article{li2025haystack, title={Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation}, author={Mufei Li and Dongqi Fu and Limei Wang and Si Zhang and Hanqing Zeng and Kaan Sancak and Ruizhong Qiu and Haoyu Wang and Xiaoxin He and Xavier Bresson and Yinglong Xia and Chonglin Sun and Pan Li}, journal={arXiv preprint arXiv:2510.07414}, year={2025} }