A comprehensive framework for multi-node, multi-GPU scalable LLM inference on HPC systems using vLLM and Ollama. Includes distributed deployment templates, benchmarking workflows, and chatbot/RAG pipelines for high-throughput, production-grade AI services
hpc rag ai-inference vllm llm-inference ollama distributed-inference vllm-serve ai-inference-server multinode-training
- Updated
Dec 10, 2025 - Python