Run local LLMs on iGPU, APU and CPU (AMD , Intel, and Qualcomm (Coming Soon)). Easiest way to launch OpenAI API Compatible Server on Windows, Linux and MacOS
| Support matrix | Supported now | Under Development | On the roadmap |
|---|---|---|---|
| Model architectures | Gemma Llama * Mistral + Phi | ||
| Platform | Linux Windows | ||
| Architecture | x86 x64 | Arm64 | |
| Hardware Acceleration | CUDA DirectML IpexLLM | QNN ROCm | OpenVINO |
* The Llama model architecture supports similar model families such as CodeLlama, Vicuna, Yi, and more.
+ The Mistral model architecture supports similar model families such as Zephyr.
- [2024/06] Support Phi-3 (mini, small, medium), Phi-3-Vision-Mini, Llama-2, Llama-3, Gemma (v1), Mistral v0.3, Starling-LM, Yi-1.5.
- [2024/06] Support vision/chat inference on iGPU, APU, CPU and CUDA.
- Supported Models
- Getting Started
- Compile OpenAI-API Compatible Server into Windows Executable
- Prebuilt Binary (Alpha)
- Acknowledgements
| Models | Parameters | Context Length | Link |
|---|---|---|---|
| Gemma-2b-Instruct v1 | 2B | 8192 | EmbeddedLLM/gemma-2b-it-onnx |
| Llama-2-7b-chat | 7B | 4096 | EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml |
| Llama-2-13b-chat | 13B | 4096 | EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml |
| Llama-3-8b-chat | 8B | 8192 | luweigen/Llama-3-8B-Instruct-int4-onnx-directml |
| Mistral-7b-v0.3-instruct | 7B | 32768 | EmbeddedLLM/mistral-7b-instruct-v0.3-onnx |
| Phi-3-mini-4k-instruct-062024 | 3.8B | 4096 | EmbeddedLLM/Phi-3-mini-4k-instruct-062024-onnx |
| Phi3-mini-4k-instruct | 3.8B | 4096 | microsoft/Phi-3-mini-4k-instruct-onnx |
| Phi3-mini-128k-instruct | 3.8B | 128k | microsoft/Phi-3-mini-128k-instruct-onnx |
| Phi3-medium-4k-instruct | 17B | 4096 | microsoft/Phi-3-medium-4k-instruct-onnx-directml |
| Phi3-medium-128k-instruct | 17B | 128k | microsoft/Phi-3-medium-128k-instruct-onnx-directml |
| Openchat-3.6-8b | 8B | 8192 | EmbeddedLLM/openchat-3.6-8b-20240522-onnx |
| Yi-1.5-6b-chat | 6B | 32k | EmbeddedLLM/01-ai_Yi-1.5-6B-Chat-onnx |
| Phi-3-vision-128k-instruct | 128k | EmbeddedLLM/Phi-3-vision-128k-instruct-onnx |
-
Windows
- Custom Setup:
- IPEX(XPU): Requires anaconda environment.
conda create -n ellm python=3.10 libuv; conda activate ellm. - DirectML: If you are using Conda Environment. Install additional dependencies:
conda install conda-forge::vs2015_runtime.
-
Install embeddedllm package.
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .. Note: currently supportcpu,directmlandcuda.- DirectML:
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml] - CPU:
$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu] - CUDA:
$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda] - IPEX:
$env:ELLM_TARGET_DEVICE='ipex'; python setup.py develop - OpenVINO:
$env:ELLM_TARGET_DEVICE='openvino'; pip install -e .[openvino] - With Web UI:
- DirectML:
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml,webui] - CPU:
$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu,webui] - CUDA:
$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda,webui] - IPEX:
$env:ELLM_TARGET_DEVICE='ipex'; python setup.py develop; pip install -r requirements-webui.txt - OpenVINO:
$env:ELLM_TARGET_DEVICE='openvino'; pip install -e .[openvino,webui]
- DirectML:
- DirectML:
-
Linux
- Custom Setup:
- IPEX(XPU): Requires anaconda environment.
conda create -n ellm python=3.10 libuv; conda activate ellm. - DirectML: If you are using Conda Environment. Install additional dependencies:
conda install conda-forge::vs2015_runtime.
-
Install embeddedllm package.
ELLM_TARGET_DEVICE='directml' pip install -e .. Note: currently supportcpu,directmlandcuda.- DirectML:
ELLM_TARGET_DEVICE='directml' pip install -e .[directml] - CPU:
ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu] - CUDA:
ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda] - IPEX:
ELLM_TARGET_DEVICE='ipex' python setup.py develop - OpenVINO:
ELLM_TARGET_DEVICE='openvino' pip install -e .[openvino] - With Web UI:
- DirectML:
ELLM_TARGET_DEVICE='directml' pip install -e .[directml,webui] - CPU:
ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu,webui] - CUDA:
ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda,webui] - IPEX:
ELLM_TARGET_DEVICE='ipex' python setup.py develop; pip install -r requirements-webui.txt - OpenVINO:
ELLM_TARGET_DEVICE='openvino' pip install -e .[openvino,webui]
- DirectML:
- DirectML:
-
Custom Setup:
-
Ipex
-
For Intel iGPU:
set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1
-
For Intel Arcâ„¢ A-Series Graphics:
set SYCL_CACHE_PERSISTENT=1
-
-
-
ellm_server --model_path <path/to/model/weight>. -
Example code to connect to the api server can be found in
scripts/python. Note: To find out more of the supported arguments.ellm_server --help.
ellm_chatbot --port 7788 --host localhost --server_port <ellm_server_port> --server_host localhost. Note: To find out more of the supported arguments.ellm_chatbot --help.
It is an interface that allows you to download and deploy OpenAI API compatible server. You can find out the disk space required to download the model in the UI.
ellm_modelui --port 6678. Note: To find out more of the supported arguments.ellm_modelui --help.
NOTE: OpenVINO packaging currently uses torch==2.4.0. It will not be able to run due to missing dependencies which is libomp. Make sure to install libomp and add the libomp-xxxxxxx.dll to C:\\Windows\\System32.
-
Install
embeddedllm. -
Install PyInstaller:
pip install pyinstaller==6.9.0. -
Compile Windows Executable:
pyinstaller .\ellm_api_server.spec. -
You can find the executable in the
dist\ellm_api_server. -
Use it like
ellm_server..\ellm_api_server.exe --model_path <path/to/model/weight>.Powershell/Terminal Usage:
ellm_server --model_path <path/to/model/weight> # DirectML ellm_server --model_path 'EmbeddedLLM_Phi-3-mini-4k-instruct-062024-onnx\onnx\directml\Phi-3-mini-4k-instruct-062024-int4' --port 5555 # IPEX-LLM ellm_server --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'ipex' --device 'xpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct' # OpenVINO ellm_server --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'openvino' --device 'gpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct'
You can find the prebuilt OpenAI API Compatible Windows Executable in the Release page.
Powershell/Terminal Usage (Use it like ellm_server):
.\ellm_api_server.exe --model_path <path/to/model/weight> # DirectML .\ellm_api_server.exe --model_path 'EmbeddedLLM_Phi-3-mini-4k-instruct-062024-onnx\onnx\directml\Phi-3-mini-4k-instruct-062024-int4' --port 5555 # IPEX-LLM .\ellm_api_server.exe --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'ipex' --device 'xpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct' # OpenVINO .\ellm_api_server.exe --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'openvino' --device 'gpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct'- Excellent open-source projects: vLLM, onnxruntime-genai, Ipex-LLM and many others.

