Run local LLMs on iGPU, APU and CPU (AMD , Intel, and Qualcomm (Coming Soon)) Easiest way to launch OpenAI API Compatible Server on Windows, Linux and MacOS
| Support matrix | Supported now | Under Development | On the roadmap |
|---|---|---|---|
| Model architectures | Gemma Llama * Mistral + Phi | ||
| Platform | Linux Windows | ||
| Architecture | x86 x64 | Arm64 | |
| Hardware Acceleration | CUDA DirectML | QNN ROCm | OpenVINO |
* The Llama model architecture supports similar model families such as CodeLlama, Vicuna, Yi, and more.
+ The Mistral model architecture supports similar model families such as Zephyr.
- [2024/06] Support Phi-3 (mini, small, medium), Phi-3-Vision-Mini, Llama-2, Llama-3, Gemma (v1), Mistral v0.3, Starling-LM, Yi-1.5.
- [2024/06] Support vision/chat inference on iGPU, APU, CPU and CUDA.
| Models | Parameters | Context Length | Link |
|---|---|---|---|
| Gemma-2b-Instruct v1 | 2B | 8192 | EmbeddedLLM/gemma-2b-it-onnx |
| Llama-2-7b-chat | 7B | 4096 | EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml |
| Llama-2-13b-chat | 13B | 4096 | EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml |
| Llama-3-8b-chat | 8B | 8192 | EmbeddedLLM/mistral-7b-instruct-v0.3-onnx |
| Mistral-7b-v0.3-instruct | 7B | 32768 | EmbeddedLLM/mistral-7b-instruct-v0.3-onnx |
| Phi3-mini-4k-instruct | 3.8B | 4096 | microsoft/Phi-3-mini-4k-instruct-onnx |
| Phi3-mini-128k-instruct | 3.8B | 128k | microsoft/Phi-3-mini-128k-instruct-onnx |
| Phi3-medium-4k-instruct | 17B | 4096 | microsoft/Phi-3-medium-4k-instruct-onnx-directml |
| Phi3-medium-128k-instruct | 17B | 128k | microsoft/Phi-3-medium-128k-instruct-onnx-directml |
Windows
- Install embeddedllm package.
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .. Note: currently supportcpu,directmlandcuda.- DirectML:
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml] - CPU:
$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu] - CUDA:
$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda] - With Web UI:
- DirectML:
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml, webui] - CPU:
$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu, webui] - CUDA:
$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda, webui]
- DirectML:
- DirectML:
Linux
- Install embeddedllm package.
ELLM_TARGET_DEVICE='directml' pip install -e .. Note: currently supportcpu,directmlandcuda.- DirectML:
ELLM_TARGET_DEVICE='directml' pip install -e .[directml] - CPU:
ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu] - CUDA:
ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda] - With Web UI:
- DirectML:
ELLM_TARGET_DEVICE='directml' pip install -e .[directml, webui] - CPU:
ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu, webui] - CUDA:
ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda, webui]
- DirectML:
- DirectML:
usage: ellm_server.exe [-h] [--port int] [--host str] [--response_role str] [--uvicorn_log_level str] [--served_model_name str] [--model_path str] [--vision bool] options: -h, --help show this help message and exit --port int Server port. (default: 6979) --host str Server host. (default: 0.0.0.0) --response_role str Server response role. (default: assistant) --uvicorn_log_level str Uvicorn logging level. `debug`, `info`, `trace`, `warning`, `critical` (default: info) --served_model_name str Model name. (default: phi3-mini-int4) --model_path str Path to model weights. (required) --vision bool Enable vision capability, only if model supports vision input. (default: False) ellm_server --model_path <path/to/model/weight>.- Example code to connect to the api server can be found in
scripts/python.
ellm_chatbot --port 7788 --host localhost --server_port <ellm_server_port> --server_host localhost.
- Excellent open-source projects: vLLM, onnxruntime-genai and many others.