-
-
Couldn't load subscription status.
- Fork 10.9k
Description
Motivation.
VLLM provides an easy-to-use backend access machanism and there are many backends have been integrated.
As shown in #6368, #6728, #6066, many users want to use vllm on Ascend NPU.
The main purpose of this RFC is to follow the existing backend access machanism and make Ascend NPU available for VLLM.
Proposed Change.
We introduce Ascend Executor/Worker(s) based on GPU Executor/Worker(s) as Ascend runtime management and worker on NPU. We also apply the Ascend Backend as the replacement of attention layer, the Page Attention/Flash Attention ops are implemented here.
Because torch_npu already natively supports torch since 2.1.0, we should try to keep it consistent with the GPU code and make the least code changes in our implements.
Feedback Period.
A month
CC List.
Any Other Things.
Background
Ascend NPU is a range of AI processors using Neural Processing Unit. It will efficiently handle matrix-matrix multiplication, dot-product and scalars. There are many projects have supported Ascend NPU, such as onnxruntime, deepspeed, llama.cpp
MindIE is the Ascend inference engine, a high-performance deep learning inference framework, is designed based on Ascend hardware.
RoadMap
The initial version will include the following:
- Ascend Executor
- Ascend Worker
- Ascend Model Runner
- Ascend MindIE Backend
- Ascend SingleOps Backend

