Skip to content

[RFC]: Add Ascend NPU as a new backend #7692

@wangshuai09

Description

@wangshuai09

Motivation.

VLLM provides an easy-to-use backend access machanism and there are many backends have been integrated.
As shown in #6368, #6728, #6066, many users want to use vllm on Ascend NPU.
The main purpose of this RFC is to follow the existing backend access machanism and make Ascend NPU available for VLLM.

Proposed Change.

图片1

We introduce Ascend Executor/Worker(s) based on GPU Executor/Worker(s) as Ascend runtime management and worker on NPU. We also apply the Ascend Backend as the replacement of attention layer, the Page Attention/Flash Attention ops are implemented here.

图片2

Because torch_npu already natively supports torch since 2.1.0, we should try to keep it consistent with the GPU code and make the least code changes in our implements.

Feedback Period.

A month

CC List.

@mgoin
@WoosukKwon

Any Other Things.

Background

Ascend NPU is a range of AI processors using Neural Processing Unit. It will efficiently handle matrix-matrix multiplication, dot-product and scalars. There are many projects have supported Ascend NPU, such as onnxruntime, deepspeed, llama.cpp

MindIE is the Ascend inference engine, a high-performance deep learning inference framework, is designed based on Ascend hardware.

RoadMap

The initial version will include the following:

  • Ascend Executor
  • Ascend Worker
  • Ascend Model Runner
  • Ascend MindIE Backend
  • Ascend SingleOps Backend

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions