[NeuralChat] CUDA serving with Triton Inference Server #1293

Spycsh · 2024-02-21T02:40:18Z

Type of Change

Task

support serving and deploying NeuralChat models with Triton Inference Server on CUDA (single or multi-card) devices.

The tag v2 is the version also enables multi-card instance group initialization.

serving and deploying NeuralChat models with Triton Inference Server on CUDA

example

None. It requires numba but already wrapped in the docker image. No change to itrex itself.

VincyZhang · 2024-02-23T02:42:30Z

Spycsh added 2 commits February 20, 2024 18:33

add trt cuda support

961d54c

fix README

8c3d104

Spycsh requested a review from lvliang-intel as a code owner February 21, 2024 02:40

Merge branch 'main' into spycsh/triton_cuda

66df159

VincyZhang added the ITREX1.3.2 label Feb 23, 2024

lvliang-intel approved these changes Feb 23, 2024

View reviewed changes

VincyZhang merged commit 4657036 into main Feb 23, 2024

VincyZhang deleted the spycsh/triton_cuda branch February 23, 2024 05:24