Skip to content

Running DeepSeek R1 on 8x RTX 6000 PRO #5581

@aikitoria

Description

@aikitoria

Hi, is this planned to be supported? With a total of 768GB VRAM, this setup should be able to load the original FP8 weights, and with 14 TB/s combined bandwidth it should also be very fast for low batch sizes in tensor parallel mode!

However, trying to load the model with the sample script, the following error is currently returned:
Unsupported SM version for FP8 block scaling GEMM

The FP4 model does load, but it is vastly slower than expected (only 27 t/s for batch 1, I would have expected an order of magnitude faster for 37B active parameters)

Maybe RTX 6000 PRO support is still work in progress in general? I am seeing a bunch of sm120 related commits recently.

Metadata

Metadata

Labels

not a bugSome known limitation, but not a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions