Approximately, how many tokens per second would the (edited) >~ $ 40k x 8 >=~ $3...

andersa · on Sept 22, 2024

If you really want to know an exact number for a specific use case, you can rent an 8xH100 node on RunPod and benchmark it.

You should expect somewhere around 30t/s for a single response, if running the FP8 rowwise quant that would typically be used on such a node, with TensorRT-LLM. Massively more in total with batching.

That quant is twice the size as the 4.5bpw one used on the Mac though. A lower quality one would be faster.