Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Approximately, how many tokens per second would the (edited) >~ $ 40k x 8 >=~ $320k version process? Would this result in a >~32x boost in performance compared to other setups? Thanks!


If you really want to know an exact number for a specific use case, you can rent an 8xH100 node on RunPod and benchmark it.

You should expect somewhere around 30t/s for a single response, if running the FP8 rowwise quant that would typically be used on such a node, with TensorRT-LLM. Massively more in total with batching.

That quant is twice the size as the 4.5bpw one used on the Mac though. A lower quality one would be faster.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact