405B is beyond homelab-scale. I recently obtained a 4x4090 rig, and I am comfortable running 70B and occasionally 128B-class models. For 405B, you need 8xH100 or better. A single H100 costs around $40k.
Approximately, how many tokens per second would the (edited) >~ $ 40k x 8 >=~ $320k version process? Would this result in a >~32x boost in performance compared to other setups? Thanks!
If you really want to know an exact number for a specific use case, you can rent an 8xH100 node on RunPod and benchmark it.
You should expect somewhere around 30t/s for a single response, if running the FP8 rowwise quant that would typically be used on such a node, with TensorRT-LLM. Massively more in total with batching.
That quant is twice the size as the 4.5bpw one used on the Mac though. A lower quality one would be faster.