405B is beyond homelab-scale. I recently obtained a 4x4090 rig, and I am comfort...

HPsquared · on Sept 21, 2024

Here is someone running 405b on 12x3090 (4.5bpw). Total cost around $10k.

https://www.reddit.com/r/LocalLLaMA/comments/1ej9uzh/local_l...

Admittedly it's slow (3.5 token/sec)

andersa · on Sept 22, 2024

This is really suboptimal. They should buy 4 more 3090s so they can run it massively faster with tensor parallelism.

wslh · on Sept 21, 2024

Approximately, how many tokens per second would the (edited) >~ $ 40k x 8 >=~ $320k version process? Would this result in a >~32x boost in performance compared to other setups? Thanks!

andersa · on Sept 22, 2024

If you really want to know an exact number for a specific use case, you can rent an 8xH100 node on RunPod and benchmark it.

You should expect somewhere around 30t/s for a single response, if running the FP8 rowwise quant that would typically be used on such a node, with TensorRT-LLM. Massively more in total with batching.

That quant is twice the size as the 4.5bpw one used on the Mac though. A lower quality one would be faster.

mechagodzilla · on Sept 22, 2024

I bought a used dual-socket xeon workstation with 768GB of RAM for ~$3k and can run the 405B model at ~0.3 tokens/sec.