Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

405B is beyond homelab-scale. I recently obtained a 4x4090 rig, and I am comfortable running 70B and occasionally 128B-class models. For 405B, you need 8xH100 or better. A single H100 costs around $40k.


Here is someone running 405b on 12x3090 (4.5bpw). Total cost around $10k.

https://www.reddit.com/r/LocalLLaMA/comments/1ej9uzh/local_l...

Admittedly it's slow (3.5 token/sec)


This is really suboptimal. They should buy 4 more 3090s so they can run it massively faster with tensor parallelism.


Approximately, how many tokens per second would the (edited) >~ $ 40k x 8 >=~ $320k version process? Would this result in a >~32x boost in performance compared to other setups? Thanks!


If you really want to know an exact number for a specific use case, you can rent an 8xH100 node on RunPod and benchmark it.

You should expect somewhere around 30t/s for a single response, if running the FP8 rowwise quant that would typically be used on such a node, with TensorRT-LLM. Massively more in total with batching.

That quant is twice the size as the 4.5bpw one used on the Mac though. A lower quality one would be faster.


I bought a used dual-socket xeon workstation with 768GB of RAM for ~$3k and can run the 405B model at ~0.3 tokens/sec.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact