Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I experimented with both Exo and llama.cpp in RPC-server mode this week. Using an M3 Max and an M1 Ultra in Exo specifically I was able to get around 13 tok/s on DeepSeek 2.5 236B (using MLX and a 4 bit quant with a very small test prompt - so maybe 140 gigs total of model+cache). It definitely took some trial and error but the Exo community folks were super helpful/responsive with debugging/advice.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact