I experimented with both Exo and llama.cpp in RPC-server mode this week. Using a...

I experimented with both Exo and llama.cpp in RPC-server mode this week. Using an M3 Max and an M1 Ultra in Exo specifically I was able to get around 13 tok/s on DeepSeek 2.5 236B (using MLX and a 4 bit quant with a very small test prompt - so maybe 140 gigs total of model+cache). It definitely took some trial and error but the Exo community folks were super helpful/responsive with debugging/advice.