You actually need a lot less than that if you use the mmap option, because then ...

noman-land · on Sept 21, 2024

Can you say a bit more about this? Based on my non-scientific personal experience on an M1 with 64gb memory, that's approximately what it seems to be. If the model is 4gb in size, loading it up and doing inference takes about 4gb of memory. I've used LM Studio and llamafiles directly and both seem to exhibit this behavior. I believe llamafiles use mmap by default based on what I've seen jart talk about. LM Studio allows you to "GPU offload" the model by loading it partially or completely into GPU memory, so not sure what that means.

ycombinatrix · on Sept 21, 2024

How does one set this up?

ignoramous · on Sept 21, 2024

With ggml the mmap part is the default. It isn't a panacea though [0]. Note that most runtimes (like MLX, ONNX, TensorFlow, JAX/XLA etc) will employ a number of techniques for efficient inference and mmap is just one part of it.

[0] https://news.ycombinator.com/item?id=35455930

andersa · on Sept 22, 2024

You missed the part where this is slow as hell.