9 Comments
User's avatar
Michael Tan's avatar

Thanks for sharing this.

My main concern is the hardware lifecycle - how quickly it becomes outdated, depreciates, or unserviceable. That’s currently holding me back from committing to the investment. Curious how you view that risk.

Sean Chase's avatar

What a great write up. I debate an M5 128 a lot lately. I’ve been renting on Vast.

Joey Hipolito's avatar

Been running local models on an M2 Max for about the same time and the thing that surprised me most was how much I stopped caring about the "vs cloud" debate. It's just always there, no context leaking anywhere. Model quality tradeoff is real though. I still hit the cloud for anything that needs serious reasoning.

Sven's avatar

Excellent post. Thanks for sharing!

Simon Ryan's avatar

Great Post, I'm using your mlx-lm fork, very nice!

Here is my GLM-5.1 llama-swap config:

(This one works with the model in a non hugging-face cache directory - in case that helps anyone)

"GLM-5.1-mlx":

env:

- "CUDA_VISIBLE_DEVICES=0"

proxy: "http://0.0.0.0:${PORT}"

# Increase timeouts for long-running GLM-5.1 requests

timeouts:

connect: 60

responseHeader: 3600

tlsHandshake: 30

idleConn: 300

# This must match the model given on the command line or it will try to use hf

useModelName: /Users/simon/models/mlx/GLM-5.1-RAM-420GB-MLX

cmd: >

/opt/local/bin/uvx --from git+https://github.com/spicyneuron/mlx-lm.git@cache-warmup mlx_lm.server

--host 127.0.0.1

--port ${PORT}

--max-tokens 150000

--prompt-cache-warmup

--temp 1.0

--top-p 0.95

--top-k 48

--min-p 0.01

--model mlx/GLM-5.1-RAM-420GB-MLX/

(Apologies for lack of indent, it seems your comments dont support markdown)

Tatsu Ikeda's avatar

Really great report! I did a 3090 setup 2 years ago and it was disappointing. I am ready to trade speed for model size at this point.

Alastair's avatar

Amazing level of detail here. Thanks so much for sharing it.

Makuchaku's avatar

Important question : were you able to use this setup to code any complex project?

I found running Qwen on my so-smol 5080 works, but is not intelligent enough for my coding needs. Perhaps Kimi changes that experience? I even tried Gemma 4 and it cannot even complete basic tasks for me.

spicyneuron's avatar

Kimi and Qwen 3.5 397 are definitely above the “can do useful coding” threshold, but still far from the same level as Codex or Claude. I mostly use them to soak up low-stakes / well-defined tasks that aren’t worth spending $20/month plan tokens on.