My main concern is the hardware lifecycle - how quickly it becomes outdated, depreciates, or unserviceable. That’s currently holding me back from committing to the investment. Curious how you view that risk.
Been running local models on an M2 Max for about the same time and the thing that surprised me most was how much I stopped caring about the "vs cloud" debate. It's just always there, no context leaking anywhere. Model quality tradeoff is real though. I still hit the cloud for anything that needs serious reasoning.
Important question : were you able to use this setup to code any complex project?
I found running Qwen on my so-smol 5080 works, but is not intelligent enough for my coding needs. Perhaps Kimi changes that experience? I even tried Gemma 4 and it cannot even complete basic tasks for me.
Kimi and Qwen 3.5 397 are definitely above the “can do useful coding” threshold, but still far from the same level as Codex or Claude. I mostly use them to soak up low-stakes / well-defined tasks that aren’t worth spending $20/month plan tokens on.
Thanks for sharing this.
My main concern is the hardware lifecycle - how quickly it becomes outdated, depreciates, or unserviceable. That’s currently holding me back from committing to the investment. Curious how you view that risk.
What a great write up. I debate an M5 128 a lot lately. I’ve been renting on Vast.
Been running local models on an M2 Max for about the same time and the thing that surprised me most was how much I stopped caring about the "vs cloud" debate. It's just always there, no context leaking anywhere. Model quality tradeoff is real though. I still hit the cloud for anything that needs serious reasoning.
Excellent post. Thanks for sharing!
Great Post, I'm using your mlx-lm fork, very nice!
Here is my GLM-5.1 llama-swap config:
(This one works with the model in a non hugging-face cache directory - in case that helps anyone)
"GLM-5.1-mlx":
env:
- "CUDA_VISIBLE_DEVICES=0"
proxy: "http://0.0.0.0:${PORT}"
# Increase timeouts for long-running GLM-5.1 requests
timeouts:
connect: 60
responseHeader: 3600
tlsHandshake: 30
idleConn: 300
# This must match the model given on the command line or it will try to use hf
useModelName: /Users/simon/models/mlx/GLM-5.1-RAM-420GB-MLX
cmd: >
/opt/local/bin/uvx --from git+https://github.com/spicyneuron/mlx-lm.git@cache-warmup mlx_lm.server
--host 127.0.0.1
--port ${PORT}
--max-tokens 150000
--prompt-cache-warmup
--temp 1.0
--top-p 0.95
--top-k 48
--min-p 0.01
--model mlx/GLM-5.1-RAM-420GB-MLX/
(Apologies for lack of indent, it seems your comments dont support markdown)
Really great report! I did a 3090 setup 2 years ago and it was disappointing. I am ready to trade speed for model size at this point.
Amazing level of detail here. Thanks so much for sharing it.
Important question : were you able to use this setup to code any complex project?
I found running Qwen on my so-smol 5080 works, but is not intelligent enough for my coding needs. Perhaps Kimi changes that experience? I even tried Gemma 4 and it cannot even complete basic tasks for me.
Kimi and Qwen 3.5 397 are definitely above the “can do useful coding” threshold, but still far from the same level as Codex or Claude. I mostly use them to soak up low-stakes / well-defined tasks that aren’t worth spending $20/month plan tokens on.