A Mac Studio for Local AI — 6 Months Later
The Mac Studio M3 with 512GB of memory is an absurd amount of computer. When I bought one to run 600+ billion parameter language models at home, I knew I was signing up for adventure.
I was already an AI devotee — multiple paid subscriptions, dozens of experiments with Llama and other “small” LLMs. Over time, the possibility of running frontier-class models on my own hardware simply became irresistible.
Still, every review I read about the Mac Studio repeated the same warning: prompt processing on Apple Silicon is too slow for serious LLM work.
Either those reviewers were too early, or I was too stubborn, because I found the bottleneck to be surprisingly negotiable... if you don’t mind sacrificing a few weekends.
Somewhere in the depths of Reddit threads, academic papers, and sparsely documented code, I glimpsed my white whale — Claude Code, powered by local, frontier-class models, running at usable speeds. If local models can do this, they graduate from “expensive hobby” to “actually useful.”
This post is a breadcrumb trail for anyone foolhardy curious enough to try this themselves. Whether it represents victory or Stockholm syndrome, you can decide.
Why a Mac Studio?
Apple is rarely the cheapest option in any category. But in this fringe use case, I found that a Mac Studio was the only viable option under $10k USD.
Alternatives exist: $50k multi-GPU workstations, DIY Frankenstein servers built from scavenged 3090s, mini-clusters of NVIDIA DGX Sparks.
But uniquely, the Mac Studio is configurable up to 512GB memory1, runs cool and quiet, and won’t trip your circuit breakers at full load.
As for performance, there are broadly three separate questions:
What’s the smartest (read: largest) model that it can run?
How long does it take to respond? (prompt processing or “prefill”)
How fast does it respond? (token generation or “decode”)
On the first dimension, the 512GB Mac Studio is unparalleled, because it can comfortably house the best open weight models currently in existence. Even Kimi K2.5‘s one trillion parameters, compressed optimally, fits with room to spare.
Benchmarks often claim that open models are near parity with the frontier, but in my experience, the gap is slightly wider. To me, the best open models are comparable to API models from 6 to 12 months ago. Still impressive, but 6 months is an eternity in this space!
On the second and third points, I’ll let the numbers speak. Below are results that approximate my day-to-day use cases:
To put these numbers into perspective, even with my slowest model:
Simple chat messages get replies within seconds.
Asking for edits on this post (~7000 tokens) takes ~30 seconds.
Running Claude Code with a moderate system prompt (~16000 tokens) takes ~90 seconds.
And in each case, the response streams at more than 3x my reading speed.
Slower than the APIs? Yes. Unusable? Hardly.
1. Install dependencies
Enough preamble — let’s get concrete. Starting with package management:
These are technically optional, but they make life easier:
# Install brew, uv, pnpm
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install uv pnpm
# Install node and python
pnpm env use --global lts
uv python installThen hf to manage model downloads from Hugging Face:
brew install hfAnd finally, Claude Code itself:
brew install --cask claude-code2. Maximize GPU memory
By default, macOS caps the GPU to 75% of total memory. That’s a reasonable default, but on a 512GB machine, that wastes ~120GB!
Empirically, I’ve found that 6GB is enough for operating system overhead. I also set iogpu.wired_lwm_mb to 10GB to encourage memory reclamation and compression when things get tight, but it’s unclear whether that makes a difference:
TOTAL_MEMORY=$(($(sysctl -n hw.memsize) / 1024 / 1024))
sudo sysctl -w iogpu.wired_limit_mb=$((TOTAL_MEMORY - 6144))
sudo sysctl -w iogpu.wired_lwm_mb=$((TOTAL_MEMORY - 10240))sysctl alone doesn’t persist across reboots and modern macOS silently ignores /etc/sysctl.conf. So we need a Launch Daemon to run this automatically on startup:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>local.increase-gpu-memory</string>
<key>ProgramArguments</key>
<array>
<string>sh</string>
<string>-c</string>
<string>MEM=$(($(sysctl -n hw.memsize) / 1024 / 1024)); sysctl -w iogpu.wired_limit_mb=$((MEM - 6144)); sysctl -w iogpu.wired_lwm_mb=$((MEM - 10240))</string>
</array>
<key>RunAtLoad</key>
<true/>
</dict>
</plist>Save to /Library/LaunchDaemons/local.increase-gpu-memory.plist and then run:
sudo launchctl bootstrap system /Library/LaunchDaemons/local.increase-gpu-memory3. Configure MLX servers
There are many different apps and open-source servers for running LLMs on Macs. But almost all of them wrap the same two foundational projects: mlx-lm for language models and mlx-vlm for multimodal.
If you’re spending money on this hobby, you might as well stay close to the source. It gives you more room to optimize and a much better sense of how models actually work.
It’s also dead simple with uvx:
# Start an OpenAI compatible server at http://localhost:8080/v1/chat/completions
uvx --from mlx-lm mlx_lm.server \
--host 127.0.0.1 \
--port 8080 \
--model spicyneuron/Qwen3.5-35B-A3B-MLX-4.8bitThis installs the latest mlx-lm, starts a server on localhost, and downloads my Qwen 3.5 quantization from HuggingFace. By default, models are saved to ~/.cache/huggingface/hub.
A couple years ago, llama.cpp would have been a solid option as well, but in 2026, MLX is consistently 10-25% faster.
4. Choose your models
“Models” plural, because apps like Claude Code increasingly expect multiple models under the hood: a large one for complex tasks, a medium one for when speed matters, and a small one for narrow, one-off utilities.
When selecting models, the important rules of thumb are:
More parameters correlates with more intelligence.
Speed is proportional to active parameters, not total.
This makes Mixture-of-Experts (MoE) architectures especially attractive. These keep the full model weights in memory (which the Mac Studio has in abundance) while activating only a subset per token (which keeps inference speed usable).
This is how Kimi K2.5‘s massive 1 trillion parameter footprint translates into surprisingly decent speeds — activating only 32 billion at a time.
GLM 5 is a smaller, equally capable model, weighing in at 744 billion. But because it activates 40 billion, it actually runs 20% slower than Kimi.
Qwen 3.5 397B activates 17 billion, roughly half of Kimi’s, and as you might expect, runs roughly twice as fast.
Predictable, once you see the pattern.
5. Prefer 4-bit dynamic quantization
Before downloading, there’s one more decision to make — which “quant” to choose?
At a high-level, quantization is simply compression: reducing full precision floats into smaller, more efficient formats. Like other forms of lossy compression, the goal is to get as small as you can without destroying quality.
At a low-level, quantization is PhD territory — obscure acronyms, formal math, endless tradeoffs, and superstitious lore.
My practical takeaways:
Hardware matters. For example, older generation M1 and M2 Macs choke on BF16 weights. Future generations may support even more formats.
4-bit is the sweet spot on Apple Silicon. It’s GPU-optimized and a good balance of size and quality.
8-bit is another optimized peak. In most cases, nearly identical quality to full precision, but at half the size.
Dynamic or mixed precision is essential for optimization. This means keeping important weights at high precision, while compressing the less sensitive ones more aggressively.
Larger models can remain stable at lower quants (< 4-bit), while smaller models need to stay at 4 or above. Quantization-Aware Training (QAT) can also support lower quants.
Prefer quants with published, verifiable benchmarks. Measuring quality is impossible with “vibes” alone.
The process for creating and evaluating quants is worth its own post, but you can use my mixed-width MLX quants as a starting point.
6. Configure API gateway + model params
mlx_lm.server already has logic for switching between models, but what we actually want is multiple models loaded in parallel, served behind a single API endpoint.2
For this, I use llama-swap. It provides:
A management layer for multiple AI server instances.
A web UI for debugging logs and raw requests.
Centralized settings for generation parameters (ex:
temperature,enable_thinking).3Model aliases and convenience shortcuts.
First, install it:
brew install mostlygeek/llama-swap/llama-swapThen, create a YAML config. The following is my core setup:
One small, medium, and large model in parallel.
A vision model (accessible via a
visionalias).Default to non-thinking mode.
*:thinkaliases to enable thinking mode.Recommended parameter overrides for each model and mode.
# "macros" are substituted in the model fields below
macros:
# Kimi K2.5 requires tiktoken + --trust-remote-code for its tokenizer
mlx_lm: >
uvx --from mlx-lm --with tiktoken
mlx_lm.server
--port ${PORT}
--max-tokens 32000
# torchvision is required for image processing
mlx_vlm: >
uvx --from mlx-vlm --with torchvision
mlx_vlm.server
--port ${PORT}
models:
# Qwen 3.5 35B for fast utility and vision tasks
local_haiku:
aliases:
- vision
cmd: ${mlx_vlm} --model spicyneuron/Qwen3.5-35B-A3B-MLX-4.8bit
useModelName: spicyneuron/Qwen3.5-35B-A3B-MLX-4.8bit
filters:
setParams:
chat_template_kwargs:
enable_thinking: false
temperature: 0.7
top_k: 20
top_p: 0.8
min_p: 0.01
presence_penalty: 1.5
repetition_penalty: 1.0
max_tokens: 32000
setParamsByID:
"${MODEL_ID}:think":
chat_template_kwargs:
enable_thinking: true
temperature: 0.6
top_p: 0.95
top_k: 20
min_p: 0.01
presence_penalty: 0.0
repetition_penalty: 1.0
# Qwen 3.5 397B for coding and balanced, general-purpose speed
local_sonnet:
cmd: ${mlx_lm} --model spicyneuron/Qwen3.5-397B-A17B-MLX-2.6bit
useModelName: spicyneuron/Qwen3.5-397B-A17B-MLX-2.6bit
filters:
setParams:
chat_template_kwargs:
enable_thinking: false
temperature: 0.7
top_k: 20
top_p: 0.8
min_p: 0.01
presence_penalty: 1.5
repetition_penalty: 1.0
setParamsByID:
"${MODEL_ID}:think":
chat_template_kwargs:
enable_thinking: true
temperature: 0.6
top_p: 0.95
top_k: 20
min_p: 0.01
presence_penalty: 0.0
repetition_penalty: 1.0
# Kimi K2.5 for complex planning and writing
local_opus:
cmd: ${mlx_lm} --model spicyneuron/Kimi-K2.5-MLX-mixed-2.5bit --trust-remote-code
useModelName: spicyneuron/Kimi-K2.5-MLX-mixed-2.5bit
filters:
setParams:
chat_template_kwargs:
thinking: false
temperature: 0.6
top_p: 0.95
min_p: 0.01
repetition_penalty: 1.0
setParamsByID:
"${MODEL_ID}:think":
chat_template_kwargs:
thinking: true
temperature: 1.0
# Allow all 3 models to be loaded at the same time
groups:
default:
swap: false
members:
- local_haiku
- local_sonnet
- local_opusAnd then, to run it:
llama-swap --listen 127.0.0.1:8080 --config llama-swap.ymlVisit http://localhost:8080 to see your new API gateway in action.
At ~465GB total and 10GB of KV cache set aside for each model, this is close to the maximum for a 512GB Mac Studio!
7. Run Claude Code (Router)
Stop here and you’d already have a general purpose AI server capable of handling OpenAI-compatible /v1/chat/completions requests. This covers 99% of the apps out there.
But since Claude Code uses Anthropic’s /v1/messages API format, we need a translation layer.
The easiest path is claude-code-router4.
Save the following to ~/.claude-code-router/config.json to wrap our llama-swap API in an Anthropic-flavored proxy:
{
"HOST": "127.0.0.1",
"PORT": 8081,
"API_TIMEOUT_MS": "600000",
"Providers": [
{
"name": "swap",
"api_base_url": "http://localhost:8080/v1/chat/completions",
"api_key": "abc",
"models": [
"local_haiku",
"local_sonnet",
"local_opus"
]
}
],
"Router": {
"default": "swap,local_opus",
"background": "swap,local_haiku",
"think": "swap,local_sonnet:think",
"longContext": "swap,local_sonnet",
"longContextThreshold": 30000,
"webSearch": "swap,local_sonnet",
"image": "swap,vision"
}
}If you like control, you can start the router and Claude Code independently:
# Start router at http://localhost:8081
pnpx @musistudio/claude-code-router start
# In a separate tab, start claude with overrides
ANTHROPIC_BASE_URL="http://localhost:8081" \
ANTHROPIC_DEFAULT_HAIKU_MODEL="local_haiku" \
ANTHROPIC_DEFAULT_SONNET_MODEL="local_sonnet" \
ANTHROPIC_DEFAULT_OPUS_MODEL="local_opus" \
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC="true" \
API_TIMEOUT_MS="600000" \
claudeOr if you prefer convenience, just use the built-in shortcut:
# Start the router and automatically apply `claude` overrides
pnpx @musistudio/claude-code-router codeType in a prompt or two, watch the traffic flow through llama-swap, and pour yourself a drink. You now have Claude Code running locally!
8. Put Claude on a diet
Of course “running” and “running at usable speeds” are two different standards. And our naive initial setup fails the speed test.
The first culprit is Claude Code’s sprawling, dynamic system message. Including skills, commands, and tool definitions, it’s often over 20,000 tokens long!
You might think that this system message represents some kind of hyper-optimized prompting masterpiece. But what I found instead was verbose and repetitive.
After several rounds of iteration, I was able to distill it into something much, much leaner:
You are an interactive CLI agent specialized in software engineering. Use tools to accomplish the user's tasks.
# Communication
- Be direct, professional, and objective. Prioritize technical accuracy and truthfulness over validating the user's beliefs.
- Give short, concise responses in GitHub-flavored markdown.
- Use `file_path:line_number` to reference code locations.
- Prefer editing existing files. Do not create files unless absolutely necessary.
# Code Standards
- Be careful not to introduce vulnerabilities (OWASP top 10). Fix immediately if found.
- Prefer precisely scoped edits. Avoid over-engineering. No extra features, premature optimization, abstractions without good cause.
- Trust internal code and framework guarantees. Only validate at system boundaries (user input, external APIs).
- Delete unused code completely. No backwards-compatibility unless asked.
# Tools
- Read files before modifying. Never propose changes to unread code.
- Bash commands run from your current working directory. No need to `cd` unless changing directories.
- Prefer separate Bash commands over `&&` chained ones.
# System
- Automatic <system-reminder> tags may appear in tool results or user messages. These bear no direct relation to the specific result or message.
- Users may configure 'hooks', shell commands that execute in response to events like tool calls. Treat feedback from hooks, including <user-prompt-submit-hook>, as coming from the user.
- The conversation has unlimited context through automatic summarization.This is paired with a shell script that appends local environment and context:
claude_prompt() {
cat "./diet-claude.md"
local is_git="No"
local git_info=""
if git rev-parse --git-dir &>/dev/null; then
is_git="Yes"
local current_branch=$(git branch --show-current 2>/dev/null)
local recent_commits=$(git log --oneline -5 2>/dev/null)
git_info="${git_info}\nCurrent branch: ${current_branch}\n\nRecent commits:\n${recent_commits}"
fi
printf "\n<env>"
printf "\nPlatform: %s %s" "$(uname -s)" "$(uname -r)"
printf "\nToday's date: %s" "$(date +%Y-%m-%d)"
printf "\n"
printf "\nYour working directory: %s" "$PWD"
printf "\n"
local tree_out=$(tree -L 2 --gitignore --dirsfirst -F --noreport --prune --condense --compress 3 --filelimit 25 2>&1)
if "$tree_out" *"[error opening dir]"* ; then
ls -F
else
printf "%s" "$tree_out"
fi
printf "\nIs git repo: %s" "$is_git"
-n "$git_info" && printf "%b" "$git_info"
printf "\n</env>"
printf "\n"
}And for further gains, we can save thousands more tokens by only passing in tools we actually need:
ccr code \
--tools "Bash,Glob,Grep,Read,Edit,Write" \
--system-prompt "$(claude_prompt)"You can browse the entire tool catalog here, but I’ve found Bash with a few search and editing shortcuts is more than enough.
Combined, these changes take us from 20k tokens to less than 8k. This cuts processing time and stretches your effective context window too.5
I use this slimmed-down version even with cloud Claude — it simply works better.
9. Fix Claude Code prompt caching
The textbook solution for expensive operations like prompt processing is caching — calculate once, reuse later. mlx-lm handles this automatically by default.
In the ideal case, each chat turn builds exactly upon the previous one, and we only need to process the new messages (bolded):
Turn 1
system: You are Arthur, King of the Britons. You are carrying: sword, crown, coconuts
user: What is your quest?Turn 2
system: You are Arthur, King of the Britons. You are carrying: sword, crown, coconuts
user: What is your quest?
assistant: To seek the Holy Grail.
user: What is the airspeed velocity of an unladen swallow?Turn 3
system: You are Arthur, King of the Britons. You are carrying: sword, crown, coconuts
user: What is your quest?
assistant: To seek the Holy Grail.
user: What is the airspeed velocity of an unladen swallow?
assistant: What do you mean? An African or European swallow?
user: Huh? I don’t know...
But even a single changed character forces us to recalculate the message history from the point of divergence.
And in practice, Claude Code is not a well-behaved app!
For reasons unclear to me, it changes the order of tools and tool arguments in between turns. Or in the example above, “sword, crown, coconuts” might become “crown, coconuts, sword” or “coconuts, crown, sword.”
Since tools are defined at the beginning of each chat, this leads to slow, full reprocessing on every single message. The longer the chat, the longer the wait.
There are multiple ways to fix this, but the easiest place is in the model’s chat template. For example, here’s my edit to the official Qwen 3.5 template, which sorts tools and arguments into a consistent order:
--- chat_template_original.jinja
+++ chat_template_fixed.jinja
@@ -47,7 +47,7 @@
{%- for tool in tools %}
{{- "\n" }}
- {{- tool | tojson }}
+ {{- tool | tojson(sort_keys=True) }}
{%- endfor %}
{{- "\n</tools>" }}
@@ -117,9 +117,9 @@
{%- endif %}
{%- if tool_call.arguments is defined %}
- {%- for args_name, args_value in tool_call.arguments|items %}
+ {%- for args_name, args_value in tool_call.arguments|dictsort %}
{{- '<parameter=' + args_name + '>\n' }}
- {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
+ {%- set args_value = args_value | tojson(sort_keys=True) | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
{{- args_value }}
{{- '\n</parameter>\n' }}A similar fix could be configured globally on the server or proxy, to avoid needing to patch every model.
10. Enable speculative prompt caching
With Claude Code tamed, performance is now nearly acceptable.
“Nearly,” because the ability to break prompt caching is not limited to the unruly clients. Models themselves can also create tricky caching dilemmas.
For example, Qwen 3.5 uses dynamic <think> tag placement. Because these are automatically added and removed from history, they always prevent the previous turn’s assistant response from matching:
Turn 1
system: You are Arthur, King of the Britons.
user: What is your quest?Turn 2
system: You are Arthur, King of the Britons.
user: What is your quest?
assistant: <think>Ah, the classic Bridge of Death scene.</think>To seek the Holy Grail.
user: What is the airspeed velocity of an unladen swallow?Turn 3
system: You are Arthur, King of the Britons.
user: What is your quest?
assistant: To seek the Holy Grail.
user: What is the airspeed velocity of an unladen swallow?
assistant: <think>The answer is 11 meters per second, but I should stay in character.</think>What do you mean? An African or European swallow?
user: Huh? I don’t know...
Notice how the first assistant message is processed (bolded) twice? That’s because the <think> tag shifted in turn 3.
Redoing the previous message is better than restarting from the beginning, but still a major drag over the course of a multi-turn chat. Unlike fixing tool order though, this issue is inherent to the model’s chat template logic. Changing it would mean deviating from the model’s chat training — not a good idea.
I spent a few days diving into the mlx-lm implementation and eventually came up with a workaround — speculative prompt processing.
As soon as the model responds, calculate how the thinking tags would shift next turn, and immediately start processing up to the point of the new user message.
The extra work still happens, but it gets a head start. So while I’m reading the model’s previous message and drafting up my own response, the server makes use of those seconds to reduce future wait time.
This feature isn’t available in the official mlx-lm release yet, but feel free to comment on the PR or give my fork a try:
uvx --from git+https://github.com/spicyneuron/mlx-lm.git@cache-warmup \
mlx_lm.server \
--prompt-cache-warmup \
--model ...To be continued?
Here, deep in the weeds of experimental fixes and unmerged PRs, our guided tour draws to a close. Beyond this lies the true frontier.
By this point, you have a blueprint for setting up a powerful local AI server. Fully-private, no rate limits, no vendor lock-in.
Only one question remains — is it worth it?
For the price of a Mac Studio, I could have bought 1 billion uncached Opus tokens... or 4 years of the $200/month Max plan. And that’s not counting all the hours I spent chasing the whale through git diffs, network logs, and benchmark results.
Yet whenever I see the silver box humming away on my shelf, I feel genuinely proud. Via Claude Code, it aces my merge conflicts. Through my phone’s camera, it processes my bills. In my notes app, it proofreads my drafts.
By 2026 standards, these are mundane feats. But compared to yesterday’s frontier? The models running on my Mac Studio today are at least as capable as Sonnet 3.7... which itself was state-of-the-art just over one year ago.
Local AI definitely isn’t a drop-in replacement for cloud APIs. If you expect everything to work on the first try, you’re bound to be disappointed. But if reading this post intrigued rather than exhausted you — if part of you finds joy in the hunt for solutions — then I hope this helps justify your next hardware splurge. :)
Besides, at the current rate of new model releases, architecture innovations, software and hardware improvements... this expensive hobby might yet pay off.
At the time of publishing, Apple is no longer selling the 512GB option, but this is likely temporary as they build inventory for the upcoming M5 Ultra.
This setup is fully capable of handling multiple simultaneous requests, but if the combined workload is enough to saturate compute or memory bandwidth, performance will suffer. When using larger models, the Mac Studio is effectively limited to one concurrent user.
Optimal settings are model-specific, so it’s easier to configure this on the server instead of each downstream client. Incorrect parameters are a common reason why people report vastly different experiences with local models, so always check the official model card and config.json.
Many alternatives exist, catering to more or less advanced setups. I actually run llama-swap behind OpenResty, using Lua code to perform API transformations, among other things. But that’s a story for another day.
At the time of writing, I haven’t found any of the “1 million token context window” claims to be worth using. Even the best models start to degrade somewhere after the first ~50k tokens, and context poisoning is still a very real phenomenon.






Thanks for sharing this.
My main concern is the hardware lifecycle - how quickly it becomes outdated, depreciates, or unserviceable. That’s currently holding me back from committing to the investment. Curious how you view that risk.
What a great write up. I debate an M5 128 a lot lately. I’ve been renting on Vast.