Ollama MLX: Run Local AI 3x Faster on Apple Silicon
Ollama just shipped version 0.19, and it's a big deal for anyone running local LLMs on a Mac. The new release swaps in Apple's MLX framework as the backend on Apple Silicon — and the numbers are wild.
On an M5 Max, Ollama can now prefill at 1,851 tokens per second and decode at 134 tokens per second with the Qwen3.5-35B model. That's not cloud inference. That's your laptop.
What Changed Under the Hood
Previous versions of Ollama relied on GGML (via llama.cpp) for inference. That worked well, but Apple's MLX framework takes better advantage of the unified memory architecture that makes Apple Silicon special. Instead of copying data between CPU and GPU memory, MLX keeps everything in one shared pool.
The result is dramatically lower latency and higher throughput — especially on M5 chips, which add dedicated GPU Neural Accelerators that MLX can tap into directly.
Here's what Ollama 0.19 brings to the table:
- MLX backend on macOS — Apple's native ML framework handles inference, optimized for unified memory
- NVFP4 quantization support — Same format NVIDIA uses in production, so your local results match what you'd get from a cloud provider
- Smarter KV cache — Reuses cache across conversations, stores snapshots at intelligent checkpoints, and keeps shared prefixes alive longer
- Coding-optimized Qwen3.5 model — Tuned specifically for agentic and coding tasks with tools like Claude Code
Why This Matters for Builders
If you're running Claude Code, OpenCode, Codex, or any coding agent that uses Ollama as its local backend, this update makes everything snappier. The improved cache means repeated tool calls and branching conversations don't waste time reprocessing the same system prompt.
For indie hackers and solo builders, this is about cost and privacy:
- No API costs — Run capable models like Qwen3.5-35B locally for $0 per token
- No data leaving your machine — Your code, prompts, and context stay on your Mac
- Production-quality quantization — NVFP4 means your local model behaves the same as cloud-hosted versions
- Works offline — Code on a plane, in a café with bad wifi, wherever
Apple even highlighted Ollama alongside OpenClaw in the announcement. The local AI stack is maturing fast — your Mac is becoming a serious inference machine, not just a dev terminal.
The Performance Numbers
Ollama tested with Qwen3.5-35B-A3B quantized to NVFP4 on Apple's M5 lineup. The improvements over the previous GGML-based backend are significant:
- M5: Noticeable speedup in both prefill and decode
- M5 Pro: Even bigger gains with more GPU cores
- M5 Max: 1,851 tok/s prefill, 134 tok/s decode — competitive with small cloud instances
You'll need at least 32GB of unified memory to run the 35B model comfortably. If you've got an M5 Max with 64GB or more, you're in an excellent spot.
Getting Started
Download Ollama 0.19 and pull the coding-optimized model:
ollama run qwen3.5:35b-a3b-coding-nvfp4
To use it with Claude Code:
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4
Or with OpenClaw:
ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4
The Bigger Picture
Local AI is no longer a toy. Between Ollama's MLX backend, Apple's Neural Accelerators on M5, and open-weight models like Qwen3.5 getting better every month, the gap between local and cloud inference is closing fast.
For builders who care about privacy, cost control, and offline capability, this is the update you've been waiting for. Your Mac just became a much more capable AI workstation — and it didn't cost you a cent.
Running Local AI on Your Mac?
I write about AI tools, automation, and building smarter every week. Check out more posts on local models, coding agents, and the builder's toolkit.
Read More Posts