March 31, 2026

Ollama MLX: Run Local AI 3x Faster on Apple Silicon

Ollama just shipped version 0.19, and it's a big deal for anyone running local LLMs on a Mac. The new release swaps in Apple's MLX framework as the backend on Apple Silicon — and the numbers are wild.

On an M5 Max, Ollama can now prefill at 1,851 tokens per second and decode at 134 tokens per second with the Qwen3.5-35B model. That's not cloud inference. That's your laptop.

What Changed Under the Hood

Previous versions of Ollama relied on GGML (via llama.cpp) for inference. That worked well, but Apple's MLX framework takes better advantage of the unified memory architecture that makes Apple Silicon special. Instead of copying data between CPU and GPU memory, MLX keeps everything in one shared pool.

The result is dramatically lower latency and higher throughput — especially on M5 chips, which add dedicated GPU Neural Accelerators that MLX can tap into directly.

Here's what Ollama 0.19 brings to the table:

MLX backend on macOS — Apple's native ML framework handles inference, optimized for unified memory
NVFP4 quantization support — Same format NVIDIA uses in production, so your local results match what you'd get from a cloud provider
Smarter KV cache — Reuses cache across conversations, stores snapshots at intelligent checkpoints, and keeps shared prefixes alive longer
Coding-optimized Qwen3.5 model — Tuned specifically for agentic and coding tasks with tools like Claude Code

Why This Matters for Builders

If you're running Claude Code, OpenCode, Codex, or any coding agent that uses Ollama as its local backend, this update makes everything snappier. The improved cache means repeated tool calls and branching conversations don't waste time reprocessing the same system prompt.

For indie hackers and solo builders, this is about cost and privacy:

No API costs — Run capable models like Qwen3.5-35B locally for $0 per token
No data leaving your machine — Your code, prompts, and context stay on your Mac
Production-quality quantization — NVFP4 means your local model behaves the same as cloud-hosted versions
Works offline — Code on a plane, in a café with bad wifi, wherever

Apple even highlighted Ollama alongside OpenClaw in the announcement. The local AI stack is maturing fast — your Mac is becoming a serious inference machine, not just a dev terminal.

The Performance Numbers

Ollama tested with Qwen3.5-35B-A3B quantized to NVFP4 on Apple's M5 lineup. The improvements over the previous GGML-based backend are significant:

M5: Noticeable speedup in both prefill and decode
M5 Pro: Even bigger gains with more GPU cores
M5 Max: 1,851 tok/s prefill, 134 tok/s decode — competitive with small cloud instances

You'll need at least 32GB of unified memory to run the 35B model comfortably. If you've got an M5 Max with 64GB or more, you're in an excellent spot.

Getting Started

Download Ollama 0.19 and pull the coding-optimized model:

ollama run qwen3.5:35b-a3b-coding-nvfp4

To use it with Claude Code:

ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4

Or with OpenClaw:

ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4

The Bigger Picture

Local AI is no longer a toy. Between Ollama's MLX backend, Apple's Neural Accelerators on M5, and open-weight models like Qwen3.5 getting better every month, the gap between local and cloud inference is closing fast.

For builders who care about privacy, cost control, and offline capability, this is the update you've been waiting for. Your Mac just became a much more capable AI workstation — and it didn't cost you a cent.

Running Local AI on Your Mac?

I write about AI tools, automation, and building smarter every week. Check out more posts on local models, coding agents, and the builder's toolkit.