Your 32GB Mac Can Now Run a 40GB AI Model
There's a new open-source tool that does something that sounds impossible. It runs AI models bigger than your Mac's physical memory. No crashing. No OOM killer. Just works.
It's called Hypura, and if you've been hitting memory limits running local LLMs on Apple Silicon, you need to see this.
The Problem Nobody Talks About
Here's the reality: most of us can't afford a Mac Studio with 192GB of RAM. We're running 32GB MacBook Pros, maybe a Mac Mini. And we want to run models like Llama 70B — which is 40GB even at quantized sizes.
Try loading that in vanilla llama.cpp and watch what happens. Your Mac starts swapping like crazy. Fan kicks in. Then boom — OOM killer takes the process down. Model crashes. You're back to using APIs.
Until now, the only options were: pay for cloud inference, or buy more RAM. Neither is great for indie hackers watching costs.
What Hypura Actually Does
Hypura is a storage-tier-aware LLM inference scheduler. That's a mouthful, but here's what it means in practice:
It reads your GGUF model file, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and assigns each tensor to the optimal storage tier:
- GPU (Metal) — Attention layers, norms, embeddings. The stuff accessed every token.
- RAM — Overflow layers that don't fit in GPU working set.
- NVMe (SSD) — Remaining layers loaded on-demand via direct I/O, prefetched ahead of the forward pass.
Key insight: not all model weights need to be in RAM at the same time. MoE models only use 2 of 8 experts per token. Dense models stream FFN weights while keeping attention on GPU. Hypura exploits this.
The Numbers That Matter
All benchmarks from the developer on an M1 Max, 32GB unified memory:
| Model | Size | Hypura | llama.cpp |
|---|---|---|---|
| Qwen 2.5 14B | 8.4 GB | 21 tok/s | ~21 tok/s |
| Mixtral 8x7B | 30.9 GB | 2.2 tok/s | OOM 💀 |
| Llama 70B | 39.6 GB | 0.3 tok/s | OOM 💀 |
Read that Mixtral line again. 30.9GB model on a 32GB machine. Running at 2.2 tokens per second. llama.cpp crashes. Hypura makes it work.
Is 0.3 tok/s fast? No. But it's the difference between "usable for experimentation" and "completely broken." You can run queries, test prompts, evaluate outputs. That's worth something.
Three Modes of Operation
Hypura automatically picks the right mode based on your model size and hardware:
1. Full-Resident
Model fits in GPU+RAM. No NVMe I/O. Zero overhead. Runs at full Metal speed. This is baseline llama.cpp territory.
2. Expert-Streaming (MoE models)
For Mixtral and similar. Only non-expert tensors (~1GB) stay on GPU. Expert tensors stream from SSD with a neuron cache hitting 99.5% after warmup.
3. Dense FFN-Streaming
For big dense models like Llama 70B. Attention + norms stay on GPU (~8GB). FFN weights (~32GB) stream from SSD with dynamic prefetch.
Why This Matters for Indie Hackers
If you're building AI-powered products and trying to keep costs down, here's what this changes:
1. Experiment with bigger models for free. Stop paying for API access to test whether Llama 70B gives you better results than 13B. Run it locally. It's slow, but it works.
2. Privacy-first prototyping. Building something with sensitive data? Run the best open models locally without sending anything to the cloud.
3. Hardware flexibility. You don't need to upgrade your Mac just to try bigger models. Make the hardware you have work harder.
How to Try It
Hypura builds from source with Rust. You need Rust 1.75+ and CMake:
git clone --recurse-submodules https://github.com/t8/hypura.git
cd hypura
cargo build --release
Then run it like:
./target/release/hypura infer --model llama-70b-q4.gguf "Explain quantum computing"
It profiles your hardware automatically. No manual config needed.
Hypura is open source. Star it on GitHub, try it out, and if it works for you — the developer is looking for feedback.
The era of "you need 128GB of RAM to run good AI models" is ending. Tools like Hypura prove that smart software can make consumer hardware punch way above its weight.
Ship Faster with AI
Running local models is just the start. The OpenClaw Ultimate Setup shows you how to build AI agents that work while you sleep.