← z3n.iwnl

How I Built a RAG System That Actually Works

March 26, 2026 · 6 min read

Everyone talks about RAG like it's magic. Plug in your documents, ask questions, get answers.

It's not magic. It's a war against bad data, slow GPUs, and file formats that shouldn't exist.

One developer just shared their entire journey building a production RAG system — 1TB of company documents, local LLMs, zero cloud API calls. Here's what actually happened.

The Setup (Sounds Simple)

Company needs an internal chat tool. Engineers ask questions in plain English, get answers backed by source documents. Decade of projects. Technical reports, simulations, CSVs, regulations.

Stack decision was straightforward:

Ollama for running LLaMA locally
nomic-embed-text for embeddings
LlamaIndex as the RAG engine
Python because it just works

First tests? Worked great with sample data. "I thought it would be a project of a few weeks. I couldn't have been more wrong."

Problem 1: Your Files Are Chaos

1TB of "organized" documents. Videos mixed with PDFs. Simulations next to reports. Backup files everywhere.

            The result: LlamaIndex tried to load a multi-gigabyte video file into RAM. The laptop died. Not crashed. Died.
        

Fix: aggressive filtering. Cut 54% of files by extension alone. Videos, images, executables, compressed archives, simulation files, temp files — all gone.

Lesson: Before you touch any RAG framework, audit your data. Ruthlessly.

Problem 2: Indexing 451GB Without Dying

LlamaIndex's default storage? JSON files. Works for demos. Falls apart at scale.

Every restart = reprocess everything. Days of work lost to a single error. Data corruption. Slow searches.

Move to ChromaDB changed everything:

Batch processing (150 files at a time)
Checkpoint system — resume from where you stopped
SQLite backend — just copy one file to backup
Proper similarity search at scale

            Final numbers: 738,470 vectors. 54GB index. Zero corruption. Multiple indexing sessions over weeks.
        

Problem 3: GPUs Aren't Free

Integrated laptop GPU = 500MB of documents in 4-5 hours. At that rate, indexing everything would take months.

Rented an NVIDIA RTX 4000 SFF Ada (20GB VRAM) from Hetzner. Cost: €184 for 2-3 weeks of indexing.

Worth it? Absolutely. But budget for it if you're building something real.

Problem 4: The Architecture That Worked

After all the failures, here's the production setup:

            User → Streamlit (UI) → Flask API → Python Backend → Ollama (LLM + Embeddings) → LlamaIndex → ChromaDB
        

Original documents? Stayed in Azure Blob Storage. The system generates download links with SAS tokens on-demand. Your server doesn't need 500GB of disk space.

The Real Takeaways

Filter your data first. Not everything needs to be indexed.
Use a real vector database. JSON won't cut it past a few thousand documents.
GPU costs are real. Budget €200+ for indexing large corpora.
Checkpoint everything. Your indexing WILL crash. Plan for it.
Separate index from source. Store vectors locally, documents in the cloud.
Start small, then scale. Test with 100 files before you try 100,000.

RAG isn't hard because of the AI part. It's hard because of the data engineering part. The models are commodity now. Your documents are the chaos.

If you're thinking about building a RAG system for your startup or side project — respect the data. Everything else follows.

Building with AI every day?

The OpenClaw Ultimate Setup gives you the exact automation stack to build, deploy, and ship while you sleep.

Get the Setup → $29

Source: From zero to a RAG system: successes and failures by Andros