Architecture 5 min read

RAG (Retrieval-Augmented Generation) from Scratch

Published on November 24, 2025

RAG (Retrieval-Augmented Generation) from Scratch

Language models (LLMs) are powerful but have a clear limit: they don’t know your documentation, your knowledge base, or your internal data. If you ask them something that wasn’t in what they “saw” during training, they can hallucinate: invent answers that sound plausible but are false. RAG (Retrieval-Augmented Generation) is the technique that reduces that problem: instead of relying only on what the model “knows”, you retrieve relevant fragments from your own documents and inject them into the LLM’s context. The AI then answers based on your data, not just its memory.

In this article I explain what RAG is, why it helps against hallucinations, and how to build it from scratch with your own documents.

The problem: hallucinations and closed knowledge

An LLM generates text from learned patterns. It has no “memory” of your PDFs, your wiki, or your database. If you ask about an internal procedure or a product that only exists in your company, it can:

  • Invent details that don’t exist.
  • Mix training data with “guesses”.
  • Give coherent but wrong answers.

Also, the model’s knowledge has a cutoff date: it doesn’t know what happened after that or what only exists in your systems. To use AI for support, internal docs, or product assistants, you need to anchor answers to your sources.

What RAG is

RAG combines two things:

  1. Retrieval: Given a user question, you search your document collection for the most relevant fragments (e.g. by semantic similarity using embeddings).
  2. Augmented Generation: Those fragments are passed to the LLM as context along with the question. The model generates the answer from that context, not just from what it “knows” by heart.

Typical flow:

  1. User asks a question.
  2. The question is turned into a vector (embedding) with the same model you used to index.
  3. You search a vector store (or similarity index) for fragments whose embeddings are closest to the question.
  4. You build a prompt that includes: instructions + retrieved fragments + question.
  5. The LLM generates the answer using that prompt.
  6. Optionally you cite sources (which fragments you used) so the user can verify.

So the AI “reads” your documents at query time and answers from what it finds, which reduces hallucinations when the right answer is in your data.

Why it reduces hallucinations

  • Explicit context: The model doesn’t have to “remember” your docs; you give them on every question. If the answer is in the context, it tends to rely on it.
  • Verifiable sources: You can show which paragraph or document the information came from (citations), so you can audit and correct.
  • Updates without retraining: You change documents, reindex, and the system uses the new version. You’re not stuck with the model’s cutoff date.

RAG doesn’t eliminate hallucinations completely (the model can ignore context or mix it badly), but it greatly reduces them when the relevant content is retrieved and presented well in the prompt.

How to build RAG from scratch (concepts)

1. Documents and chunking

  • Sources: PDFs, Markdown, web pages, tickets, FAQs, etc.
  • Chunking: You split each document into manageable fragments (chunks), e.g. 500–1000 tokens or by paragraph. Too small and you lose context; too large and relevance gets diluted.
  • Metadata: Store title, source, date, etc., for filtering and to show “source: X” to the user.

2. Embeddings

  • Each chunk is passed through an embedding model (e.g. OpenAI, Cohere, or open-source like sentence-transformers) and you get a vector of fixed dimension.
  • Those vectors are stored in a vector store (Pinecone, Weaviate, Chroma, pgvector, etc.) or a similarity index (FAISS, Annoy).

3. Retrieval

  • When a question arrives, you turn it into an embedding with the same model.
  • You search for the k most similar chunks (by dot product or cosine distance).
  • Optional: filters by metadata (e.g. only docs for one product, one year).

4. Prompt and generation

  • You build the prompt, e.g.: “Answer only using the following context. If the answer is not in the context, say you don’t know. Context: [chunks]. Question: [question].”
  • You call the LLM (GPT-4, Claude, Llama, etc.) with that prompt.
  • You return the answer and, if you want, the fragments used as “sources”.

5. Optional improvements

  • Re-ranking: A second model reorders retrieved chunks by actual relevance to the question.
  • HyDE: You generate a “hypothetical answer” to the question, embed it, and search with that; sometimes improves retrieval.
  • Multi-query: You generate several reformulations of the question and retrieve with all of them; merge results.
  • Chunk overlap: Chunks that overlap a bit so you don’t cut sentences in half.

Typical stack (example)

  • Documents: Markdown/PDF → parser + chunker (LangChain, LlamaIndex, or custom code).
  • Embeddings: OpenAI/Cohere API or local model (sentence-transformers).
  • Vector store: Chroma, Pinecone, Weaviate, pgvector (if you already use Postgres).
  • LLM: OpenAI/Anthropic API or local model (Llama, Mistral).
  • Orchestration: A service that receives the question, does retrieval, builds the prompt, and calls the LLM; optionally a framework (LangChain, LlamaIndex) to chain steps.

My personal perspective

RAG is the most practical way to connect an LLM to your knowledge without fine-tuning: you add documents, reindex, and the assistant can answer from your wiki, manuals, or data. Hallucinations drop when the answer is in the context; when it’s not, a good prompt (“say you don’t know”) avoids inventing.

Building it “from scratch” means deciding chunking, embeddings, vector store, and prompt; then you can iterate (re-ranking, HyDE, multiple sources). You don’t need the most complex stack to start: one document type, one embedding model, and a simple vector store already give you most of the value. From there, you improve as you need more precision, more sources, or better citation.