PromptisePromptise
Docs
Promptise - AI Framework LogoPromptise

The foundation layer for agentic intelligence. Build, secure, and operate autonomous AI systems at scale with Promptise Foundry.

Foundry

  • The Promptise Agent
  • Reasoning Engine
  • MCP
  • Agent Runtime
  • Prompt Engineering

Resources

  • Documentation
  • GitHub
  • Guides
  • Learning Paths

Company

  • About
  • Imprint
  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Subprocessors

© 2026 Promptise by Manser Ventures. All rights reserved.

Back to Guides/Guide

RAG Zero-to-One (Tiny Corpus)

Learn Retrieval Augmented Generation for small document sets using a clear librarian and writer model. Build a minimal RAG workflow from chunking to citations, connect LLMs to retrievers, format grounded prompts, and test answers in a hands-on mini lab.

September 4, 2025
8 min read
Promptise Team
Beginner
RAGPrompt EngineeringRetrievalMebeddingsGroundingLLM Architecture

In this short guide you’ll build a mental model for Retrieval-Augmented Generation (RAG) and ship a tiny, working pattern: retrieve the right passages and answer with citations. You’ll see how RAG plugs into an LLM, what the workflow looks like end-to-end, and you’ll run a mini-lab comparing a naïve answer to a retrieved one while tracking faithfulness qualitatively.

RAG, defined: Retrieval-Augmented Generation means the model looks up relevant documents first, then writes an answer grounded in those documents. This reduces “hallucinations,” which are confident but unsupported statements.

Why it matters for a tiny corpus: When your knowledge base is small—say a few PDFs, wiki pages, or notes—RAG is cheap to set up and gives a big reliability boost. You don’t need a data platform to see real gains.


The mental model

Think of RAG as a two-step brain: a librarian and a writer. The librarian finds likely passages. The writer assembles an answer that quotes or cites those passages.

Core loop in words: User asks → system retrieves top-k passages from your corpus → system injects those passages into the LLM’s prompt → LLM answers and cites.

A compact example

Your corpus has three notes:

  • D1: “Pour-over coffee tastes best with water between 90–96 °C. Use ~15:1 water:coffee ratio.”

  • D2: “Boiling water (100 °C) can scorch light-roast flavors. Let the kettle rest 30–45 s after boil.”

  • D3: “Cold brew uses room-temp water; steep 12–18 h.”

Question: “What water temperature should I use for pour-over?”

A naïve LLM might say “Use boiling water.” A RAG-grounded answer should say “90–96 °C” and cite D1, possibly D2 for nuance.

💡 Insight: Retrieval does not “teach” the model; it reminds it. The LLM still writes, but the passages anchor what it writes.


Architecture at a glance

Below is the minimal, practical shape of a tiny-corpus RAG. Keep it small and explicit before you add bells and whistles.

json

[Documents] → [Chunker] → [Embedding Model] → [Vector Index] | ↑ | top-k passages | └───────────────[Retriever] <──────── [Query Embedding] | [Prompt Composer] | [LLM Answerer] | Answer + Citations

Key parts in one sentence each: Chunker splits documents into bite-sized passages. Embedding model turns text into vectors (number lists) that capture meaning. Vector index stores those vectors for fast similarity search. Retriever finds the closest passages to the question. Prompt composer formats the question and passages for the LLM. LLM answerer writes the response and includes citations.

⚠️ Pitfall: Skipping the chunker and embedding whole documents usually hurts precision. The retriever returns entire files when you only needed a paragraph.


Workflow, step by step

Start with the end in mind: faithful answers with visible sources. Then wire the pipeline.

  1. Collect & prep documents. Keep a folder of plain text or lightly cleaned markdown. For a tiny corpus, consistency beats automation.

  2. Chunk. Split each file into passages of ~200–400 words with a small overlap (e.g., 40–60 words) to preserve context.

  3. Embed & index. Use a single embedding model across both documents and queries. Store vectors in a simple local index.

  4. Retrieve. For each question, embed it and pull the top 3–5 passages by cosine similarity.

  5. Compose the prompt. Insert the retrieved passages into a grounded template that forces citations.

  6. Answer or abstain. If nothing relevant is found, say so explicitly and invite a narrower question.

💡 Insight: Retrieval quality matters more than model size here. A smaller LLM with great retrieval often beats a larger LLM answering from memory.


Walkthrough: naïve vs. retrieved answers

Tiny corpus (reuse from above):

  • D1 (p1): “Pour-over coffee tastes best with water between 90–96 °C. Use ~15:1 water:coffee ratio.”

  • D2 (p1): “Boiling water (100 °C) can scorch light-roast flavors. Let the kettle rest 30–45 s after boil.”

  • D3 (p1): “Cold brew uses room-temp water; steep 12–18 h.”

Question: “What water temperature should I use for pour-over?”

Naïve answer (no retrieval): “Use boiling water for best extraction.”

Retrieved passages (k=2): [D1 p1], [D2 p1]

RAG answer with citations: “Use 90–96 °C for pour-over. If you just boiled the kettle, let it rest 30–45 seconds to avoid hitting 100 °C, which can mute light-roast flavors. [D1 p1; D2 p1]”

Why it’s better: The number is anchored to the exact passage, and the answer includes a small operational step with a source.


Practical: copy-paste prompts and scaffolds

Below are minimal templates you can drop into your tool or orchestrator. They assume you already retrieved passages.

System prompt — RAG answerer

json

You are a careful assistant that answers ONLY using the provided context. Rules: - If the context is irrelevant or insufficient, say "I don't have enough context to answer." - When you do answer, cite passage IDs like [D1 p1] inline at the end of sentences they support. - Prefer direct quotes for numbers, definitions, and names. - Be concise and avoid speculation.

User message template — with retrieved context

json

Question: {{QUESTION}} Context passages (each has an ID): {{PASSAGES}} Write a short answer (2–5 sentences) grounded ONLY in the passages. Include citations like [D3 p2] next to the claims they support. If multiple passages support the same claim, cite both.

Passage formatting — one per chunk

json

[D1 p1] "Pour-over coffee tastes best with water between 90–96 °C. Use ~15:1 water:coffee ratio." [D2 p1] "Boiling water (100 °C) can scorch light-roast flavors. Let the kettle rest 30–45 s after boil."

Optional structured output Use a light schema to encourage explicit citations and confidence.

json

{ "answer": "Use 90–96 °C for pour-over... [D1 p1; D2 p1]", "citations": [ {"id": "D1 p1", "quote": "Pour-over ... 90–96 °C."}, {"id": "D2 p1", "quote": "Boiling water (100 °C) ... 30–45 s after boil."} ], "confidence": "medium" }

💡 Insight: Asking for quotes nudges the model to copy exact phrases for critical claims. This improves faithfulness without heavy tooling.


Troubleshooting and trade-offs

Chunk size controls how specific your retrieval is. Chunks that are too large pull in fluff; chunks that are too small strip the context that explains a claim. Start at ~300 words with 10–20% overlap and adjust based on recall vs. precision in your results.

Top-k is your breadth knob. If answers feel shallow or miss edge cases, increase k from 3 to 5. If answers become rambly or contradictory, reduce k or add a simple reranker that scores “question-passage match” using an LLM in judge mode.

If the model still hallucinates, check your prompt rules. Make “abstain when unsure” explicit and visible. Also check retrieval quality: mismatched embeddings for queries vs. docs, missing normalization, or indexing the wrong text (e.g., including boilerplate) all degrade results.

Citations must be meaningful. Citing an entire document ID for a specific number is weak grounding. Encourage per-passage IDs and short quotes for critical facts like figures, dates, and thresholds.

⚠️ Pitfall: Letting the LLM both retrieve and answer in one shot often re-introduces memory-based guesses. Keep retrieval external and deterministic at first.


Mini exercise / lab

Goal: Compare naïve vs. retrieved answers and track faithfulness qualitatively.

Setup: Use the three coffee passages above as your entire corpus. Ask two questions:

  1. “What brew ratio does the note suggest for pour-over?”

  2. “How long should I steep cold brew?”

Procedure: First, ask your LLM each question with no context. Save the answers. Then run the same questions through the RAG prompt with retrieved passages (k=2). For each answer, rate faithfulness on a 3-point, qualitative scale: Unsupported, Partially supported, Fully supported—and list the exact quoted lines that back each claim.

Expected output snippet (for Q2):

json

Answer: Steep cold brew for 12–18 hours. [D3 p1] Support: "Cold brew uses room-temp water; steep 12–18 h." Faithfulness: Fully supported

💡 Insight: Even a tiny corpus surfaces the habit: cite first, write second. That habit carries over when your corpus grows.


How the LLM “connects” in practice

At tiny scale you don’t need heavy infrastructure. A script or low-code tool can handle the steps: load files, chunk, embed, store vectors locally, retrieve top-k, and compose the prompt. The LLM never “reads your folder” directly. It only sees the selected passages you pass in. That boundary is the safety rail: if the passages don’t support the claim, the model is instructed to abstain.

As you grow, you can swap the local index for a vector database, add a reranker stage, and log queries, retrieved IDs, and outputs for later review. The core contract remains the same: retrieval produces candidates; the prompt enforces grounding; the answer carries citations.


Summary & Conclusion

You learned the RAG pattern as a librarian-plus-writer loop: retrieve first, then generate with citations. For a tiny corpus, this is enough to improve accuracy without complex systems. You saw how chunking, embeddings, and a vector index feed a simple retriever, and how a grounded prompt keeps the LLM honest.

The main trade-offs live in chunk size and top-k. Too big or too many, and answers sprawl; too small or too few, and answers miss key details. Make abstention explicit and reward quotes for critical facts. When answers go off-track, fix retrieval quality before swapping models.

The lab asked you to compare naïve vs. retrieved answers and grade faithfulness qualitatively. That habit—reading the passage that “pays for” each claim—will guide your future iterations.

Next steps:

  • Add a one-line reranker that scores passage relevance with an LLM judge and compare outputs.

  • Log each run’s question, retrieved IDs, answer, and faithfulness rating; review ten examples to tune chunk size and k.

  • Expand your corpus to 10–20 documents and repeat the lab with two new questions that need synthesis, not just one number.

Learning Paths

Structured Learning

Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.

Explore Paths

Continue Your Learning Journey

Ready to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.

Explore More GuidesBrowse Learning Paths