Who should read this Beginner level guide?

This guide is perfect for Beginner level practitioners looking to improve their prompt engineering skills in Prompt Engineering, Evaluation Loops, Golden Set, Prompt Testing, Reliability.

How long does it take to complete this guide?

This guide takes approximately 8 min read to read and understand.

Back to Guides/Guide

Beginner Eval Loops

"This beginner guide explains evaluation loops for prompts. You will build a golden set of test cases, create a pass or fail rubric, and log results. A hands-on lab shows how to compare prompt variants, record outcomes, and turn failures into improvements.

September 4, 2025

8 min read

Promptise Team

Beginner

Prompt EngineeringEvaluation LoopsGolden SetPrompt TestingReliability

An eval loop is a tiny, repeatable way to check whether your prompt actually works. Instead of guessing, you run a fixed set of test inputs, record the model’s outputs, and judge them with a simple rule. You learn what helps, what hurts, and when to stop tweaking.

A golden set is a handful of representative examples (10–20 is fine) you’ll reuse every time you change a prompt. A rubric is the rule you use to judge each output. For beginners, start with pass/fail: either the output meets the rules, or it doesn’t.

Why this matters: LLM behavior drifts with tiny changes. An eval loop gives you a baseline, a way to compare variants, and a record you can share. It turns “I think this is better” into “This passed 15/20 instead of 9/20.”

💡 Insight: The first win is consistency, not perfection. A small, stable golden set beats a large, fuzzy one.

Mental Model

Think of eval loops as a kitchen taste test:

Pick a dish (the task).
Collect a few spoonfuls that represent the menu (the golden set).
Decide what “done” tastes like (the rubric).
Try seasoning A vs. seasoning B (prompt variants).
Record each bite, mark 👍/👎, pick the better recipe.

Compact example: Task = “Turn a product review into two bullets: one Pro, one Con.” Rubric = exactly two lines; line 1 starts with “Pro:”, line 2 “Con:”; each line 8–14 words; grounded in the review; no emojis. Golden set = 12 diverse reviews (short/long, glowing/angry, mixed).

Walkthrough

Start by defining the task and the observable success rules. Avoid subjective words like “good” or “helpful.” Instead, list constraints you can check by eye (or with simple code): number of lines, required labels, word range, forbidden characters.

Next, assemble a micro golden set. Include short, long, noisy, and edge cases. Write a one-line note for each test explaining why it’s there (“very short praise,” “mixed sentiment,” “mentions shipping delay”).

Then, create a baseline prompt and one or two variants. Change only one thing per variant (tone instruction, length, or format). Run the same golden set through each variant.

Finally, log everything—inputs, prompt name, outputs, pass/fail, and notes—in a CSV. With a glance, you’ll see which prompt wins and where it fails. Keep the CSV; it becomes your history as you iterate.

⚠️ Pitfall: Don’t change the golden set while testing a variant. Freeze the set; change only the prompt. Add new test cases in a new round.

A Minimal Rubric (Pass/Fail)

A short rubric you can scan quickly:

Criterion	Pass check
Two bullets only	Output has exactly 2 non-empty lines
Labels	First line starts “Pro:”, second “Con:”
Length	Each line 8–14 words
Grounded	No claims absent from the review text
Clean format	No emojis; no extra preamble or epilogue

If any criterion fails, mark Fail. Otherwise, Pass.

Practical: Copy-Paste Prompts & Snippets

Starter system prompt (you can reuse this):

json

You are a careful writing assistant. Always follow the task’s format rules exactly. If an instruction conflicts with the format rules, the format rules win. When unsure, choose the safer, more literal interpretation of the rules.

Baseline user prompt (Variant A):

json

Task: Summarize this product review into exactly two bullets. Rules: - Line 1 must start with "Pro:" and state a genuine advantage from the review. - Line 2 must start with "Con:" and state a genuine drawback from the review. - Each line must be between 8 and 14 words. - Do not add emojis, headers, or extra commentary. Review: {{REVIEW_TEXT}}

Tighter formatting prompt (Variant B):

json

Task: Produce exactly two lines that meet all constraints. Constraints: 1) Line 1 begins "Pro:"; Line 2 begins "Con:". 2) Each line 8–14 words, plain text only. 3) Content must be traceable to the review (no new claims). 4) No extra lines or whitespace before/after. Review: {{REVIEW_TEXT}}

Golden set (example of 6; aim for 12–20 in practice):

json

ID: R1 “Loved the camera quality and battery life. Case felt cheap and slippery.” ID: R2 “Package arrived late. Customer support apologized quickly and refunded shipping.” ID: R3 “Keyboard is quiet and comfortable, but Bluetooth pairing drops every hour.” ID: R4 “Great sound for podcasts; bass is weak for music. Price is fair.” ID: R5 “Assembly took five minutes. Wobbles slightly on carpet. Looks nicer than photos.” ID: R6 “Screen is bright outdoors. Colors seem off compared to my old monitor.”

CSV schema you’ll keep updating:

run_id,prompt_variant,test_id,input_text,output_text,pass_fail,notes

Tiny Python to run, check basics, and log a CSV (Replace call_llm with your vendor call; the checks are simple and readable.)

json

import csv, re, datetime GOLDEN = [ ("R1", "Loved the camera quality and battery life. Case felt cheap and slippery."), ("R2", "Package arrived late. Customer support apologized quickly and refunded shipping."), ("R3", "Keyboard is quiet and comfortable, but Bluetooth pairing drops every hour."), ("R4", "Great sound for podcasts; bass is weak for music. Price is fair."), ("R5", "Assembly took five minutes. Wobbles slightly on carpet. Looks nicer than photos."), ("R6", "Screen is bright outdoors. Colors seem off compared to my old monitor."), ] SYSTEM_PROMPT = """You are a careful writing assistant. Always follow the task’s format rules exactly. If an instruction conflicts with the format rules, the format rules win. When unsure, choose the safer, more literal interpretation of the rules. """ PROMPTS = { "A_baseline": """Task: Summarize this product review into exactly two bullets. Rules: - Line 1 must start with "Pro:" and state a genuine advantage from the review. - Line 2 must start with "Con:" and state a genuine drawback from the review. - Each line must be between 8 and 14 words. - Do not add emojis, headers, or extra commentary. Review: {txt}""", "B_tighter": """Task: Produce exactly two lines that meet all constraints. Constraints: 1) Line 1 begins "Pro:"; Line 2 begins "Con:". 2) Each line 8–14 words, plain text only. 3) Content must be traceable to the review (no new claims). 4) No extra lines or whitespace before/after. Review: {txt}""" } def call_llm(system, user): # TODO: replace with your model call. For now, raise to avoid accidental use. raise NotImplementedError("Plug in your LLM SDK here.") def word_count(line): return len(re.findall(r"\b\w+\b", line)) def grounded(line, src): # Heuristic: at least one meaningful word from source appears in line src_words = set(w.lower() for w in re.findall(r"\b[a-z]{4,}\b", src)) out_words = set(w.lower() for w in re.findall(r"\b[a-z]{4,}\b", line)) return len(src_words & out_words) > 0 def check(output, src): lines = [l.strip() for l in output.strip().splitlines() if l.strip()] if len(lines) != 2: return False, "not-2-lines" if not lines[0].startswith("Pro:"): return False, "missing-Pro" if not lines[1].startswith("Con:"): return False, "missing-Con" if not (8 <= word_count(lines[0]) <= 14): return False, "len-Pro" if not (8 <= word_count(lines[1]) <= 14): return False, "len-Con" if any(ch in output for ch in "🙂😀😂🔥⭐️"): return False, "emoji" if not grounded(lines[0], src): return False, "ungrounded-Pro" if not grounded(lines[1], src): return False, "ungrounded-Con" return True, "ok" run_id = datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%SZ") with open(f"eval_run_{run_id}.csv", "w", newline="", encoding="utf-8") as f: w = csv.writer(f) w.writerow(["run_id","prompt_variant","test_id","input_text","output_text","pass_fail","notes"]) for name, tmpl in PROMPTS.items(): for tid, txt in GOLDEN: user = tmpl.format(txt=txt) # out = call_llm(SYSTEM_PROMPT, user) out = "<<REPLACE WITH MODEL OUTPUT>>" # placeholder during setup ok, note = check(out, txt) w.writerow([run_id, name, tid, txt, out, "PASS" if ok else "FAIL", note]) print("Wrote CSV:", f"eval_run_{run_id}.csv")

Expected CSV snippet (hand-entered while wiring up the SDK):

json

run_id,prompt_variant,test_id,input_text,output_text,pass_fail,notes 20250905T090100Z,A_baseline,R1,"Loved the camera quality and battery life. Case felt cheap and slippery.","Pro: Camera quality and battery life exceed expectations for daily use. Con: Case feels cheap and slippery in regular handling.",PASS,ok 20250905T090100Z,A_baseline,R2,"Package arrived late. Customer support apologized quickly and refunded shipping.","Pro: Customer support apologized quickly and refunded shipping cost promptly. Con: Package arrival was late compared to the promised delivery date.",PASS,ok

(Your real run will have both PASS and FAIL rows. That’s good—failures show you what to fix.)

Troubleshooting & Trade-offs

If many outputs fail for length, tighten the language in your prompt (“8–14 words” vs. “about 10 words”), and move format rules above style tips. If failures cluster on grounding, add a direct instruction: “Only state points explicitly present in the review.”

If a variant “wins” by gaming the rubric (passing mechanical checks but reading worse), add a second check later (e.g., “avoid hedging like ‘maybe’ or ‘seems’”). Keep pass/fail simple at first; add checks only when you observe a pattern.

There’s a trade-off between strictness and recall. Very strict rubrics raise precision (fewer bad passes) but may reject acceptable answers. Start strict to shape behavior; loosen only when you see unnecessary fails.

Mini Exercise / Lab

You’ll build and run your first eval loop.

Write 12 diverse reviews (or copy 12 from your dataset). Include extremes: very short praise, long complaints, mixed comments, and one that mentions delivery.
Use the system prompt and both variants above.
Run all 12 inputs through each variant. Log to a CSV as shown.
Mark PASS/FAIL for each row using the checks. Count wins per variant and note common failure reasons.

Expected result: One variant wins by a small margin (e.g., 8/12 vs. 6/12). You’ll also have a short list of failure modes like “missing Con” or “too long,” which points to the next prompt tweak.

Summary & Conclusion

Eval loops turn prompting from art into craft. With a small golden set and a clear pass/fail rubric, you can measure whether a change actually helps. Logging to CSV gives you memory and momentum: you can compare runs, spot regressions, and communicate progress.

Start tiny, freeze your golden set per round, and change one thing at a time. Expect a mix of passes and fails—that’s how you learn where the prompt breaks. Over time, your golden set will grow, your rubric will mature, and your prompts will stabilize.

When results feel inconsistent, make the rubric more observable and the prompt more literal. When outputs pass the mechanics but miss the spirit, add a second, simple check and expand your edge cases.

Next steps

Add 6 more edge cases to your golden set (very long, sarcasm, typos).
Wire your real model call into the script and re-run the loop.
Promote the winning prompt to “v1” and save the CSV; start a fresh run (“v2”) when you try the next change.

Learning Paths

Structured Learning

Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.

Explore Paths

Continue Your Learning Journey

Ready to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.

Explore More Guides Browse Learning Paths

Beginner Eval Loops

September 4, 2025

8 min read

Promptise Team

Beginner

Prompt EngineeringEvaluation LoopsGolden SetPrompt TestingReliability