"This beginner guide explains evaluation loops for prompts. You will build a golden set of test cases, create a pass or fail rubric, and log results. A hands-on lab shows how to compare prompt variants, record outcomes, and turn failures into improvements.
An eval loop is a tiny, repeatable way to check whether your prompt actually works. Instead of guessing, you run a fixed set of test inputs, record the model’s outputs, and judge them with a simple rule. You learn what helps, what hurts, and when to stop tweaking.
A golden set is a handful of representative examples (10–20 is fine) you’ll reuse every time you change a prompt. A rubric is the rule you use to judge each output. For beginners, start with pass/fail: either the output meets the rules, or it doesn’t.
Why this matters: LLM behavior drifts with tiny changes. An eval loop gives you a baseline, a way to compare variants, and a record you can share. It turns “I think this is better” into “This passed 15/20 instead of 9/20.”
💡 Insight: The first win is consistency, not perfection. A small, stable golden set beats a large, fuzzy one.
Think of eval loops as a kitchen taste test:
Pick a dish (the task).
Collect a few spoonfuls that represent the menu (the golden set).
Decide what “done” tastes like (the rubric).
Try seasoning A vs. seasoning B (prompt variants).
Record each bite, mark 👍/👎, pick the better recipe.
Compact example: Task = “Turn a product review into two bullets: one Pro, one Con.” Rubric = exactly two lines; line 1 starts with “Pro:”, line 2 “Con:”; each line 8–14 words; grounded in the review; no emojis. Golden set = 12 diverse reviews (short/long, glowing/angry, mixed).
Start by defining the task and the observable success rules. Avoid subjective words like “good” or “helpful.” Instead, list constraints you can check by eye (or with simple code): number of lines, required labels, word range, forbidden characters.
Next, assemble a micro golden set. Include short, long, noisy, and edge cases. Write a one-line note for each test explaining why it’s there (“very short praise,” “mixed sentiment,” “mentions shipping delay”).
Then, create a baseline prompt and one or two variants. Change only one thing per variant (tone instruction, length, or format). Run the same golden set through each variant.
Finally, log everything—inputs, prompt name, outputs, pass/fail, and notes—in a CSV. With a glance, you’ll see which prompt wins and where it fails. Keep the CSV; it becomes your history as you iterate.
⚠️ Pitfall: Don’t change the golden set while testing a variant. Freeze the set; change only the prompt. Add new test cases in a new round.
A short rubric you can scan quickly:
Criterion | Pass check |
Two bullets only | Output has exactly 2 non-empty lines |
Labels | First line starts “Pro:”, second “Con:” |
Length | Each line 8–14 words |
Grounded | No claims absent from the review text |
Clean format | No emojis; no extra preamble or epilogue |
If any criterion fails, mark Fail. Otherwise, Pass.
Starter system prompt (you can reuse this):
You are a careful writing assistant. Always follow the task’s format rules exactly. If an instruction conflicts with the format rules, the format rules win. When unsure, choose the safer, more literal interpretation of the rules.
Baseline user prompt (Variant A):
Task: Summarize this product review into exactly two bullets. Rules: - Line 1 must start with "Pro:" and state a genuine advantage from the review. - Line 2 must start with "Con:" and state a genuine drawback from the review. - Each line must be between 8 and 14 words. - Do not add emojis, headers, or extra commentary. Review: {{REVIEW_TEXT}}
Tighter formatting prompt (Variant B):
Task: Produce exactly two lines that meet all constraints. Constraints: 1) Line 1 begins "Pro:"; Line 2 begins "Con:". 2) Each line 8–14 words, plain text only. 3) Content must be traceable to the review (no new claims). 4) No extra lines or whitespace before/after. Review: {{REVIEW_TEXT}}
Golden set (example of 6; aim for 12–20 in practice):
ID: R1 “Loved the camera quality and battery life. Case felt cheap and slippery.” ID: R2 “Package arrived late. Customer support apologized quickly and refunded shipping.” ID: R3 “Keyboard is quiet and comfortable, but Bluetooth pairing drops every hour.” ID: R4 “Great sound for podcasts; bass is weak for music. Price is fair.” ID: R5 “Assembly took five minutes. Wobbles slightly on carpet. Looks nicer than photos.” ID: R6 “Screen is bright outdoors. Colors seem off compared to my old monitor.”
CSV schema you’ll keep updating:
run_id,prompt_variant,test_id,input_text,output_text,pass_fail,notes
Tiny Python to run, check basics, and log a CSV (Replace call_llm with your vendor call; the checks are simple and readable.)
import csv, re, datetime GOLDEN = [ ("R1", "Loved the camera quality and battery life. Case felt cheap and slippery."), ("R2", "Package arrived late. Customer support apologized quickly and refunded shipping."), ("R3", "Keyboard is quiet and comfortable, but Bluetooth pairing drops every hour."), ("R4", "Great sound for podcasts; bass is weak for music. Price is fair."), ("R5", "Assembly took five minutes. Wobbles slightly on carpet. Looks nicer than photos."), ("R6", "Screen is bright outdoors. Colors seem off compared to my old monitor."), ] SYSTEM_PROMPT = """You are a careful writing assistant. Always follow the task’s format rules exactly. If an instruction conflicts with the format rules, the format rules win. When unsure, choose the safer, more literal interpretation of the rules. """ PROMPTS = { "A_baseline": """Task: Summarize this product review into exactly two bullets. Rules: - Line 1 must start with "Pro:" and state a genuine advantage from the review. - Line 2 must start with "Con:" and state a genuine drawback from the review. - Each line must be between 8 and 14 words. - Do not add emojis, headers, or extra commentary. Review: {txt}""", "B_tighter": """Task: Produce exactly two lines that meet all constraints. Constraints: 1) Line 1 begins "Pro:"; Line 2 begins "Con:". 2) Each line 8–14 words, plain text only. 3) Content must be traceable to the review (no new claims). 4) No extra lines or whitespace before/after. Review: {txt}""" } def call_llm(system, user): # TODO: replace with your model call. For now, raise to avoid accidental use. raise NotImplementedError("Plug in your LLM SDK here.") def word_count(line): return len(re.findall(r"\b\w+\b", line)) def grounded(line, src): # Heuristic: at least one meaningful word from source appears in line src_words = set(w.lower() for w in re.findall(r"\b[a-z]{4,}\b", src)) out_words = set(w.lower() for w in re.findall(r"\b[a-z]{4,}\b", line)) return len(src_words & out_words) > 0 def check(output, src): lines = [l.strip() for l in output.strip().splitlines() if l.strip()] if len(lines) != 2: return False, "not-2-lines" if not lines[0].startswith("Pro:"): return False, "missing-Pro" if not lines[1].startswith("Con:"): return False, "missing-Con" if not (8 <= word_count(lines[0]) <= 14): return False, "len-Pro" if not (8 <= word_count(lines[1]) <= 14): return False, "len-Con" if any(ch in output for ch in "🙂😀😂🔥⭐️"): return False, "emoji" if not grounded(lines[0], src): return False, "ungrounded-Pro" if not grounded(lines[1], src): return False, "ungrounded-Con" return True, "ok" run_id = datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%SZ") with open(f"eval_run_{run_id}.csv", "w", newline="", encoding="utf-8") as f: w = csv.writer(f) w.writerow(["run_id","prompt_variant","test_id","input_text","output_text","pass_fail","notes"]) for name, tmpl in PROMPTS.items(): for tid, txt in GOLDEN: user = tmpl.format(txt=txt) # out = call_llm(SYSTEM_PROMPT, user) out = "<<REPLACE WITH MODEL OUTPUT>>" # placeholder during setup ok, note = check(out, txt) w.writerow([run_id, name, tid, txt, out, "PASS" if ok else "FAIL", note]) print("Wrote CSV:", f"eval_run_{run_id}.csv")
Expected CSV snippet (hand-entered while wiring up the SDK):
run_id,prompt_variant,test_id,input_text,output_text,pass_fail,notes 20250905T090100Z,A_baseline,R1,"Loved the camera quality and battery life. Case felt cheap and slippery.","Pro: Camera quality and battery life exceed expectations for daily use. Con: Case feels cheap and slippery in regular handling.",PASS,ok 20250905T090100Z,A_baseline,R2,"Package arrived late. Customer support apologized quickly and refunded shipping.","Pro: Customer support apologized quickly and refunded shipping cost promptly. Con: Package arrival was late compared to the promised delivery date.",PASS,ok
(Your real run will have both PASS and FAIL rows. That’s good—failures show you what to fix.)
If many outputs fail for length, tighten the language in your prompt (“8–14 words” vs. “about 10 words”), and move format rules above style tips. If failures cluster on grounding, add a direct instruction: “Only state points explicitly present in the review.”
If a variant “wins” by gaming the rubric (passing mechanical checks but reading worse), add a second check later (e.g., “avoid hedging like ‘maybe’ or ‘seems’”). Keep pass/fail simple at first; add checks only when you observe a pattern.
There’s a trade-off between strictness and recall. Very strict rubrics raise precision (fewer bad passes) but may reject acceptable answers. Start strict to shape behavior; loosen only when you see unnecessary fails.
You’ll build and run your first eval loop.
Write 12 diverse reviews (or copy 12 from your dataset). Include extremes: very short praise, long complaints, mixed comments, and one that mentions delivery.
Use the system prompt and both variants above.
Run all 12 inputs through each variant. Log to a CSV as shown.
Mark PASS/FAIL for each row using the checks. Count wins per variant and note common failure reasons.
Expected result: One variant wins by a small margin (e.g., 8/12 vs. 6/12). You’ll also have a short list of failure modes like “missing Con” or “too long,” which points to the next prompt tweak.
Eval loops turn prompting from art into craft. With a small golden set and a clear pass/fail rubric, you can measure whether a change actually helps. Logging to CSV gives you memory and momentum: you can compare runs, spot regressions, and communicate progress.
Start tiny, freeze your golden set per round, and change one thing at a time. Expect a mix of passes and fails—that’s how you learn where the prompt breaks. Over time, your golden set will grow, your rubric will mature, and your prompts will stabilize.
When results feel inconsistent, make the rubric more observable and the prompt more literal. When outputs pass the mechanics but miss the spirit, add a second, simple check and expand your edge cases.
Next steps
Add 6 more edge cases to your golden set (very long, sarcasm, typos).
Wire your real model call into the script and re-run the loop.
Promote the winning prompt to “v1” and save the CSV; start a fresh run (“v2”) when you try the next change.
Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.
Explore PathsReady to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.