Learn inference time self improvement for LLMs without fine tuning. Use Self Refine loops to draft, critique, and revise, add CRITIC verification, and Reflexion memory. Covers rubrics with LLM as Judge, evolving prompts with OPRO or APE, and safe decoding.
Promise: by the end, you’ll know how to make an LLM critique, revise, and actually get better within the prompt loop—no weight updates, no fine-tuning. You’ll wire up inner-loop self-improvement for a single answer, add outer-loop reflection that carries lessons across attempts, and—when needed—plug in judges, tools, and prompt evolution so the system keeps sharpening itself.
When we say teach the model to improve itself, we’re talking about inference-time learning: mechanisms that steer the same frozen model to produce stronger outputs on each try. The ingredients are simple:
Critique → Revise loops that turn a draft into a better answer (the Self-Refine pattern). (arXiv, OpenReview)
Reflection memory so the model learns “what worked” and avoids repeating mistakes on the next attempt (the Reflexion pattern). (arXiv, neurips.cc)
Judges and tools that verify claims, score alternatives, or spot errors (LLM-as-a-Judge, CRITIC). (arXiv)
Principled self-critique with a “constitution” (values or rules) when safety or coverage matters. (arXiv)
Prompt evolution that automatically improves the instructions themselves (APE/OPRO, PromptBreeder). (arXiv)
Rewindable decoding that lets a model backtrack mid-generation when it notices trouble (RAIN). (OpenReview)
These methods are well-studied and, importantly, require no new training. They rely on the model’s latent knowledge, plus structure and feedback you provide at run time.
Think of self-improvement as two nested loops around the same model:
Inner loop (one question): Draft → Critique → Revise until acceptance criteria (quality, citations, tests passing) are met. This is Self-Refine; it can be augmented with tools (search, code runner) for fact-checking or debugging (CRITIC). (arXiv)
Outer loop (across questions): Reflect → Store a lesson → Apply later. The model writes a short “lesson learned” with triggers (“if the task looks like X, avoid Y, try Z”) and the system prepends those snippets next time (Reflexion). (arXiv)
A judge—or a small eval set—decides when the loop should stop and which variant wins. Judges can be another LLM, a rubric, a unit test, or a tool-verified check. (arXiv)
Below is the smallest useful inner loop. It turns a draft answer into a revised answer using the model’s own critique.
Step 1 — Draft. “Answer {{QUESTION}}. Produce a concise answer and nothing else.”
Step 2 — Critique. “Here is your answer. List concrete flaws that matter for {{QUALITY_CRITERIA}}. Make each flaw actionable, cite any claims that need support, and suggest fixes. Do not rewrite the answer.”
Step 3 — Revise. “Rewrite the answer to address the critique. Keep strengths, fix weaknesses, satisfy {{QUALITY_CRITERIA}}. Show final answer only.”
This exact loop—Self-Refine—has repeatedly improved outputs across tasks without training new weights. You can repeat it 1–3 times; beyond that, returns diminish unless you add tools or tests. (arXiv)
Use when you need better writing, clearer logic, or format correctness and you can specify a rubric.
How: draft, generate an actionable critique (not a rewrite), then revise.
Why it works: it forces the model to separate evaluation from generation, which reduces anchoring on weak first drafts. (arXiv)
Use when truth or execution matters (facts, math, code, safety).
How: after the draft, instruct the model to call external tools—search, calculator, interpreter—to verify target properties (truth, functional output), then revise with evidence.
Why: the loop improves because verification is grounded in external signals. (arXiv)
Use when tasks repeat and you want carry-over learning without training.
How: after each attempt, have the model summarize a lesson with a trigger + fix; store it; prepend relevant lessons on similar future tasks.
Why: models can condition on these short “experience snippets” to avoid repeating mistakes. (arXiv)
Use when safety, tone, or values must be respected.
How: give a short “constitution” (rules/principles). Ask the model to critique its draft against the constitution, then revise accordingly.
Why: you replace ad-hoc preferences with explicit rules. (Anthropic later used these critiques for training, but the critique-and-revise loop works at inference time, too.) (arXiv)
Use when you need selection among candidates or a stopping rule.
How: write a rubric and have a judge model score candidates blind to author identity; pick top-k, optionally tournament-style.
Caveat: judge reliability varies by task and model; meta-evaluations show fine-tuned judges aren’t drop-in replacements for frontier judges. Use calibrated rubrics and spot-check. (arXiv)
Use when your prompt template is the bottleneck.
APE / OPRO: the model proposes new instruction prompts; a scorer evaluates; the loop keeps the winners. Think “LLM as optimizer.” (arXiv)
PromptBreeder: evolve a population of prompts (and the mutation strategies themselves). Useful when diversity helps. (arXiv)
Use when early tokens derail the answer.
How: the model self-evaluates mid-generation and “rewinds” to a better branch if it detects violations (e.g., wrong plan, unsafe path).
Why: prevents committing to a bad trajectory. It’s inference-time only. (OpenReview)
Below is an implementation recipe you can drop into any stack that can call an LLM and optional tools.
Draft (current answer)
Critique (actionable issues)
Evidence (optional: tool outputs/citations/tests)
Score (judge rating)
Reflections (short lessons with triggers)
def self_improve(task, judge, tools=None, reflections=[], max_iters=3, target=0.85): prompt = build_prompt(task, reflections) # include relevant lessons draft = llm(prompt) # 1) Draft for t in range(max_iters): critique = llm(critique_prompt(task, draft)) # 2) Critique (no rewriting) evidence = run_tools_if_any(draft, tools) # 3) Verify (optional) draft = llm(revise_prompt(task, draft, critique, evidence)) # 4) Revise score = judge_score(judge, task, draft) # 5) Score with rubric if score >= target: break reflection = llm(reflect_prompt(task, draft, score)) # 6) Write lesson store_reflection(reflection) # 7) Persist return draft, score, reflection
Critique prompts should demand specific, fixable issues tied to your rubric (format, correctness, citations, safety). That move is what made Self-Refine work reliably. (arXiv)
Judges: pin them to a rubric and absolute scale; randomize candidate order; use spot human checks on a slice (papers show judge quality matters). (arXiv)
Tools: connect whatever gives ground truth—search, code, tests. CRITIC shows this lifts reliability further. (arXiv)
Critique (task-agnostic). One step that asks for flaws only, bound to a rubric.
You wrote the answer below. Identify concrete flaws that matter for: {{RUBRIC}}. For each flaw, write: what’s wrong, why it matters, how to fix. Don’t rewrite the answer. Answer with a numbered list of flaws (max {{N}}).
Revise with evidence.
Rewrite the answer to fix each numbered flaw. Keep strengths.
If a flaw cites missing evidence,
Reflection memory (outer loop).
Write a 3-line “lesson learned” from this task.
Line 1
How many inner-loop iterations? 1–3. More can overfit to the judge or introduce verbosity; add tools/tests instead if you need more quality. CRITIC’s tool checks often beat extra iterations. (arXiv)
Judge risk. LLM-as-Judge is powerful but imperfect; recent meta-studies show fine-tuned judges can lag behind frontier models and display biases. Calibrate with a seeded gold set; avoid letting the same model judge and compete. (arXiv)
Not a silver bullet for reasoning. Evidence shows models sometimes fail to self-correct hard reasoning without external signals; that’s your cue to add tools, tests, or decomposition. (arXiv)
Inference cost vs. wins. These loops add tokens and latency. In practice, a single critique-revise pass with a judge is a strong default; add a second pass only when scores justify it.
If you notice that every answer needs similar fixes, your instructions are the bottleneck. Use an automatic prompt optimizer:
OPRO / APE: let the model propose prompt variants, score them on a small validation set (or with a judge), and keep the winners; repeat for a few rounds. This reliably beats hand-crafted prompts on many tasks. (arXiv)
PromptBreeder: evolve a population of prompts and the mutation strategies themselves—handy when diversity matters or tasks are heterogeneous. (arXiv)
Implementation tip: reuse your judge from the inner loop as the optimizer’s scoring function; DeepMind’s OPRO repo shows the wiring. (GitHub)
The critique is vague. Add examples of good flaws in your critique prompt (“Missing citation for [X]; Definition of [Y] contradicts [Z]”). That nudges specificity (a known lever in Self-Refine). (arXiv)
It keeps making the same mistake. Your reflections may be too fuzzy. Make triggers concrete (e.g., “TRIGGER: question asks for top-k list”) and keep them short, like lint rules. (arXiv)
Judge over-optimizes verbosity. Penalize length in the rubric and prefer evidence-backed criteria: citations present, tests pass, or tools confirm. CRITIC is your friend here. (arXiv)
Mid-generation derailments. Try RAIN-style decoding with checkpoints and a quick self-evaluation every N tokens; rewind on violations. (OpenReview)
Goal: wire a single-question Self-Refine + Judge loop with optional web verification.
Pick a task, e.g., “Explain the difference between TLS and SSL for a non-technical audience, with one trustworthy citation.”
Run Draft → Critique → Revise once, using the critique and revise prompts above.
Judge the draft vs. revision on a 0–1 scale using this rubric: accuracy, clarity, one trustworthy citation, ≤150 words. Select the winner.
If your model can browse or call tools, add one verification step (search for the cited page title) before revising (CRITIC). The revised output should upgrade the citation from vague to specific.
Expected outcome: the revised answer is shorter, clearer, and contains a specific citation; your judge assigns a higher score to the revision. (If not, inspect the critique—likely too generic.)
Self-Refine shows repeated gains across diverse tasks via iterative self-feedback (no training). (arXiv, OpenReview)
Reflexion demonstrates inference-time “learning” with reflective memory and improves agent performance without weight updates. (arXiv, neurips.cc)
CRITIC improves correctness by verifying drafts with tools, then revising from evidence. (arXiv)
LLM-as-Judge is useful but imperfect; empirical studies and surveys map its limits—calibration matters. (arXiv)
OPRO / APE / PromptBreeder automate instruction search and often beat human prompts without model changes. (arXiv)
RAIN provides inference-time alignment by self-evaluation and rewinding, not training. (OpenReview)
Limits matter: “LLMs cannot self-correct reasoning yet” highlights where pure self-critique stalls—use tools/decomposition there. (arXiv)
You don’t need to fine-tune to get sustained gains. A disciplined Draft → Critique → Revise loop (Self-Refine) lifts quality on the spot. Add tools for verifiable properties (CRITIC) and a reflection memory to carry lessons forward (Reflexion). When your instructions—not the answers—are the problem, let the model optimize the prompt (OPRO/APE/PromptBreeder). Use judges sparingly and calibrate them; ground them in rubrics and evidence.
Treat these as control flows around a frozen model. They’re cheap to adopt, easy to A/B, and, with a small eval set, you can prove the uplift before changing anything heavier in your stack.
Next steps
Wrap your top task with the inner loop and log scores, tokens, and latency. Promote if win-rate ≥ your pre-set bar.
Add a tiny reflection store (JSON lines) and a retriever that picks 1–3 lessons by trigger.
If you hit a ceiling, run a weekend OPRO job to evolve your instructions using your judge as the scorer.
Self-Refine: Madaan et al., Self-Refine: Iterative Refinement with Self-Feedback. (arXiv, OpenReview, ACM Digital Library)
Reflexion: Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS). (arXiv, neurips.cc)
CRITIC: Gou et al., CRITIC: LLMs Can Self-Correct with Tool-Interactive Critiquing. (arXiv)
Constitutional AI (critique & revise at inference time): Bai et al. (arXiv)
LLM-as-Judge (empirical & survey): Huang et al.; Li et al. (survey). (arXiv)
Rewindable decoding / inference-time alignment: Li et al., RAIN. (OpenReview)
Limits of self-correction for reasoning: Huang et al., Large Language Models Cannot Self-Correct Reasoning Yet. (arXiv)
Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.
Explore PathsReady to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.