Who should read this Advanced level guide?

This guide is perfect for Advanced level practitioners looking to improve their prompt engineering skills in Prompt Engineering, Prompt Optimization, Evaluation & Metrics, Error Analysis, Routing.

How long does it take to complete this guide?

This guide takes approximately 90 min read to read and understand.

Back to Guides/Guide

Auto-Prompting: APE, ProTeGi, and Multi-Branch Optimization

Learn automatic prompt optimization with APE and ProTeGi. Generate strong task instructions, refine them through critic, edit, and re score loops, and prototype multi branch prompts that route tricky inputs. Includes scoring, artifacts, and auditing.

September 6, 2025

90 min read

Promptise Team

Advanced

Prompt EngineeringPrompt OptimizationEvaluation & MetricsError AnalysisRouting

Promise: you’ll learn to generate, refine, and route prompts automatically—first by searching for good instructions (APE), then by editing them with textual “gradients” (ProTeGi), and finally by branching the prompt to cover distinct failure patterns (AMPO-style). You’ll leave with copy-ready scaffolds, a minimal scoring loop, and a tiny lab you can run in a notebook or directly in chat.

Why this matters (and what we’ll build)

Every strong prompt starts out weak. Hand-tuning gets you partway, but systematic gains come from a loop: propose → score → edit → specialize. Automatic Prompt Engineer (APE) shows that LLMs can write candidate instructions and we can pick winners with a simple score. ProTeGi shows that you can improve a prompt by asking the model to critique errors in natural language (a “textual gradient”) and then apply that critique as an edit, guided by a small dataset and a metric. Multi-branch methods (e.g., AMPO) recognize that many tasks contain multiple patterns—so the right end state is not one prompt, but a small set of branches with a lightweight router. Together, these tools lift you out of trial-and-error and into an optimization mindset. (aclanthology.org)

Lay of the land (terms in plain words)

Small labeled set or proxy metric. You need either a handful of gold examples with exact answers (classification/extraction), or a proxy score such as BLEU/ROUGE or an LLM-rater you trust. Keep a dev set for optimization and a tiny hold-out for reality checks.
Score function m(prompt, dataset) returns a number: accuracy, F1, or an averaged judge score.
Budget. Count model calls and tokens. Auto-prompting burns budget if you don’t cap beam width, steps, and candidate evaluation.
Overfitting guard. Evaluate the final winner on your hold-out and watch for collapses.

We’ll use a toy but realistic micro-task—classify support emails as billing / technical / account—because it’s easy to score exactly and exposes common failure modes (ambiguous wording, multi-label confusion, out-of-scope mail).

The move, part 1: APE (generate and select instructions)

Mental model: treat the instruction as a small “program.” Ask an LLM to propose many candidate instructions from a few input→output demonstrations. Score each candidate on your dev set. Keep the best. Optionally, iterate by paraphrasing top candidates and re-scoring (a Monte-Carlo style search). In practice, two short prompts do most of the work: a proposer and a scorer.

What tends to work

Seed with 5–12 gold examples that expose edge cases.
Generate ~20–50 candidate instructions in one or two waves.
Score with exact match/F1 when possible; otherwise, a rubric-based LLM judge (but confirm on a hold-out).

APE-style proposer (drop-in): Use this to synthesize candidate instructions from I/O demos.

System (once): You generate clear, model-executable task instructions. User: Given these input→label pairs, infer the task and propose one instruction that would make a black-box model reproduce the labels. Make it specific and testable. Data (few lines): Input: "I was double charged on my card" → Label: billing Input: "App crashes on login" → Label: technical Input: "Change my email address" → Label: account (…4–8 more lines) Format: INSTRUCTION: <one precise instruction> RATIONALE: <why this instruction should work> Output only INSTRUCTION and RATIONALE.

Run this N times (diverse seeds / temperatures) to get a pool like:

“Classify each message into billing/technical/account; prefer the most specific category; if none, say unknown.”
“Output one label from {billing, technical, account}; treat payment disputes as billing; app errors as technical; profile changes as account.”

Scoring loop (pseudo-code you can copy):

json

def score_instruction(instr, dev_examples, llm_call): correct = 0 for x, gold in dev_examples: y = llm_call(f"{instr}\nMessage: {x}\nLabel:") correct += (normalize(y) == gold) return correct / len(dev_examples) pool = generate_candidates(proposer_prompt, demos, n=40) scores = [(p, score_instruction(p, dev, llm)) for p in pool] top = sorted(scores, key=lambda t: t[1], reverse=True)[:5] # (Optional) Paraphrase top instructions → re-score → pick winner

What you get: an initial high-performing instruction with minimal hand editing. APE consistently beats naive prompts on zero-shot tasks when you can measure outcomes, because you’ve turned ideation into a search.

The move, part 2: ProTeGi (iterate with textual “gradients”)

Mental model: when your prompt makes mistakes, ask the model to explain how the prompt caused them (that explanation is a textual gradient), then ask the model to apply the gradient by editing the prompt in the opposite direction. Wrap that in a tiny beam search and use a bandit-style selector to decide which edited prompts deserve more budget. You’ve imported the spirit of gradient descent into text.

Three prompts power the loop

Critic (Δ): summarizes where the current prompt fails on a mini-batch and names the policy defects.
Editor (δ): rewrites the prompt to fix those defects—not to overfit the batch.
Selector: evaluates the edited candidates and advances the best few (beam).

Δ — batch critic (pasteable): Feed K dev errors (input, gold, model_output) and the current instruction; get crisp failure themes.

You are a prompt critic. Given the current instruction and mistakes, write 3–5 policy defects that caused these errors. Instruction: {{CURRENT_INSTRUCTION}} Mistakes (K=4):

Message: "..."; Gold: billing; Output: account
... Output JSON: {"defects":[{"name":"...", "evidence":"(1,3)", "fix":"<policy change>"}, ...]}

δ — prompt editor (pasteable): Apply the critic’s defects to produce small, auditable edits; emit multiple candidates.

You are a prompt editor. Edit the instruction to fix the listed defects without increasing length by >15%. Produce up to 3 candidates. Current instruction: {{CURRENT_INSTRUCTION}} Defects JSON: {{CRITIC_JSON}} Output: CANDIDATE_1: <revised instruction> CHANGELOG_1: <one-line why> (repeat for each candidate)

Selector (scoring): reuse the APE scorer; keep the top b candidates; repeat r rounds or until gains stall. Small beams (b=3–5) and r=2–4 rounds are usually enough.

Why this works: the critic focuses on generalizable flaws (“disambiguate multi-topic emails; choose a single best label”), and the editor encodes those fixes into the policy. The beam + bandit selection keeps cost down while still exploring multiple edits. ProTeGi reported up to 31% relative improvement over the starting prompt on small classification tasks under modest budgets; your mileage will vary, but even 5–10 points is common when the initial instruction is vague.

The move, part 3: Multi-Branch Optimization (split by pattern)

Some tasks hide multiple patterns: short transactional emails, multi-issue complaints, out-of-scope chatter. Forcing one prompt to cover all patterns invites contradictions (“always output one label” vs. “sometimes output unknown”). AMPO-style methods embrace this by discovering error patterns, proposing branch-specific edits, and pruning weak branches. The end result is a compact tree: a 1-line router plus 2–4 specialized branches. (aclanthology.org)

Branching in practice (prompt-only version)

Pattern recognition: repurpose the ProTeGi critic to cluster defects into patterns (e.g., multi-topic, payment detail missing, account vs. billing confusion).
Branch drafting: for each pattern, derive a branch instruction (“If the email contains both a payment issue and an app error, output the dominant issue after checking the first sentence for intent.”).
Router: a tiny first step that assigns an input to a branch: “Choose ONE of {general, multi-topic, OOS} and then apply its instruction.”
Pruning: keep branches that beat the monolith by ≥X points on their slice; merge or drop the rest.

Router + branches (scaffold):

json

[Router] Decide the pattern for the message: {single-issue, multi-topic, out-of-scope}. Then follow the corresponding branch instruction exactly. [Branch: single-issue] <specialized instruction emphasizing specificity and tie-breakers> [Branch: multi-topic] <rules for picking a dominant label; ignore salutations; prefer explicit monetary terms> [Branch: out-of-scope] <emit "unknown" with one-line reason>

AMPO’s core idea is iterative: summarize error reasons → adjust branches → prune. Even without code, two cycles of this procedure often recover the wins: fewer contradictions, cleaner policies, and higher accuracy on messy inputs. (aclanthology.org)

Show, don’t tell: one compact run-through

Start with your baseline instruction from APE. Score: 74% on the 40-item dev set.

ProTeGi round (K=4 mistakes): critic flags “Multi-topic mail chooses the first mentioned issue,” “Ambiguous account terms,” “No unknown policy.” Editor emits three candidates. You re-score; best candidate hits 81%.
Branch once: critic groups residual errors into two patterns: multi-topic and out-of-scope. You add a 1-line router and two small branches, each with 2–3 rules. Re-score by pattern: single-issue 88%, multi-topic 76%, OOS 92%. Weighted total: 85%. Hold-out check: 84% (no collapse).

No heroics, just a steady propose → edit → branch march.

Deepening the craft (budgets, selection, and judges)

How big should the search be? Early wins arrive fast. Try 40–80 APE proposals → pick 5; then 2–3 ProTeGi rounds with beam=3–5. If your metric is exact match, you can afford wider beams. If you rely on an LLM judge, cap it.

Selectors and bandits. ProTeGi’s bandit framing saves tokens: evaluate candidates on subsets of the dev set and spend extra budget only on promising contenders. If you prefer simplicity, evaluate on the full dev set but cut beam width.

Judges vs. exact metrics. Exact metrics (EM/F1) are gold. If you must use an LLM judge, freeze a strict rubric and log rationales. Then sanity-check winners on a small human-labeled slice.

When not to branch. If patterns are not separable (errors are random) or routing is flaky, branches create overhead without gains. In that case, do another ProTeGi round or revisit the instruction’s conflict resolution rules.

In practice: copy-ready prompts & snippets

A. APE proposer (zero-shot instruction search) One turn generates one candidate; run it 20–50× with varied seeds.

json

You are an instruction synthesizer. Infer a precise, testable instruction that makes a black-box model reproduce the labels below. INPUT→LABEL pairs: {{FEW_SHOT_PAIRS}} Requirements: - Name the label set explicitly. - Specify tie-breakers and unknown handling. - Avoid examples or prose; write one concise instruction. Return: INSTRUCTION: <single paragraph> RATIONALE: <one sentence>

B. ProTeGi critic (textual gradient over a mini-batch)

json

Role: Prompt critic. Task: From the mistakes below, infer defects in the current instruction. CURRENT_INSTRUCTION: {{INSTR}} MISTAKES (each has message, gold, model_output): {{K_ERRORS}} Write 3–5 defects with evidence indices and a concrete policy fix for each. Output JSON: {"defects":[{"name":"...", "evidence":"(1,3)", "fix":"..."}, ...]}

C. ProTeGi editor (apply gradient to produce candidates)

json

Role: Prompt editor. Apply the "defects" to revise the instruction with minimal changes. Constraints: <=15% longer; keep label set intact; no examples. Input: - CURRENT_INSTRUCTION: {{INSTR}} - DEFECTS_JSON: {{CRITIC_JSON}} Output up to 3 candidates: CANDIDATE_i: <revised instruction> CHANGELOG_i: <one-line justification>

D. Branch router + pruning checklist

json

[Decide a pattern for the message: {single-issue, multi-topic, out-of-scope}.] Rule for multi-topic: pick the dominant issue using (monetary terms > authentication > UI nuisance). Rule for out-of-scope: if no evidence matches any label, output "unknown" and a 6-12 word reason. [Apply the branch-specific instruction.]

Prune branches that don’t beat the monolith on their slice by ≥X points. Keep the tree ≤3 leaves.

Troubleshooting (what breaks and what to try)

Prompt bloat. Edits grow the instruction into a wall of text. Add a length budget to the editor and occasionally run a “compress without losing policy” pass.

Metric mismatch. Your judge rewards verbosity or hedging. Tighten the format (e.g., “Output one label from {…} and nothing else”) and add a post-processor.

Search degeneracy. Candidates converge to near-duplicates. Increase proposer temperature, add a paraphrase wave, or add a diversity filter (reject >0.9 semantic similarity).

Overfitting the dev set. Dev goes up, hold-out stalls. Reduce beam width, use subsets during selection, and re-introduce “unknown”/abstain rules.

Router flakiness. If the first line (router) often misroutes, make it predict-then-verify: ask the router to output the pattern and a 1-sentence reason; drop to the monolithic prompt when confidence words (e.g., “maybe”) appear.

Mini Lab (15 minutes)

Goal: run a tiny APE → ProTeGi → Branch loop on 12 labeled emails.

Prepare data: 12 emails with gold labels (billing/technical/account/unknown). Split 8/4 (dev/hold-out).
APE: generate 40 candidate instructions; score on the 8 dev items; keep top 3.
ProTeGi: take the best instruction and run one critic+editor round on the 4 errors from the dev set; score 3 edited candidates; keep the best.
Branch: from the remaining 2–3 errors, write one router line and one branch instruction that fixes the top pattern; re-score.
Report: dev accuracy, hold-out accuracy, final instruction(s).

Expected outcome (plausible): baseline 62–72% → after APE 74–80% → after ProTeGi 78–84% → after one branch 82–88% (hold-out within 1–2 points of dev).

Summary & Conclusion

Automatic prompting is less a trick than a habit: treat instructions as artifacts that are proposed, scored, and edited—and sometimes split when the task demands it. APE gives you breadth quickly; ProTeGi turns mistakes into momentum with textual gradients and light beam/bandit selection; multi-branch optimization acknowledges that one prompt cannot rule them all.

The recurring pattern is simple: keep a small, honest metric; cap the budget; write prompts that produce artifacts (defects JSON, changelogs, router decisions) so you can audit the evolution. When you reach diminishing returns, branch once, prune hard, and rerun a final sanity check on a hold-out.

Next steps

Wire your scoring loop to real logs (opt-in!) and let ProTeGi critique fresh mistakes weekly.
Replace the judge with exact metrics wherever possible; where not, freeze a rubric and sample multiple judges.
Try AMPO-style branching on one more task (e.g., summarization with length control) and compare monolith vs. 2-branch trees.

References (official papers)

Automatic Prompt Engineer (APE). Large Language Models Are Human-Level Prompt Engineers. ICLR 2023. Zhou et al.
ProTeGi. Prompt Optimization with “Gradient Descent” and Beam Search. arXiv:2305.03495. Pryzant et al., 2023.
AMPO (Multi-Branch). AMPO: Automatic Multi-Branched Prompt Optimization. EMNLP 2024 (ACL Anthology). Yang et al. (aclanthology.org)

Additional context & surveys (optional):

A Systematic Survey of Automatic Prompt Optimization Techniques. arXiv:2502.16923, 2025. (arXiv)
A Survey of Automatic Prompt Engineering. arXiv:2502.11560, 2025. (arXiv)

These works underpin the methods and design choices described here—and are worth skimming before you scale your loop.

Learning Paths

Structured Learning

Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.

Explore Paths

Continue Your Learning Journey

Ready to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.

Explore More Guides Browse Learning Paths

Auto-Prompting: APE, ProTeGi, and Multi-Branch Optimization

September 6, 2025

90 min read

Promptise Team

Advanced

Prompt EngineeringPrompt OptimizationEvaluation & MetricsError AnalysisRouting