Who should read this Advanced level guide?

This guide is perfect for Advanced level practitioners looking to improve their prompt engineering skills in Prompt Engineering, AI Agents, ReAct, Verification, System Design, AI Safety, AI Ops, Orchestration.

How long does it take to complete this guide?

This guide takes approximately 45 min read to read and understand.

Back to Guides/Guide

LLM Agent Orchestration

This in-depth guide teaches you how to transform a raw language model into a reliable, auditable system that can reason step-by-step, use tools safely, keep state, verify outputs, and stop on time. By the end, you’ll understand orchestration as a control system: the model proposes, the controller disposes.

September 12, 2025

45 min read

Promptise Team

Advanced

Prompt EngineeringAI AgentsReActVerificationSystem DesignAI SafetyAI OpsOrchestration

Promise. This guide gives you a mental model and practical playbook for controlling a large language model end to end—so it uses tools when needed, keeps state, verifies its own work, and knows when to stop. We’ll frame orchestration as a control system with clear layers and interfaces, and we’ll carry a ReAct-style agent through as a running example. No code—just the concepts, contracts, and habits you need to build systems that behave.

Why this matters. Models write convincing text; the world cares about steps, units, sources, and deadlines. Orchestration is how you turn free-form dialogue into a deterministic conversation: the model proposes; your runtime disposes—checking formats, running tools with gloves on, and landing the plane.

Lay of the Land: Three Planes, One Authority

Think of an LLM system as three planes under a single tower:

Control plane (you): policies, schemas, schedules, budgets, and the controller loop/graph. It decides what’s allowed and when to stop.
Data plane (the model): plans, drafts, critiques—the proposals the model makes.
Tool plane (the world): calculators, retrieval, APIs, files—everything the model may ask to use.

Authority never leaves the control plane. The model may propose an action or answer; the controller checks shape and rules, executes tools safely, verifies results, and decides the next state.

Rendering chart...

The Layer Cake (what’s in an orchestrated system)

You can ship with three core layers—Policy, Protocol, Runtime—but production agents profit from a fuller stack. Here’s the set that maps cleanly to real systems.

Rendering chart...

1) Policy (Etiquette)

What it is. Behavioral rules the model follows: short “thought” (state your next intent, not your inner monologue), one “action” at a time, observations are controller-provided only, and a clean “final answer.” What it controls. Output discipline: no sprawling prose between steps; no guessing observations; no tool freelancing. Why it works. Etiquette keeps the conversation tidy and parseable, which unlocks everything else.

2) Protocol (Contract)

What it is. The message-level spec the model must follow each turn. It defines fields (step, thought, action, final), allowed values (which tools exist), and stop conditions. What it controls. Determinism. If you can’t parse it, you can’t trust it. Practical note. Version your protocol. If you change field names or shapes, bump a version tag and keep traces replayable.

3) Runtime (Controller)

What it is. The loop—or small state machine—that runs the show. Each turn it sends a control frame (task, step index, budgets, allowed actions, last observation, output schema), receives one model turn, checks it, executes a single action if present, updates a ledger, applies gates, and decides continue / verify / stop. What it controls. Time, budgets, allowed actions, and termination.

Rendering chart...

4) Skills & Tools (Capabilities)

Skills are LLM-only subroutines—plan, draft, critique, revise—invoked with fixed prompts. Tools are external functions—retrieval, calculators, APIs—with typed inputs/outputs and side-effect policies. What this layer controls. How work gets done under supervision. Treat both as named actions; from the controller’s perspective, they’re symmetrical.

5) Memory (Ledger + Summaries)

What it is. A small, append-only ledger of normalized facts {value, unit, source, step} and short summaries you feed back to the model so it doesn’t forget. What it controls. State that is auditable, not a blob of past tokens.

6) Verification (QA)

What it is. Cheap, deterministic checks (math recomputation, unit sanity, citation presence) and, when needed, a second-pass verifier prompt/model that evaluates the answer against a rubric. What it controls. Trust. Nothing “ships” without passing verification.

7) Safety & Compliance

Allow-lists for domains, schema validation on inputs/outputs, timeouts, rate limits, and redaction in logs.

8) Telemetry & Tracing

Per-turn trace events (tool inputs/outputs, token counts, cost estimates, gate results, verifier status).

9) Evaluation & Tests

Scenario suites and golden traces that must still pass after you change prompts, models, or tools. What it controls. The risk of regressions.

10) Lifecycle & Ops

Versioning for etiquette/protocol/registry, model routing (use a stronger model only when stakes or uncertainty warrant), and escalation to humans.

💡 Insight: Most “hallucinations” are not mystical; they’re protocol violations (unparsable output, invented tools) or controller drift (no step cap, no progress test). Fix the rails; don’t romanticize the model.

How the Layers Connect (one turn, end to end)

Controller → Model: Send a control frame: task; current state; step count; remaining token/time budgets; allowed actions for this state (skills and tools by name); last observation; a reminder of etiquette; and a summary of the ledger.
Model → Controller: Return exactly one turn: a brief thought plus either one action (skill/tool) or a final answer.
Controller: Validate the turn against the protocol; if it’s an action, execute it safely; convert outputs into a compact observation; update the ledger; apply gates (schemas, allow-lists, unit checks); then decide next state. If it’s a final answer, run verification; if it passes, stop, otherwise continue with a hint.
Telemetry: Append a trace event; reduce budgets; increment step.

Stop conditions are evaluated in code, not left to the model: goal met and verified, maximum steps reached, or no progress (e.g., repeated thoughts with no new facts).

Rendering chart...

ReAct as a Concrete Example (no code, full control)

ReAct stands for Reason + Act: the model alternates short Thought and one Action, then receives an Observation and repeats until a Final Answer. Here’s what a ReAct agent looks like when fully orchestrated.

Rendering chart...

Etiquette for ReAct (behaviors you teach)

Keep Thought to one or two sentences—state intent, not inner monologue.
Choose one Action from the allow-listed skills/tools.
Wait for Observation; never invent it.
Produce a Final Answer only when the goal is satisfied; include citations if your domain requires them.

Protocol for ReAct (what your controller expects)

Each turn must contain:

step: increasing integer.
thought: brief next intent.
action: exactly one {name, kind(skill|tool), input} or final_answer: {text, citations[]}. Your controller treats any deviation as a policy violation and stops (or retries once with a terse reminder).

Runtime for ReAct (what the loop enforces)

States. A tiny state machine works well: PLAN → GATHER → DRAFT → VERIFY → FINALIZE.
- PLAN allows skill.plan only.
- GATHER allows external tools like web.search, doc_store.get, calculator.
- DRAFT allows skill.draft (LLM-only) plus cheap tools.
- VERIFY bans new tools; permits skill.explain_errors if the verifier flags issues.
- FINALIZE only emits the final answer.
Budgets. Cap steps (e.g., 4–6), thoughts (e.g., ≤ 240 characters), and wall clock time.
Progress test. If a turn adds no new fact to the ledger and repeats intent, increment a no-progress counter and terminate after a small threshold.

A full ReAct trace (conceptual)

Task. “Compute the 7.7% VAT for a CHF 1,299 item in Switzerland and give the total. Cite the VAT rate.”

Step 1 — PLAN Thought: “I’ll retrieve the VAT policy, then compute tax, then total.” Action: skill.plan Observation (from controller): A three-step plan (stored; not reprinted to the model unless summarized).
Step 2 — GATHER Thought: “Fetch the VAT rate from our policy store.” Action: tool: doc_store.get, input: vat.ch.standard Observation: “Standard VAT rate: 7.7% (Switzerland). Source: policy/vat.json#2024-01-01” Ledger update: {fact: 7.7%, unit: percent, source: policy/vat.json#2024-01-01, step:2}
Step 3 — GATHER Thought: “Compute 1299 * 0.077 to get VAT.” Action: tool: calculator, input: 1299 * 0.077 Observation: 100.023 → rounded by controller to 100.02 (document the rounding policy) Ledger update: {fact: 100.02, unit: CHF, source: calculator, step:3}
Step 4 — DRAFT Thought: “Draft a concise answer with citation.” Action: skill.draft Observation: A short paragraph: “VAT at 7.7% on CHF 1,299 is CHF 100.02; total CHF 1,399.02. [policy/vat.json#2024-01-01]”
Step 5 — FINALIZE Thought: “All criteria met; finalize.” Final Answer: Text + citation. Verification: Controller recomputes the math and checks that the citation exists in the ledger. Pass → Stop.

What made this safe. The model never touched the world directly. It proposed actions; the controller executed them under allow-lists and schemas, normalized results into a ledger, and enforced stop conditions.

Designing Etiquette (so the model plays nice)

You don’t need long poetry. Give the model three things in plain language:

Phases. Thought (brief intent), Action (one skill/tool with valid input), Final Answer (only when done).
Rules. Use only allow-listed actions; don’t invent observations; keep thoughts short; stop when goal satisfied or budgets hit.
Output shape. The fields you expect each turn; nothing else.

💡 Insight: Brevity is a feature. Long “thoughts” cost tokens and leak your control. Ask for intent, not chain-of-thought.

Defining the Contract (so you can trust replies)

Protocols fail when they’re vague. Make yours mechanical:

Fields. Fix the names and types: step (int), thought (string), action (object with {name, kind, input}) or final (object with {text, citations[]}).
Allowed values. Provide the current list of actions (skills + tools) in the control frame. Anything else is a violation.
Stop conditions. Encode them: answered, maxSteps, noProgress, policyViolation.
Version. Tag every turn and frame with a version string; you’ll thank yourself when you upgrade prompts.

⚠️ Pitfall: “We’ll just parse whatever the model writes.” That way lies brittle regexes and edge cases. If you can’t parse it strictly, you don’t control it.

Building the Controller (what you enforce every turn)

Control frame contents (what you send): task, state, step number, token/time budgets, ledger summary, last observation (compact), list of allowed actions (skills/tools with short descriptions), etiquette reminders, and the output schema.

Gates (what you check):

Shape. Turn must match the protocol (one action or a final answer).
Allow-list. Action name must be in the current state’s allowed list.
Schemas. Tool inputs/outputs must match expectations; reject on mismatch.
Safety. Side-effects require explicit allowance; everything has timeouts and size caps.
Progress. The ledger should grow or the plan should advance; otherwise count toward no progress.
Verification. Before stopping, ensure numbers recompute and citations exist.

Decisions (what you own): continue with the next step, send a verifier hint back to the model, escalate to a stronger model, or stop.

Memory That Stays Small (and useful)

Skip giant context windows stuffed with past turns. Keep two artifacts:

A ledger of normalized facts (values, units, sources, step). This is your backbone for verification and citations.
Summaries you craft for the control frame: a two–four sentence “state of the world” the model can reliably consume.

This pattern keeps prompts short, improves determinism, and makes audits easy.

Verification: your last line, not an afterthought

Verification isn’t a vibe; it’s a gate:

Programmatic checks. Recompute arithmetic; verify units; enforce structural constraints on answers.
Citation policy. If a claim requires a source, insist on at least one citation drawn from the ledger (not fresh web links the model imagines).
Verifier pass. For higher stakes, run a tiny rubric (“Are all numerical claims supported by ledger entries?”) and require “OK” to land.

If verification fails, the controller either adds a succinct hint and lets the model repair once or escalates.

Safety: practical controls that matter

Allow-lists. Only permit tools and network domains you name. Default-deny everything else.
Timeouts and rate limits. Per tool, per user, per run. Kill loops before they kill your budget.
Schema validation. Inputs and outputs must match schemas before execution or propagation.
Redaction in traces. Hash IDs and emails by default; log full payloads only with a need and a policy.

Telemetry & Tracing: without it, you’re flying blind

Log every turn: step, state, selected action, tool latency, token counts, estimated cost, gate results, verifier status. Keep a compact trace per run; it becomes your debugger, audit trail, and evaluation seed.

Rendering chart...

Evaluation & Tests: how you avoid regressions

From your traces, curate a dozen golden cases:

Unparsable output → controller stops with policy violation.
Hallucinated tool → stop.
Tool returns out-of-schema data → gate failure.
Same thought twice, no new facts → no-progress stop.
Max steps enforced even when the model “refuses to stop.”
Verification fails once → repair attempt → stop if still failing.

Run these whenever you change etiquette, protocol, tools, or models.

Lifecycle & Operations: run it like a real system

Version the big three. Etiquette, protocol, tool registry.
Route by stakes. Cheap model until uncertainty rises; then escalate.
Feature-flag risky changes. Roll out to a fraction of traffic and watch the traces.
Replayers. Be able to replay old traces under the new controller to see what breaks.

When

If the sequence of work is fixed (“turn CSV into chart,” “call one endpoint and format”), an agent is overkill. Use a single tool call with a templated prompt and a verifier. Agents shine when the order of operations is uncertain and you need interleaved reasoning + tools.

Troubleshooting (symptom → cause → corrective move)

The model writes prose instead of structured turns. Etiquette not binding or protocol unclear → Restate the output shape plainly in the control frame; reject non-conforming output and retry once with a terse reminder.
It calls tools you didn’t expose. The model generalized → Include an explicit allow-list of actions in each frame; treat unknown actions as policy violations and stop.
It never lands. No clean goal test or step cap → Define a concrete goal condition (e.g., “answer includes total and at least one ledger citation”) and set maxSteps to 4–6; enforce no-progress.
Numbers look off. Free-form observations or missing unit norms → Normalize units when adding to the ledger; verify arithmetic programmatically; round deterministically.
It’s slow/expensive. Verbose thoughts or unnecessary tools → Cap thought length, prefer skills over tools when possible, and cache deterministic tool outputs.

Mini Lab (no code, 10 minutes)

Goal. Practice the orchestration mindset by running a ReAct task on paper like a controller would.

Pick a task. “Find the author’s official homepage and list one recent article title with the publication year.”
Define allowed actions. Skills: plan, draft. Tools: web.search, open.url, extract.title.
Set stop conditions. Must have a title + year + URL, and a single citation from allow-listed domains. Max 5 steps.
Run the turns. For each step, write a one-sentence Thought and choose one Action. As the controller, you invent Observations only from tools you “execute” mentally or with a browser. Update a tiny ledger after each observation.
Verify. Check the year exists on the page you opened and matches the title; ensure the URL is on your allow-list.
Land. Write a concise final answer with the citation. If anything is missing, add a hint and allow one repair step.

You just did orchestration: etiquette kept the turns tidy; protocol told you what to expect; the controller enforced budgets and stop rules; the ledger and verifier kept you honest.

Summary & Conclusion

Orchestration is not “tool calls with flair.” It’s a control system for conversations. You set Etiquette so the model behaves, a tight Protocol so you can parse its replies, and a Runtime that owns time, budgets, and stop conditions. Surround that engine with skills & tools you expose deliberately, a small ledger that captures facts, verification that gates trust, safety around the edges, telemetry that makes behavior visible, tests that prevent regressions, and ops practices that let you change things without breaking users.

Carry one sentence with you: the model proposes; the controller disposes. That separation of powers is how you get reliability without giving up flexibility.

Next steps

Write your etiquette in one page: phases, rules, output shape. Keep it short and strict.
Define your protocol fields and control frame contents; version both.
Start with a ReAct state machine (PLAN → GATHER → DRAFT → VERIFY → FINALIZE). Add a ledger and a simple verifier. Expand tools only when you can log and test them.

Learning Paths

Structured Learning

Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.

Explore Paths

Continue Your Learning Journey

Ready to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.

Explore More Guides Browse Learning Paths

LLM Agent Orchestration

September 12, 2025

45 min read

Promptise Team

Advanced

Prompt EngineeringAI AgentsReActVerificationSystem DesignAI SafetyAI OpsOrchestration