Learn structured reasoning with Tree of Thoughts. Build loops with beam width, scoring rubrics, and budget caps. See when ToT outperforms self consistency, explore graph style variants, and apply prompts with reliability tactics for better results.
Promise: by the end of this guide you’ll know how to implement Tree-of-Thoughts (ToT) with beam width, scoring, and budget caps; how to compare ToT vs. self-consistency for puzzles and planning; and how to sketch graph-style adaptations (Graph of Thoughts and planning-based variants) so you can choose the right search shape for your reasoning tasks.
Think of a logic puzzle as a maze. Classic Chain-of-Thought (CoT) walks forward, narrating one path. Sometimes it gets lucky; often it dead-ends. ToT sketches several partial paths (“thoughts”) at each step, scores them, prunes weak ones, and backtracks when needed — more like a chess player than a daydreamer. In the original paper, ToT transformed success on search-y tasks (e.g., the Game of 24) by exploring branches rather than committing to one guess; the authors reported a jump from single-digit accuracy with plain CoT to strong double digits with ToT’s search and backtracking. (arXiv)
Graph-style variants generalize the shape from trees to arbitrary graphs: you can merge states, revisit them with new information, or run multiple improvement operations in loops — a better fit for tasks where ideas recombine rather than just branch. The Graph of Thoughts work formalized this and showed both quality and cost benefits on certain tasks by orchestrating these graph operations. (arXiv, ojs.aaai.org)
Let’s standardize a few terms you’ll use in code and prompts:
State: a compact representation of “where we are” in solving the task (e.g., chosen moves so far, remaining constraints).
Thought: a candidate next step (one or a few sentences or an action) that advances a state.
Expansion: prompting the model to propose multiple thoughts from a state.
Scoring: estimating which thoughts look promising (LLM-as-judge, heuristics, or even external checks).
Search policy: how you traverse states (depth-first, breadth/beam, best-first).
Budget: caps that keep you honest — max nodes, max depth, token and time budgets, and early stopping rules.
Self-consistency is different: you sample many full CoT trajectories (using higher temperature for diversity), then vote on answers; it often boosts accuracy on math/commonsense because multiple paths converge on the same end. It’s cheap to orchestrate, but it doesn’t guide the middle; you can still sample lots of bad paths. (arXiv)
A ToT loop alternates generation and selection:
From each frontier state, ask the model for k candidate thoughts that obey your interface (short, structured).
Score each thought (rubric, heuristic, or quick simulation), producing a value in a consistent range.
Advance top-scoring thoughts into new states; prune the rest.
Stop when a goal test passes or the budget is exhausted.
Two parameters govern the feel:
Beam width (B): how many states you keep at each depth. Larger B increases recall but cost grows roughly linearly in B.
Scoring temperature / judge strictness: too soft → drift; too hard → prune promising oddballs.
In the ToT paper, simple BFS/beam plus LLM self-evaluation made a huge difference on tasks like 24-game and mini-crosswords because early mistakes are reversible: you can backtrack and try a different branch. (arXiv)
Below is a minimal, copy-pasteable prompt + pseudocode bundle you can adapt. It separates expansion from scoring and runs a beam search with budget caps.
Purpose: produce 3–5 short candidate moves from the current state; keep them crisp so scoring is cheap.
System: You are a deliberate problem-solver. Work in short, verifiable steps. User: TASK: {{TASK_DESCRIPTION}} CURRENT STATE: {{STATE_JSON}} CONSTRAINTS: - Keep each THOUGHT ≤ {{N}} sentences. - Each THOUGHT must be actionable and self-contained. FORMAT: Return JSON with `thoughts: string[]` of length {{K}}. Only JSON.
Purpose: judge a single thought on a 0–10 rubric tied to the task.
System: You are a strict evaluator. Score future promise, not eloquence. User: TASK: {{TASK_DESCRIPTION}} STATE: {{STATE_JSON}} CANDIDATE THOUGHT: "{{THOUGHT}}" RUBRIC (0–10): - Valid: obeys constraints; no contradictions. (0–3) - Progress: reduces search space or satisfies a subgoal. (0–4) - Reversibility: keeps options open; avoids dead-ends. (0–3) FORMAT: {"score": <0-10 integer>, "one_sentence_rationale": "<≤20 words>"} Only JSON.
Purpose: expand, score, and keep the top B states per depth with budget caps.
def tot_beam(task, init_state, beam_width=4, max_depth=6, max_nodes=200): from heapq import nlargest Node = namedtuple("Node","state score depth path") # path = [(state, thought, score), ...] frontier = [Node(init_state, 0, 0, [])] visited = 0 best = None while frontier and visited < max_nodes: # Goal test solved = [n for n in frontier if goal_test(n.state)] if solved: best = max(solved, key=lambda n: n.score); break # Expand one depth level candidates = [] for node in frontier: thoughts = expand_thoughts(task, node.state, k=4) # prompt 1 for t in thoughts: s = score_thought(task, node.state, t) # prompt 2 new_state = transition(node.state, t) candidates.append(Node(new_state, node.score + s, node.depth+1, node.path + [(node.state, t, s)])) visited += 1 if visited >= max_nodes: break if visited >= max_nodes: break # Keep top B by score, cap depth frontier = nlargest(beam_width, [n for n in candidates if n.depth <= max_depth], key=lambda n: n.score) return best # or top-k if not solved
Implementation notes. goal_test and transition are task-specific; keep states small (JSON) and deterministic. For scoring, LLM-as-judge is simplest, but if you can cheaply simulate or check (e.g., validate arithmetic, verify a constraint), combine both and trust the check over the model.
Beam width is your recall knob. Start at B=3–5. Bigger beams help in deceptive landscapes but multiply costs.
Scoring should be fast, monotone, and comparable across depths. If scores drift upward with depth, normalize (e.g., average per step) so shallow nodes aren’t unfairly favored.
Budgets: cap nodes visited and maximum depth; add early stopping when the top-k beams’ scores stagnate. Keep a “best so far” even when not solved.
Use higher temperature (0.7–1.0) for expansion to generate diverse thoughts; use lower temperature (0–0.3) for scoring so the judge is consistent. If expansions collapse into near duplicates, add anti-duplication (“propose distinct strategies”) or a diversity penalty when selecting top beams.
Puzzles with branching decisions (24-game, crosswords, path planning, scheduling): ToT shines because it can recover from early wrong turns via backtracking and prioritize promising branches. This is exactly the regime ToT targeted and where its reported gains came from. (arXiv)
Single-trajectory algebra/commonsense where the end answer is unique and short: self-consistency often suffices. You buy diversity by sampling multiple CoT runs, then vote on the final answer; it’s simpler to implement and was shown to provide large uplifts on GSM8K/others. (arXiv)
A quick rule of thumb: if intermediate choices matter, prefer guided search (ToT); if the task is mostly straight-line derivation, try self-consistency first and escalate only if needed.
Real problems don’t always branch neatly. You might want to merge equivalent states, revisit a state with a different operation, or refine a partial solution in cycles. Graph of Thoughts (GoT) treats “thoughts” as nodes in a general graph and defines operations (generate, transform, evaluate, merge) that can be sequenced and looped. The authors report higher quality (e.g., on sorting) and lower cost than ToT when the workflow benefits from reusable substructures and feedback loops. (arXiv)
A cousin approach, Reasoning via Planning (RAP), frames the process explicitly as planning with a world model: one LLM role proposes actions; another evaluates next-state value and balances exploration vs. exploitation to grow a reasoning tree efficiently. If your task naturally exposes state transitions and rewards, RAP-like planning policies can outperform naive BFS/beam. (arXiv)
Finally, if you need stronger search, there’s a line of work that borrows AlphaZero-style tree search for LLMs (value/prior estimates + Monte-Carlo tree search). It’s heavier but can help when depth grows and a naive beam drowns. (arXiv)
Below are copy-ready snippets you can mix and match.
Use this when a straight-line solution exists; escalate to ToT if variance is high.
Instruction: Solve the task. Think briefly, then give ONLY the final answer.
Ground scoring in a visible rubric so it’s audit-able and you can tune weights.
Judge Instruction: Score 0–10 using this rubric: 1) Valid (hard fail if illegal). 2) Progress (reduces uncertainty). 3) Reversibility. Return: {"score": int, "rationale": "<≤20 words>"}
Arithmetic: evaluate candidates; discard division by zero; keep ones that approach target.
Constraints: run a quick validator against the state (e.g., “no time overlap”, “budget ≥ 0”).
Deduping: hash a canonical representation of states; skip if seen.
Stop if: - goal_test == True - visited_nodes >= MAX_NODES - depth >= MAX_DEPTH - top_beam_score hasn't improved by ≥ δ in the last M levels
Keep telemetry: beams per depth, acceptance rate, judge variance. It’s invaluable when tuning.
Judge drift: the evaluator grows lenient or inconsistent across steps. Fix: lower its temperature, shorten inputs, and anchor the rubric with examples of high/low scores (few-shot judging is OK because it doesn’t leak task answers).
Verbosity tax: thoughts get long, scoring costs spike. Fix: enforce short formats and hard token caps in prompts; prefer bullet-like “micro-moves.”
Myopic pruning: early good-looking but wrong branches dominate. Fix: add a diversity bonus (e.g., penalize n-gram overlap) or keep a small exploration beam that admits one quirky candidate per level.
Loops and revisits: the search revisits equivalent states. Fix: canonicalize states and add a visited set; for graphs, explicitly merge nodes by key.
Unstable scores: when the same thought gets different scores on re-ask. Fix: score twice and average, or ensemble two judges (LLM + heuristic check) and trust the check.
We’ll use a constrained itinerary micro-task: “Given 5 candidate meetings, schedule 3 within 2 hours, minimizing walking time, with one fixed at 10:30.” There are many partial paths and a goal test (“3 meetings scheduled, constraints satisfied”).
State: JSON with chosen meetings, elapsed time, total walk.
Thought: propose the next meeting or a swap.
Judge: validates constraints, scores progress and reversibility.
Beam: B=4, max_depth=5, max_nodes=120.
Run the harness above; you should see the beam try a couple of arrangements, backtrack around the fixed 10:30 slot, then settle on a schedule that meets the constraints. If self-consistency on the same task just samples full schedules without guided pruning, you’ll likely watch it burn budget on invalid lineups before converging — a textbook case where ToT’s guided expansion pays off. (This pattern mirrors why ToT worked so well on search-like tasks in the original work.) (arXiv)
Goal: implement a toy ToT for the 24-game and compare to self-consistency.
Setup. Numbers: [3, 3, 8, 8]. Operators: + - * /. Goal: exactly 24.
State = multiset of remaining numbers and current expression.
Thought expansion prompt = propose up to 3 legal operations that combine two numbers into one, returning the new number and expression; keep each thought to one operation.
Judge = rule out illegal ops (e.g., division by zero), score higher when the new number is closer to 24 or when it forms the last step to 24; otherwise penalize.
Expected output (shape, not exact values):
{ "chosen_solution": "(8 / (3 - 3/8)) * 8", "value": 24, "top_path_scores": [28, 27, 26], "nodes_visited": 97, "depth": 3, "beam_width": 4, "notes": "Found at depth 3 after backtracking once." }
(Your exact expression may differ; the important part is that ToT consistently finds a valid 24 under a modest budget, while self-consistency with, say, 20 samples sometimes misses or costs more samples to land on the answer — echoing the original 24-game results motivating ToT. If you don’t hit 24, log the best value seen and raise beam_width from 3→5.)
If you notice these smells, consider a graph rather than a tree:
Reusable sub-solutions (you keep reinventing the same partial plan).
Multiple improvement modes (e.g., “refine,” “swap,” “sort,” “summarize,” each best applied at different times).
Feedback loops (“polish → evaluate → polish again”) where iterations help.
GoT formalizes this as a Graph of Operations you orchestrate: generate nodes, transform them, evaluate, and merge duplicates — which is often cheaper than re-expanding equivalent branches many times. Reported results show noticeable quality gains and >30% cost reductions on certain tasks vs. a pure ToT loop. (arXiv)
Cache aggressively. Hash (STATE, prompt) to avoid re-scoring identical thoughts.
Guardrails. Treat the judge’s rationale as non-authoritative; when a cheap verifier exists, use it.
Logging. Keep full paths and scores; it’s your audit trail and training data for future fine-tuning.
Evaluation. Build a tiny golden set (10–50 items). Compare: self-consistency (N samples) vs. ToT (B, budget) vs. graph loop. Track solve rate, tokens, latency.
ToT turns reasoning into guided search: expand several options, score them, keep the few that matter, and course-correct when an early step goes wrong. It’s especially effective when the task’s intermediate choices change what’s possible later, and you can define a goal test and a judge that rewards progress. Self-consistency remains a strong baseline where the path is mostly straight — it’s easy to implement and wins often on math and commonsense. (arXiv)
When your problem has reusable parts, multiple improvement operations, or loops, step up to graph-structured orchestration (GoT) or planning-style approaches (RAP). They let you revisit and merge states and balance exploration vs. exploitation more explicitly — sometimes improving quality while cutting cost. (arXiv)
The craft here is not mystical; it’s search engineering. Define states crisply, keep thoughts short, judge consistently, and respect budgets. With that discipline, ToT & friends become reliable tools rather than fragile demos.
Instrument your harness. Add counters for nodes, depth, and judge variance; export runs to a CSV for post-mortems.
Tune one knob at a time. Start with B=3, depth=5, N=100 nodes; plot solve-rate vs. cost.
Prototype a GoT loop. Pick one task with natural “refine/merge” operations and orchestrate them as a small graph.
Graph of Thoughts (GoT): arXiv preprint and AAAI proceedings; official repo. (arXiv, ojs.aaai.org, GitHub)
Self-Consistency for CoT: arXiv preprint and PDF; OpenReview. (arXiv, OpenReview)
Reasoning via Planning (RAP): arXiv preprint and overview. (arXiv, Semantic Scholar)
AlphaZero-like Tree-Search for LLMs: arXiv preprint (depth-aware search guidance). (arXiv)
(Dates and claims are based on the cited sources as of September 7, 2025.)
Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.
Explore PathsReady to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.