Explore advanced reasoning with Chain of Thought. Learn self consistency to boost accuracy by sampling and voting, and least to most prompting to break problems into steps. Covers when CoT helps, cost and latency tradeoffs, and how to validate gains.
Promise: by the end, you’ll know when Chain-of-Thought (CoT) helps (and when it hurts), how to set temperature/top-p and sample counts for self-consistency voting, and how to apply least-to-most decomposition to generalize beyond your examples. You’ll also stand up a tiny GSM8K-style eval to confirm the uplift, not just hope for it.
Chain-of-Thought means asking the model to show brief intermediate steps before the final answer. Early results showed large accuracy gains on multi-step reasoning tasks, especially math word problems and symbolic manipulation. The catch: CoT is not magic—it trades tokens and latency for (often) better reliability, and it’s sensitive to decoding choices. (arXiv)
Self-consistency is a decoding strategy: instead of trusting a single chain, you sample multiple distinct chains (by turning up randomness), then vote on the final answers those chains reach. On datasets like GSM8K, this simple move delivered double-digit absolute accuracy gains over greedy CoT. It works because many problems admit multiple valid ways to reason toward the same unique answer. (arXiv)
Least-to-Most is a prompting pattern that explicitly decomposes a hard problem into an ordered series of simpler sub-problems, then solves them in sequence, feeding each result into the next. It was introduced to tackle the failure mode where models can mimic easy exemplars yet stumble on harder instances that require deeper, compositional generalization. (arXiv)
Why now: these three ideas—CoT, self-consistency, least-to-most—are the foundation of today’s “think first, then answer” techniques. If you’re already comfortable with few-shot prompts, you’re ready to make the jump.
CoT usually helps when the task has latent structure: multi-step arithmetic, logic puzzles, causal chains (“if X then Y”), or compositional transformations (e.g., “reverse words, then sort by length”). The model benefits from “externalizing” intermediate state.
It can hurt when:
The task is pure recall or simple extraction—extra steps just add verbosity and room for drift.
The model starts to rationalize after it has already chosen an answer, producing confident but wrong “explanations.”
Your context window is tight; verbose chains crowd out essential evidence.
A pragmatic rule: gate CoT. Ask for steps only if the model signals uncertainty or the task category is known to be multi-step. For everything else, request a concise rationale (one to three bullets) or no rationale at all.
The core mental model is simple: don’t trust one path—sample many, then vote. Concretely:
Prompt for short, structured CoT. Keep steps minimal, then print a single, parseable final answer.
Turn up diversity. Use a moderate-to-high temperature (0.7–1.0) or top_p (~0.9–0.95).
Sample k chains (5–20 is a good starting band).
Extract the final answers (numbers, options, or short spans).
Majority-vote the final answers; optional tie-breakers: choose the answer with the shortest chain (often a proxy for confidence) or run a single deterministic re-check pass that verifies the voted answer against the question.
Why it works: the model’s probability mass over reasoning paths can be multi-modal—there are multiple plausible chains that converge on the correct end state. Sampling explores those modes; voting marginalizes over them. On math and commonsense benchmarks, this outperformed greedy CoT by wide margins (e.g., +17.9 points on GSM8K in the original paper). (arXiv)
Cost/latency trade-off: self-consistency is roughly k× tokens and (if sequential) k× latency. You can parallelize the k samples and you can early-stop once the same final answer appears, say, 3 times—a big win on easy items.
Below is an end-to-end pattern you can drop into your stack. It shows the prompt contract (clear output schema), sampling setup (diversity on, multiple candidates), and voting.
What it does: asks for a brief, bounded chain, then a parseable answer, and prepares for self-consistency extraction.
System (policy): You are a careful reasoning assistant. Use brief, numbered steps only when needed. Always end with: FINAL: <short answer> User (task): Q: A shop sells 3 apples for $2. How much for 12 apples? Constraints: - If reasoning is required, provide at most 3 numbered steps. - Then print: "FINAL: <number>" with no extra text.
A single run with temperature=0 will likely return FINAL: 8. With self-consistency (e.g., temperature=0.9, n=10), you’ll collect several correct chains that agree on 8, and vote that in.
Chain-of-thought learns “patterns of steps” from your few-shot demos. When the test item is harder than the demos, CoT can imitate the style but still fail. Least-to-Most solves this by planning small sub-problems first (least), solving them, and building up to the target (most).
Two-phase prompt shape:
Decomposition plan. “List the minimal sub-problems required to solve Q, in order.”
Progressive solving. For i from 1 to N: “Solve sub-problem i. Use previous results as inputs. If any result is missing or inconsistent, revise the plan.”
This explicit curriculum pressure is what unlocked strong generalization in the original paper (e.g., near-perfect accuracy on SCAN under splits where standard CoT collapsed). (arXiv, openreview.net)
When to prefer least-to-most over plain CoT: problems that compose rules (execute A, then B, unless condition C), require intermediate artifacts (tables, maps, partial programs), or where the solution is sensitive to ordering.
Self-consistency (prompt contract). One-liner: request a short chain and a parseable final answer to enable voting.
System: Be concise. If steps are needed, provide them as 1–3 bullets. Always end with: FINAL: <answer> User: Q: {{QUESTION}} Format: - (Optional) Steps: up to 3 bullets. - FINAL: <short answer only>
Self-consistency (pseudocode). One-liner: sample k chains with diversity; majority-vote the final answers.
# parameters k = 10 # number of samples temperature = 0.9 # encourage diverse chains top_p = 0.95 # 1) collect candidates cands = client.responses.create( model=MODEL, input=[PROMPT], n=k, temperature=temperature, top_p=top_p, ) # 2) extract parseable answers like: "FINAL: 8" answers = [] for c in cands: text = c.output_text m = re.search(r'FINAL:\s*(.+)', text) if m: answers.append(m.group(1).strip()) # 3) majority vote from collections import Counter winner, votes = Counter(answers).most_common(1)[0] # 4) optional verification pass (deterministic) verify = client.responses.create( model=MODEL, input=[f"Question: {QUESTION}\nProposed answer: {winner}\n" "Verify in one bullet why this is correct or say 'Not sure'. " "Then print ONLY: FINAL: <answer>"], temperature=0.0 )
Least-to-Most (prompt contract). One-liner: first plan sub-problems, then solve them in order.
System: You solve complex problems by breaking them into minimal sub-problems and solving them in sequence. Keep each sub-problem and solution brief. User: Q: {{QUESTION}} Step 1 — Plan: List the minimal ordered sub-problems as (1), (2), ... (no solutions yet). Step 2 — Solve: For i from 1..N: - Solve sub-problem (i) in 1–2 sentences. - State any intermediate result as VAR[i]=... At the end, print: FINAL: <short answer>
💡 Insight: least-to-most pairs beautifully with self-consistency. You can sample multiple decompositions, then vote on their final answers, or even vote on the plans first (choose the most common plan) and then solve it deterministically.
Temperature vs. top-p. Both increase randomness. For self-consistency, you want diversity without nonsense. A practical default is temperature=0.7–1.0 with top_p=0.9–0.95. If the model starts rambling, cap chain length (“≤3 bullets”) and raise top-p while lowering temperature slightly to keep variety while avoiding off-topic detours. Gains from self-consistency depend on this diversity lever. (arXiv)
k (number of samples). Diminishing returns kick in quickly. On math-style tasks, 5–10 often captures most of the uplift; 20–30 helps on the hardest tails. Use early stopping once you hit, e.g., 3 votes for the same answer.
Chain budget. Constrain rationale size: “max 3 bullets / ≤60 tokens.” This keeps cost stable and discourages confabulation.
Stop sequences & schema. Always require a single final field (e.g., FINAL:) so your voter is robust.
All chains agree—but are wrong. You sampled variants of the same mistaken premise. Add a self-check (“state the key assumption in one short phrase before solving”) and diversity in plans (least-to-most) rather than just in token choice.
No majority emerges. Answers scatter. Either the question is ambiguous, or diversity is too high. Reduce temperature, add acceptance criteria (“output must be an integer”) and normalize (strip punctuation, unit conversions).
Latent arithmetic slips. Extract the final number and re-compute deterministically in a verifier pass; or require the model to compute with named variables (VAR[1], VAR[2]) so you can parse and check.
Token bloat. Your chains are too long. Tighten the contract (“max 3 bullets,” “no restating the question”), or switch to rationale-lite: ask for a 1-line outline rather than full prose.
Goal: confirm self-consistency + least-to-most uplift on a tiny, synthetic GSM8K-style set.
Three runs per item:
deterministic (no CoT; temperature=0),
greedy CoT (one chain; temperature=0),
self-consistency CoT (temperature=0.9, n=10).
Add least-to-most for the three hardest items (by your judgment). Use the two-phase prompt. Optionally pair it with self-consistency (n=10).
Score pass/fail by exact numeric match; log total tokens.
Expected pattern: deterministic < greedy-CoT << self-consistency; the biggest deltas occur on the multi-step items. Least-to-most should rescue at least one “harder-than-demo” case.
Cost accounting. If your average CoT response is ~80 tokens and you sample k=10, that’s ~800 extra completion tokens per item (plus input tokens). Keep a budget and add early-stopping.
Latency. Parallelize the k samples. If your infra can’t, use a budgeted policy: k=5 by default; escalate to k=15 only on “hard” items (e.g., when the model’s first two samples disagree).
Guardrails. Don’t leak chains to end-users unless they truly need them. In UX, show answers by default and gate rationales behind “show steps.” For internal logging, store both final answers and concise rationales for audits.
Evaluation discipline. Always confirm uplift with a small, stratified set of your own problems (easy/medium/hard). Report accuracy vs. tokens. Treat CoT as a tool, not a doctrine.
CoT is a lever; self-consistency is the gearbox that turns that lever into reliable motion. When diversity is tuned and outputs are easy to vote on, you get most of the gains with minimal ceremony. When the problem is tougher than your demos, least-to-most gives the model a roadmap before it takes a single step.
In practice, this trio—short, structured CoT; self-consistency sampling; and least-to-most decomposition—can lift accuracy dramatically on reasoning tasks while keeping your system understandable and auditable.
We framed CoT as controlled externalization of intermediate state, then showed how self-consistency marginalizes over multiple chains to stabilize answers and why least-to-most can generalize beyond seen exemplars by enforcing an explicit curriculum. You saw concrete prompt contracts, sampling settings that matter, and a compact mini-lab to verify gains on a toy GSM8K-style set. Most importantly, you now know when not to use CoT and how to keep costs and latency sane.
Wire the self-consistency voter into your stack with early-stopping and a verification pass.
Convert one brittle flow to least-to-most, especially if it composes multiple rules.
Build a 12–50 item golden set and track accuracy vs. tokens each time you tweak sampling or prompts.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — arXiv; OpenReview/PDF. (arXiv, openreview.net)
Self-Consistency Improves Chain of Thought Reasoning in Language Models — arXiv; OpenReview/PDF. (arXiv, openreview.net)
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models — arXiv; OpenReview/PDF. (arXiv, openreview.net)
GSM8K: Grade School Math Word Problems — arXiv; GitHub dataset; HuggingFace dataset card. (arXiv, GitHub, Hugging Face)
(All links above point to the official papers or maintainers’ pages.)
Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.
Explore PathsReady to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.