Who should read this Advanced level guide?

This guide is perfect for Advanced level practitioners looking to improve their prompt engineering skills in Prompt Engineering, Reasoning, Decomposition, Retrieval-Augmented Generation, RAG, Evaluation, Tool Use, Production.

How long does it take to complete this guide?

This guide takes approximately 70 min read to read and understand.

Back to Guides/Guide

Self-Ask, Auto-CoT, and Decomposition Prompts

Explore structured decomposition for complex questions. Learn Self Ask to create explicit sub questions, Auto CoT to generate reasoning examples, and retrieval layers to ground steps. Covers when to use modular vs least to most, plus cost, caching, and reuse.

September 6, 2025

70 min read

Promptise Team

Advanced

Prompt EngineeringReasoningDecompositionRetrieval-Augmented GenerationRAGEvaluationTool UseProduction

Promise: by the end of this guide, you’ll be able to make models break hard questions into solvable pieces, auto-generate diverse reasoning demonstrations (so you aren’t hand-crafting chains all day), and plug the decomposition into retrieval so answers stay fresh.

Why decomposition beats “one big leap”

Large models are great at finishing sentences; they’re less reliable at planning. When a question hides multiple hops—identify an entity, fetch a fact, compare, then decide—plain prompting asks the model to juggle all steps in one pass. Decomposition prompting changes the game: it forces the model to ask and answer intermediate sub-questions before committing to the final claim. Self-Ask formalized this pattern; Auto-CoT shows how to create diverse, high-quality reasoning examples automatically; FreshPrompt brings search into the loop so those sub-answers reflect the current world. (arXiv, aclanthology.org)

The lay of the land

Self-Ask. A prompting pattern where the model explicitly lists follow-up questions and answers them one by one before producing the final answer. It’s simple, works with plain LMs, and can be paired with tools (e.g., a web search) for the sub-steps. (arXiv)

Auto-CoT. Instead of hand-writing chains of thought, you auto-sample a diverse set of seed questions, have the model produce reasoning for each, and curate the best chains as demonstrations. Diversity prevents the prompt from overfitting to a single style and improves robustness. (arXiv, OpenReview)

Decomposition prompts (the broader idea). Beyond Self-Ask, there are modular variants (Decomposed Prompting) and step-ordering strategies (Least-to-Most) that structure work into sub-tasks or ever-harder sub-problems. We’ll contrast them so you know when to use which. (arXiv)

Freshness via retrieval. Some sub-questions demand current facts. FreshPrompt shows how to construct a prompt “brief” from search results so the model reasons over recent, relevant evidence. We’ll graft that onto Self-Ask. (aclanthology.org)

A compact demonstration

Let’s use a tiny, factual multi-hop query:

Question. Who lived longer: Alan Turing or Muhammad Ali?

Self-Ask sketch (abbreviated).

Sub-Q1: What is Alan Turing’s lifespan? → 1912–1954 (age 41).
Sub-Q2: What is Muhammad Ali’s lifespan? → 1942–2016 (age 74).
Decision: 74 > 41 → Muhammad Ali.

This isn’t about trivia; it’s about forcing the model to expose intermediate facts and the comparison step. That visibility is what lets you check, cache, or fix each hop.

The move: Self-Ask you can paste today

Use this when a single question likely hides two or more operations (identify → lookup → transform → compare).

What it does: instructs the model to ask/answer its own sub-questions before giving the final answer, and to separate any toolable lookups.

json

You are a careful reasoner. For the given question: 1) Decide if follow-up questions are needed. If yes, generate them. 2) Answer each follow-up briefly. 3) If a follow-up requires external knowledge, mark it with [NEEDS SEARCH] and propose a search query. 4) After sub-answers are complete, produce a short final answer with a one-line justification. Question: {{QUESTION}} Format: Need follow-ups? yes|no Follow-ups: - Q1: ... A1: ... - Q2 [NEEDS SEARCH]: ... Proposed query: "..." A2: ... Final: {{ONE-SENTENCE ANSWER}} — {{BRIEF JUSTIFICATION}}

Why this works. The paper behind Self-Ask shows that explicitly decomposing into follow-ups narrows the “compositionality gap,” and the same scaffold makes it easy to plug a search tool into only the sub-questions that need it. (arXiv)

Production notes. Route only [NEEDS SEARCH] sub-questions to your retriever, cache their answers, and replay cached sub-answers for near-duplicate queries. Cap the maximum number of follow-ups (e.g., 6) to bound latency.

Auto-CoT: stop hand-writing chains, keep the gains

Hand-crafted chains of thought can lift accuracy, but they’re laborious and brittle. Auto-CoT automates the process with four moves:

Sample diverse seeds. Cluster or stratify your question pool; pick one exemplar per cluster to avoid near-duplicates.
Generate chains automatically. Use zero-shot CoT (“Let’s think step by step.”) to produce a few candidate chains for each seed.
Filter and de-dupe. Keep chains with correct final answers (if labels exist) or with high self-consistency across multiple samples; drop off-topic or overly vague chains.
Assemble your few-shot prompt. Mix styles and difficulties to avoid overfitting.

The original work finds that diverse, automatically generated exemplars can match or beat carefully engineered demonstrations on a broad slate of reasoning tasks. You keep the uplift without the handcrafting bottleneck. (arXiv)

Copy-ready scaffold to build an Auto-CoT set

Use this pair of prompts offline to curate your demo bank.

Generator (run K times per seed):

json

Task: Solve the question with clear step-by-step reasoning and a final answer. Question: {{Q}} Respond as: Reasoning: - step 1) ... - step 2) ... Answer: {{FINAL}}

Judge (labels optional):

json

You are curating chain-of-thought exemplars. Given the question, a proposed reasoning chain, and (optional) gold answer: - Is the chain on-topic and specific (not generic platitudes)? - Are the steps locally valid? - Does the final answer match the gold (if provided)? Return JSON with { "keep": true|false, "why": "...", "fix_suggestion": "..." }.

Tip: Select 6–10 exemplars that vary in structure (enumerations, tables-in-text, small calculations) and difficulty. Run a quick A/B with and without each candidate; keep only those that move your evals.

Decomposition patterns compared

When you need traceable hops over open-book facts, Self-Ask is a direct fit: it exposes follow-ups and makes search insertion cheap. For task pipelines (e.g., extract → normalize → aggregate → decide), Decomposed Prompting is stronger: you design a little library of prompts, one per sub-task, and wire them together; it can even swap in symbolic code where LMs are weak. When generalization to harder problems matters (solve Grade 8 having only seen Grade 5), Least-to-Most sequences sub-problems from easy to hard so the model builds the solution surface progressively. These ideas complement Auto-CoT: you can auto-generate your exemplars for each sub-task, then compose. (arXiv)

Freshness: plug retrieval into the decomposition

FreshPrompt offers a pragmatic recipe: gather recent snippets via a search engine, organize them into a structured “brief,” and place the brief before the question so the model reasons over up-to-date evidence. Marry that with Self-Ask by running retrieval only for sub-questions tagged [NEEDS SEARCH]. (aclanthology.org)

One-file pattern you can adopt

Run Self-Ask; collect sub-Qs with [NEEDS SEARCH] and their proposed queries.
For each, run your retriever and format snippets as cards:

json

[Card 1] Title: {{PAGE_TITLE}} Source: {{DOMAIN}} Date: {{ISO_DATE}} Snippet: {{2–3 sentences with the key fact; include entity names and numbers}} [Card 2] ...

Feed the cards back with a grounded resolution prompt:

json

You now have evidence cards. For each [NEEDS SEARCH] sub-question, quote the relevant card IDs, extract the answer, and state the confidence. Evidence cards: {{CARDS}} Sub-questions: {{LIST FROM SELF-ASK}} Return: - Q#: {{text}} Evidence: [Card #, ...] Extracted fact: ... Confidence: 1–5

Collapse into the final decision using the extracted facts.

Why this helps. You reduce unnecessary calls (only searching what needs search), make provenance explicit, and can cache card sets per query template. FreshPrompt reports wins over alternative search-augmented prompting and even some commercial systems on freshness-sensitive benchmarks. (aclanthology.org)

Cost, latency, and reliability trade-offs

Self-Ask depth vs. time. Cap the number of sub-questions and prefer breadth-first decomposition (outline the hops first, then fill) when latency is tight.
Auto-CoT curation. Diversity boosts robustness, but too many exemplars inflate tokens and slow responses. Aim for a compact set (≤10) and prune regularly with evals. (arXiv)
Decomposed pipelines. Each sub-task boundary is a chance to cache and to mix modalities (rules/code/LM). The trade-off is orchestration complexity; start with two or three modules, then grow. (arXiv)
Freshness loop. Retrieval dominates latency; batch queries, set timeouts, and dedupe URLs. Consider weekly refreshes of stable sub-facts and on-demand fetches for volatile ones. (aclanthology.org)

In practice: copy-ready prompts

A. Self-Ask for complex QA (no tools).

json

You will answer by first writing follow-up questions you need, answering them, and then giving the final answer. Question: {{QUESTION}} Need follow-ups? yes|no Follow-ups: - Q1: ... A1: ... - Q2: ... A2: ... Final: {{ANSWER}} — because {{ONE-LINE JUSTIFICATION}}.

B. Self-Ask with retrieval hooks (tools optional).

json

If a follow-up requires outside facts, mark it [NEEDS SEARCH] and provide a precise web query string. For any [NEEDS SEARCH] item, wait for "EVIDENCE CARDS:" input before giving its answer.

C. Auto-CoT builder (offline).

json

System: You create diverse, high-quality reasoning exemplars. User: From this set of questions, select {{K}} that are maximally different in topic and structure. For each, produce a clear step-by-step solution and final answer. Return JSON list with {question, reasoning_steps:[...], answer}.

D. Decomposed Prompting (modular).

json

# Stage 1 (Entity extraction) Extract the entities and relations needed to answer: {{QUESTION}}. Return JSON {entities: [...], relations: [...]}. # Stage 2 (Evidence gathering) Given {entities, relations}, propose up to 3 search queries per relation. # Stage 3 (Decision) Using the following evidence cards, answer the question with a one-sentence justification.

Troubleshooting: what goes wrong and what to try

The model invents unnecessary sub-questions. Add a budget (“no more than 5 follow-ups”) and insert a sanity check: “Remove any follow-up that doesn’t change the final answer.” In evals, penalize bloat.

Shallow decomposition misses a key hop. Seed the prompt with 1–2 meta-examples (“first identify entities, then time window, then compare”). Alternatively, run a second pass: “Propose an alternative decomposition that uses different steps.”

Auto-CoT yields bland or wrong chains. Enforce diversity at sampling time (cluster seeds) and filter aggressively. If you lack gold labels, use self-consistency (vote across multiple generations) or a verifier rubric to cull weak chains. (arXiv)

Retrieval swamps latency. Batch queries derived from multiple sub-questions, restrict domains, and de-duplicate by URL and n-gram overlap in snippets. FreshPrompt’s “card” packaging improves model focus—adopt it even if your retriever differs. (aclanthology.org)

Mini Lab (5–10 minutes)

Goal: feel the difference between one-shot answering and Self-Ask + retrieval hooks.

Pick a multi-hop question in your domain (e.g., “Which company acquired the startup founded by {{FOUNDER}}, and what year was the acquisition?”).
Run the plain prompt: “Answer in one sentence.” Save the output.
Run the Self-Ask prompt with [NEEDS SEARCH] marking. For each marked sub-question, manually paste one short evidence card (title, date, 2–3 sentence snippet) from a trusted source.
Compare: Did the decomposition change the answer? Are justifications clearer?
Optional: Build a 6-example Auto-CoT bank from your backlog, and A/B one evaluation set with and without it.

Expected output snippets (abbreviated):

json

Need follow-ups? yes Follow-ups: - Q1: Who founded {{STARTUP}}? A1: {{FOUNDER}} - Q2 [NEEDS SEARCH]: Which company acquired {{STARTUP}} and when? Proposed query: "acquired {{STARTUP}} year" A2: {{COMPANY}} in {{YEAR}}. [Card 1, Card 2] Final: {{COMPANY}} — acquired {{STARTUP}} in {{YEAR}}.

Close: what you can now do

You’ve learned to force structure onto messy questions with Self-Ask, eliminate prompt bottlenecks with Auto-CoT, and keep answers current by injecting retrieval only where it matters. These techniques don’t just raise accuracy; they make the reasoning trace visible so you can cache, verify, and fix step by step. In production, start small—cap sub-questions, keep your Auto-CoT bank tight, and adopt evidence cards for anything time-sensitive.

Summary & Conclusion

Decomposition prompting turns reasoning from a monologue into a scaffolded dialogue with the task. Self-Ask exposes the hops; Auto-CoT supplies diverse, high-quality exemplars without handcrafting; FreshPrompt-style evidence keeps those hops grounded in current facts. Together, they give you accuracy, transparency, and a practical path to scale.

Next steps

Build a 6–10 example Auto-CoT bank for your hardest task family and run a quick eval slice. (arXiv)
Wrap Self-Ask in your tool stack: route only [NEEDS SEARCH] sub-questions to retrieval; cache sub-answers. (arXiv)
For pipelines, try a two-stage Decomposed Prompting skeleton and replace one fragile sub-task with code. (arXiv)

References (official sources)

Self-Ask / Compositionality Gap. Ofir Press et al., Measuring and Narrowing the Compositionality Gap in Language Models. arXiv:2210.03350; see also the project page and blog. (arXiv, ofir.io, GitHub)
Auto-CoT. Zhuosheng Zhang, Aston Zhang, Mu Li, Alex Smola, Automatic Chain of Thought Prompting in Large Language Models. arXiv:2210.03493; ICLR 2023 (OpenReview); code. (arXiv, OpenReview, GitHub)
Decomposed Prompting (modular). Tushar Khot et al., Decomposed Prompting: A Modular Approach for Solving Complex Tasks. arXiv:2210.02406. (arXiv)
Least-to-Most Prompting (ordered decomposition). Denny Zhou et al., Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625; OpenReview. (arXiv, OpenReview)
FreshPrompt (retrieval for freshness). Tri Vu et al., Refreshing Large Language Models with Search Engine Augmentation (FreshPrompt, FreshQA, FreshEval). Findings of ACL 2024. (aclanthology.org)

(All links point to arXiv, OpenReview, ACL Anthology, or the authors’ official pages.)

Learning Paths

Structured Learning

Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.

Explore Paths

Continue Your Learning Journey

Ready to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.

Explore More Guides Browse Learning Paths

Self-Ask, Auto-CoT, and Decomposition Prompts

September 6, 2025

70 min read

Promptise Team

Advanced

Prompt EngineeringReasoningDecompositionRetrieval-Augmented GenerationRAGEvaluationTool UseProduction