PromptisePromptise
Docs
Promptise - AI Framework LogoPromptise

The foundation layer for agentic intelligence. Build, secure, and operate autonomous AI systems at scale with Promptise Foundry.

Foundry

  • The Promptise Agent
  • Reasoning Engine
  • MCP
  • Agent Runtime
  • Prompt Engineering

Resources

  • Documentation
  • GitHub
  • Guides
  • Learning Paths

Company

  • About
  • Imprint
  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Subprocessors

© 2026 Promptise by Manser Ventures. All rights reserved.

Back to Guides/Guide

Skeleton-of-Thought & Section Budgets (speed + structure)

Learn Skeleton of Thought for long form generation by planning an outline first and expanding under word or token budgets. Improve density, latency, and readability with practical prompts, parallel fill, budget tricks, and checks for coverage and accuracy.

September 6, 2025
75 min read
Promptise Team
Advanced
Prompt EngineeringStructure & OutliningSection BudgetsLatency OptimizationParallelizationAPI OrchestrationEvaluation & QA

Promise. You’ll learn a fast, production-friendly way to make long answers cleaner and cheaper: first have the model produce a numbered outline (“skeleton”), then fill each section under explicit word/token budgets. This two-phase move often reduces latency and improves organization, and it lets you parallelize the “fill” phase in your API layer. We’ll end by wiring a tiny coverage eval so you can verify that the final answer actually hits every planned section.

Why now. “Skeleton-of-Thought” (SoT) is no longer a novelty. The core idea—plan structure, then expand in parallel—has shown notable speedups across multiple models without touching weights or serving stacks, and in many cases improves answer quality. In the original study, SoT achieved up to ~2.4× average speedups across models, with concrete examples like Claude responses dropping 22s→12s (1.83×) and Vicuna-33B 43s→16s (2.69×). Because SoT treats models as black boxes, you can adopt it with either parallel API calls or batched decoding on open-source models. (ar5iv)


Lay of the land

Skeleton-of-Thought (SoT). Instead of one long, sequential generation, you ask the model to output a short, numbered outline first. Then you expand each numbered point separately. In API settings, each point expansion is its own request—you can run them concurrently and assemble the final answer in order. (ar5iv)

Section budgets. A “section budget” is a soft cap on tokens or words you assign to each outline item. Budgets keep density under control: enough room for essential claims, not enough for tangents. Recent work on token-budget-aware reasoning shows why this matters: specifying budgets can compress reasoning cost while preserving quality—especially when the budget is tuned to task complexity. (aclanthology.org)

When SoT shines. Long, structured outputs: reports, proposals, multi-part explanations, multi-criteria comparisons, and tutorials. The original paper notes it’s less suitable for short answers or step-by-step chain-dependent reasoning (tight arithmetic proofs, code tracing), where sequential thought is the point. Use a router to turn SoT on only when it helps. (arXiv, ar5iv)


The move (mental model)

Picture a good editor at a whiteboard. First: boxes and arrows—the skeleton. Second: writers take one box each and draft in parallel. Finally: the editor assembles, trims to fit the page, and checks that every box was filled and nothing new was added. That’s SoT in miniature.

The technical translation is simple:

  1. Skeleton pass. Prompt for a numbered outline with terse labels only.

  2. Budgeted fills. For each n., send a dedicated expansion prompt that (a) repeats the whole skeleton for context, (b) asks to expand only point n, and (c) enforces a small, explicit budget.

  3. Assembly + checks. Reorder expansions by index, stitch them with headings, then run two checks: “coverage of planned sections” and “no extra claims.”

Because the heavy decoding happens in the fills, doing those concurrently cuts wall-clock latency without changing the model. That’s the key SoT insight. (ar5iv)


Show, don’t tell (one compact demo)

Below are copy-ready prompts. They’re minimal on purpose; swap {{…}} with your variables.

1) Skeleton prompt (numbered outline). What it does: asks for a short, numbered plan—no content yet.

json

You are a planning assistant. Produce a numbered skeleton (1., 2., 3., …) for {{TASK}}. Requirements: - 5–8 points. - Each point ≤ 12 words; no sub-bullets; no full sentences. - Names only (labels), not the content itself. Return JSON: {"skeleton": ["1. …", "2. …", "..."]}

2) Budgeted fill prompt (per section). What it does: expands only one numbered point under a hard budget.

json

You are drafting section {{INDEX}} of an answer to {{TASK}}. Skeleton (for context; do not change numbering): {{SKELETON_JSON}} Write ONLY section {{INDEX}} as a short paragraph. - Hard budget: ≤ {{BUDGET_TOKENS}} tokens (≈ {{BUDGET_WORDS}} words). - No new sections, no references to other sections. - Factual, concrete, and self-contained. Return: "{{INDEX}}. {{SECTION_TITLE}} — {{TEXT}}"

3) Assembly rule (post-processing). What it does: sorts by {{INDEX}}, prefixes each with its title, joins with a blank line, then runs a quick two-question check:

  • Did we get one fill for every skeleton item?

  • Did any fill introduce claims not implied by its label?

That’s the smallest SoT you can ship and measure.


Deepen: budgets, adherence, and parallelism details

Budgets that stick. Models don’t count tokens perfectly, but they obey examples and visible counters. Two practical nudges:

  • Ask for a word budget when possible (“≤ 80 words”). It’s easier for the model to gauge.

  • Include a tiny self-report footer in the format spec, e.g., "(~{{WORDS_USED}} words)". Discard it in assembly, but keep it to help adherence.

Budgeting is not only about cost—dense writing improves readability. A small body of work shows budget-aware prompting can compress reasoning length with limited accuracy loss when budgets are right-sized to complexity; for harder queries, learnable or heuristic budgets work better than a one-size-fits-all cap. (aclanthology.org)

Parallelize the fill phase. For API models, fire N independent calls (one per section) and wait for the slowest. For local models, batch the N prompts into a single forward with left-padded sequences; you amortize the decode cost across the batch. The original SoT paper explains both paths and why decode time—not prefill—dominates long generations. (ar5iv)

Router or not? SoT-R (SoT with a simple router) only triggers skeletonization for suitable questions. Your first router can be a rule: enable SoT when the requested output has ≥4 sections or ≥400 words. A learned router can look at task type, estimated length, and whether sections are semantically independent. The reference shows SoT-R improving both speed and outcome quality across categories. (ar5iv)

When not to use it. For tightly coupled step-by-step reasoning, math proofs, or short answers, standard CoT or PAL-style execution beats SoT. The SoT authors explicitly call out these limits. (arXiv)


In practice: copy-paste prompts and a minimal orchestrator

A. Production skeleton prompt (final cut). Use when you want strong control over outline style.

json

System: You plan structures; you never write full content. User: Plan a skeleton for {{TASK}} with 6–9 items. Rules: - Strict numbering: 1., 2., 3., … - ≤ 10 words per item, imperative labels (e.g., "State context", not "Context"). - No sub-bullets; no expansions. Reply as JSON: {"skeleton": ["1. ...", "2. ...", "..."]}

B. Production fill prompt (final cut). Use when you want crisp, budgeted paragraphs.

json

System: You write only one section with tight focus. User: Expand ONLY section {{INDEX}} of the plan for {{TASK}}. Skeleton: {{SKELETON_JSON}} Constraints: - Title + 1 paragraph. - ≤ {{BUDGET_WORDS}} words; end with "(~X words)" where X is your count. - No forward references; no new sections; no claims beyond the label. Return exactly: "{{INDEX}}. {{TITLE}} — {{PARAGRAPH}} (~X words)"

C. Tiny orchestrator (conceptual Python). This shows parallel fills; adapt to your stack.

json

import asyncio, json, httpx async def call(model, messages): # wrap your provider here async with httpx.AsyncClient(timeout=60) as client: r = await client.post("https://provider.chat.completions", json={"model": model, "messages": messages}) return r.json()["choices"][0]["message"]["content"] async def sot_answer(task, model="gpt-4o-mini"): # 1) skeleton sk_prompt = [{"role":"system","content":"You plan structures; you never write full content."}, {"role":"user","content":f'Plan a skeleton for {task} with 6–9 items...\nReply as JSON {"{"}"skeleton": [...]{"}"}'}] sk = json.loads(await call(model, sk_prompt))["skeleton"] # 2) parallel fills async def fill(idx, item): fill_prompt = [{"role":"system","content":"You write only one section with tight focus."}, {"role":"user","content":f'Expand ONLY section {idx} of {task}.\nSkeleton:\n{json.dumps(sk)}\nConstraints: ≤ 80 words; end with "(~X words)".\nReturn exactly: "{idx}. {item[3:]} — ... (~X words)"'}] return await call(model, fill_prompt) fills = await asyncio.gather(*[fill(i+1, item) for i, item in enumerate(sk)]) # 3) assembly return "\n\n".join(sorted(fills, key=lambda s: int(s.split(".")[0])))

This is intentionally bare: no retries, no schema validation. In production, add schema guards (e.g., JSON Mode) and a watchdog that reissues any section that violates the budget or format.


Troubleshooting (what breaks and how to fix it)

Runaway fills. Some models ignore “only section n.” Fix with a stricter format (quoted return line), and include the full skeleton plus a loud reminder that other sections will be expanded by other workers. If a model still drifts, cut temperature and lower the budget to force concision. (ar5iv)

Uneven lengths across sections. If one section hogs the budget, tell the model to “reuse only unique facts not present in other sections” and enable a post-assembly length rebalance pass that trims any section >120% of the target.

Cross-section contradictions. Because sections are drafted independently, they can disagree. After assembly, run a short consistency scan prompt: “List any contradictions among sections 1–N; be specific.” If contradictions are found, re-prompt the offending section(s) with the conflict highlighted and a hard budget.

Router regrets. If your router triggers SoT for tasks that needed stepwise reasoning, expect quality drops. Add a simple guard: if the task mentions “prove,” “derive,” “trace,” or “debug step-by-step,” skip SoT and use CoT/PAL. (arXiv)


A tiny coverage eval (fast, useful)

You don’t need a lab full of judges to know if SoT worked. Add one quick, model-graded check after assembly:

Coverage check prompt. What it does: verifies that every skeleton item was expanded and nothing extra was invented.

json

You are verifying structure and scope. Inputs: - Skeleton (numbered): {{SKELETON_TEXT}} - Draft (assembled): {{DRAFT_TEXT}} Answer ONLY in JSON with keys: - "covered": list of indices (e.g., [1,2,3,4,5]) that are fully addressed. - "missing": list of indices that are absent or only partially addressed. - "extraneous": true/false if the draft introduces unplanned sections. - "notes": ≤ 40 words on the largest gap. Be strict about scope: flag content that drifts beyond each label.

Store covered/missing and fail the run if anything is missing or extraneous=true. For latency, use a small, fast model for this verifier.

Optional budget audit. Parse each section’s (~X words) self-report and compare to your target. If X exceeds target by ≥25%, re-issue that section with a stronger budget line (“≤ 60 words; delete nonessential qualifiers.”). Budget-aware approaches like this routinely cut tokens with minimal quality loss when tuned. (aclanthology.org)


Mini lab (5 minutes)

Pick a topic you know well: “A stakeholder-friendly post-mortem of a failed launch.” Use:

  1. Skeleton prompt above (target 6–7 items).

  2. Fill prompt with 70 words per section.

  3. Run the coverage check.

Expected: a clean seven-part memo where each section is a crisp paragraph, no forward references, and the coverage JSON returns all indices in covered, none in missing, and extraneous=false.

If sections feel samey, lower budgets to 50–60 words and add a constraint per section (e.g., “include exactly one metric” for Impact; “list two decisions” for Next Steps). That single line often unlocks clarity.


Close

Skeleton-first generation is not just a speed hack; it’s a way to govern long outputs. By separating planning from prose and explicitly budgeting each unit of work, you make outputs easier to read, cheaper to generate, and faster to serve. The orchestration is straightforward: one planning call, N parallel fills, one assembly pass, two tiny checks.

There are edges. If a task’s reasoning is interdependent or inherently sequential, SoT can force unnatural seams. For those, stick with chain-of-thought or program-of-thought styles. But for the vast middle of long-form writing—explanations, reports, reviews—SoT plus section budgets is a sweet spot.

The most important lesson: treat structure and density as first-class control knobs. When you make them explicit in prompts and code, quality follows—and latency falls in line.

Next steps

  • Add a lightweight router: enable SoT only for multi-section, ≥400-word tasks; otherwise fall back to CoT/PAL. (ar5iv)

  • Introduce a house style: headings, sentence length targets, and one “must-include fact” per section that your verifier checks.

  • Track two curves in production: wall-clock latency vs. number of sections, and total tokens vs. per-section budget; tune the knee points.


References & further reading

  • Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation. arXiv (ICLR’24), with method details, templates, and speedup analyses; includes the SoT-R router extension. (arXiv, ar5iv)

  • Microsoft Research (paper page + blog). Accessible summaries, diagrams, and motivation for SoT as a black-box, parallel-friendly method. (Microsoft)

  • ENLSP @ NeurIPS 2023. Workshop listing and PDF for Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding. (ENLSP NeurIPS Workshop 2023, neurips.cc)

  • Token-Budget-Aware LLM Reasoning. Findings of ACL 2025; shows how specifying and learning budgets can compress reasoning while preserving performance. Useful backdrop for section-budget design. (aclanthology.org)

Learning Paths

Structured Learning

Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.

Explore Paths

Continue Your Learning Journey

Ready to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.

Explore More GuidesBrowse Learning Paths