Who should read this Advanced level guide?

This guide is perfect for Advanced level practitioners looking to improve their prompt engineering skills in Prompt Engineering, Long-Context Optimization, Compression Techniques, Reliability & Evaluation.

How long does it take to complete this guide?

This guide takes approximately 50 min read to read and understand.

Back to Guides/Guide

Prompt Compression & Lost-in-the-Middle Mitigation

Learn how to make long prompts shorter and smarter. Understand why models forget the middle of context, how to front load constraints and tail load examples, and apply a shrink to fit checklist inspired by LLMLingua and lost in the middle research.

September 6, 2025

50 min read

Promptise Team

Advanced

Prompt EngineeringLong-Context OptimizationCompression Techniques Reliability & Evaluation

If you’ve ever watched a model ignore the one crucial detail buried somewhere in a long prompt, you’ve met position bias. Models tend to remember the beginning and the end more than the middle. In this guide you’ll learn two things you can use today without extra tools: (1) how to compress prompts while keeping the parts that matter, and (2) how to order what remains so the model actually uses it. By the end you’ll have a reusable “front-load constraints, tail-load examples” skeleton and a shrink-to-fit checklist you can apply to any task.

Why now: empirical work shows recall drops when relevant info sits mid-context; performance rebounds when it appears near the head or tail. In parallel, prompt-compression methods like LLMLingua and its variants show you can slash tokens while holding quality steady—if you preserve semantics and structure. (arXiv, Computer Science, aclanthology.org)

Lay of the Land (plain definitions)

Position bias / “lost in the middle.” For long contexts, models often weigh information at the start and end more than the middle. This isn’t just folklore; it shows up in retrieval and QA tests across models. (arXiv, aclanthology.org)
Prompt compression. Any technique that reduces tokens while keeping task-critical meaning. In research, LLMLingua compresses with a coarse-to-fine selector and budget controller; LLMLingua-2 distills a small model to select tokens that preserve downstream utility. We’ll echo the principles of these methods with prompt-only moves. (arXiv)
Signal vs. scaffolding. Signal is the minimum information needed to do the task: the ask, constraints, key facts, and 1–2 instructive examples. Scaffolding is everything else: politeness, repeated definitions, redundant prose, and sprawling demonstrations.

💡 Insight: Most “long prompt pain” is not window size; it’s signal density and placement.

The Move (core mental model)

Think in two passes:

Compress to preserve meaning. Boil the prompt down to the ask, constraints, and indispensable facts—prefer extractive edits (keep spans) over abstractive paraphrases when accuracy matters.
Re-order for attention. Park the task and constraints up front; put the shortest, most relevant examples at the end; keep the middle as thin as possible (indexes, glossaries, or links/aliases).

Research backs both halves: models use head/tail best; compression that preserves salient spans can maintain accuracy at 2–5× smaller budgets. (arXiv)

Show, Don’t Tell (compact demo)

Initial (bloated) prompt, ~260 tokens. You’re a policy analyst… [polite preamble], then three paragraphs on reporting standards; the question comes in the third paragraph; examples appear mid-prompt; constraints (“bullet points, cite sources, <200 words”) appear near the end.

After (compressed & re-ordered), ~120 tokens.

TASK (Head): Summarize the proposed policy’s impact on freelance reporters. Output ≤180 words, bullet points, cite 2 sources by title.

CONSTRAINTS (Head):

No new claims; quote if exact figures.
If unsure, say “Not confirmed.”
Audience: city council; neutral tone.

FACTS (Head): F1: Draft requires permits for “commercial recording” in council buildings. F2: Freelancers may apply individually; fee waiver possible. F3: Penalty for non-compliance: fine up to $500.

EXAMPLES (Tail):

Good: “Impact: likely admin burden; fee waivers mitigate cost.”
Bad: “This is clearly unconstitutional.” (opinionated; no cite)

The ask/constraints/facts sit at the head; tiny examples anchor the tail; nothing critical lives in the middle.

Deepen: What to compress (and what not to)

Keep (high-value signal).

The ask (what to produce, for whom, and in what format).
Constraints that affect correctness (budgets, refusal rules, citation style).
Atomic facts with numbers, names, or quotes you’ll grade the model against.
One positive and one negative micro-example if the task is fuzzy.

Trim (low-value scaffolding).

Politeness, role-play flourishes, repeated definitions.
Verbose chain-of-thought (keep the instruction, not the sample reasoning, unless your evaluation truly needs it).
Redundant examples that don’t add new failure modes.

Prefer extractive compression when stakes are high. Research-style methods achieve speedups by selecting truly informative spans rather than paraphrasing everything. Mimic this with “keep-only” edits before you paraphrase. (arXiv)

The Skeleton You Can Reuse

Front-load constraints, tail-load examples. Paste this, then swap the brackets.

Purpose: A two-phase scaffold that preserves meaning and places high-value tokens at head and tail. How to use: Fill {{ASK}}, {{CONSTRAINTS}}, and {{FACTS}}; keep 1–2 micro-examples at the end.

json

You are helping with {{TASK_DOMAIN}}. TASK (Do this exactly): {{ASK}} # what to produce, audience, success criteria CONSTRAINTS (Binding): - {{FORMAT}} # schema, word/section budgets - {{VERIFICATION}} # “No new facts; cite or mark Not confirmed” - {{STYLE/TONE}} # only if correctness depends on it - {{REFUSAL_RULES}} # e.g., if sources stale, ask to refresh FACTS (Atomic, extractive): F1: {{fact with numbers/names/quotes}} F2: {{...}} F3: {{...}} REFERENCE INDEX (thin middle, optional): - Abbrev: {{short aliases for repeated entities}} - Sections: [S1]=..., [S2]=... # if you must keep long context EXAMPLES (Tail, tiny): Good: {{1–2 lines showing correct style}} Bad: {{1–2 lines showing a common mistake}}

Why this layout? Because relevance peaks near the head and tail. You’re “bookending” the critical bits and starving the middle. (arXiv)

The Shrink-to-Fit Checklist

When a prompt is over budget, work top-down. Each step typically saves 10–40%:

Name the output. Replace throat-clearing with a crisp “TASK: …” header.
Compress constraints into bullets. Convert prose like “please ensure that you…” to - Must: ….
Make facts atomic. Turn paragraphs into F1/F2/F3 lines. Numbers and named entities survive; adjectives don’t need to.
Alias repetition. Give long names short tags ([ACME-Board]) and reuse the tags.
Cull examples. Keep exactly one Good and one Bad. Move both to the tail.
Strip role-play. Unless it impacts correctness, delete “You are a world-class…” lines.
Use skeletal syntax. Prefer Key: Value over sentences.
Enforce a hard budget. If you must, ask the model to compress background extractively to N tokens before the main run (see “Two-Pass Compression”).
Bookend criticals. Duplicate one indispensable fact at both head and tail (cheap insurance against the middle dip).
Re-ask if unsure. Add a policy: “If a required fact is missing or ambiguous, ask a one-line clarifying question.”

LLMLingua-style results suggest that sticking to extractive spans and a budget controller preserves accuracy even under aggressive cuts; our checklist operationalizes those ideas for manual prompting. (arXiv)

Two-Pass Compression (no external tools)

When context is huge (docs, transcripts), run a prompt-only two-pass routine:

Pass A — Extractive pre-compression prompt. What it does: selects only the spans you’ll grade against.

Use this to squeeze provided context to a fixed token target before the main task.

json

You will reduce CONTEXT to its task-critical spans. GOAL: Keep only facts needed to answer {{QUESTION}} for {{AUDIENCE}}. RULES: - Extract spans verbatim; do not paraphrase numbers or names. - Mark each kept span with an ID [K1], [K2], ... - Target ≤ {{TOKEN_BUDGET}} tokens total. OUTPUT: - "Kept": list of [K*] spans (verbatim). - "Discarded themes": 1–3 short phrases describing what you dropped.

Pass B — Main task with bookended layout. Feed the “Kept” spans into FACTS, put the ask/constraints at the head, and your tiny examples at the tail.

Why this works: you mimic LLMLingua’s coarse-to-fine selection and LongLLMLingua’s emphasis on key-information density, without any extra model. (arXiv)

Variations, Boundaries, and When Not to Use

Legal/medical/financial outputs: favor extractive compression; avoid paraphrases that could distort meaning.
Creative writing: keep constraints short; spend budget on style exemplars at the tail.
Reasoning-heavy tasks (math, code): bookend the spec and tests; examples can be zero or one.
Very long contexts (>50k tokens): ordering still matters, but model-level fixes (e.g., long-context training or architectural tweaks) dominate; prompting can’t fully erase the middle dip. Emerging methods try to modify attention paths but are out of scope here. (arXiv)

⚠️ Pitfall: Over-compressing with abstractive paraphrase can invent facts. Corrective move: run an extractive compression first; then, if you must paraphrase for style, do it after the output is generated, not before.

In Practice (copy-paste prompts)

A. One-shot “diet” for any prompt Use this on your own prompt to slim it before sending it to the main model.

json

You will compress my prompt while preserving task-critical meaning. STEPS: 1) Identify TASK, CONSTRAINTS, FACTS, and EXAMPLES in my text. 2) Rewrite into the skeleton below. Keep facts extractive; delete fluff. 3) Ensure the result ≤ {{N}} tokens. SKELETON: TASK: CONSTRAINTS: FACTS: EXAMPLES (1 good, 1 bad): TEXT: {{PASTE_ORIGINAL_PROMPT_HERE}}

B. Bookended QA for retrieval-like tasks Use this when you already have snippets or notes.

json

TASK: Answer {{QUESTION}} for {{AUDIENCE}} in {{FORMAT/BUDGET}}. CONSTRAINTS: - Use only the facts below; no new claims. - Cite [F#] for each claim. If unsure, write "Not confirmed." FACTS: F1: {{...}} F2: {{...}} F3: {{...}} EXAMPLES: Good: "X because [F1][F3]." Bad: "X because everyone knows..." # no citation

C. Self-check for placement Use this to validate your ordering before you run expensive calls.

json

Review the prompt below. List the 5 most important facts as you see them, in order of where they appear (head/middle/tail). If any important fact is in the middle, propose a re-ordering to move it to head or tail without increasing tokens. PROMPT: {{YOUR_PROMPT}}

Troubleshooting (failure modes & fixes)

Model misses a number you put in FACTS. Duplicate that single fact at the tail (bookend). Add “Cite [F#] next to each figure.” The redundancy is cheap; the recall bump is real. (arXiv)
Output is correct but verbose. Your constraints aren’t binding. Add a hard word/token budget and a schema (e.g., JSON or bullet count).
It hallucinates missing context. Your refusal rules are weak. Add: “If a required fact is missing or ambiguous, ask a 1-line clarifying question; otherwise mark ‘Not confirmed.’”
Examples hijack the style. Move examples to the tail and minimize them; ensure the Good example matches your target tone precisely.
Compression wrecks accuracy. Switch to extractive first; if still fragile, increase budget slightly or split the task (SoT: outline → fill).

Mini Lab (5–7 minutes)

Scenario. You must answer: “What’s the warranty for the ACME Pro Model X?” You’re handed a 300-token policy blob. Here’s a synthetic excerpt (shortened):

The ACME Pro Model X is covered by a Limited Warranty. The purchaser has 24 months of coverage from the date of retail purchase. This covers defects in materials and workmanship. Batteries are covered for 12 months. Cosmetic damage and misuse are excluded. For repairs, customers must present a receipt and a device with an intact serial number. Authorized service centers only. Shipping to the center is at the customer’s cost; return shipping is covered by ACME. Refunds are not offered under this policy; replacements or repairs only.

Your task.

Use the Shrink-to-Fit Checklist to turn the blob into FACTS (≤60 tokens).
Write a bookended QA prompt with your TASK, CONSTRAINTS, FACTS, and a 1-line Good example at the tail.
(Optional) Ask the model to answer; verify each claim cites [F#].

Expected compression (one possible solution).

json

FACTS: F1: Warranty: 24 months from retail purchase; defects in materials/workmanship. F2: Battery: 12 months. F3: Exclusions: cosmetic damage, misuse. F4: Service: receipt + intact serial; authorized centers only. F5: Shipping: customer pays to center; ACME pays return. F6: Remedy: repair/replacement only; no refunds.

Notice it’s extractive (no paraphrased numbers), atomic, and short enough to sit at the head and—if needed—duplicate the single most important line at the tail.

Why this works (and what the literature says)

Head/tail placement beats middle. The “lost-in-the-middle” effect shows higher accuracy when relevant info lives near the beginning or end; placing key facts there (or bookending them) aligns your prompt with model attention. (arXiv, aclanthology.org)
Compression that preserves semantic core holds up. LLMLingua and LLMLingua-2 demonstrate that careful selection—often extractive, budget-aware, and guided by a learned proxy—can give 2–5× (sometimes 20×) reductions with small quality loss. Our manual two-pass mirrors their coarse-to-fine idea. (arXiv)
Long-context improvements still respect placement. Even with architectures and plug-ins aimed at mitigating position bias, careful ordering remains a practical, model-agnostic win. (arXiv)

Summary & Conclusion

Compression without drift, then placement without mercy—that’s the recipe. Start by extracting the minimum set of facts and constraints that determine correctness. Recast them into a compact skeleton with the ask and constraints at the head and a micro-example at the tail. Keep the middle thin: indexes, aliases, or nothing at all. If recall still slips, bookend a single critical fact.

This approach doesn’t fight the model; it rides its tendencies. You’ll spend fewer tokens, ship faster prompts, and most importantly, avoid the quiet failures where the right detail sat in the wrong place.

If you remember one thing: front-load what must be obeyed, tail-load what must be imitated, and make everything else earn its tokens.

Next steps

Take a long prompt you already use; run the Shrink-to-Fit Checklist and measure tokens saved and error rate on five cases.
Introduce the Two-Pass Compression for any task involving pasted documents; log how often the model cites [F#] correctly.
For a risky workflow, try duplicating one critical fact at tail; compare miss-rates before/after on ten trials.

References

Liu, N.F. et al. Lost in the Middle: How Language Models Use Long Contexts. arXiv / TACL (2023–2024). (arXiv, Computer Science, aclanthology.org)
Jiang, H. et al. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. arXiv / EMNLP 2023. (arXiv, llmlingua.com)
Pan, Z. et al. LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. arXiv (2024). (arXiv, llmlingua.com)
Jiang, H. et al. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. arXiv (2023). (arXiv)
Zhang, Z. et al. How Language Models Use Long Contexts Better via Plug-and-Play Multi-Scale Selective Attention (“Found in the Middle”). arXiv (2024). (arXiv)

If you want, I can take one of your current long prompts and run the diet + bookend pass on it right now.

Learning Paths

Structured Learning

Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.

Explore Paths

Continue Your Learning Journey

Ready to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.

Explore More Guides Browse Learning Paths

Prompt Compression & Lost-in-the-Middle Mitigation

September 6, 2025

50 min read

Promptise Team

Advanced

Prompt EngineeringLong-Context OptimizationCompression Techniques Reliability & Evaluation