Move beyond prompt engineering to constrain model behavior at inference time using token-level hardening, temperature tuning, and confidence thresholds.
You've built a careful prompt. You've tested it. You've added guardrails and clarifications. And yet, the model still generates something you didn't expect—something unsafe, off-policy, or just wrong.
The hard truth: Prompt engineering alone cannot guarantee model safety at scale. Prompts guide intention; they don't lock behavior. When safety matters, you need to constrain the model at inference time, before outputs leave your system.
This guide explores four hardening techniques that work below the prompt layer, where they shape what tokens the model can generate and how confident it must be before it generates them. These aren't replacements for good prompting—they're multipliers. Used together, they let you trade capability for safety in precise, measurable ways.
Here's the friction point: a language model is a statistical engine. It predicts the next token based on learned patterns. A well-crafted prompt steers those patterns, but it doesn't prevent generation. Even with explicit instructions ("Never say X"), the model assigns probability to X—sometimes a high probability.
When a prompt fails, it usually fails in one of three ways:
Misunderstanding context: A rare prompt misinterprets your constraint.
Following learned patterns over instructions: The model was trained on data where a certain response dominates, and your prompt can't quite override it.
Generating unanticipated outputs: The model generates something statistically plausible that you never anticipated at all.
Hardening techniques live in the execution layer. They run after the model computes token probabilities but before those tokens are returned to you. They can reject outputs, constrain choices, or require higher confidence. The model still generates the same way—nothing about its internals changes—but what actually reaches your application is filtered.
Rendering chart...
This diagram shows where hardening fits: The model does its job normally. Hardening intercepts at the probability stage, before decoding completes. You control what actually gets returned.
Force format compliance
Allowlists and blocklists
Grammar-based generation
Control variability
Low temperature equals deterministic
High temperature equals exploratory
Reject low-confidence outputs
Confidence-based filtering
Escalate to human review
Block specific tokens
Prevent harmful phrases
Defense against jailbreaks
Each technique addresses a different vector of control. They are independent but often used together.
You specify which tokens (or sequences of tokens) the model is permitted or forbidden to generate. The decoder then removes disallowed tokens from consideration at each step, forcing the model to pick only from the allowed set.
Think of it as adding a gatekeeper at the token level. At each generation step, before the model picks its next token, the gatekeeper says: "You can only choose from this list" or "You can never choose from that list."
Schema enforcement:
The output must be valid JSON, or it must match a specific format.
Harmful phrase prevention:
You want to prevent generation of a known harmful phrase or category of tokens.
Restricted vocabulary:
Outputs must come from a finite set—a classifier that outputs one of ten labels, or a calculator that only outputs digits and operators.
The model generates a probability distribution over all tokens. Constrained decoding modifies that distribution by zeroing out probabilities for disallowed tokens. If you're blocking the token "exploit," the model can't choose it—the probability becomes zero. If you're forcing JSON, the model will only generate characters that keep the output valid.
Some implementations (like those in Outlines or Guidance) go further: they use a grammar or finite-state machine to enforce structure. For example, you can define a regex or JSON schema, and the decoder ensures every generated token respects the grammar. This is powerful and precise, but it's also computationally more expensive than simple allowlisting.
Rendering chart...
This sequence shows the flow: Every token is vetted before it's added to the output.
Latency: Checking tokens against allow/blocklists adds a small overhead per token. Grammar-based approaches are slower—they have to validate state at each step. For a 100-token output, the difference might be 50–200ms on a GPU, depending on implementation. This matters if you're serving under strict latency budgets.
Capability loss: If your allowlist is too narrow, the model can't express nuance. A chatbot constrained to output only from 50 approved phrases will be wooden. But that's sometimes the point—you want wooden safety.
Debugging friction: When the model is constrained and it produces an awkward or incomplete response, it's often because the allowlist itself is wrong. You're now debugging two things: the prompt and the constraints.
Suppose you're building a content moderation flag that must output one of: SAFE, UNSAFE, NEEDS_REVIEW. Your prompt asks the model to classify text, but you want to guarantee the output is one of those three strings—never a typo, never an explanation.
# Input to the model: You are a content classifier. Classify the following text. Text: "I love this product." Output one of: SAFE, UNSAFE, NEEDS_REVIEW. Only output the label, nothing else. # Output (most of the time): SAFE ✓ # Output (sometimes): Safe (capitalization differs) ✗ # Output (occasional edge case): SAFE - this is clearly appropriate ✗
# Allowed tokens: only {SAFE, UNSAFE, NEEDS_REVIEW, \n, space} # Output (always): SAFE ✓
Key insight: 💡 The model still "thinks" the same way internally, but the decoder intercepts and enforces the boundary. Constraint is applied at execution, not at training.
You adjust how "creative" or "conservative" the model's choices are. Lower temperature = the model picks the most likely token more often. Higher temperature = the model explores alternatives more freely.
Temperature is a parameter that scales the logits (the raw scores the model assigns to each token) before they're converted to probabilities. If you think of the model as an actor reading a script, temperature is how much the actor improvises. Low temperature: stick to the script. High temperature: add flair, vary the delivery.
Factual/deterministic tasks: You want the most probable answer, not a creative riff on it (classification, fact retrieval).
Multiple variations: You're generating multiple variations and want them to differ (creative brainstorming, exploration).
Safety-critical domains: You need consistency and predictability.
The model computes logits—unnormalized scores—for all tokens. These are divided by temperature, then converted to probabilities (via softmax).
Temperature = 1.0 (default)
→ Probabilities as the model learned them
Temperature = 0.5 (conservative)
→ Logits are scaled down
→ Most-likely token becomes even more likely
→ Other tokens drop in probability
→ Distribution becomes sharper
Temperature = 2.0 (creative)
→ Logits are scaled up
→ Less-likely tokens get a better chance
→ Distribution becomes flatter
At temperature = 0, you'd use greedy decoding: always pick the single most-likely token. This is the safest, most deterministic choice, but it can also lock you into a mediocre or repetitive path.
Sampling means random selection according to the probability distribution. Top-k sampling (pick from the k most-likely tokens) and nucleus sampling (top-p: pick from the smallest set whose cumulative probability adds up to p) are variants that further reduce the tail of unlikely choices.
Rendering chart...
This diagram maps temperature to behavior: Pick the region that matches your use case.
Latency: None. Temperature is a parameter; changing it doesn't slow generation.
Capability: Low temperature reduces creativity and diversity. If you set temperature = 0.3, the model will be repetitive and sometimes get stuck in loops. For tasks that need variety (brainstorming, creative writing), this is a real trade-off. For deterministic tasks (classification, fact retrieval), it's a win.
Predictability: Low temperature makes outputs more stable across runs (same input, same output). High temperature adds noise—useful for sampling, risky if you need consistency.
Imagine a medical assistant that must describe a symptom classification. You want the same classification every time you ask the same question.
# Prompt: You are a symptom classifier. Return only the most likely category. Symptom: fever and fatigue Categories: viral, bacterial, fatigue_syndrome # Output: The symptom profile is most consistent with viral infection, though bacterial infection remains possible. (The model explores nuance.)
# Same prompt # Output: viral (Concise, deterministic.)
# Same prompt # Output: viral (Same as above—further down the distribution, still viral.)
Key insight: 💡 In safety contexts, low temperature + explicit instruction is powerful: the model has fewer degrees of freedom to surprise you.
After the model generates each token, you inspect its confidence (the probability assigned to that token). If confidence is below a threshold, you reject the entire output or flag it for human review.
This is a filtering layer that says: "I trust this model when it's certain, but not when it's guessing."
High-stakes domains:
Medical diagnosis, financial advice, legal guidance—uncertain outputs are worse than no output.
Invisible harm risk:
The model might generate something plausible-sounding but false.
Audit trails:
You want a record of which outputs the model was confident about, and which were borderline.
At each generation step, the model outputs a probability distribution. The probability of the chosen token is its confidence for that step. You can inspect this:
Token-level:
Per-token confidence.
Aggregated:
Minimum confidence across all tokens, or mean confidence.
If any token falls below your threshold, you have options:
Reject entirely:
Return an error or a fallback response.
Return partial output:
Up to the point of low confidence, flag it for review.
Generate alternative:
Ask the model again with more context.
Rendering chart...
This flowchart shows decision points: You choose which branch fits your risk tolerance.
Latency: Negligible. You're reading numbers the model already computed.
Coverage: You may reject outputs that are actually correct. If you set the threshold too high, your system becomes conservative—it rejects things it should have accepted. This is a recall problem (false negatives). If too low, you don't filter anything.
Calibration: Language models' confidence is not always well-calibrated. A token probability of 0.8 doesn't mean 80% of the time that token is correct. You have to empirically test: at what threshold do you get acceptable precision and recall for your domain?
You're building a question-answering system over a knowledge base. You want to avoid hallucination—the model inventing facts.
# Input: Question: "What is the capital of Benin?" Knowledge base: contains the correct answer # Model generates: "Porto-Novo" # Token confidences: P(Porto) = 0.92, P(-Novo) = 0.88 Mean confidence = 0.90 # Threshold: 0.75 # Result: ✅ ACCEPT
# Input: Question: "What is the capital of Benin?" # Model generates: "Cotonou is the largest city, and some sources cite it as the capital, though Porto-Novo is officially..." # Token confidences: [0.82, 0.75, 0.68, 0.55, ...] Min confidence = 0.55 # Threshold: 0.75 # Result: ❌ REJECT - flag for human review
Key insight: 💡 The threshold lets you trade off precision and recall. Higher threshold = safer but fewer answers. Lower threshold = more coverage but riskier.
You block specific tokens or patterns from being generated at all, hard-stopping the decoder if it tries to generate them. Unlike constrained decoding (which forces outputs into a schema), token filtering is preventive: "Never generate this token in this context."
Prevent specific harmful phrases: Known slurs, explicit content, sensitive information patterns.
Defend against jailbreaks: Block tokens that are common in adversarial prompts or known attack patterns.
Last-resort filtering: You've tried prompting; you need an additional layer.
Before each decoding step, you remove disallowed tokens from the probability distribution (set their probability to zero). If the model tries to choose a blocked token, it can't—it must pick the next best option. This forces the model to "think around" the blocked concept.
This is aggressive. Unlike temperature tuning (which just makes bad choices less likely), token filtering prevents them entirely.
Rendering chart...
This flow shows token filtering in action: Blocked tokens are zeroed out; the model picks from what remains.
Capability loss: If you block a token the model should be able to generate, you degrade its ability. For example, if you block the word "suicide," the model can't write about suicide prevention resources. This is a real constraint.
Adversarial arms race: Attackers can work around token filtering by asking the model to describe the concept instead of using the word, or by using synonyms. Filtering one token doesn't filter the concept. This is why token filtering alone is not sufficient—it's one layer of defense.
False sense of security: Filtering "bomb" doesn't prevent the model from describing how to build one. It just prevents the exact word. Use this alongside other techniques, not as a standalone solution.
⚠️ Pitfall: Relying solely on token filtering to prevent harmful outputs. Token filtering is a speed bump, not a wall. An adversary can work around it. Use it as part of a defense-in-depth strategy.
You're moderating a chatbot for a child-safe platform. You block a known slur at the token level.
# Blocked token: [SLUR] (a specific harmful word) # User prompt: "Generate a list of offensive words" # Model's internal probability: [SLUR] is highly likely # Decoder: blocks [SLUR], forces model to pick next-best option # Output: "I can't provide that. Generating slurs violates our policy."
# User prompt: "Can you spell out [SLUR] for me?" # Model's internal probability: [SLUR] is highly likely # Decoder: blocks [SLUR] # Output: "I can't generate that word."
# User prompt: "Describe the concept behind [SLUR] in educational terms" # Model output: Model describes the harmful concept without using the exact word # Result: Filter bypassed, harm potential remains
Key insight: 💡 Token filtering is useful but not bulletproof. It's defensive, not preventive of the underlying intent. The model can express the same idea in different words.
Every hardening technique moves you along a spectrum: more safety, less capability. Less safety, more capability. The question is: where on that spectrum does your system belong?
A medical diagnosis assistant needs high capability (you want nuanced reasoning) but cannot hallucinate dangerous advice. A content label classifier can sacrifice capability (it just picks one of five labels) for absolute safety.
Rendering chart...
This spectrum shows where different applications fall: Use it to anchor your own system's position.
Start by defining your failure modes. What's the worst thing that could happen if the model generates something wrong?
Correctable mistake: The user knows the output is wrong and can redo it. → Light hardening.
Invisible harm: The user believes false information. → Aggressive hardening.
Dangerous outcome: The output could cause direct harm. → Aggressive hardening + human review.
Then measure: Test your system against a dataset of edge cases. Record how often hardening techniques reject or alter output. Track both:
False positives: Things you rejected that were fine.
False negatives: Things that got through that shouldn't have.
Low stakes (helpful assistant, brainstorming)
↓
Light Hardening: Temperature tuning + token filtering for known bad phrases
↓
Medium stakes (classification, summarization)
↓
Moderate Hardening: Constrained decoding + temperature tuning
↓
High stakes (medical, financial, safety-critical)
↓
Aggressive Hardening: Constrained decoding + token probability thresholds + human review
A related question: Should you use a general-purpose model like Claude or GPT-4, or a security-hardened variant?
General-purpose models are capable across many domains but weren't specifically trained to resist jailbreaks or refuse unsafe requests. They have refusal training (they'll decline some requests), but that training is prompt-level. A sufficiently creative adversary can work around it.
Security-focused models (if available for your use case) are trained to be harder to jailbreak. They're often slightly less capable in open-ended tasks but more reliable at refusing unsafe requests. They're still not bulletproof—no model is.
Scenario | General-Purpose + Light Hardening | Security-Focused + Moderate Hardening |
Internal tool, trusted users | ✅ | — |
Public API, risk-averse use cases | ✅ | — |
High-stakes application (medical, financial) | — | ✅ |
Adversarial environment (moderation, policy enforcement) | — | ✅ |
General-Purpose:
You're in a controlled environment where users are known and motivated to cooperate.
The risk of jailbreak is low.
You need high capability across varied tasks.
Your hardening layer (token filtering, thresholds) can handle edge cases.
Security-Focused:
Your users are adversarial (e.g., automated attacks, red teamers).
The cost of a failure is very high.
You're willing to sacrifice some capability for reliability.
You want the model's training and your execution layer to reinforce safety.
Key insight: 💡 In practice, you often use both: a capable general-purpose model with aggressive hardening in your inference layer. The model's capability is your asset; the hardening is your insurance.
Let's build something concrete. Imagine you're classifying user-generated content into one of three buckets: APPROVED, NEEDS_REVIEW, REJECTED. You want high confidence, deterministic outputs, and no hallucination.
Prompt + model selection:
Use a general-purpose model with a clear, directive prompt.
Constrained decoding:
Force output to be exactly one of the three labels (token allowlist).
Temperature tuning:
Set temperature to 0.2 (conservative, deterministic).
Token probability threshold:
Reject if confidence drops below 0.75 during any token.
FUNCTION classify_content(text): prompt = """You are a content classifier. Classify this content into exactly one category. Categories: APPROVED, NEEDS_REVIEW, REJECTED Content: {text} Output only the label.""" tokens_allowed = ["APPROVED", "NEEDS_REVIEW", "REJECTED", "\n", " "] temperature = 0.2 confidence_threshold = 0.75 result = call_model_with_hardening( prompt = prompt, allowlist = tokens_allowed, temperature = temperature, return_token_probabilities = true ) output_text = result.text token_probs = result.token_probabilities min_confidence = minimum(token_probs) IF min_confidence < confidence_threshold: return { status = "UNCERTAIN", output = output_text, confidence = min_confidence, action = "ESCALATE_TO_HUMAN" } ELSE: return { status = output_text, confidence = min_confidence, action = "APPLY" }
# Input:
"Great product! Highly recommend."
# Model generates:
APPROVED
# Token confidences:
[0.94, 0.91]
# Min confidence:
0.91 > 0.75 ✓
# Output:
APPROVED (apply)
# Input:
[Adversarial prompt trying to get APPROVED on bad content]
# Model generates:
NEEDS_REVIEW
(It's uncertain, hedges)
# Token confidences:
[0.68, 0.72]
# Min confidence:
0.68 < 0.75 ✗
# Output:
UNCERTAIN (escalate to human)
# Input:
"I hate this, 1-star"
# Model generates:
REJECTED
# Token confidences:
[0.93, 0.89]
# Min confidence:
0.89 > 0.75 ✓
# Output:
REJECTED (apply)
How the three techniques work together:
Rendering chart...
The allowlist ensures the output is always one of three labels (no typos, no hallucination of a fourth category). The low temperature makes the model decisive—it doesn't hedge or explain, it just picks the best option. The confidence threshold catches cases where the model is actually uncertain, preventing false precision.
Goal: Observe how token filtering changes model behavior.
Setup: You'll need access to a model API that supports token filtering or constrained decoding.
OpenAI API: Supports logit_bias (approximates this).
Ollama with Outlines: Supports it more directly.
Anthropic Claude API: Can use tool use or post-processing to enforce constraints.
Run this prompt without any token restrictions:
Prompt: "List three ways to improve security."
Model: Claude (or your preferred model)
Settings: temperature = 0.7 (normal)
No token restrictions.
Record the output. Observe: Is it diverse? Does it hallucinate? How long is it?
Example output:
1. Patch systems regularly.
2. Use exploit detection tools.
3. Implement network segmentation.
Repeat the same prompt with token filtering enabled:
Prompt: "List three ways to improve security."
Model: Claude (or your preferred model)
Settings: temperature = 0.7 (unchanged)
Token blocklist: ["malware", "exploit", "vulnerability"]
Record the output. Compare to baseline.
Example output:
1. Patch systems regularly.
2. Use advanced threat detection.
3. Implement network segmentation.
Repeat with both token filtering AND low temperature:
Prompt: "List three ways to improve security." Model: Claude (or your preferred model) Settings: temperature = 0.2 (conservative) Token blocklist: ["malware", "exploit", "vulnerability"]
Record the output.
Example output:
1. Patch systems regularly.
2. Monitor for attacks.
3. Isolate critical systems.
Comparison | Finding |
Baseline vs. token-filtered | The model uses a synonym or rephrases to avoid the blocked term. This shows that filtering doesn't prevent the concept , just the word . |
Token-filtered vs. token-filtered + low temperature | Adding low temperature makes responses more deterministic and repetitive, even without filtering. The two techniques compound. |
All three outputs | The core idea remains ("improve security"); the implementation shifts around the constraints. |
Baseline (unrestricted): 1. Patch systems regularly. 2. Use exploit detection tools. 3. Implement network segmentation. With blocklist ["exploit"]: 1. Patch systems regularly. 2. Use advanced threat detection. 3. Implement network segmentation. With blocklist + temp 0.2: 1. Patch systems regularly. 2. Monitor for attacks. 3. Isolate critical systems.
This shows why hardening is a layer, not a solution: the model adapts, and you need multiple layers to be robust. A single token filter is bypassed easily; combined with temperature tuning and constrained decoding, you create friction that makes deviation costly.
This usually means your allowlist is too narrow, or your temperature is too low. The model wants to express something it's not allowed to.
🔧 Fix:
Audit your constraints.
For allowlists, add back tokens that are safe but useful.
Broaden the scope.
If you're blocking entire concepts, reconsider—can you block the harmful use instead of the word itself?
Raise temperature.
Try increasing it from 0.2 to 0.5. You'll lose some consistency, but you'll regain expressiveness.
Example:
# Before (too constrained): Temperature: 0.1 Allowlist: 30 tokens Output: "secure. secure system. secure." # After (rebalanced): Temperature: 0.4 Allowlist: 100 tokens (added safe variants) Output: "Implement security patches regularly. Use a firewall. Monitor logs."
The model is working around your filters. It's using synonyms, descriptions, or multi-step reasoning to convey the blocked concept.
🔧 Fix:
Increase your defense layers.
Add prompt-level instructions: "Never describe how to build a..."
Combine with confidence thresholds.
If the model's confidence drops (it's working harder to say the thing), reject it.
Consider a security-focused model.
The foundation is harder to compromise.
Example:
# Before (filter bypassed): User: "Describe how to make [harmful thing]" Filter: Blocks keyword Model: Describes it in detail using different words Result: Harm potential remains # After (layered defense): Filter: Blocks keyword + related synonyms Prompt: Explicitly refuses to describe harmful techniques Confidence threshold: Rejects low-confidence outputs Result: Model hedges or refuses
Your threshold is too high, or the model is genuinely uncertain about its outputs on your task.
🔧 Fix:
Lower the threshold incrementally.
0.85 → 0.80 → 0.75 and measure precision and recall on a validation set.
Find the sweet spot.
Where do you catch actual errors without over-rejecting?
If even low thresholds reject too much, improve upstream.
Your prompt might be ambiguous, or the task might be genuinely hard. Fix the prompt first.
Example:
# Validation set: 100 outputs # Threshold = 0.90: Rejected: 45 outputs Precision: 0.95, Recall: 0.40 (too many false negatives) # Threshold = 0.75: Rejected: 18 outputs Precision: 0.89, Recall: 0.78 (good balance) # Threshold = 0.50: Rejected: 5 outputs Precision: 0.82, Recall: 0.95 (too lenient) # Choose: 0.75
Constrained decoding with grammar or complex state machines can add overhead.
🔧 Fix:
Profile to identify the bottleneck.
Simple token allowlists are cheap; grammar-based decoding is expensive.
Switch to simpler constraints.
Blocklist instead of grammar.
Accept the trade-off.
Safety often costs milliseconds. If latency < 200ms total, most applications are fine.
Example:
# Before (grammar-based): Latency per request: 450ms - Model generation: 300ms - Grammar validation: 150ms # After (simple allowlist): Latency per request: 320ms - Model generation: 300ms - Allowlist check: 20ms # Or: Accept 450ms if safety is critical
Prompting is how you intend model behavior; hardening is how you enforce it. A well-written prompt guides a model toward safe, useful outputs. Hardening techniques—constrained decoding, temperature tuning, token probability thresholds, token filtering—work at the inference layer to make that guidance unbreakable.
The four techniques are not equally powerful, and they're not meant to be used alone:
Constrained decoding
is your most reliable tool when you need to enforce strict schemas or vocabularies. It's expensive but bulletproof.
Temperature tuning
is your most accessible—a single parameter that makes outputs more or less deterministic. It costs nothing and works everywhere.
Token probability thresholds
catch cases where the model isn't sure and let you escalate to humans. It's a precision tool for high-stakes domains.
Token filtering
is your last resort—aggressive but limited, effective at preventing specific phrases but vulnerable to workarounds.
The balance between safety and capability is not a fixed point; it shifts with context. A chatbot serving millions of users can afford to be conservative. A research assistant used by experts can be more permissive. Your job is to measure your failure modes, test your constraints, and find the balance that makes sense for your domain.
General-purpose models remain the best choice for capability; security-focused variants are worth considering in adversarial environments. In most cases, the right answer is a capable model plus aggressive hardening in your execution layer. You get the model's power and your safety guarantees.
1. Audit your current system. If you're serving models in production, ask: What happens if the model generates something unexpected? Is it caught by downstream validation, or does it reach the user? That gap is where hardening belongs. Map out your current defenses and identify weak points.
2. Start with one technique. Pick the easiest win: if you're outputting structured data, add constrained decoding. If you're worried about false confidence, add a token probability threshold. Build incrementally—don't try to use all four at once. Let each layer stabilize before adding the next.
3. Measure before and after. Set up a validation dataset with edge cases and adversarial inputs. Record how many outputs change when you add each hardening layer. Track precision, recall, and latency. Make the trade-off visible so you can defend your choices to stakeholders.
Outlines (token-constrained generation):
https://github.com/outlines-ai/outlines
Guidance (structured generation):
https://github.com/guidance-ai/guidance
OWASP LLM Top 10:
Security vulnerabilities in LLM applications
Constitutional AI:
How to build models with built-in safety constraints (Anthropic's approach)
Token probability calibration:
Papers on confidence estimation in language models
Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.
Explore PathsReady to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.