Master five validation layers that catch semantic attacks regex can't. Learn when to validate before vs. after the model, tune for your risk tolerance, and implement a working pipeline today.
You've probably built some form of input validation already. A length check. A regex pattern for email or phone. Maybe a blocklist of naughty words. And it works—until it doesn't. A user submits something syntactically perfect that still manages to confuse your model, extract sensitive data, or trigger behavior you never intended. That's the moment you realize: regex catches the surface. Semantic attacks happen underneath.
The truth is this: LLMs don't process text like traditional applications. A 500-character prompt injection disguised as a benign question looks clean to every scanner you own. A carefully crafted sequence of tokens can flip model behavior without triggering a single blocklist. Length, format, and pattern matching are table stakes—necessary but insufficient. You need validation that thinks.
By the end of this guide, you'll understand five validation layers and when each one matters. You'll know which ones run before you hit the model and which ones catch problems in the response. You'll have working pseudocode to chain them together, and a concrete mini lab to test your thinking on real inputs. Most importantly, you'll recognize the trade-off between false positives (blocking legitimate users) and false negatives (letting attacks through)—and you'll know how to tune for your actual risk.
Regex and length checks exist in a world where the rules are clear: an email has an @, a ZIP code is five digits. LLMs break that assumption. They're meaning-sensitive systems. They respond to intent, context, and subtle linguistic patterns. A malicious input doesn't look like SELECT * FROM users—it looks like a normal question wrapped in a context that steers the model toward unintended output.
Here's the core problem: traditional validation protects against syntax and known patterns. LLM attacks are often semantic. They're about what the input means in the context of your specific system, not what it looks like in a vacuum.
Example scenario: A blocklist catches "give me the password to the admin account." It misses "What would a system administrator type to access the admin account?" Same meaning, different surface. Or consider this: a user asks "Summarize the conversation so far" in a multi-turn interaction. The input itself is harmless. But if your system has leaked sensitive details in prior responses, and the user is now asking you to package them, you've got a problem—and regex never sees it.
This is where layered validation comes in. Not as a waterfall (one check gates the next) but as an ensemble: multiple techniques that catch different attack surfaces, each one blind to what the others see.
Rendering chart...
Each layer filters different attack surfaces. Together, they form a defense-in-depth strategy. Let's walk through each.
This layer answers a simple question: Does the input fit the expected shape? It's the fastest and cheapest check—run it first, fail loud and early.
Token count matters because context windows are finite, and extremely long inputs (or those that expand significantly after processing) can degrade model quality or waste resources. Length checks catch obvious abuse—a 50,000-character "question" that's really spam. But here's the catch: tokens and characters aren't the same. A compressed JSON object might be short in characters but expensive in tokens. A prose paragraph with rare words might be long in characters but cheap in tokens.
Real scenario: A user uploads what claims to be a 1,000-word article, but it's actually 40,000 tokens of repeated Unicode escapes designed to exhaust your context window. A character-only check passes. A token count check catches it immediately—before you spend compute on the model.
def validate_format(user_input: str, max_tokens: int = 4000, min_length: int = 1) -> tuple[bool, str]: """ Check if input meets basic format requirements. Returns (is_valid, reason). """ if len(user_input) < min_length: return False, "Input too short" # Rough token estimation (actual tokenization depends on your model) estimated_tokens = len(user_input.split()) * 1.3 # naive estimate if estimated_tokens > max_tokens: return False, f"Exceeds token limit ({estimated_tokens:.0f} > {max_tokens})" return True, "Format OK"
This layer is validation-before because checking tokens is CPU-cheap and you want to reject junk before it reaches the model.
💡 Insight: Use your actual model's tokenizer, not a heuristic. tiktoken for OpenAI models, transformers for Hugging Face. A rough estimate saves time; an exact count saves embarrassment.
This layer is what most teams already have. Regex patterns, keyword blocklists, known-bad phrases. It's imperfect but fast and necessary: you're stopping the obvious stuff—hate speech, explicit sexual content, names of private systems you don't want exposed, known attack templates.
The catch: blocklists are reactive. They catch yesterday's attacks. Sophisticated inputs slip through because they're novel or because they reformulate a known attack in a way the regex didn't anticipate.
Real scenario: Your blocklist includes "exec(", "os.system", and other code-execution patterns. A user asks: "Write me a Python function that safely runs a command string." Innocent question—but if your model is trained to be helpful with code, it might generate os.system() calls. The blocklist never triggers on the input, only on what you decide to block, so the output could still be risky. A better blocklist might include "run.*command", "execute.*shell", and so on, but now you're blocking legitimate requests about command-line tools.
This is where Layer 2 meets the false-positive problem: the broader your blocklist, the more legitimate requests you reject. The narrower it is, the more attacks slip through.
import re def check_blocklist(user_input: str, blocklist: list[str]) -> list[str]: """ Find all blocklist matches in input. Returns list of matched patterns. """ matches = [] normalized_input = user_input.lower() for pattern in blocklist: if re.search(pattern, normalized_input): matches.append(pattern) return matches # Example blocklist (adjust for your domain) BLOCKLIST = [ r"exec\(", r"os\.system", r"eval\(", r"__import__", r"drop\s+table", # SQL injection r"admin.*password", ] def validate_content(user_input: str, threshold: int = 2) -> tuple[bool, list[str]]: """ Check input against blocklist. Reject if too many matches. """ matches = check_blocklist(user_input, BLOCKLIST) if len(matches) >= threshold: return False, matches return True, matches
⚠️ Pitfall: Over-broad blocklists create false positives. Instead of blocking the word "admin," block the phrase "act as admin" or "pretend to be admin." Target the attack pattern, not the word.
Layer 2 can run before or after the model, depending on your goals. Run it before if you want to reject requests with obvious red flags. Run it after if you're worried about detecting harmful outputs the model generates.
This is where you start thinking like the system, not like a regex engine. You're asking: Does this input make sense given what we know about the user, the session, the application logic?
Semantic checks are context-dependent. They depend on your domain. In a customer support system, "reset my password" makes sense if the user is authenticated; it doesn't make sense if they're anonymous. In a financial advisory chatbot, "what should I buy today?" is reasonable; "how do I hedge my cryptocurrency with stolen credit cards?" is not—not because of the words themselves, but because the combination violates domain logic.
Real scenario: A user is interacting with your medical symptom checker. They ask, "I have a sore throat and I want to schedule surgery tomorrow." The words are fine. Individually, they're harmless. But semantically, the combination is odd—a sore throat doesn't warrant emergency surgery, and requesting surgery in a symptom-checker chatbot (rather than through a medical system) suggests either confusion or an attempt to get the model to suggest inappropriate care. A semantic check asks: Does this request align with what this tool is for? And if the answer is no or unclear, you can choose to clarify, narrow the scope, or reject the request.
class SemanticValidator: """ Domain-specific semantic checks. Customize for your application. """ def __init__(self, system_description: str): self.system_description = system_description self.valid_intents = [] # Define for your domain def check_scope(self, user_input: str, user_context: dict) -> tuple[bool, list[str]]: """ Does this request make sense in the context of our system? Returns (is_valid, list_of_issues). """ issues = [] # Check 1: Is user authenticated for actions requiring authentication? if "admin" in user_input.lower() and not user_context.get("is_authenticated"): issues.append("Admin action from unauthenticated user") # Check 2: Is this a multi-turn manipulation attempt? if user_context.get("interaction_count", 0) > 50: if any(phrase in user_input.lower() for phrase in ["ignore previous", "forget", "new instructions"]): issues.append("Likely injection after many interactions") # Check 3: Does request match system purpose? # (Customize for YOUR domain) if self.system_description == "medical_symptom_checker": if any(phrase in user_input.lower() for phrase in ["schedule surgery", "prescribe medication", "perform procedure"]): issues.append("Request outside system scope (requires medical professional)") return len(issues) == 0, issues # Usage validator = SemanticValidator("customer_support") is_valid, issues = validator.check_scope( user_input="Reset my password", user_context={"is_authenticated": True, "interaction_count": 5} ) print(is_valid, issues) # True, []
💡 Insight: Semantic checks are the hardest to get right, but they're also the most powerful. Spend time understanding your domain. What requests make sense? What doesn't? Encode that knowledge explicitly.
Semantic checking typically runs before the model, because you're deciding whether the request is sensible in the first place. If it's not, why send it to the model?
This is more specialized but increasingly important. You're looking at the distribution of tokens in the input, not the meaning of words. Some attacks look normal at the character level but show strange token patterns—sudden spikes in rare tokens, unusual token n-grams, token sequences that are statistically anomalous for your domain.
Example: prompt injection sometimes uses specific token patterns that compress poorly or combine rare tokens in unusual ways. Token analysis can catch these by building a simple model of what "normal" looks like for your application and flagging outliers.
Real scenario: Your model is trained on English text. Most user inputs have a typical distribution of tokens—common words, punctuation, numbers. An attacker embeds a prompt injection by using a long sequence of very rare tokens (Unicode characters, escaped sequences, formatting codes that tokenize into unusual patterns). Character-level checks might miss this; regex might miss this. But token n-gram analysis shows: "This input has 40% rare tokens, when our baseline is 2–5%. Something's off."
from collections import Counter def analyze_token_distribution(user_input: str, tokenizer) -> dict: """ Analyze the token distribution of an input. Returns dict with statistics. """ tokens = tokenizer.encode(user_input) vocab_size = len(tokenizer) # Token frequency analysis token_freq = Counter(tokens) rare_tokens = sum(1 for token, count in token_freq.items() if token > vocab_size * 0.9) # Statistics total_tokens = len(tokens) unique_tokens = len(set(tokens)) rare_token_ratio = rare_tokens / total_tokens if total_tokens > 0 else 0 return { "total_tokens": total_tokens, "unique_tokens": unique_tokens, "rare_token_ratio": rare_token_ratio, "token_diversity": unique_tokens / total_tokens if total_tokens > 0 else 0, "rare_token_count": rare_tokens, } def flag_anomalous_tokens(user_input: str, tokenizer, baseline: dict) -> tuple[bool, str]: """ Check if token distribution deviates from baseline. """ stats = analyze_token_distribution(user_input, tokenizer) # Flag if rare token ratio is significantly higher than baseline if stats["rare_token_ratio"] > baseline["rare_token_ratio"] * 2: return False, f"Unusual token distribution ({stats['rare_token_ratio']:.1%} rare tokens)" # Flag if diversity is suspiciously low (compressed payload) if stats["token_diversity"] < baseline["token_diversity"] * 0.5: return False, "Low token diversity (possible compressed payload)" return True, "Token distribution normal" # Example: Build baseline from clean data # baseline = analyze_token_distribution(sample_of_normal_inputs, tokenizer)
This layer runs before the model because token analysis is fast (you don't need to call the LLM; you just analyze the tokenization). You catch the attack before it reaches your model.
This is the most sophisticated layer: instead of binary pass/fail, you assign a risk score based on signals from all the previous layers. Low score? Send straight to the model. High score? Maybe require additional verification, log it, or route to a different system. Medium score? Send to the model but with explicit guardrails (e.g., reduce output length, disable certain capabilities).
Risk scoring is dynamic and composable. You're not blocking based on any single signal; you're weighting multiple signals and making a decision about how much scrutiny this request deserves.
Real scenario: A user submits a request that has a mild content-filter match (2/10 risk) but shows normal token distribution (0/10 risk) and makes sense semantically (1/10 risk). Combined risk score: 3/10. You send it to the model, no problem. A different user submits a request that's syntactically normal (0/10 risk) but shows 7 content-filter matches (6/10 risk), has unusual tokens (4/10 risk), and doesn't make semantic sense in context (5/10 risk). Combined risk: 15/10 (or capped at 10/10). You log it, optionally notify a human, and route to a more heavily guarded model or deny it outright.
class RiskScorer: """ Compute risk score from multiple validation layers. """ # Weights for each layer (tune these based on your threat model) WEIGHTS = { "syntax": 2, "content": 4, "semantic": 3, "token": 3, } def compute_score(self, syntax_risk: float, # 0-10 content_risk: float, # 0-10 semantic_risk: float, # 0-10 token_risk: float) -> float: # 0-10 """ Compute weighted risk score. """ weighted_sum = ( syntax_risk * self.WEIGHTS["syntax"] + content_risk * self.WEIGHTS["content"] + semantic_risk * self.WEIGHTS["semantic"] + token_risk * self.WEIGHTS["token"] ) total_weight = sum(self.WEIGHTS.values()) score = (weighted_sum / total_weight) * 10 # Normalize to 0-10 return min(score, 10) # Cap at 10 def decide_action(self, risk_score: float) -> str: """ Map risk score to action. """ if risk_score < 3: return "PROCEED" elif risk_score < 6: return "PROCEED_WITH_CAUTION" else: return "REJECT" # Usage scorer = RiskScorer() score = scorer.compute_score( syntax_risk=1, content_risk=6, semantic_risk=4, token_risk=3, ) print(f"Risk Score: {score:.1f}") # Risk Score: 3.9 print(f"Action: {scorer.decide_action(score)}") # Action: PROCEED_WITH_CAUTION
💡 Insight: Weights are a choice, not a discovered truth. Start simple, weight equally. Then look at logs: which layers caught real attacks? Increase their weight. Which layers generated false positives? Decrease their weight. Iterate.
Risk scoring runs before the model because you're making a triage decision. But you can recalculate it after the response if you're monitoring model output too.
Here's the brutal truth: you cannot eliminate both false positives and false negatives. You can only choose your tolerance for each and monitor both metrics.
Rendering chart...
A researcher asking "How do I weaponize this idea?" gets rejected because "weaponize" hits your blocklist. A customer whose name happens to be "Ben Admin" can't create an account because your system thinks it's an attack. A user in a non-English language gets rate-limited because their token distribution looks anomalous to your English-tuned model.
Cost: Frustration, support tickets, lost users, reputation damage.
A prompt injection that's grammatically perfect and semantically reasonable slips through all your checks and gets to the model. Your model leaks information or behaves in an unintended way. You don't know until a user reports it, or worse, until they exploit it.
Cost: Data breach, model hijacking, compliance violation, security incident.
For most applications, the calculus is: A false negative is worse than a false positive. An attacker getting through is worse than a legitimate user being inconvenienced. But this depends on your domain. In a high-volume customer service system, false positives might cascade and break the entire experience. In a medical or financial system, false negatives might cause real harm.
Practical tuning steps:
Start permissive. Run validation, log everything (what passed, what was flagged), but don't block yet.
After a week or two, you'll see your baseline: what does normal look like for your users?
Tighten the checks gradually. Each time you add a new rule, measure the impact: how many legitimate requests does it reject? Is the security gain worth it?
Monitor both metrics continuously:
False positive rate: (Legitimate requests blocked) / (Total legitimate requests)
False negative rate: (Attacks that got through) / (Total attacks attempted)
Set targets. Example: "We want FP rate < 1% and FN rate < 5%." Then adjust weights and thresholds to hit those targets.
This is the question that shapes your architecture: when do you validate?
Rendering chart...
The check is cheap (token count, simple regex, semantic logic).
Rejecting the input early saves you money or prevents downstream problems.
You're confident in the rule (low false-positive rate).
Examples: Token count, obvious blocklist matches, basic format checks, semantic checks that are domain-specific and clear.
The check requires understanding the response, not just the input.
You're trying to detect whether the model behaved unexpectedly.
The rule is fuzzy and would generate false positives on input alone.
Examples: Detecting if the model leaked information, checking if output violates policy, analyzing if the model was manipulated by the input.
You want defense in depth: stop attacks early, but also catch edge cases where an input got through and the model generated something dangerous.
You can afford the overhead.
For most teams, the pattern is: aggressive before-validation (fast, cheap) + light after-validation (catch what got through). Layers 1–4 run before. Layer 5 risk scoring runs before, but you recalculate after the response to decide whether to log, cache, or escalate.
Here's a practical validation pipeline that combines three layers—syntax, content, and semantic—into one function you can adapt and use.
This validates before sending to the model. It shows the thinking behind each check and the logic for deciding what to do next.
Rendering chart...
def validate_input_pipeline(user_input: str, session_context: dict, config: dict) -> dict: """ Complete validation pipeline with all five layers. Returns decision dict: {valid, action, reason, risk_score}. """ # Layer 1: Syntax & Token Count is_valid, reason = validate_format( user_input, max_tokens=config.get("max_tokens", 4000), min_length=config.get("min_length", 1) ) if not is_valid: return { "valid": False, "action": "REJECT", "reason": f"Format check failed: {reason}", "layer": 1, "risk_score": 10, } # Layer 2: Content Filtering blocklist = config.get("blocklist", []) matches = check_blocklist(user_input, blocklist) threshold = config.get("blocklist_threshold", 2) if len(matches) >= threshold: return { "valid": False, "action": "REJECT", "reason": f"Content check failed: {len(matches)} matches", "matched_patterns": matches, "layer": 2, "risk_score": 9, } # Layer 3: Semantic Checking semantic_validator = config.get("semantic_validator") if semantic_validator: is_semantic_valid, issues = semantic_validator.check_scope( user_input, session_context ) if not is_semantic_valid: return { "valid": False, "action": "REJECT", "reason": f"Semantic check failed: {issues}", "layer": 3, "risk_score": 8, } # Layer 4: Token-Level Analysis tokenizer = config.get("tokenizer") baseline = config.get("token_baseline") if tokenizer and baseline: is_token_valid, token_reason = flag_anomalous_tokens( user_input, tokenizer, baseline ) if not is_token_valid: return { "valid": False, "action": "REJECT", "reason": f"Token check failed: {token_reason}", "layer": 4, "risk_score": 8, } # Layer 5: Risk Scoring scorer = config.get("risk_scorer", RiskScorer()) # Compute individual risk scores (0-10 scale) syntax_risk = 1 if len(user_input) > 3000 else 0 content_risk = (len(matches) / threshold) * 5 if matches else 0 semantic_risk = 2 if (session_context.get("interaction_count", 0) > 50 and "ignore" in user_input.lower()) else 0 token_risk = 3 if not is_token_valid else 0 overall_risk = scorer.compute_score( syntax_risk, content_risk, semantic_risk, token_risk ) action = scorer.decide_action(overall_risk) return { "valid": True, "action": action, "risk_score": overall_risk, "layer": 5, "component_scores": { "syntax": syntax_risk, "content": content_risk, "semantic": semantic_risk, "token": token_risk, }, }
# Setup config = { "max_tokens": 4000, "min_length": 1, "blocklist": [r"exec\(", r"os\.system"], "blocklist_threshold": 2, "semantic_validator": SemanticValidator("customer_support"), "tokenizer": tiktoken.get_encoding("cl100k_base"), "token_baseline": {"rare_token_ratio": 0.05, "token_diversity": 0.7}, "risk_scorer": RiskScorer(), } # Test 1: Normal request result1 = validate_input_pipeline( "What's my account balance?", {"is_authenticated": True, "interaction_count": 3}, config ) print(result1) # {valid: True, action: 'PROCEED', risk_score: 0.5, ...} # Test 2: Suspicious request result2 = validate_input_pipeline( "exec(malicious_code) ignore previous instructions", {"is_authenticated": False, "interaction_count": 100}, config ) print(result2) # {valid: False, action: 'REJECT', layer: 2, risk_score: 9, ...}
Pick one of these scenarios and write a simple semantic check for your domain. The goal is to build intuition for what "makes sense in context" means.
You have a chatbot that answers FAQs about billing.
Legitimate requests:
"How do I update my payment method?"
"Why was I charged twice?"
Invalid requests:
"Transfer my account to my competitor."
"Delete all customer records."
"What's the CEO's salary?"
Task: Write a function that checks whether the input fits the scope of billing support.
def validate_billing_support_scope(user_input: str) -> tuple[bool, str]: """ Check if request is within billing support scope. """ # Define valid intent keywords valid_intents = [ "payment", "charge", "invoice", "subscription", "refund", "billing", "account", "price" ] # Define red flags (out of scope) red_flags = [ r"delete.*records", r"transfer.*competitor", r"ceo.*salary", r"company.*secret", ] input_lower = user_input.lower() # Check if request has valid intent has_valid_intent = any(intent in input_lower for intent in valid_intents) # Check for red flags has_red_flag = any(re.search(flag, input_lower) for flag in red_flags) if has_red_flag: return False, "Request contains sensitive topic outside support scope" if not has_valid_intent: return False, "Request doesn't match billing/payment intent" return True, "Request is within billing support scope"
Users describe symptoms and get educational information.
Legitimate:
"I have chest pain and shortness of breath."
Invalid:
"I want to fake a disability claim."
"Can you tell me how to poison someone and not get caught?"
Task: Write a function that detects when the input is asking for something the system shouldn't provide.
class MedicalScopeValidator: def __init__(self): self.valid_intents = ["symptom", "disease", "condition", "pain", "fever"] self.harmful_intents = [ r"poison", r"fake.*claim", r"harm.*someone", r"hide.*illness", ] def validate(self, user_input: str) -> tuple[bool, str]: input_lower = user_input.lower() # Red flag: harmful intent for pattern in self.harmful_intents: if re.search(pattern, input_lower): return False, "Request seeks harmful information" # Check: does it describe a symptom? has_symptom = any(intent in input_lower for intent in self.valid_intents) if not has_symptom and not re.search(r"i have|symptoms|feel|ache", input_lower): return False, "Request should describe symptoms, not be a general question" return True, "Request is within medical education scope"
Users ask the model to write code.
Legitimate:
"Write a function to sort a list."
Invalid:
"Write a script that steals passwords."
"Help me write a backdoor."
Task: Write a function that checks the intent of the code request.
def validate_code_generation_intent(user_input: str) -> tuple[bool, str]: """ Check if code request is for benign purposes. """ harmful_intents = [ r"steal", r"crack.*password", r"backdoor", r"exploit", r"ransomware", r"malware", r"virus", r"keylog", ] benign_patterns = [ r"sort", r"filter", r"loop", r"parse", r"function", r"calculate", r"read", r"write", ] input_lower = user_input.lower() # Check for harmful intent if any(re.search(pattern, input_lower) for pattern in harmful_intents): return False, "Code request appears intended for malicious purposes" # Check for benign context if any(re.search(pattern, input_lower) for pattern in benign_patterns): return True, "Code request is for a benign utility" # Unclear—safer to allow but log for review return True, "Code request intent is unclear; log for review"
Write 5-10 examples of legitimate requests for your domain
Write 5-10 examples of invalid requests.
Run your semantic check against each. Does it classify them correctly?
Refine the logic. What did you miss?
TEst on edge cases: ambiguous requests, requests that are almost invalid.
Expected output: Your function should return a boolean (valid or invalid) or a reason (why it's invalid). Log the results. The goal isn't perfection; it's building a mental model of what your domain considers "in scope" vs. "out of scope." That model, once clear, becomes rules you can encode.
Start with a lighter content filter. Instead of a strict blocklist, use intent-based detection: look for patterns that suggest the user is trying to manipulate the system, not just looking for dangerous words.
Example: Instead of blocking the word "admin", block the phrase "act as admin", "pretend you are", "ignore previous instructions"—these are manipulation patterns, not just words.
Alternatively:
Lower your risk-score threshold
so more requests are "proceed with caution" rather than "reject."
Run a sample of rejected requests past a human.
If >30% seem legitimate, your checks are too tight. Loosen them.
Increase your false-positive tolerance gradually.
Better to be slightly permissive and adjust than to frustrate all your users.
Add Layer 4 (token-level analysis). Build a simple baseline of what "normal" token distribution looks like in your domain, then flag inputs that deviate significantly. Use logs from past security incidents to understand what token patterns attackers use, and add them to your checks.
Also:
Run a security audit.
Have a red-teamer (or a friendly adversary) try to break your validation. What gets through? Add explicit checks for those attack patterns.
Increase your risk-score weights
for layers that have caught real attacks before. Trust the data.
Add more semantic checks
specific to your domain. Generic checks catch generic attacks; domain-specific checks catch sophisticated, targeted attacks.
You're probably right. Start with a smaller scope. Instead of trying to validate "is this request reasonable for my app?", validate "is this request in the right category?"
Example: A customer-support bot can check "Is this about billing, orders, or returns?" and reject anything outside those categories. That's simpler, clearer, and less prone to weird edge cases.
It is, a bit. Risk scoring is a choice, not a discovered truth.
Start simple: Weight each layer equally, or weight by cost/frequency. Test it.
Iterate: Did adding a semantic check improve your security without breaking legitimate requests? If yes, increase its weight. If no, decrease it.
Measure: Track both false positive rate and false negative rate. Use those metrics to tune weights. The goal is to hit your targets (e.g., "FP rate < 1%, FN rate < 5%"), not to find the "perfect" weights.
Regex and length checks are table stakes. They're necessary but not sufficient. LLM attacks are semantic—they exploit meaning, context, and the model's tendency to be helpful. Traditional validation can't catch them alone.
The solution is layered validation: five complementary techniques that each catch different attack surfaces. Format checks are cheap and fast—run them first. Content filtering catches obvious stuff but generates false positives—tune it aggressively. Semantic checks are domain-specific but powerful—they stop requests that make no sense for your system. Token analysis catches statistical anomalies—useful for sophisticated attacks. Risk scoring ties everything together, letting you make nuanced decisions instead of binary pass/fail.
The hardest part isn't building the validation. It's tuning it. False positives and false negatives are trade-offs, not bugs. You choose your tolerance based on your domain's risk. Medical systems should bias toward false positives (better to block a legitimate request than miss an attack). High-volume customer systems should bias toward false negatives (better to let a few sketchy requests through than frustrate thousands of users).
Most importantly: validation is not security theater. It's one layer of defense. Combine it with output monitoring, rate limiting, explicit guardrails in your system prompt, and regular security audits. Validation stops the easy attacks and suspicious inputs. It doesn't stop a determined adversary or a fundamental flaw in your system design.
Audit your current validation. What do you have today? Blocklists? Length checks? Semantic rules? Write it down. Now ask: What attack could slip through each one? You'll find gaps. Prioritize the biggest gaps.
Implement Layer 3 (semantic checking) first. It's high-impact and domain-specific to your application. Start with one clear rule—"Is this request in scope?"—and expand from there. Test it on real user data before deploying to production.
Set up logging and monitoring. Log every validation decision: what passed, what was flagged, what was rejected, and why. After a week, analyze the logs. Are you seeing attacks you didn't expect? Are legitimate users being blocked? Adjust based on data, not intuition. Treat validation tuning as an ongoing process, not a one-time setup.
Anthropic Prompt Injection Guide:
Learn deeper patterns for understanding and defending against prompt injection: https://docs.anthropic.com/en/docs/build-with-claude/prompt-injection-mitigation
OWASP Top 10 for LLMs:
Industry-standard vulnerabilities and defenses: https://owasp.org/www-project-top-10-for-large-language-model-applications/
Tokenization (tiktoken):
Exact token counting for OpenAI models: https://github.com/openai/tiktoken
Security Testing Tools:
Use frameworks like
fuzz
and
hypothesis
to generate edge cases and stress-test your validation: https://hypothesis.readthedocs.io/
Last Updated: November 2025 Applicable to: Claude models (all versions), other LLM APIs with custom validation requirements
Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.
Explore PathsReady to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.