Who should read this Intermediate level guide?

This guide is perfect for Intermediate level practitioners looking to improve their prompt engineering skills in LLM Security, Monitoring & Observability, Incident Response, Operations.

How long does it take to complete this guide?

This guide takes approximately 20 min read to read and understand.

Back to Guides/Guide

Audit & Monitoring: Detecting When Things Go Wrong

Build detection systems that catch LLM attacks and anomalies in real time—without false-alarm fatigue. Learn what to log, which patterns signal trouble, and how to alert sustainably.

November 8, 2025

20 min read

Promptise Team

Intermediate

LLM SecurityMonitoring & ObservabilityIncident ResponseOperations

The Reality: You Can't Prevent Everything

You can't prevent every attack. Someone will find a way to slip a prompt injection past your guard rails. Your API will hit an unexpected edge case. A user will try something you didn't anticipate. What matters now isn't perfection—it's speed. The moment something goes wrong, you need to know.

This guide teaches you how to build that detection layer: what to log without creating liability, which patterns mean trouble, and how to alert without drowning your team in noise.

By the end, you'll have a working monitoring schema, a set of real alerting rules, and the judgment to tune them so you catch actual problems without burning out on false alarms.

Why Logging Feels Hard (And Why It Matters)

Here's the friction point: logging an LLM system feels like logging a nuclear reactor. Every query could be sensitive. Every output could contain something you shouldn't store. So teams either log nothing (and go blind when something breaks) or log everything (and create a compliance nightmare).

The answer isn't "log more carefully." It's log strategically—enough signal to detect attacks and debug failures, without building a liability.

Think of it like airport security. You don't record every passenger's full conversation in the terminal. But you do watch for certain behaviors: someone trying to pass something through a scanner, repeated attempts to move backward through the line, tools that don't match the checkpoint rules. You're capturing pattern data, not transcripts.

For LLM systems, that means logging:

What you must capture: The query itself (or a fingerprint of it), the system's response (was it rejected? why?), what rejection rule fired, whether the output was flagged, and why it was flagged. You also need metadata: timestamp, user identifier (not their name—an ID), which model version handled the request, latency, and token counts.
What you must not capture: Full user PII (SSN, credit card numbers, passwords). Sensitive data the user asked about. The model's internal reasoning if it contains user information. Encryption keys, API tokens, or secrets passed in queries.

💡 Insight: The principle is simple: Capture enough to reconstruct what happened and why, without storing the harm. If someone tries prompt injection with a bank account number in the payload, you log "injection detected in field X at timestamp Y with token count Z and rejection reason R." You do not log the actual account number. If a user gets an unexpectedly toxic output, you log that it happened, when, which model, which safeguard didn't catch it—but not the full toxic text (log a hash of it, or a redacted version).

This balance is critical. Too much detail, and you've created a privacy vulnerability worse than the attack you're trying to catch. Too little, and when something breaks at 3 a.m., you'll be flying blind.

The Signals: What Actually Indicates Trouble

Attacks on LLM systems don't all look the same, but they share patterns. Learn to spot them.

Pattern 1: Repeated Rejections from the Same User or IP

A single rejection is normal—someone hits a guard rail, the system says "no," they move on. But if the same user or IP triggers five rejections in two minutes across different queries, something's off. They're either testing your boundaries or running an automated probe.

The specific queries might look innocent individually ("What's the capital of France?" "How do birds fly?") but the volume and timing of rejections signals someone is mapping your limits. Log each rejection with the user ID, IP, exact timestamp, and which rule fired. When you see clustering—multiple rejections in a tight window—that's your alert trigger.

Pattern 2: Sudden Change in Query Length or Complexity

Your baseline: users typically send queries between 50 and 500 tokens. One Tuesday morning, a user starts sending 2,000-token queries full of nested instructions and context-switching prompts. They're either using your system differently (legitimate) or attempting a prompt injection (suspicious).

The signal isn't the length alone—it's the deviation from that user's norm. If they've never sent queries longer than 200 tokens in three months, a 1,800-token query is worth a second look. This is where baseline anomaly detection earns its keep.

Pattern 3: Outputs Flagged for Harm Immediately After a Rejection

This is a tell. User sends a query that your prompt injection filter blocks. They rephrase it slightly and resend. This time it gets through your injection filter but triggers your toxicity guard rail. That sequence—rejection, then immediate harmful output on retry—suggests someone is actively iterating around your safeguards.

Single harmful outputs happen; repeated attempts to slip past different guard rails in short succession suggest intentional abuse.

Pattern 4: Unusual Token Consumption Spikes

Your API logs token usage. If a single user consumes 10x their normal monthly token count in an hour, that's noise you need to investigate. It could be a legitimate burst (they're running a batch job), or it could be someone testing your model's outputs at scale, possibly looking for edge cases or collecting training data.

Log token consumption per user, per hour, and flag when it deviates significantly from their historical average.

Pattern 5: High Output Rejection Rates with No Input Rejection

Normally, if your system is working, you'll see a rough ratio: some queries get rejected at input (prompt injection filters), and some outputs get rejected (toxicity, hallucination, policy violations). If you're seeing only output rejections—never input rejections—and the rate is unusually high (say, 15% when your historical average is 2%), something has shifted.

Maybe a new model version is misbehaving. Maybe an attacker has learned your input filters and is focusing on manipulating outputs. Either way, the signal matters.

The Core Challenge: Alerting Without False Positives

Here's where most teams stumble. You set up monitoring, configure alerts, and by day three you're getting paged at midnight for things that don't matter. A legitimate user hit your API harder than usual. A query naturally fell into a pattern that looked suspicious. You've over-alerted, and now your team ignores alerts. That's worse than no alerts at all.

The fix isn't more alerts. It's tuning: setting thresholds that catch real problems while accepting that some benign activity will look weird.

Think of it as a precision-recall trade-off. You can be hypervigilant (low threshold for alerting) and catch almost every attack—but you'll also page your team 500 times a day for false positives. Or you can be relaxed (high threshold) and rarely be bothered—but you'll miss real incidents. The goal isn't zero false positives or zero misses. The goal is a sustainable operating point where your team investigates real problems and spends minimal time on noise.

Here's how to find it. Start conservative: set alert thresholds high enough that you'd only page on something you'd definitely want to know about right now. For rejection-rate spikes, that might mean "alert if rejection rate jumps to 3x the 7-day average and we see at least 50 rejections in the last hour." You're not just looking for high rates; you're looking for sudden, sustained changes. This filters out random statistical noise and catches trends.

When an alert fires, your team should investigate. Don't just acknowledge and move on. After a week, review: Did this alert correspond to a real issue? If 80% of your alerts were false positives, lower the threshold. If you're missing attacks that you found out about some other way, raise it.

⚠️ Pitfall: The cost of over-alerting is alert fatigue, slower incident response, team burnout, and eventually alerts get ignored. The cost of under-alerting is you find out about attacks from customers, regulators, or the news. One is annoying. The other is a career event. Err toward over-alerting early, then tune down once you have data.

Anomaly Detection: Baselining Without the ML

You don't need a PhD in machine learning to detect anomalies. You need one simple insight: normal behavior is boring and consistent; attacks are novel and sudden.

Here's the approach. For each user, maintain a rolling baseline: their historical average query length, their average tokens per request, how often they hit rejections, what time of day they use your system, which endpoints they call most. Over the first two weeks of a user's activity, you're building a profile. After that, deviations flag.

Let's be specific. A user's baseline metrics might look like:

Average query length: 200 tokens (standard deviation: 40 tokens)
Queries per day: 12 (standard deviation: 3)
Rejection rate: 1.5%
Peak usage window: 9 a.m.–5 p.m. EST
Most-called endpoint: /chat

Now, real usage will vary. Some days they send 15 queries, some days 8. But if they suddenly send 80 queries, all 2,000 tokens each, at 3 a.m., from a new IP address—that's multiple deviations. That's anomalous.

You don't need to calculate complex statistical models. Simple percentile-based rules work:

Rule: Flag if a metric exceeds the user's historical 95th percentile and hasn't been exceeded in the last 30 days. This catches spikes while ignoring seasonal variation.

Example: If a user's historical max query length is 600 tokens (95th percentile), and they send a 3,000-token query, flag it. But if they consistently send 2,500-token queries every Friday, don't flag—it's part of their normal pattern.

You build this with basic stats: calculate the mean and standard deviation for each metric, per user, rolling over a 30-day window. When you see a new data point, check: Is it more than 2.5 standard deviations from the mean, and is it a new extreme for this user? If yes, log it as an anomaly.

The power here is that you adapt automatically. Different users have different normal. A data scientist might send 5,000-token queries regularly; a casual user never does. Your baseline captures that. And you don't need to hand-tune thresholds for every user type—the math does it.

Visual: Detection Strategy Flow

Here's how these detection patterns and techniques layer together depending on what you're trying to catch:

Rendering chart...

This chart shows the decision tree: based on which phase of processing a request is in and which pattern it matches, you determine confidence level and whether to alert or flag for further monitoring.

Your Monitoring Schema

Here's exactly what to log. This is the schema I've seen work across dozens of teams, from startups to enterprises.

{
"timestamp": "2025-11-08T14:23:45Z",
"user_id": "usr_7f2e9c1d",
"session_id": "sess_a4b1f892",
"api_endpoint": "/chat",
"model_version": "gpt-4-turbo-2024-10",

"request_token_count": 245,
"request_length_chars": 1847,
"request_hash": "sha256:8f3a9c2e...",

"input_filter_applied": "injection_detector",
"input_filter_result": "rejected",
"input_filter_reason": "prompt_injection_high_confidence",

"output_generated": false,
"output_token_count": null,
"output_hash": null,

"output_filter_applied": null,
"output_filter_result": null,
"output_filter_reason": null,

"latency_ms": 145,
"user_flag": "test_user",
"error": null
}

Why Each Field Matters

Field	Purpose
timestamp + user_id + session_id	Reconstruct sequences. When did this user send requests? Burst or spread out? Multiple filters triggered in quick succession?
request_hash + output_hash	Log the hash, not content. Deduplication without storing sensitive data. Actual content lives separately with stricter access controls.
request/output_token_count + request_length_chars	Size signals. Large queries anomalous for some users. Oversized outputs might indicate hallucination or jailbreak.
filter_applied + filter_result + filter_reason	Which rule fired? What happened? Why? This is pure signal for investigation and tuning.
latency_ms	Unusual slowdowns indicate attacks (API bombing) or system problems. Include it.
user_flag	Manually tag users: "test account," "suspicious," "vip," "batch job." Surfaces patterns faster.

Storage & Access:

Store logs in JSON Lines format (one JSON object per line).
Ship to a log aggregation system (Datadog, Splunk, CloudWatch, ELK).
Query with time ranges, filters, and aggregations.
Full request/output content stored separately, encrypted, with shorter retention than metadata.

Building Alerting Rules (In Plain English)

Here are three real alerting rules. Each is written in plain language first, then expressed as a query you'd run against your logging system.

Rule 1: Sudden Rejection-Rate Spike

What it catches: Users testing your boundaries or probing your system.

Plain English: "Alert if a single user's rejection rate jumps to 2x their 7-day average and we see at least 10 rejections in the last 15 minutes from that user."

Why it matters: Users don't normally hit rejection filters in bursts. When they do, it usually means they're testing limits or someone's probing your defenses.

Query logic:

Calculate the user's rejection rate over the past 7 days.
In the rolling 15-minute window, count rejections.
If count > 10 and current_rate > 2x_baseline, trigger alert.

False-positive mitigation:

If the user is a known batch-job runner or test account, skip the alert.
If they're brand new (< 1 week active), use a higher threshold (3x instead of 2x).
If they're in a region with known high-latency issues, adjust baseline for that region.

Rule 2: Output Rejection Immediately After Input Rejection

What it catches: Active iteration around your safeguards.

Plain English: "Alert if a user's query is rejected by the input filter, and within 2 minutes they send another query that passes the input filter but is rejected by the output filter."

Why it matters: This is someone actively reworking their queries to slip past your defenses. First attempt blocked at input, they rephrase, it gets through input but fails output. That sequence is rare in normal usage.

Query logic:

Look for pairs of requests from the same user within a 2-minute window.
First request: input_filter_result = "rejected".
Second request: input_filter_result = "passed"but output_filter_result = "rejected".
If pair found, trigger alert.

False-positive mitigation:

If the two requests are for completely different topics (banking vs. weather), it's probably coincidence.
Add semantic similarity check if possible; otherwise, accept some false positives because the signal is strong.

Rule 3: Anomalous Token Consumption

What it catches: Sudden scaling attacks and model testing at volume.

Plain English: "Alert if a user consumes more than 3x their average daily token count in a single hour, and they've never previously exceeded that amount."

Why it matters: Sudden token spikes can indicate someone testing your model at scale, collecting outputs, or attempting a denial-of-service-like attack through API abuse.

Query logic:

Calculate each user's average daily token consumption over 30 days (rolling window).
For each hour, sum their token usage.
If hourly_sum > 3x_daily_average and user_has_never_hit_this_level_before, trigger alert.

False-positive mitigation:

If the user has told you they're running a batch job, suppress this alert for that duration.
If they're a data scientist who runs large experiments periodically, adjust their baseline upward.
Check user_flag field—if tagged "batch_job_scheduled," suppress alert.

Visual: Alert Tuning Workflow

Here's how to think about tuning thresholds over time:

Rendering chart...

The workflow: deploy conservatively, tune based on actual alerts, converge on a sweet spot where your team investigates real issues without burnout.

Your First Mini Lab: Building a Real Detection Rule

Let's make this concrete. You're going to implement a simple but powerful rule: flag when any user hits your API with more than 10 rejections in a rolling 10-minute window.

Your Task

Use the sample data below. Write a rule that, when run hourly, identifies which users crossed the threshold in the past hour. You don't need to write code—pseudocode or SQL-like queries are fine.

Sample Data

json

timestamp: 2025-11-08T14:00:00Z, user_id: user_123, result: passed timestamp: 2025-11-08T14:01:15Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:01:30Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:02:45Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:03:00Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:03:15Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:03:30Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:03:45Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:04:00Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:04:15Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:04:30Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:04:45Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:05:00Z, user_id: user_456, result: rejected (injection) timestamp: 2025-11-08T14:05:15Z, user_id: user_123, result: passed

The Rule in Pseudocode

FOR EACH user:
FOR EACH 10-minute window in the past hour:
count = COUNT(requests WHERE result = "rejected" IN this window)
IF count >= 10:
ALERT(user_id, count, filters_that_fired, time_window)

Expected Output

You should generate one alert:

ALERT: user_456 triggered 11 rejection events
(all prompt_injection_detector)
in 10-minute window 14:01–14:11 UTC

How You'd Implement This

In your log aggregation system:

Write a scheduled query (runs every hour).
Group logs by user_id and 10-minute time bucket.
Count rejections per bucket.
Filter for buckets where count >= 10.
Pipe results to your alerting service (PagerDuty, Slack, email).

What Success Looks Like

You run the query. It outputs one alert for user_456. You click through to their request logs and see the pattern: same IP, same filter, rapid-fire requests. You know what to investigate: that IP's intent, whether they're in a denylist, what their access should be going forward.

Visual: Logging & Filtering Architecture

Here's how data flows from request to alert:

Rendering chart...

This shows the full pipeline: requests flow through filters (each logging), logs aggregate, rules evaluate, and alerts fire when thresholds cross.

When Things Go Wrong: Troubleshooting Your Monitoring

If You're Missing Attacks (You Find Out Later)

Symptom: You discover a real attack through customer reports, not your alerts.

Root cause: You don't have enough signal, your thresholds are too high, or you're not logging the right patterns.

Diagnostic steps:

Are input_filter_result and input_filter_reason being logged? If not, you're blind to injection attempts.
Are output_filter_reason and rejection counts captured? If not, you can't detect output-phase attacks.

Corrective moves:

Lower thresholds immediately. Alert on 5 rejections in 10 minutes instead of 10. Alert on 1.5x baseline instead of 2x. Yes, you'll get more false positives. That's the cost of not missing attacks. Tune down later once you have data.
Review the attack you missed. What pattern did it follow? Did rejections cluster by time? By user? By query type? By IP? Add a rule for that pattern.
Add a signal. Maybe you're not detecting a particular attack type at all. Look at the failed attack: what was different about those requests? Log that difference.

If You're Over-Alerting (False Positives Flood Your Team)

Symptom: Your team is ignoring alerts, treating them as noise, or you're paging on things that obviously aren't incidents.

Root cause: Thresholds are too sensitive; you haven't accounted for legitimate variation.

Diagnostic steps:

Categorize your false positives. For each, ask: What was the user actually doing? (Testing limits, legitimate batch job, regional behavior spike, script automation, etc.)

Corrective moves:

Get specific, don't dial everything up. Instead of raising thresholds globally, adjust by context:
- By user: If user_456 is a known batch-job runner and 80% of their alerts are false positives, raise their threshold from 2x to 3x.
- By time: If most false positives happen during off-hours (3–6 a.m.), suppress that rule during those hours or use a higher threshold then.
- By endpoint: If /analyze generates more false positives than /chat, adjust the rule for that endpoint.
Suppress known benign patterns.
Tag users as "batch_job," "test," "data_scientist." Skip alerts for them (or use higher thresholds).
Review your baseline calculation.
If you're comparing against a 7-day average and someone always spikes on Fridays, your 7-day average might be inflated. Use a 30-day rolling window and check for seasonality.

If Latency Is Spiking But You're Unsure Why

Symptom: latency_ms field is high, but you don't know if it's your problem, the user's problem, or the model's.

Diagnostic steps:

Log latency by api_endpoint, model_version, and user_id separately.
Slice the data: Is latency high for all users or just one? All endpoints or just one? All model versions or one specific version?

Corrective moves:

If high latency is specific to one user & one endpoint:
Probably their network or hardware. Might also be their query size (very large prompts). Not your emergency, but worth documenting.
If high latency is global across one model version:
That version has a problem. Investigate model code or infrastructure for that version.
If high latency happens only during peak hours:
You need to scale. Latency is a symptom of capacity; more requests than your system can handle.
If it's intermittent:
Could be a specific query pattern that's expensive. Group latency_ms by request_token_count and request_length_chars. Do expensive queries consistently have high latency? If yes, the model/API is working as designed. If no, something's unstable.

Visual: Troubleshooting Decision Tree

When monitoring isn't working, this chart shows how to diagnose and fix:

Rendering chart...

This is your troubleshooting flowchart when things aren't working.

Summary & Conclusion

Monitoring an LLM system isn't about logging everything or logging nothing. It's about logging strategically: capturing enough signal to detect attacks and debug failures, without creating a privacy or compliance liability. The schema I've outlined—timestamp, user ID, query hash, filter results, token counts, latency—gives you that balance.

The hardest part isn't logging. It's alerting. Thresholds matter enormously. Too high, and you miss incidents. Too low, and your team burns out. Start aggressive (alert on smaller deviations), investigate every alert, and tune down as you learn what normal looks like in your system. Use baselines: every user is different, and anomalies are deviations from their normal, not from a global average.

Three patterns catch most attacks: rejection-rate spikes (someone probing your boundaries), rejection-then-harm sequences (active iteration around your defenses), and anomalous token consumption (someone testing your model at scale). These aren't fancy machine learning algorithms. They work because attackers behave differently than normal users, and different behavior leaves traces.

The payoff is real: when something goes wrong—an attack, a model glitch, a user misbehaving—you'll know about it as it happens, not after the fact. Your team investigates quickly. You're able to say, with evidence, "Here's what we saw, here's what we did, here's what we learned." That's the foundation of a resilient system and the difference between incident response and crisis management.

Next Steps

Implement the schema: Start logging the fields outlined above. You don't need perfect logging immediately—log enough to reconstruct what happened and why. If you're already logging unevenly, audit your current logs and fill the gaps. Most teams are missing either input_filter_result, output_filter_reason, or rejection counts. Prioritize those.
Set your first alert rule: Pick one of the three I described (rejection-rate spike is the easiest to start with). Implement it against your logging system. Run it against your historical logs and see what it would have caught over the past month. Did it catch something real? What false positives showed up? Document and tune. Go live with it.
Build your incident response runbook: When an alert fires, what happens next? Who gets paged? What's the first question your team asks? What's the process for blocking a user, rolling back a model version, or escalating? Write it down as a runbook. The best monitoring in the world is useless if no one knows what to do when it triggers. Your runbook should say: "If rejection-rate alert fires, (1) check these logs first, (2) here's the command to list the user's recent requests, (3) here's the decision tree for whether to block them, (4) here's who to loop in if it looks serious."

Learning Paths

Structured Learning

Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.

Explore Paths

Continue Your Learning Journey

Ready to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.

Explore More Guides Browse Learning Paths

Audit & Monitoring: Detecting When Things Go Wrong

Build detection systems that catch LLM attacks and anomalies in real time—without false-alarm fatigue. Learn what to log, which patterns signal trouble, and how to alert sustainably.

November 8, 2025

20 min read

Promptise Team

Intermediate

LLM SecurityMonitoring & ObservabilityIncident ResponseOperations