A practical guide for engineers and technical leaders building LLM systems in or for the EU. Cuts through regulatory language to explain the EU AI Act's risk classification system, compliance requirements, timeline, and what you actually need to do—documented with a realistic compliance scenario and actionable checklist.
You've probably noticed the EU AI Act floating through your Slack channels and risk meetings. Maybe someone sent you a 100-page regulatory document and said "figure this out." You're not alone. The Act is real, it's coming into force in phases, and it will affect you even if your servers sit nowhere near Brussels—because the EU's regulatory reach is enormous, and most LLM deployments touch European users or data eventually.
The good news: underneath the legal language, the Act is trying to solve one problem. It wants to make sure that AI systems that could meaningfully harm people are built with transparency, tested rigorously, monitored closely, and kept under human control when it matters. That's not radical. It's actually closer to how you should be building LLMs anyway.
This guide cuts through the regulation to show you what it actually means for your work: what gets classified how, which requirements apply to you (and which don't), what compliance looks like in practice, and what you should start doing today. By the end, you'll know whether you're in scope, what you'd need to do if you are, and how to think about it without drowning in jargon.
The EU AI Act is a regulatory framework for artificial intelligence launched in phases starting in 2023, with full enforcement expected by 2027. Think of it as a risk-based taxonomy: the EU sorted all possible AI uses into buckets—from "banned outright" to "basically no rules"—and assigned compliance requirements to each bucket.
The core principle is proportional regulation. If your AI system could cause serious harm (like deciding who gets a job or a loan), the Act says you have to prove you've thought about that, tested for it, documented what you know, and kept humans involved in critical decisions. If your system just translates text, the rules are lighter. If your system does something explicitly harmful (like mass surveillance or social credit scoring), it's banned entirely.
The Act applies to organizations that place AI systems on the EU market or make them available to EU users—regardless of where you're based. If you offer an LLM as a service to European customers, or if European users interact with your system, you're in scope. Geography doesn't protect you; deployment does.
The EU AI Act sorts AI systems into four categories, and your compliance obligations depend entirely on where your system lands. Understanding this taxonomy is the entire game.
Prohibited AI: Certain AI uses are banned outright. These are practices the EU considers so inherently harmful that no amount of safety engineering makes them acceptable. For LLMs, this includes using language models to manipulate people emotionally (e.g., a system designed to radicalize users at scale), real-time facial recognition in public spaces by law enforcement without strict judicial oversight, and social credit systems that assign scores to citizens based on AI assessment of behavior. These are rare for commercial LLM builders to stumble into, but they're worth knowing exist. If you're even considering any of these, stop and talk to legal counsel instead of continuing.
High-Risk AI: This is the category that matters for most LLM work. High-risk systems are those where failure or misuse could cause significant harm to people's fundamental rights or safety. High-risk systems include AI used in hiring decisions (resume screening, candidate ranking), credit and lending decisions (loan approvals), law enforcement deployments (predictive policing, investigative search), immigration and border control decisions, and educational assessment systems that determine student outcomes. High-risk AI also includes systems that detect emotional states (used for medical diagnosis or recruitment) and biometric identification systems.
For high-risk AI, the Act is strict: you must document your system thoroughly, validate it works as intended, identify risks and mitigate them, keep humans in the loop for critical decisions, maintain detailed logs, monitor performance after deployment, and report serious incidents to authorities.
General-Purpose AI (GPAI): This category emerged during the regulatory process as a response to large language models. General-purpose AI is a system that can be used for a wide range of tasks with little adaptation after initial release. ChatGPT is general-purpose AI. So is Claude. So is Llama. The EU recognized that one LLM can be deployed in dozens of ways—some safe, some high-risk—so it created a middle category with lighter-touch requirements.
Providers of general-purpose AI must comply with transparency requirements (document how your model was trained, what data you used, what capabilities and limitations it has) and submit a model card to the EU registry. You must also implement safeguards to prevent generating illegal content or violating copyright. But you don't need to validate it for every possible use case, because you're not responsible for every use case—your customers are.
Lower-Risk AI: Systems that don't fall into the above categories face minimal to no specific EU AI Act requirements. A spam filter? Recommendation algorithm that doesn't make high-stakes decisions? Chatbot that answers FAQ questions? These are typically lower-risk. The rules are light: maybe transparency for certain uses, but not the full machinery.
This is where the confusion often starts. An LLM isn't automatically in one category. The same model can be high-risk in one deployment and general-purpose (or lower-risk) in another. Location and use case matter.
Here's how to think about it:
If you built the LLM itself (not fine-tuned it, built it from scratch), you're a general-purpose AI provider. You document the model's training, capabilities, limitations, and safeguards. You register with the EU. You're done with baseline compliance—the downstream users are responsible for complying with high-risk requirements if they use your LLM in a high-risk way.
If you took an LLM and deployed it for a specific high-risk use, you're now operating a high-risk AI system. The base model provider's documentation helps you, but you're responsible for the full compliance stack: risk assessment, testing, documentation of how your application works, human oversight, logging, monitoring.
Consider a concrete example. OpenAI provides ChatGPT (general-purpose AI). OpenAI complies with GPAI transparency and safeguard requirements. A company takes ChatGPT and builds a hiring support system that screens resumes and flags candidates for interview. That company is now operating a high-risk AI system. They inherit some of OpenAI's documentation, but they have to add their own: how they're using the model, what testing they've done, what safeguards exist to keep humans involved in the final hiring decision, what risks they've identified (like bias against certain demographics), and how they've mitigated them.
Another example: a bank uses an LLM to power a customer service chatbot that answers questions about account status and basic product info. No financial decisions are made by the AI. This is likely lower-risk or general-purpose AI, not high-risk. The bank still wants to comply with GPAI requirements if applicable, but they're not subject to the full high-risk machinery.
The distinction hinges on whether the AI system makes or significantly influences a high-stakes decision about someone, or contributes to a decision that could affect fundamental rights. If yes, high-risk. If no, probably not.
If you've landed in the high-risk category, the Act requires five concrete things. Let's walk through each in practical terms.
1. Documentation and Transparency
You must document what your system does, how it was built, what it's trained on, what it can fail at, and what oversight exists. This isn't a legal file you lock away. It's a manifest you'd hand to a customer or regulator that says: "Here's how this works, here are its actual limits, and here's what we do about them."
Practically, this means: a technical description of your LLM application (not the underlying model, but how you've deployed it, what data it accesses, what it outputs, what humans do with those outputs). A description of the training data sources (even if it's "fine-tuned on customer support conversations from 2023–2024"). Known limitations and failure modes (e.g., "The system can hallucinate citations; humans review all outputs before they reach users"). Documentation of how you've addressed risks (e.g., "We removed training data from sensitive demographics to reduce bias, and we audit outputs quarterly for fairness"). A user guide explaining what the system is and isn't intended for.
You don't need to open-source your weights or training data. You need to prove you know what's in the system and what it does.
2. Testing and Validation
You must validate that your system works as intended and doesn't cause unreasonable harms. This doesn't mean perfection—it means systematic testing and documented evidence that you've thought about failure modes and verified your safeguards work.
Practically: before you deploy, run tests. For a hiring LLM, that might mean: Does the system screen candidates equitably across demographic groups? What's the false-positive and false-negative rate? Does it flag edge cases (candidates with nonstandard backgrounds) or does it silently miss them? Are there prompts that cause it to hallucinate or refuse appropriately? Document the test results. Note failures and edge cases; don't hide them. Explain how you've handled them (retraining, adding human review steps, warnings to users). Keep these records.
3. Risk Management
You must identify plausible harms and show how you've mitigated them. This is systematic risk-thinking, not checklist-ticking.
For an LLM in hiring, plausible risks include: bias (the system disproportionately screens out candidates from certain demographics), hallucination (the system invents qualifications or inaccuracies), and over-reliance (recruiters trust the system too much and skip human review). Your risk management plan doesn't say "these harms are impossible." It says: "We've identified these risks. Here's how likely they are. Here's what we're doing to reduce likelihood and impact." Maybe you route all flagged candidates through human review (reduces reliance risk). Maybe you audit outputs by demographic quarterly (catches bias early). Maybe you add system prompts that tell the LLM to note uncertainty (reduces hallucination harm).
Document this thinking. Regulators want to see that you've asked hard questions and made deliberate choices, not that you've wished problems away.
4. Human Oversight
This is the element that catches builders off-guard. "Human oversight" doesn't mean a human checks everything (that defeats the purpose of using AI). It means humans remain in the loop for decisions that matter, and they have the tools and training to catch failures and override the AI when needed.
For high-risk LLM systems, human oversight is mandatory. Practically, this means: Define what decisions the AI informs. In hiring, the AI might score candidates; humans make the hiring decision. In lending, the AI might flag risk factors; a human reviews those factors and makes the credit decision. Ensure humans have the information they need to evaluate the AI's reasoning—not just a score, but enough detail to override it if it seems wrong. Train humans on what the AI can and can't do. Audit cases where humans overrode the AI to catch patterns (e.g., "Humans override the AI 40% of the time for candidates over age 55; maybe there's bias in the model").
5. Logging and Monitoring
You must keep records of how your system operates and behaves in the real world. This isn't just for compliance—it's how you catch creeping problems.
Practically: log inputs, outputs, and human decisions. For an LLM processing hiring applications, log the resume input, the AI's ranking, the human's decision, and whether the AI's recommendation matched the human's choice. Maintain logs for at least three years. Set up monitoring dashboards: Is output quality degrading? Are certain demographic groups being systematically screened out? Are humans overriding the system more often (a sign it might be drift-training or failing in new ways)? Establish an incident reporting process: if a serious failure occurs (the system outputs something harmful, or a false positive has real consequences), log it, investigate it, report it to authorities if required, and adjust the system.
The EU AI Act is being phased in. Understanding the timeline helps you prioritize and plan without panicking.
The first wave of requirements came into force in February 2025: the prohibited practices are now binding. If your system violates these, you're breaking the law now.
General-purpose AI transparency requirements and registry submissions are expected to be enforced from around mid-2025 onwards, though timelines are still being finalized as this guide is written. If you're a provider of a large language model or foundation model, start preparing your model cards, training documentation, and safeguard descriptions now. The EU will likely give a grace period, but you don't want to scramble.
High-risk AI requirements (documentation, testing, risk management, human oversight, logging) have a compliance window extending to approximately 2027, with enforcement ramping up gradually. This gives you time, but not infinite time. Starting in 2025 is prudent because building compliance into your system is easier than retrofitting it later.
Lower-risk systems generally fall outside scope, but transparency may apply in specific cases.
What you should do now: If you operate or build an LLM deployed in the EU, inventory your use cases. Decide which are high-risk, which are general-purpose. For high-risk deployments, begin documentation and risk assessment. For GPAI, prepare model cards and training documentation. For both, set up monitoring and logging infrastructure if you don't have it. Legal counsel should help you finalize this categorization—it's not always obvious.
You're wondering: does this actually apply to me? Here's how to think through it.
First, location: Where are your users or data? If any material number of your LLM users are in the EU, or if your system processes EU resident data, you're in scope. This includes EU customers, EU visitors to your website or app, or EU data subjects whose information your LLM accesses. You can't opt out by saying "we're not based in the EU." The rule is where the system operates and who it affects.
Second, the use case: What does your LLM do? If it makes or significantly influences decisions about employment, credit, insurance, education, law enforcement, immigration, or fundamental rights, it's likely high-risk. If it's a consumer-facing chatbot that answers questions, provides information, or assists with writing (without making binding decisions), it's lower-risk. If it's a foundation model or general LLM you're offering to third parties, you're a GPAI provider.
Third, the jurisdiction: Are you subject to EU law? If you place systems on the EU market (sell them to EU customers, make them available to EU users, store their data in the EU, or are incorporated in the EU), yes. If you operate entirely outside the EU and have zero EU users or data, then no—but this is increasingly rare for any internet-connected system.
If you answer "yes" to scope but "unsure" on use case, the safest assumption is high-risk. You can downgrade later with legal guidance; you don't want to discover mid-audit that you're high-risk and have no documentation.
If you're genuinely uncertain, ask these questions: (1) Could the AI's output directly or significantly influence a major life decision for a person? (2) Could failure of the AI cause material harm to someone's rights, safety, or well-being? (3) Are you a provider of the model itself, or an implementer using someone else's model? (4) Do you have users or data in the EU? If you answer "yes" to questions 1 and 4, or if you're unclear on any of them, you should assume high-risk and consult with legal counsel. Better cautious than surprised.
Let's walk through what compliance actually looks like for a concrete case.
Imagine a mid-market HR tech company builds a resume-screening LLM. Candidates upload resumes. The LLM summarizes qualifications, flags relevant experience, and scores candidates on relevance to the job posting. Recruiters see the score and summary and decide whether to interview.
This is a high-risk system because it significantly influences hiring decisions, which affect fundamental rights (employment, livelihood) for EU residents. Here's what compliance requires:
Documentation: The company documents the system (we deployed an LLM for resume screening; it's based on Claude-3.5-Sonnet, fine-tuned on historical job postings and hired/not-hired labels from 2023–2024). They document training data sources (company hiring records, publicly available job descriptions, anonymized resume data from partner companies). They document how it works (the LLM receives a resume and job posting, generates a summary and relevance score 1–10, returns this to the recruiter). They document known limitations: the system can hallucinate qualifications; it may undervalue nontraditional backgrounds; it has less training data for niche roles. They document safeguards: all outputs are reviewed by a human recruiter before any decision; the system is audited quarterly for demographic bias; recruiters are trained that the score is a suggestion, not a decision.
Testing: Before launch, they tested the system on a held-out set of resumes (resumes they didn't train on). They measured: How often does the LLM's recommendation match the hiring decision of domain experts? (Accuracy: 78%.) How does accuracy vary by job level, industry, or demographic? (For mid-level technical roles, 85%; for entry-level roles, 71%; minimal variance by gender, but 8% lower accuracy for candidates with nontraditional education paths.) What failure modes show up? (The LLM sometimes misses leadership experience described in narrative form; it's more confident in flagging technical skills than soft skills.) They document this. They note the 71% accuracy for entry-level and decide to add a human-in-the-loop rule: for entry-level candidates, recruiters review the summary before it's shared with hiring managers.
Risk Management: They identify risks: (1) Bias—the system might systematically downrank candidates from certain demographics. Mitigation: quarterly bias audit across demographics; alert mechanism if any group has accuracy >2% lower than baseline. (2) Hallucination—the LLM might invent qualifications. Mitigation: recruiters always read original resume; output is a summary tool, not a decision. (3) Over-reliance—recruiters might trust the score too much. Mitigation: training program on how the LLM works and its limits; best practice guide for recruiters; periodic audit of hiring decisions to spot patterns (do recruiters screen out candidates the AI flagged, even when resumes were strong?).
Human Oversight: The company ensures recruiters have the original resume alongside the AI summary. The score is advisory; recruiters make the call. No candidate is rejected by the AI alone; all rejections are human-decided. Recruiters are trained on the system. Periodically (monthly), the company audits a sample of decisions: Did recruiters follow the AI recommendation? When they didn't, was it justified? What patterns emerge?
Logging and Monitoring: The company logs every resume processed, every AI output, every recruiter decision. They track: How many candidates are flagged? What's the distribution by job level and demographic? How often do recruiters override the AI? For what reasons? Are there patterns? They monitor weekly dashboards: Is the system's accuracy holding steady, or degrading? Have any groups become worse-represented in recommendations? They establish an incident protocol: if the LLM produces a particularly egregious output (invents major qualifications, or shows clear bias in a case), the team logs it, investigates, and decides if system retraining is needed.
Result: The company is high-risk compliant. They've documented what they're doing, tested it, mitigated known risks, kept humans involved, and set up monitoring. They have evidence if regulators ask. More importantly, they've built a better system: they've forced themselves to think about failure modes, they've caught bias early, and they've kept recruiters in control.
If you're building a high-risk LLM system, here's a checklist of what you need in place today to move toward compliance:
Documentation: Write a system description (one page: what it does, who uses it, what data it accesses, what it outputs). Write a training data summary (sources, rough size, date range, any PII or sensitive data included). Write a limitations document (this is hard; force yourself to list five things your LLM gets wrong or struggles with). Describe your safeguards (how you prevent the bad stuff from happening). Put these in version control or a shared drive. This is your baseline.
Testing: Run a test set on your LLM. Measure accuracy on your core metric. Measure accuracy broken down by important subgroups (if hiring, by seniority or background; if lending, by loan size or geography). Test edge cases (what does the LLM do with a resume that's two pages instead of one? What if the job posting is vague?). Log these results. Don't hide bad results; note them and explain how you handle them.
Risk Assessment: Spend two hours brainstorming: What are five bad things that could happen if this LLM failed or was misused? For each, note: How likely? How bad? What could you do to reduce likelihood or impact? Write this down. Share it with your team. This is your risk register.
Human Oversight: If your system informs high-stakes decisions, ensure humans have the information needed to override the AI. For an LLM that ranks candidates, provide summaries and original data alongside the ranking, not just the ranking. Train your humans (write a one-page guide on what the LLM is good and bad at). Audit decisions: Are humans overriding the AI more often on certain types of inputs? If yes, investigate why.
Monitoring: Set up a dashboard that tracks: (1) Volume (how many inputs are you processing?). (2) Accuracy or quality (is the LLM's performance holding steady?). (3) Subgroup parity (are certain demographics or groups getting different treatment?). (4) Human override rate (how often do humans disagree with the AI?). Check this weekly. If you notice a trend (accuracy down 5%, override rate up, bias metric spiking), investigate.
Logging: Ensure you're storing inputs, outputs, and human decisions for at least three years. Design logs so you can query them: "Show me all cases processed for this demographic in January" or "Which inputs caused the system to output this specific failure?" This is easier if you think about it upfront.
Start here. This is not perfect compliance, but it's a solid foundation. It's also how you should be building LLMs for high-stakes use anyway.
Take 20 minutes and do this exercise for one LLM system you work on.
Step 1: Describe your system. What does it do? Who uses it? What decisions does it inform?
Step 2: Identify your users and geography. Do any EU residents use this system or have their data processed by it? If you're unsure, assume yes.
Step 3: Categorize the use case. Does it inform employment, credit, education, law enforcement, or insurance decisions? Does it have the potential to affect fundamental rights? If yes, it's likely high-risk. If no, it's probably lower-risk or GPAI.
Step 4: For a high-risk system, audit yourself:
Documentation: Do you have a written description of how the system works? Do you know your training data sources? Can you list three limitations? Yes/no for each.
Testing: Have you validated the system works as intended? Have you measured performance on important subgroups? Yes/no for each.
Risk Management: Have you identified plausible harms? Have you documented how you mitigate them? Yes/no for each.
Human Oversight: Are humans in the loop for critical decisions? Do they have the information needed to override? Yes/no for each.
Monitoring: Are you logging inputs, outputs, and decisions? Are you tracking performance over time? Yes/no for each.
Step 5: Identify gaps. For any "no," write one sentence: what would it take to move to "yes"? Pick the three gaps that would have the biggest compliance impact and sketch a plan to close them in the next quarter.
This is your compliance roadmap. Share it with your legal and product teams. You now have a baseline and a direction.
"I'm unsure if we're in scope." Ask: Do we have material EU users or do we process EU resident data? If yes, assume in scope. If no, consider that many companies think they're not EU-exposed until they realize they are (a web app's analytics show 20% of traffic from the EU; a B2B service is used by European subsidiaries of customers). If you're genuinely uncertain, a 30-minute call with EU legal counsel is cheaper than discovering mid-audit that you are in scope and have no documentation. Lean cautious.
"Compliance seems overwhelming." It's not one task; it's five. Break it into chunks. Start with documentation (write down what you're doing; doesn't take long). Then testing (measure your system's actual performance; use existing test sets). Then risk assessment (brainstorm bad outcomes; write them down). Then set up logging and monitoring (this is often a technical project, but it's finite). Then work with legal on human oversight mechanisms. You don't do it all at once. Pick one for this quarter; pick another for next quarter. In a year, you're compliant.
"We're a GPAI provider and we're not sure what to do." Prepare a model card. Document your training approach (data sources, scale, date), your model's known capabilities and limitations, your safeguards (how you prevent illegal outputs, copyright violations, jailbreaks), and a safety assessment. Submit to the EU registry when it's ready. This is lighter than high-risk compliance, but it's non-optional if you're deploying a general-purpose model in the EU.
"Our legal team says this is too vague." It is vague. Regulations often are. What you do is document your good-faith interpretation: "We believe this system falls into the high-risk category because [reason]. We're complying by [approach]. If our interpretation is incorrect, we're open to guidance." Regulators are generally more forgiving of good-faith efforts than sneaky non-compliance. But don't guess alone—bring in counsel.
"The timeline is tight." Yes. The phases are staggered, but 2027 is not infinitely far away. Start now. Even if full compliance isn't until 2027, you want documentation, testing, and risk management in place by 2026 at the latest. Starting in late 2025 is prudent.
The EU AI Act is not a trap door. It's a regulatory framework that asks you to do explicitly what you should be doing anyway if you're building LLMs for consequential use: understand what your system does, test that it works, identify and mitigate risks, keep humans in the loop, and monitor for problems.
The Act is complex because AI risk is complex. But its core logic is simple: systems that could hurt people badly should be built carefully and kept under scrutiny. For LLM builders, this means most deployments that inform high-stakes decisions (hiring, lending, education, law enforcement) fall into the high-risk category, and high-risk systems require documentation, validation, risk management, human oversight, and ongoing monitoring.
You're not in scope if you're operating entirely outside the EU with zero EU users or data. If that's not you—if you have any material EU exposure—assume you're in scope, start categorizing your systems, and begin building compliance into your practices. The good news is that this doesn't mean overhaul. It means documenting what you're doing, testing it, catching edge cases early, and keeping humans involved in decisions that matter. That's sound engineering. It's also compliance.
Start with one system. Audit it against the requirements. Identify the biggest gaps. Close them. Move to the next system. In a year, you'll have a compliant portfolio. More importantly, you'll have better systems: more transparent, more tested, and more reliable.
Inventory your LLM systems. List every LLM or LLM-powered system your organization operates. For each, note: What does it do? Who uses it? Is it deployed in the EU or does it process EU data? This takes a few hours and is foundational. You can't plan compliance without knowing what you're building.
Categorize each system. Using the risk classification framework and the scope decision tree, decide whether each system is prohibited (rare), high-risk, general-purpose AI, or lower-risk. Legal counsel should validate your categorization, but you can do a first pass yourself. This clarifies priorities and requirements.
For high-risk systems, build a 90-day compliance roadmap. Pick one high-risk system. Set aside time this month to document it (write down what it does, how it was trained, what its limitations are). In month two, run tests and measure performance on subgroups. In month three, assess risks and plan monitoring infrastructure. At the end of quarter one, you have one compliant system. Rinse and repeat for the next system.
LEGAL DISCLAIMER
This guide explains the EU AI Act in plain terms for technical practitioners and LLM builders. It is not legal advice. The EU AI Act is complex, its interpretation is still evolving, and regulatory guidance continues to develop. This guide may contain inaccuracies or may oversimplify provisions. Before making compliance decisions—especially categorizing your system, implementing safeguards, or committing resources—consult with legal counsel familiar with EU AI law and your specific jurisdiction. This guide is educational only and does not constitute legal guidance or establish a lawyer-client relationship. Reliance on this guide without independent legal review is at your own risk.
Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.
Explore PathsReady to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.