Learn multimodal prompting by putting images first, designing a clear extraction schema, and adding guardrails like locale, units, and uncertainty. Practice with a receipt photo, compare against prose, and pick up practical tips for reliable JSON outputs.
Promise: By the end of this short guide you’ll know how to prompt vision-capable LLMs so they pull the right facts from an image—like totals from a receipt—without wandering off. You’ll learn a simple, reusable pattern: put images first, then tell the model exactly what to extract.
What this is: Image + text prompting means giving a model both pixels (an image) and instructions (text). The model “sees” the photo and follows your prompt to produce an answer.
Why it matters: Vision models can transcribe receipts, whiteboards, slides, or menus—but they also love to narrate. If you don’t constrain them, you get pretty descriptions instead of clean data. The fix is to control two things: input order and extraction schema.
Two quick definitions: Multimodal = a model that accepts more than one input type (e.g., images and text). Schema = a simple list of fields and formats you expect in the output.
Think in three moves:
Pixels first. Attach the image(s) before your text. Many models weigh early content heavily; leading with the photo helps them ground on what you want analyzed.
Schema second. Name the exact fields to extract and the format you want back (often JSON). When fields are unclear or missing, say how to mark them (e.g., null).
Rules third. Add small guardrails: language/locale, units, and an uncertainty policy (“If unreadable, return null and explain in notes.”).
Compact example
You upload a receipt image first, then send this user message:
Task: Extract structured data from the image above. Return JSON only: { "merchant": string | null, "date_iso": "YYYY-MM-DD" | null, "currency": "USD|EUR|..." | null, "subtotal": number | null, "tax": number | null, "total": number | null, "notes": string (short reason when a value is null) } Rules: - Use what’s printed on the receipt. Do not infer missing values. - If multiple totals, choose the one explicitly labeled "Total" or "Amount Due". - Dates: convert to ISO (YYYY-MM-DD). If ambiguous, use null and explain in notes.
💡 Insight: Models behave better when you name the field, show the type, and set an uncertainty path. That trio reduces hallucinations dramatically.
Imagine a slightly blurry café receipt. You want: merchant, date, currency, and totals—nothing else.
Step 1 – Prepare the input Crop extra background if possible. Keep the file size reasonable. If you have multiple pages, order them logically (page 1 first).
Step 2 – System prompt (one-time) Give the model its role and constraints once for the session. (A system prompt sets overall behavior.)
You are a careful vision assistant. When an image is provided, analyze it first. Extract only the requested fields, never invent values, and prefer precision over verbosity. If text is unreadable or missing, return null for that field and add a brief note in "notes". Output must be valid JSON that matches the schema exactly—no extra keys or commentary.
Step 3 – Put the image first Upload/attach the receipt image before your text instruction. If your tool uses a messages API, place the image content block before the text content block.
Step 4 – User instruction with schema and rules Paste a short task description, the JSON structure, and 2–3 crisp rules (see the example above).
Step 5 – Run and check Validate the JSON. If it fails, tighten language (“Return JSON only”), or add an uncertainty rule.
Step 6 – Iterate lightly If the model grabs the wrong total, add a clarifier: “Choose the line labeled ‘Total’ or ‘Amount Due’; ignore tip suggestions.”
💡 Insight: Treat the model like a junior analyst: show the evidence first, then give a short checklist. Short beats clever.
1) Starter system prompt Use this once per session to steer behavior.
SYSTEM — Vision Extraction Assistant Role: Extract structured facts from images with high precision. Principles: - Look at the image first; do not speculate beyond visible text/marks. - Follow the user’s schema exactly. If a field is missing/unclear, return null and explain briefly in "notes". - Prefer numbers as numbers, dates in ISO (YYYY-MM-DD), and standardized currency codes when visible. - Output JSON only. No prose.
2) Vague vs. precise (what not to do vs. what to do)
Vague:
What’s on this receipt?
Precise:
Extract the following from the image above and return JSON only: { "merchant": string | null, "date_iso": "YYYY-MM-DD" | null, "currency": "USD|EUR|..." | null, "subtotal": number | null, "tax": number | null, "total": number | null, "notes": string } Rules: - Use printed labels; no guesses. If the currency symbol is absent, set currency to null. - If multiple dates exist, pick the printed transaction date; otherwise null.
3) Receipt extractor (drop-in user prompt)
Task: From the image above, extract the fields below. Return JSON only, no markdown. Schema: { "merchant": string | null, "date_iso": "YYYY-MM-DD" | null, "currency": "USD|EUR|..." | null, "subtotal": number | null, "tax": number | null, "total": number | null, "payment_method": "card|cash|other|null", "notes": string } Rules: - Choose the value explicitly labeled "Total" or "Amount Due". - If numbers include currency symbols or commas, normalize to a number (e.g., 12.50). - If unreadable, set the field to null and explain briefly in "notes".
4) Bonus: Whiteboard photo to action items
Summarize the image above into: { "topics": string[] (3–6 short phrases), "decisions": string[] (0–5), "action_items": [{ "owner": string | null, "task": string, "due": "YYYY-MM-DD" | null }] } Rules: - Prefer exact phrases visible on the board. - If owner/due is missing, use null. - Output JSON only.
Most modern providers support image-first messages. If you’re using a service like Amazon Bedrock with a vision-enabled model:
Order matters: In your message content, list the image(s) before the text instruction. If multiple images, order them top-to-bottom as they should be read.
Ask for JSON: Say “Return JSON only matching this schema.” If the platform offers a JSON mode or tool, enable it.
Uncertainty path: Explicitly allow null plus a short notes field. This reduces made-up values.
Locale hints: If you expect dd.mm.yyyy dates or a specific currency, say so.
One shot at a time: Avoid mixing “describe” and “extract” in the same prompt. Do extraction first; ask for a description in a second turn if needed.
⚠️ Pitfall: “Temperature 0” doesn’t fix misreads. Clarity and schema discipline matter more than sampling settings.
If the model describes instead of extracting, emphasize “Return JSON only. No prose.” and include the schema block again.
If totals are misread or duplicated, state the selection rule (“Use the line labeled ‘Total’ / ‘Amount Due’”), and allow null when ambiguous.
If dates/currency are wrong, set explicit locale expectations (“Use European date format if visible; otherwise normalize to ISO”). Ask the model to include a short notes reason on uncertainty.
If the image has multiple pages, send them in order and say: “If values repeat across pages, prefer the last page’s ‘Total’.”
Trade-off to remember: tighter schemas reduce hallucination but can drop useful context. Keep a small notes field to capture brief explanations without opening the floodgates to prose.
Your task: Take a photo of a printed receipt (or use a sample image). Attach it first. Then paste the prompt below.
Extract from the image above and return JSON only: { "merchant": string | null, "date_iso": "YYYY-MM-DD" | null, "currency": "USD|EUR|..." | null, "subtotal": number | null, "tax": number | null, "total": number | null, "notes": string } Rules: - Use only visible text; no guesses. If a field is missing or unreadable, set it to null and explain in notes. - Dates: convert to YYYY-MM-DD. If multiple, choose the transaction date; otherwise null. - Totals: choose the line labeled "Total" or "Amount Due".
Expected output (example):
{
"merchant": "Sunrise Cafe",
"date_iso": "2025-03-14",
"currency": "EUR",
"subtotal": 11.5,
"tax": 0.92,
"total": 12.42,
"notes": "Currency symbol € visible next to total. Subtotal/Tax labels present."
}If your output doesn’t validate, adjust the rules, re-run, and note what changed. That observation is the learning.
The core idea is simple: show the image first, then give a tight schema and a few rules. This grounds the model in what matters and prevents narrative drift. When fields are unclear, a null plus a brief note beats a confident guess.
You trade a bit of flexibility for reliability. That’s worth it for receipts, IDs, menus, and slides where structure wins. Most issues—wrong totals, ambiguous dates, wall-of-text replies—vanish when you constrain format and set an uncertainty path.
Keep your prompts short, consistent, and explicit. Validate outputs. Iterate only when a failure mode appears, and fix it with one precise rule rather than a longer essay.
Next steps:
Try the lab with two different receipts (one clear, one crumpled) and compare notes.
Extend the schema with an items array (name, qty, price) and see how often it succeeds.
Run the same prompt on a whiteboard photo and adapt the rules for names/dates.
Follow guided learning paths from beginner to advanced. Master prompt engineering step by step.
Explore PathsReady to Master More? Explore our comprehensive guides and take your prompt engineering skills to the next level.