AI Trip-Planning Reality Check

Citations vs hallucinations: how Wendir's agents flag what they don't know

Every AI claim in Wendir ships with the agent's name, a citation link, and a confidence pill — HIGH, VERIFY, or LOW. Here's why we built it that way, what each level actually means, and how a user is supposed to act on the difference.

The Wendir team14 May 20267 min read

Quick answer: Wendir ships every AI output with three things — the agent's name, a citation link to the source, and a confidence pill (HIGH / VERIFY / LOW). HIGH means cross-checked against a verified source (Google Places, currency API, venue site). VERIFY means LLM-generated but not yet cross-checked. LOW means the agent tried to verify and couldn't. Three pills, three actions: trust HIGH for planning, use VERIFY as a starting point and check externally before booking, treat LOW as a warning. This is how the architecture operationalises EU AI Act Article 50 (effective 2 August 2026).

The single most important architectural decision in Wendir wasn't which LLM to use, which language to write the agents in, or how many of them to have. It was this:

Every AI output ships with the agent's name, a citation link, and a confidence pill.

That sentence sounds like product copy. It's not. It's the structural difference between a system that's safe to use for planning a real trip and one that produces confident-looking text that may or may not be true.

This piece is the explanation of how Wendir's verification layer actually works, what the confidence levels mean, and how you're supposed to act on the difference. If you've read why most AI trip planners are wrappers, this is the implementation detail of the architectural argument made there.

The problem in one paragraph

LLMs hallucinate. This isn't a tuning bug — it's the architecture. A language model predicts the most plausible continuation of text. Sometimes the most plausible continuation is true; sometimes it's a coherent fabrication. The model doesn't know which is which, and crucially, the user doesn't either, because the hallucination looks identical to the truth.

In conversational use, this is a minor annoyance. In trip planning, it gets you to a restaurant that closed last year — with a deposit on the table.

How does Wendir's confidence model work?

Every output from a Wendir agent comes with one of three pills:

HIGH (green)

The agent produced this output and cross-checked it against a verified source.

For places (live in Scout and Local): Google Places API agreement on the venue's existence, hours, and rough category — plus the venue's official site if one exists. For prices: a recent currency-rate API, or the venue's official site. Transit and weather verification ship alongside the Phase 2 agents (Concierge, Reshuffler) that depend on them; same HIGH/VERIFY/LOW model, applied as those agents come online.

A HIGH pill comes with a citation link the user can tap to see the source. The source is named — "Google Places, verified 2 minutes ago" — not just "checked." If the source disagrees with the LLM's initial output, the HIGH pill is the cross-checked answer, not the LLM's guess.

How to act on HIGH: trust it for planning. Verify directly only for high-stakes decisions (non-refundable bookings, allergy-related restaurants, anything with serious downside if wrong).

VERIFY (yellow)

The agent produced this output via the LLM and has not been able to cross-check it.

This could be because:

The claim is about something not in any verifiable database (a recommendation, a vibe assessment, an opinion).
The verification sources are temporarily unavailable.
The claim is about a real entity but our verification layer doesn't yet cover that category.

A VERIFY pill is honest: the LLM thinks this; we haven't checked. The user is told not to treat it as established fact.

How to act on VERIFY: use it as a starting point. Before booking, do a quick external check — Google the place, look at recent reviews, hit the official site. The system is telling you what it doesn't know.

LOW (red)

The agent tried to verify and couldn't — and the LLM's output also has internal markers of uncertainty (e.g. the model itself flagged "I may be wrong about this"; the LLM's claim contradicts what the verification source said; the verified data is stale).

LOW pills are explicit warnings.

How to act on LOW: treat as unreliable. Verify externally before any decision. Sometimes the right action is to discard the suggestion entirely.

Why three levels and not five (or fifty)

A continuous confidence score sounds more scientific but is worse in practice. "79% confident" doesn't tell a user what to do; it makes them mentally bin the number into "high enough" or "not high enough." Better to bin it for them, into the smallest number of buckets that maps to distinct actions:

HIGH → trust for planning.
VERIFY → use as a starting point, check before committing.
LOW → don't trust; verify externally first.

Three actions, three pills. Simpler is better. The internal scoring inside the agent layer is more granular than this; the user-facing simplification is the right interface.

What gets verified

Each Wendir agent has its own verification surface. The matrix below splits live agents (shipping in closed beta) from roadmap agents (Phase 2 and Phase 3) so the distinction is honest:

Agent	Job	Verification source	Confidence floor	Phase
Scout	Extracts places from links	Google Places API	HIGH if place exists + hours match; else VERIFY	Live (closed beta)
Local	Verifies a place against real-world data	Google Places + venue site if available	HIGH on multi-source agreement; LOW on contradiction	Live (closed beta)
Moneybags	Expense math + settlement	Deterministic algorithm (no LLM in the loop)	Always HIGH — refuses to output if numbers don't reconcile	Live (closed beta)
Booker	Booking intents	Designed for human confirmation tap-through; no autonomous booking	(Designed for Phase 3)	Phase 3 — not yet live
Concierge	On-the-ground queries	Designed for real-time hours/location via APIs	(Designed for Phase 2)	Phase 2 — not yet live
Reshuffler	Re-plans on disruption	Designed for FlightAware / weather APIs as triggers	(Designed for Phase 2)	Phase 2 — not yet live
Memorykeeper	Post-trip highlight reel	Curation tool — no factual claims to verify	N/A by design	Phase 3 — not yet live

The architecture is the same across all seven; what differs is which ones have shipped. For the closed-beta MVP loop (create → invite → propose → vote → plan → settle), the live three — Scout, Local, Moneybags — are what carry the verification work. Booker, Concierge, Reshuffler, and Memorykeeper will ship behind the same verification model as they come online.

Every agent declares what its verification floor is. Some have hard floors (Moneybags can't ship a LOW; if the math doesn't add up, the agent refuses to output rather than guess). Others ship VERIFY by default and graduate to HIGH only when verification succeeds.

What it looks like in the UI

When you look at a Scout-proposed place in the Wendir beta, you don't just see "Fushimi Inari — open 24 hours."

You see something like:

Fushimi Inari — open 24 hours 🟢 HIGH · Scout · verified against Google Places, 3 min ago

The agent name, the confidence pill, the source timestamp. The interaction patterns — tapping the pill to open the source, tapping the agent name to see what it did, tapping the timestamp to see refresh time — are the designed UX; some are live in closed beta and some land progressively as we ship the supporting agent layer.

This is the operationalisation of the EU AI Act Article 50 requirement that takes effect from 2 August 2026 — AI-generated output must be clearly disclosed as AI-generated. We're not building it because of the regulation. The regulation happens to match the right architecture for an AI system you'd actually trust with a real trip.

What we don't do

A few honest negatives:

We don't claim every output is verifiable. Some recommendations are LLM-generated and clearly marked as VERIFY. We don't pretend otherwise.
We don't run a real-time fact-check on every word of LLM prose. The confidence applies to specific claims — places, hours, prices, transit — not to every sentence of qualitative description. A neighbourhood being "good for evening strolls" is opinion; no pill.
We don't gate output on verification. If a verification source is slow, the user gets the LLM's output marked VERIFY rather than nothing. Honest signal beats no signal.
We don't auto-correct LLM mistakes silently. If the LLM said one thing and the verification source said another, both are shown. The system says "Scout thought X; Local verified Y; difference flagged" rather than just showing Y. The user gets to see the disagreement.

That last one matters. Hiding the LLM's mistake makes the system look better and the user worse off — they can't develop an intuition for when the LLM is unreliable. Showing the disagreement is honest and over time calibrates trust.

How a user is supposed to act on confidence

Three rules, in order:

HIGH outputs are planning-grade. Build your day around them. Verify directly only when stakes are unusual (anaphylaxis, non-refundable, group-veto-territory).
VERIFY outputs are starting points. Before any commitment — booking, reservation, deposit — do a 30-second external check. This is the work the system is honest about not doing for you.
LOW outputs are warnings, not commands to ignore. The agent is saying "I tried and it's not solid." Sometimes the right move is to ask the agent to try a different source. Often it's to discard the suggestion.

The goal isn't to make the user do more work than a wrapper. It's to make the work distribution honest. A wrapper makes the user implicitly do all the verification with no signal of when. Wendir tells the user which 20% of the output actually needs their attention.

The counter-take

Some people prefer the wrapper's approach: clean prose, no clutter, just answer the question. The "pills and citations everywhere" model is, for some users, visual noise.

Fair. The compromise we ship: pills are visible by default but the citation links are one tap away, not in your face. (A "hide pills" preference is on the roadmap for users who don't want the chrome, but it's not in the current beta — we'd rather get the default right first.)

The underlying architecture isn't optional either way. Whether or not you see the pills, the system still runs the verification — the difference is just whether you see the result of it. The user-facing interface is something we expect to make configurable. The verification layer underneath isn't.

The shortest version

If you only remember three things:

LLM hallucination is architectural, not a tuning bug. It cannot be "fixed" inside a single LLM. The fix is a verification layer that knows when to trust the model.
Wendir ships every AI claim with HIGH / VERIFY / LOW confidence and a citation link. Three actions, three pills. The user knows which 20% of the output needs their attention.
The work distribution is honest, not hidden. A wrapper makes you verify everything implicitly. Wendir tells you where to look.

Where this fits

This is the third and final piece in the AI Reality Check lane. The series:

We asked ChatGPT, Gemini, and Claude to plan the same group trip — the experimental evidence.
Why most AI trip planners are wrappers — the structural argument.
This piece — the verification layer in practice.

The Manual companion:

How to plan a group trip without becoming the unpaid PM — the system this architecture supports.
How to split travel expenses — the place where deterministic math beats LLM math.

Closed beta, iOS-first. Waitlist.

Written by the Wendir team. Last updated: 15 May 2026.

Common questions

What is an AI hallucination?+

A hallucination is a confident, plausible-sounding claim from a language model that turns out to be wrong. For travel: a restaurant said to be open at 9am when it's actually closed; a price quoted that's three years out of date; an address that doesn't exist. The model isn't lying — it's predicting the most plausible next text, which is sometimes true and sometimes not.

How does Wendir handle hallucinations?+

Three ways. (1) Every AI output ships with a confidence pill — HIGH, VERIFY, or LOW — so the user knows what's checked. (2) Where possible, agent output is cross-checked against a real source (Google Places, currency APIs) and citations are linked. (3) When the system genuinely doesn't know, it says so explicitly rather than producing a confident-sounding guess. The user can challenge any claim with one tap.

What does each confidence level mean?+

HIGH: cross-checked against a verified source (Google Places, official venue site, real-time API). VERIFY: the LLM produced this but it hasn't been cross-checked — treat as a suggestion, not a fact. LOW: the agent is uncertain even after attempted verification — proceed with extra care. Don't book a non-refundable thing on a VERIFY without a quick external check.

Why not just always cross-check everything?+

Some things have no good source (the vibe of a neighbourhood, a recommendation for a hidden bar, a creative suggestion). Forcing verification on those produces worse output — empty results where a sensible LLM guess would have helped. The honest answer is to mark which is which, not pretend everything can be verified.

What's the EU AI Act Article 50 thing about?+

From August 2026, the EU AI Act Article 50 requires AI-generated output to be clearly disclosed as AI-generated. Wendir's 'Generated by AI' labels, confidence pills, and citations are designed to meet this requirement ahead of the deadline — and it's the right architecture regardless of regulation.

← All resources