AI Trip-Planning Reality Check

Why most "AI trip planners" are wrappers — and what a real agent system looks like

A wave of "AI trip planners" launched in the last 18 months. Most are a single LLM behind a chat UI. Here's the structural difference between a wrapper and a system, why it matters for trip planning specifically, and how to tell which kind you're using.

The Wendir team14 May 20269 min read

Quick answer: most "AI trip planners" launched in 2024-2025 are a single LLM behind a chat UI with travel-themed prompting — same architecture, different branding. Five tells of a wrapper: prose output (not structured state), no real-world verification, no concept of multiple users, no expense math, no persistence between sessions. A real agent system has specialised agents per job, a tool layer for ground truth (Google Places, currency, transit), multi-user state, deterministic algorithms (debt simplification), and citations + confidence on every output.

In the last 18 months, dozens of "AI trip planners" have launched — Mindtrip, Wonderplan, Layla, Roam Around, Vacay, Curiosio, GuideGeek, and ChatGPT itself, which most people use this way without thinking of it as a "trip planner." (Naming them as examples of the category, not as a code-level audit of any specific product — each may have features that go beyond the wrapper baseline.)

If you used one and felt like you'd just had a slightly more enthusiastic conversation with a search engine, you weren't wrong. The category default architecture — what most of these tools ship as v1 — is the same shape: a single LLM behind a chat UI, with prompt engineering to bias the output toward travel content. The differences are mostly in branding, the UI shell, and which underlying model they wrap.

This piece is the structural argument for why that shape isn't the right one for group trip planning, what a real agent system looks like, and how to tell them apart.

If you want the experimental evidence — three frontier LLMs asked to plan the same trip — see the LLM comparison piece. This piece is the structural critique that complements it.

What does an LLM-wrapper AI trip planner actually look like under the hood?

The minimal viable "AI trip planner" is about 200 lines of code:

A web UI with a chat box.
A system prompt that says "you are a friendly trip planner. When asked about a destination, suggest activities, restaurants, and a rough day-by-day plan…"
A call to GPT-4 or Claude or Gemini.
The response, rendered as Markdown.

You can build this in a weekend. Many people have. The output is impressive on first contact — the LLM has read enough travel content to produce coherent, plausible-looking itineraries. The user feels like they got something for free, and the wrapper takes the credit.

The problems are not in the LLM. They're in everything the wrapper doesn't have:

No connection to a real-world database (Google Places, transit APIs, weather, opening hours).
No state that survives the conversation.
No model of "multiple users with separate preferences."
No algorithmic component (math, sequencing, settlement).
No verification of any specific claim the LLM makes.

The LLM is genuinely doing its best. It's predicting the most plausible continuation of "what's a good day plan for Kyoto," and producing one. Plausible and correct are different things, and the wrapper has nothing in place to distinguish them.

How can I tell if an AI trip planner is just a wrapper?

How do you spot a wrapper without reading the source code?

1. The output is prose, not structured data

A wrapper produces text. "Day 1: Start your morning at Kiyomizu-dera, then walk down to…" This reads well but isn't queryable. You can't filter it, sort it, share a subset, or update one stop without the LLM rewriting the whole thing.

A real system produces structured state — a list of stops with times, durations, transit links, costs, who proposed it, who voted yes, who's allergic to what. The prose can be generated from the state on demand. The state is the durable thing.

2. It can't verify a single specific claim

Try asking: "Is Restaurant X open on Tuesdays at 7pm?"

A wrapper will produce a confident answer. "Yes, X is open Tuesday 7pm." The answer is wrong a meaningful fraction of the time — anyone who's used an LLM for operational facts has hit this — and the LLM has no way of telling you which times it's wrong.

A real system either: (a) hits a real-world API (Google Places, OpenTable) for the answer and cites the source, or (b) explicitly says "I don't know — verify this directly." The first is a real system. The second is a wrapper that's at least honest about its limits.

If you can't get any verification, citation, or confidence score on the LLM's specific claims, you're using a wrapper.

3. No concept of multiple users

Ask: "Sara doesn't eat seafood and David doesn't drink. Plan around both."

A wrapper will produce a plan that gestures at the constraint. "For Sara, the vegetarian curry. For David, the mocktails." But there is no concept of Sara and David as users in the system — they're just names in this prompt, gone the moment the conversation ends.

A real system has user objects with persistent preferences, dietary tags, and veto rights. The next time someone proposes a restaurant, the system already knows what Sara won't eat.

4. It can't settle expenses

Ask: "We spent A$3,000 across four people, here are the receipts: Sara paid X, David paid Y… how do we settle?"

A wrapper will produce a plausible-looking math step that's wrong half the time and has no way to verify itself. The right answer requires running debt simplification — an algorithm, not a language task.

A real system runs the algorithm. The output is deterministic, verifiable, and the same every time.

5. The conversation has no persistence

Close the tab. Come back tomorrow. Type "where were we?"

A wrapper has nothing. The whole conversation history is the only context, and beyond a certain size that gets truncated. There's no concept of "this trip" as a durable object — there's just a chat that may or may not still be in scope.

A real system has trips, ideas, votes, expenses, day plans — each one a persistent object you can return to, update, and share. The conversation is one surface; the state is another.

What a real agent system has

The architecture that does work for group trip planning has five components most wrappers don't:

Specialised agents

Different jobs need different tools. Verifying a restaurant's opening hours against Google Places is not the same kind of task as generating prose. Computing expense settlement is not the same as suggesting activities. Treating them all as "things to ask the LLM" is why every wrapper fails on the non-LLM tasks.

A real system has separate components — call them agents, services, or just functions — that each do one thing. They hand off to each other through a defined interface.

Wendir is built around seven specialised roles — Scout, Local, Moneybags (live in closed beta), Booker, Concierge, Reshuffler, Memorykeeper (Phase 2 and 3, shipping after MVP clears beta). Some are LLM-backed. Some aren't. The user sees one workspace; the system underneath is many tools.

See the agent runtime spec for our take on how each one cites its sources.

A tool layer

The agents call out to real APIs for ground truth:

Google Places for venue verification, hours, reviews.
Currency APIs for FX rates locked at expense-time.
Transit APIs for travel time between stops.
Weather APIs for forecast-driven re-plans.

Wrappers don't have this. The LLM is allowed to confidently claim a venue is open when it has no way to know.

Multi-user state

Users are first-class objects: names, dietary tags, veto rights, payment defaults, presence, preferences accumulated over time. When the scout proposes a place, the system already knows the local thinks the neighbourhood is wrong, the treasurer has flagged that the budget is tight, and the food-decider has a non-negotiable about shellfish.

Wrappers don't have users. They have a single conversation, and whatever ambient information about humans appears in that conversation's text.

Algorithmic operations

Some things are math, not language:

Expense settlement uses debt simplification. Deterministic. Runs in code. Always produces the right answer.
Schedule routing checks geography and opening hours against the proposed itinerary. Catches the "Arashiyama at 9am, Higashiyama at 11am" error before the user does.
Voting + consensus is structured state — votes are recorded as discrete events, the 80% rule is enforced by the system, not "interpreted from chat."

Wrappers describe how these things should work and ask the user to do them.

Citations and confidence on every output

The single most important difference. Every claim the system makes is one of:

Verified — the system hit a real source (Google Places, currency API, etc.) and quotes the source on the output. Confidence: HIGH.
Cross-referenced — multiple sources agree. Confidence: MEDIUM-HIGH.
LLM-generated, unverified — the system produced this from the LLM and is honest about not having a source. Confidence: VERIFY.
LLM-generated, contradicted — the LLM said one thing, a verified source said another. The system flags the conflict.

A user looking at a Wendir output sees the agent's name, a citation link, and a confidence pill. A user looking at a wrapper sees prose with the same level of confidence whether the LLM hallucinated or got lucky.

This is what we mean when we say "specialisation, verification, and persistence are what's missing." The wrapper architecture can't add these without becoming something else.

How to evaluate any AI trip planner in 5 minutes

If you're shopping for one, run this checklist:

Ask it to verify a specific opening-hours claim. "Is X open on Mondays?" If it answers without a source, it's a wrapper.
Ask it to settle a specific expense list. "A paid $200, B paid $300, C paid $0, equal split between three — what does C owe?" If the answer is prose rather than a deterministic number, it's a wrapper.
Close the session, return tomorrow. If "the trip" doesn't survive as a durable object, it's a wrapper.
Try to add a second user with separate preferences. If the tool's data model doesn't actually have users, it's a wrapper.
Look at the output for citations or confidence indicators. If every claim is presented with the same confidence, regardless of whether it's verifiable, it's a wrapper.

Hit any single one of these and you have your answer.

The counter-take

There are legitimate wrappers — products built on top of LLMs that acknowledge they're a wrapper and are honest about what they are. A first-pass-ideation tool that says "here are five suggestions, verify before booking" is fine. A wrapper marketed as an end-to-end trip planner is the problem.

The complaint isn't with wrappers as a category — it's with wrappers positioned as systems. A wrapper that's transparent about being a wrapper is more honest than a system that hides its mechanics. The honesty matters more than the architecture.

What we built (briefly)

Wendir is a real agent system — specialised agents, a tool layer (Google Places, FX, transit), multi-user state, algorithmic settlement, citations and confidence on every output. Closed beta, iOS-first. Waitlist.

This isn't a marketing argument — it's the technical reason we expect to do something the wrapper category can't, which is be reliable enough for the trip you actually take.

If we end up wrong about that, the wrappers win and we shut up about it. The check is straightforward: try both, run the 5-minute evaluation above on each, see which one survives.

The shortest version

If you only remember three things:

Most "AI trip planners" launched in 2024-2025 are a single LLM in a chat UI with a travel system prompt. Same architecture, different branding.
Five tells of a wrapper: prose output, no verification, no users, no expense math, no persistence.
A real agent system has specialised agents, a tool layer for ground truth, multi-user state, algorithmic operations, and citation + confidence on every output. The architecture is the difference, not the model.

Where this fits

This is the second piece in the AI Reality Check lane. The series:

We asked ChatGPT, Gemini, and Claude to plan the same group trip — the experimental evidence.
This piece — the structural argument.
Citations vs hallucinations: how Wendir's agents flag what they don't know — how the verification layer actually works.

And the Manual pieces the architecture supports:

Written by the Wendir team. Last updated: 15 May 2026.

Common questions

What's the difference between an AI trip planner and a chatbot?+

Most AI trip planners launched in 2024-2025 are a chatbot with a travel-themed prompt and a wrapper UI. A real agent system has multiple specialised components — text generation, real-world verification, multi-user state, algorithmic operations — that hand off to each other. The wrapper feels like an LLM; the system feels like a workspace.

How do I tell if an AI trip planner is a wrapper?+

Five tells: (1) the output is prose, not structured; (2) it can't verify a single specific claim (restaurant open? hours?); (3) it has no concept of multiple users with separate preferences; (4) it can't actually settle expenses; (5) the conversation has no persistence beyond the session. Hit any one of these and you're looking at a wrapper.

Are wrappers always bad?+

No — a wrapper is fine for first-pass ideation, destination overviews, or quick translations. The problem is positioning a wrapper as an end-to-end trip planner. The wrapper does the easy 20% (text generation) and leaves the hard 80% (verification, coordination, settlement) to the user.

What does a real agent system actually have?+

Specialised agents with distinct jobs (one for finding places, one for verifying them, one for expense math, etc.); a tool layer that connects to real-world APIs (Google Places, currency rates, transit) for ground truth; multi-user state that survives sessions; explicit handoffs between agents; and citations + confidence on every output so the user knows what's verified and what's an LLM guess.

Doesn't Wendir also use LLMs?+

Yes — Gemini for the text generation parts, where LLMs are actually good. The difference is that the LLM is one component in a workflow, not the whole thing. Place verification hits Google Places, not the LLM. Expense math is an algorithm, not a prompt. Voting is structured state, not a parsed chat. The LLM does what it's good at; everything else uses the right tool.

← All resources