AI Trip-Planning Reality Check
What ChatGPT, Gemini, and Claude get wrong when you ask them to plan a group trip
Three frontier LLMs, one prompt: "plan a 3-day group trip to Kyoto for 4 friends with mixed budgets and dietary needs." The kinds of output each produces, what they all structurally fail at, and the architectural reason a single chatbot can't plan a group trip well — drawn from documented LLM behaviours and our own repeated testing while building Wendir.
Quick answer: when asked to plan a group trip, ChatGPT, Gemini, and Claude all hallucinated specific restaurants (some closed years ago), got opening hours wrong, and produced itineraries with impossible geography. None of them ran the expense math, ran a vote, or tracked who proposed what. The structural problem is that a chatbot is the wrong primitive for a multi-week, multi-stakeholder coordination workflow — you need specialised agents, a verification layer over real-world APIs, multi-user state, and algorithmic operations (not just text generation).
LLMs are extraordinary one-shot text generators. They are also, increasingly, what people reach for when they want to plan a trip.
The prompt below is the kind of request that goes into one of these tools every minute of every day:
"Plan a 3-day group trip to Kyoto for 4 friends. Budget: A$800 per person excluding flights. One vegetarian, one with a shellfish allergy. We'd like to see the famous sights but also avoid the worst crowds. Suggest a day-by-day plan and tell us roughly what it will cost."
A note on what follows: the descriptions of each model below are illustrative composites drawn from repeated runs of this and similar prompts against the current frontier models during Wendir's development, plus published research on LLM hallucination in travel contexts. They are not transcripts of a single verbatim experiment. Specific model versions change every few months; the failure patterns described here have remained stable across versions and are what matters for the structural argument. If you want to verify, run the prompt yourself — the patterns repeat.
What follows is what an LLM produces alone — without a verification layer, without multi-user state, without expense math. Compare against our Kyoto deep-read for what the same prompt looks like when a real workflow is involved.
ChatGPT
Typically the fastest of the three — a complete itinerary in under 30 seconds.
What it got right:
- Three distinct themed days (Higashiyama, Arashiyama, central).
- Named the right anchor sights (Kiyomizu-dera, bamboo grove, Fushimi Inari).
- Acknowledged the dietary constraints and named a vegetarian restaurant by name.
What it got wrong:
- The vegetarian restaurant it named had closed in 2024. ChatGPT had no way to know.
- The bamboo grove was scheduled for "11:00 AM" — peak crowd hour. No warning.
- "Roughly what it will cost" produced a single bottom-line number with no breakdown, no per-day estimate, no per-person split. When asked to split four ways with one vegetarian and one shellfish-allergic person, it just divided by four.
- No way to vote, no expense tracking, no allergy-card guidance. The dietary part was a one-sentence acknowledgement, not a system.
- Confidently asserted that Fushimi Inari "opens at 8:30 AM" — it's open 24 hours.
Verdict: a serviceable starting point for a solo traveller. A misleading starting point for a group trip with real constraints.
Gemini
Typically slightly slower than ChatGPT and more cautious — adds hedges like "you may want to verify" and "hours can change." Honest but reduces usefulness for a user looking for a clean answer.
What it got right:
- Better awareness of crowd timing — suggested Fushimi at sunrise, which is correct.
- Used a table for the day-by-day plan, which is more extractable.
- Acknowledged the cost would vary based on dining choices and gave a sensible range (A$650-A$1,000pp).
What it got wrong:
- The vegetarian restaurant it named also had closed. Different one. Same problem.
- Suggested "a kaiseki dinner one night" without checking that kaiseki dinners in Kyoto require 1-2 week advance booking. A group reading this and trying to book three days out would fail.
- Day 2's geography was incoherent: Arashiyama in the morning, central Kyoto for lunch, eastern temples in the afternoon. Anyone who's been to Kyoto knows this is 90 minutes of transit you don't have time for.
- Allergy handling: "ask about shellfish at the restaurant." This is not actionable for an anaphylaxis-grade allergy in a country where most stocks contain bonito (a fish, not shellfish, but the conflation is the kind of thing that gets people in trouble).
- No multi-currency awareness. The A$ budget was just used; FX wasn't a concept.
Verdict: better-hedged than ChatGPT but the underlying problems are the same — hallucinated specifics, no operational layer, dietary advice that would not survive contact with reality.
Claude
Most consistent in style with what a thoughtful human travel planner would write — qualitative, opinionated, well-paced. Typically takes about the same time as ChatGPT.
What it got right:
- Acknowledged it didn't have real-time information and recommended cross-checking opening hours.
- Suggested Fushimi at dawn explicitly.
- Best dietary handling of the three — actually named shojin-ryori (Buddhist temple cuisine) as a vegetarian-safe category and gave the structural reason, rather than just naming one restaurant.
- Recommended booking the nice dinner from home — the same advice we give in the Kyoto deep-read.
What it got wrong:
- Still no expense splitting. Still no consensus mechanism.
- Suggested a tea ceremony as a Day 3 activity. Tea ceremonies in Kyoto run from ¥3,000 to ¥15,000 per person depending on venue — that's a 5× swing. The "rough cost" Claude gave was the midpoint, with no awareness of the variance.
- Confidently named several restaurants. Two of them existed and were still open. One had moved. One we couldn't verify either way — which is a different kind of problem (the LLM doesn't know what it doesn't know).
- No way to coordinate four humans. Same structural gap.
Verdict: the best of the three for trip ideation. Still not even close to what a group trip planning workflow needs.
Why can't a single LLM plan a group trip?
The interesting thing isn't where the three LLMs differed. It's where they failed identically.
1. They can't verify ground truth
Every LLM in this test hallucinated at least one specific factual claim (restaurant open, hours, price). This isn't a quality-of-prompt problem — it's the architecture. An LLM doesn't know anything; it predicts the most plausible continuation of text. Plausible and true are different things.
For some tasks, that's fine. For trip planning, plausible-but-wrong gets you to a restaurant that closed last year.
2. They have no model of multiple users
A group trip isn't a search query — it's a coordination problem. Four humans with different preferences, different budgets, different schedules, different vetos. A single chatbot has no native concept of "Sara voted yes and David voted no." It produces an itinerary, not the resolution of a multi-stakeholder decision.
You can prompt-engineer around this — "each person votes on each suggestion" — but you're now manually emulating in text what a real workflow would track natively.
3. They have no persistence
Trip planning takes weeks. A chat is a session. Even with conversation history, you can't ask "what did we vote on last Tuesday" because there is no Tuesday; there is only this conversation, this context window, and the slow rot of older messages dropping out.
A workflow has state — votes, bookings, expense logs, day plans — that an LLM can read from and write to but can't be.
4. They can't split the bill
The most concrete failure: not one of the three produced an actual settlement plan. Not because the prompt didn't ask (it did, implicitly — "tell us roughly what it will cost") — but because expense settlement isn't a text-generation task. It's an algorithm: track who paid what, compute net positions, run debt simplification to produce n−1 transfers.
LLMs can describe debt simplification. They can't run it on your actual receipts.
What this means for "AI trip planners"
There's a generation of products marketed as "AI trip planners" that are, structurally, a chatbot interface on top of one of these three models. They have the same limitations, dressed in nicer UI. Some hide it better; the structural failure modes are identical.
The actual unlock isn't a smarter LLM. It's:
- Specialised agents instead of one chatbot. Different jobs need different tools. Verifying a restaurant against Google Places is not the same task as generating prose; treating them as one task is why every "AI trip planner" fails the verification step.
- A verification layer over the LLM's output. When the LLM says "the restaurant is open at 9am," something has to check that against a real source before the user trusts it. This is what citations are for. (See citations vs hallucinations for our take.)
- Multi-user state. The system has to know who's in the trip, who voted, who paid. Not "the LLM tracks it in the conversation" — actually stored, queryable, durable.
- Algorithmic operations. Expense settlement, debt simplification, schedule routing — these are math, not text. They have to actually run, not be described.
- A human in the loop for the calls that matter. Some decisions (anaphylaxis-grade allergies, hard budget caps, who-fronts-the-deposit) shouldn't have an LLM in the final approval path. Not because the LLM is dumb — because the cost of being wrong is asymmetric.
This is the system Wendir is built around. Seven specialised agents (Scout, Local, Moneybags, Booker, Concierge, Reshuffler, Memorykeeper) each doing one thing, handing off to each other, with citation and confidence on every output. It's not "ChatGPT with extra steps." It's a workflow that happens to use LLMs for the parts they're actually good at — text generation, summarisation, idea synthesis — and uses other tools for everything else.
The counter-take
LLMs are still useful for trip planning. They are excellent for:
- First-pass ideation. "What are five neighbourhoods to consider for a 4-person trip to Mexico City?" — useful. They've read enough that the answer is reasonable.
- Translation and language help during the trip. Real-time, low-stakes, easy to verify.
- Drafting messages to other group members. "Help me write a message to the group asking if everyone's OK with shifting Day 2." Plain text-gen, exactly where LLMs are strong.
- Solo travel for low-stakes destinations. One person, a city you'd be fine winging in — LLMs are net positive.
The argument isn't "LLMs are bad." It's "LLMs are the wrong primitive for a multi-person, multi-week, multi-stakeholder coordination problem." Different tool, different job.
The shortest version
Three things, if you only remember three:
- All three frontier LLMs hallucinated restaurants, hours, and prices. They can't verify ground truth — that's an architectural limitation, not a tuning problem.
- None of them split the bill, ran a vote, or tracked who voted what. A chatbot is the wrong primitive for a multi-stakeholder workflow.
- The actual unlock is specialisation + verification + state + a human in the loop. Not "a better LLM." A better system.
Where this fits
This is the first piece in the AI Reality Check lane. The next two pieces go deeper:
- Why most "AI trip planners" are wrappers — and what a real agent system looks like — coming next. The structural argument in more depth.
- Citations vs hallucinations: how Wendir's agents flag what they don't know — the verification layer, specifically.
If you want the operating system this critique implies, Wendir's seven-agent architecture is the system we built around exactly these failure modes. Closed beta, iOS-first. Waitlist.
Or use the LLM you like best for the parts it's good at, and use a proper workflow for everything else. Either works. The trap is using a chatbot for the whole job.
More
- The main planning manual — the system this is the AI-critique companion to.
- Kyoto in 3 days for 4 people — the planning doc the LLMs were trying to produce.
- How to split travel expenses — the math no LLM ran for us.
Written by the Wendir team. Last updated: 15 May 2026. Behaviour patterns described above are drawn from repeated runs of similar prompts against the current frontier models during Wendir's development. Specific model versions move quickly; the structural failure modes (hallucinated specifics, no expense math, no multi-user state, no persistence) have remained stable across versions and are the point of the piece. Run the prompt yourself to verify the patterns repeat. The frontier models referenced: ChatGPT (OpenAI), Gemini (Google), Claude (Anthropic).
Common questions
Can ChatGPT plan a group trip?+
ChatGPT can produce a serviceable solo itinerary in seconds. For a group trip it produces the same itinerary while gesturing at "different preferences" — but it can't run a vote, track expenses, verify opening hours against the real world, or coordinate four humans. The output looks like a plan; the operational layer is missing.
Which AI is best for trip planning?+
For a quick destination overview or a solo trip, any of ChatGPT, Gemini, or Claude works. For a group trip with real coordination — votes, expenses, dietary constraints, schedule conflicts — none of them work standalone. The structural reason: a chatbot is a one-shot text generator. A group trip is a multi-week, multi-person, stateful workflow.
What did all three LLMs get wrong about Kyoto?+
All three suggested places that were closed, mispriced (or refused to price at all), gave outdated opening hours, and produced itineraries with logistically impossible gaps (e.g. Arashiyama in the morning, Higashiyama for lunch — a 50-minute crossing the LLM had no awareness of).
Is Wendir just a wrapper around an LLM?+
No. The LLMs (Gemini for most text generation) are the surface layer — but the system underneath is multiple specialised agents handing off to each other, a real Google Places verification layer, a real-time consensus engine, and a multi-currency settlement algorithm. The chat is the tip; the iceberg is the workflow it lives inside.
Why doesn't a single LLM work for a group trip?+
Three structural reasons: (1) one chatbot can't coordinate multiple humans — it has no native model of users, votes, or consensus; (2) LLMs hallucinate facts (hours, prices, closures) that travel decisions are downstream of; (3) trip planning is stateful and weeks-long, but most LLM interfaces are stateless chats. Specialisation, verification, and persistence are what's missing.