Your LLM Forgets What You Said Two Messages Ago

Created: 2026-03-16 | Size: 8609 bytes

TL;DR

Every LLM you use, from Llama 3.1-8B to Gemini 2.5 Pro, performs dramatically worse in multi-turn conversations compared to single-turn prompts. A large-scale study of 15 models across 200,000+ simulated conversations found an average 39% performance drop when information is spread across turns instead of given upfront. The kicker: models don't lose capability; they lose reliability. Unreliability more than doubles (112% increase), meaning the same model on the same task can nail it one run and completely botch it the next. Starting a new conversation is genuinely more effective than continuing a broken one.

The Benchmark-Reality Gap Strikes Again

We keep evaluating LLMs the wrong way. Nearly all major benchmarks (HumanEval, GSM8K, Spider) test models with a single, fully-specified prompt. One shot, all the information upfront. But that's not how anyone actually uses these systems.

Real conversations are underspecified. Users drip-feed requirements across turns. They clarify. They change their minds. They assume context carries forward. And when LLMs face this reality, they fall apart, not because they're dumb, but because they're unreliable.

This is the central finding of Laban et al.'s study, and it should change how you think about LLM benchmarks and their disconnect from production performance.

The Experiment: 200K+ Conversations, 15 Models

The researchers built a sharded simulation framework. Take a single-turn benchmark instruction, split the information into smaller "shards," and reveal one shard per conversation turn. A user simulator (GPT-4o-mini) plays the human side. A classification system tracks how the assistant responds at each turn - clarifying, hedging, refusing, or attempting an answer.

Five simulation types test different information delivery strategies:

Type	Description	Purpose
Full	Single turn, complete instruction	Baseline
Concat	Single turn, shards concatenated	Controls for rephrasing
Sharded	Multi-turn, one shard per turn	The real test
Recap	Sharded + final turn restating all shards	Agent-style remediation
Snowball	Each turn adds new shard + repeats all previous	Maximum context redundancy

Six tasks span programming and natural language: HumanEval, LiveCodeBench, Spider (text-to-SQL), BFCL (function calling), GSM8K (math), ToTTo (data-to-text), and Summary of a Haystack. 600 instructions, 15 models, 10 simulations per combination. Total cost: approximately $5,000.

Every Model Gets Lost

The results are unambiguous. Every model degrades on every task in multi-turn settings:

Concat performance sits at 95.1% of Full, confirming the information isn't lost in the sharding process; it's the multi-turn format itself that kills performance.

The big surprise: model size and capability don't help. Claude 3.7 Sonnet, Gemini 2.5 Pro, and GPT-4.1 all suffer 30-40% degradation, right alongside smaller models. Throwing more compute at the problem doesn't fix it.

It's Not Aptitude, It's Reliability

This is the study's most important decomposition. Performance breaks into two components:

Aptitude (90th percentile score): How well can the model do when things go right?
Unreliability (gap between 90th and 10th percentile): How much does performance vary across runs?

In single-turn settings, stronger models are both more capable and more reliable. In multi-turn? Aptitude drops only 16%. Unreliability increases 112% - more than doubling. The same model, same instruction, same information - but performance swings 50 percentage points between best and worst runs.

Four root causes drive this unreliability:

Premature answer attempts - the model guesses before it has enough information, then bakes wrong assumptions into subsequent responses
Answer anchoring - once the model produces an incorrect answer, it doubles down rather than correcting course
Loss of middle turns - models over-weight the first and last messages, losing information from middle turns
Verbosity spiraling - longer responses introduce more assumptions, which compound errors across turns

Two Turns Is All It Takes

The researchers tested conversations ranging from 2 to 8 shards. The result: both GPT-4o and GPT-4o-mini get lost starting at two turns. There's no graceful degradation curve; the cliff is immediate. The only reliable configuration is cramming everything into a single turn.

Reasoning Models Don't Help Either

If you hoped that reasoning models (o3, Deepseek-R1) might handle multi-turn better through chain-of-thought, bad news. They generate 33% longer responses on average, which actually makes things worse. More text means more assumptions, and assumptions compound into errors across turns.

What Actually Helps (Not Much)

The study tested several remediation strategies:

Strategy	Recovery	Notes
Snowball (repeat all context each turn)	15-20% of the gap	Best agent-like approach, still far from single-turn
Recap (restate everything in final turn)	Partial	Helps but doesn't fix accumulated errors
Lower temperature (T=0.0)	Negligible	Unreliability remains ~30 even at zero temperature
Start a new conversation	Best option	Consolidate requirements and re-prompt from scratch

The practical takeaway is almost absurdly simple: put everything in one message. If you're building systems on top of LLMs, minimize the number of turns. If you're a user who's three turns into a conversation that's going sideways, start over.

Implications for Agent Builders

This research has direct consequences for anyone building AI agent systems with reliability requirements:

Agent loops are multi-turn conversations: every tool call and response adds a turn, and each turn increases unreliability
Context engineering matters more than model selection: the best strategy is getting all relevant information into the model's context at once, not spreading it across interactions
Retry-from-scratch beats repair: if an agent gets stuck, resetting the conversation with accumulated context is more effective than trying to course-correct in-place
Temperature tuning is not a fix: you can't engineer away multi-turn unreliability through sampling parameters alone

The authors challenge LLM builders to target unreliability scores below 15 in multi-turn settings at T=1.0. Currently, no model comes close.

The Uncomfortable Truth

We've been evaluating LLMs under conditions that don't match how they're used. Single-turn benchmarks paint an optimistic picture, 90%+ on coding tasks, math, SQL generation. But the moment you introduce the natural dynamics of real conversation - underspecification, gradual information disclosure, multi-turn reasoning, performance craters by 39% on average.

This isn't a model problem to be solved by the next generation of LLMs. It's a structural problem with how transformer attention handles sequential, distributed information. Until architectures fundamentally change, the smartest strategy is designing your systems to work around it: consolidate context, minimize turns, and treat multi-turn conversations as inherently unreliable.

References

LLMs Get Lost In Multi-Turn Conversation - Laban et al., original paper
Lost in the Middle: How Language Models Use Long Contexts - Liu et al., TACL 2024. Related work on positional attention bias
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation - Wu et al., 2023. Multi-agent framework affected by these findings
Your LLM Scores 88% on Code Benchmarks. In Production, It Hits 30%. - Daita blog
Quantifying the Efficacy of Agent Skills - Daita blog