Your LLM Scores 88% on Code Benchmarks. In Production, It Hits 30%.
Created: 2026-03-07 | Size: 9643 bytes
TL;DR
A research team at Concordia University tested seven state-of-the-art LLMs on real-world Python class generation and found that models scoring 84-89% on synthetic benchmarks drop to 25-34% on actual production code. The gap is not caused by memorization issues or missing documentation; it stems from a fundamental inability to handle object-oriented semantics like attribute resolution and type consistency. Zero syntax errors across all conditions, but AttributeError and TypeError account for two-thirds of real-world failures. If you are using LLM benchmark scores to set expectations for code generation tools, you are working with inflated numbers.
The Benchmark Illusion
We have written before about how benchmarks can paint an incomplete picture of AI reliability. The pattern keeps showing up: a model aces the test, then stumbles in the field.
Rahman et al. (2025) make this concrete for code generation. They built RealClassEval, a benchmark of 400 Python classes extracted from real open-source repositories - and ran seven models through it: Qwen 2.5 Coder, GPT-4.1, GPT-5, GPT-OSS, Codestral, DeepSeek-V3, and Llama-4 Maverick.
The results are stark:
| Dataset | Pass Rate |
|---|---|
| ClassEval (synthetic) | 84–89% |
| RealClassEval (real-world) | 25–34% |
| Gap | 53–62 percentage points |
This is not a marginal difference. It is a different reality entirely.
It Is Not About Memorization
One natural objection: maybe the models just memorized the synthetic benchmark answers. The study controls for this elegantly.
RealClassEval is split into two partitions:
- Pre-cutoff classes from CodeSearchNet - almost certainly seen during training
- Post-cutoff classes from repositories created after March 31, 2025 - guaranteed unseen
The result? Six of seven models show no significant difference between seen and unseen data. All effect sizes are negligible. The models are not failing because the code is unfamiliar. They are failing because they do not understand how real-world classes work.
Syntax Is Solved. Semantics Is Not.
The error analysis is where this study gets genuinely interesting. Across all experimental conditions, SyntaxError accounts for exactly 0% of failures. Modern LLMs generate syntactically valid Python every single time.
But look at what actually breaks:
| Error Type | Share of Failures |
|---|---|
| AttributeError | 43.84% |
| TypeError | 21.65% |
| AssertionError | 18.51% |
| Other | 16.00% |
AttributeError: the model tried to access an attribute that does not exist on an object. TypeError: the model passed the wrong type to a function or operation. Together, these two account for nearly two-thirds of all real-world failures.
Compare this to synthetic benchmarks, where 72% of errors are AssertionErrors (logic mistakes caught by simple assertions). The failure modes are fundamentally different. Synthetic benchmarks test whether the model can implement correct logic for self-contained functions. Real-world code tests whether the model understands inheritance hierarchies, dynamic attribute resolution, and complex type systems.
LLMs have completely mastered the grammar of Python. They have not mastered the semantics of object-oriented programming.
What This Looks Like in Practice
Consider a real-world class that inherits from a base and overrides behavior:
class OrderProcessor(BaseProcessor):
def process(self, order):
# LLM generates this:
self.validate(order)
result = self.compute_total(order.items)
self.status_manager.update(order.id, "completed")
return result
This looks perfectly reasonable. But BaseProcessor.validate() was renamed to check_constraints() two versions ago. The status_manager attribute is initialized in a sibling class, not in BaseProcessor. And compute_total expects a LineItemCollection, not a raw list.
Three AttributeError and TypeError failures in five lines of syntactically flawless code. The model generated Python that reads like it should work, but does not understand the object graph it is operating within.
Documentation Helps Less Than You Think
The study ran an ablation on docstring completeness: full docstrings, partial docstrings, and no docstrings at all.
Full documentation improves pass rates by 1–3%. Only 2 of 7 models (Codestral and DeepSeek-V3) showed statistically significant improvements, and even those had negligible effect sizes.
This challenges the common assumption that better prompts and richer context will meaningfully close the performance gap. The models are not failing because they lack specifications. They are failing because they cannot translate specifications into correct object-oriented implementations.
RAG Works, But Only in the Sweet Spot
Retrieval-Augmented Generation tells a more nuanced story. The study tested RAG by retrieving similar class implementations from the pre-cutoff dataset and injecting them into the prompt.
| Documentation Level | RAG Improvement | Significant Models |
|---|---|---|
| Full docstrings | 1.6% | 0 of 7 |
| Partial docstrings | 4.3–6.9% | 5 of 7 |
| No docstrings | 3.6% | 0 of 7 |
The authors call this the "information gap hypothesis":
RAG is most effective when the model has enough structure to interpret retrieved examples but not enough detail to implement correctly on its own. With full docs, retrieved examples are redundant. With no docs, there is nothing to anchor the retrieved context against.
There is a catch, though. RAG does not just improve things; it substitutes one failure mode for another. Retrieved examples reduce AttributeErrors and logic failures, but increase ImportErrors (+250%) and KeyErrors (+45%). The model copies dependencies from retrieved examples that do not exist in the target class.
This is a practical concern for anyone building RAG-augmented code generation pipelines. You need post-processing to validate that the generated imports and data structure accesses actually exist in the target context.
What This Means for Practitioners
If you are evaluating LLM-based coding tools, relying on benchmark scores will mislead you. Here is what the data actually supports:
Set realistic expectations. For class-level Python generation, expect 25–34% correctness on real-world code. Every generated class needs human review and testing. This is not a failure of the tools; it is the current state of the art.
Invest in documentation, but do not expect miracles. Full docstrings provide marginal improvements that compound across large codebases. Worth doing, but not a solution to the fundamental gap.
Deploy RAG adaptively. Enable retrieval when documentation is incomplete. Disable it when you already have full specs; you are paying compute costs for no benefit. And always validate that retrieved dependencies exist in the target context.
Watch for OO-specific failures. If you are generating code that involves inheritance, dynamic attributes, or complex type hierarchies, expect higher failure rates. These are the specific capabilities that current models lack.
Encode domain knowledge as structured context. One emerging approach is agent skills: markdown files that encode project-specific conventions, class hierarchies, and architectural patterns so the model does not have to infer them from scratch. This is distinct from RAG: instead of retrieving similar code examples, you provide the semantic context the model is missing: which attributes exist, which types are expected, how the inheritance graph works.
Why This Matters to Us
At Daita, we help teams ship AI-assisted software that actually works in production. This study validates what we see in the field: the gap between demo and deployment is not a tooling problem; it is a context problem.
LLMs do not fail because they cannot write Python. They fail because they do not understand your Python: your class hierarchies, your naming conventions, your type contracts. Benchmark scores measure generic capability. Production success depends on how well the model understands the specific codebase it is operating in.
This is why our work focuses on measuring and closing the gap between what AI tools promise and what they deliver. If you cannot quantify the gap, you cannot fix it.
The gap between how AI helps software development in practice versus benchmark headlines continues to widen in the research literature. This study makes the case quantitatively: synthetic benchmarks are measuring something, but it is not what matters for production code.
References
- Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation - Rahman, Khatoonabadi, Shihab (Concordia University, 2025). Original paper.
- Your AI Agent Aces the Benchmark. It Still Can't Be Trusted. - Daita blog
- How AI Actually Helps (and Sometimes Hinders) Software Development - Daita blog
- Agent Skills: The Paradigm Shift Hiding in Plain Text - Daita blog
- Quantifying the Efficacy of Agent Skills: An Empirical Analysis of SkillsBench - Daita blog