Quantifying the Efficacy of Agent Skills: An Empirical Analysis of SkillsBench
Created: 2026-02-26 | Size: 10023 bytes
TL;DR
SkillsBench tested 7 frontier AI agents across 84 tasks and found that human-curated Agent Skills boost pass rates by 16.2 percentage points on average, but models that try to generate their own skills actually perform worse (-1.3pp). The sweet spot is 2–3 focused skills of moderate length; more documentation hurts rather than helps. Gains are largest in domains where models lack procedural knowledge (Healthcare +51.9pp, Manufacturing +41.9pp) and smallest where they already have strong priors (Software Engineering +4.5pp). The takeaway: invest in curated skill libraries, not bigger models.
Quantifying the Efficacy of Agent Skills: An Empirical Analysis of SkillsBench
The evolution of Large Language Models (LLMs) from text generators to autonomous agents has introduced a new paradigm in software development and automated workflows. However, a fundamental tension remains: while foundation models possess broad capabilities, they frequently lack the specialized, procedural knowledge required for complex, domain-specific tasks. Fine-tuning is often prohibitively expensive and sacrifices generalizability.
Enter Agent Skills, structured packages of procedural knowledge, code templates, and resources that augment LLM agents at inference time. Despite their rapid adoption in the AI community, there has been a distinct lack of standardized methodology to measure their actual impact.
A recent paper, "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks," addresses this critical gap. By introducing a rigorous evaluation framework, the researchers provide empirical data on how, when, and why Agent Skills improve LLM performance.
The SkillsBench Framework
SkillsBench is a comprehensive benchmark designed to treat Agent Skills as first-class evaluation artifacts. The benchmark comprises 86 tasks across 11 diverse domains, of which 84 are evaluated (2 excluded due to GPU requirements and verifier timeouts), utilizing deterministic verifiers to ensure reproducibility and prevent data leakage.
To isolate the impact of procedural augmentation, each task was evaluated under three distinct conditions:
- No Skills (Baseline): The agent relies solely on its pre-trained knowledge.
- Curated Skills: The agent is provided with human-authored, high-quality procedural guidance.
- Self-Generated Skills: The agent is prompted to author its own procedural knowledge before attempting the task.
The benchmark tested 7 frontier agent-model configurations (including Claude Code, Gemini CLI, and Codex CLI) across 7,308 trajectories.
Defining Agent Skills
To understand the impact of Skills, it is necessary to distinguish them from other runtime augmentation paradigms. As outlined in the study, Skills uniquely combine modular packaging with procedural guidance and executable resources.
Table 1: Comparison of Runtime Augmentation Paradigms
| Feature | Prompts | RAG | Tools | Agent Skills |
|---|---|---|---|---|
| Modular/reusable | × | ✓ | ✓ | ✓ |
| Procedural guidance | Limited | × | × | ✓ |
| Executable resources | × | × | ✓ | ✓ |
| Cross-model portable | ✓ | ✓ | ✓ | ✓ |
The tasks themselves were rigorously stratified by difficulty, based on estimated human completion time, ensuring a robust test of agent capabilities:
Table 2: Task Difficulty Stratification
| Difficulty | Count (%) | Estimated Human Time |
|---|---|---|
| Core | 17 (19.8%) | < 60 min |
| Extended | 43 (50.0%) | 1–4 hours |
| Extreme | 26 (30.2%) | > 4 hours |
Key Findings: The Data Behind Agent Skills
The empirical results from SkillsBench yield several critical insights into the current state of agent augmentation.
1. Curated Skills Drive Substantial, Yet Variable, Improvements
On average, the introduction of curated Skills improved the task pass rate by 16.2 percentage points (pp). The highest overall performance was achieved by the Gemini CLI paired with Gemini 3 Flash, which reached a 48.7% pass rate when equipped with Skills.
2. The Failure of Self-Generated Skills
A pivotal finding of the study is that models cannot reliably author the procedural knowledge they require. When prompted to generate their own skills, agent performance actually degraded by an average of -1.3pp compared to the baseline. This demonstrates that effective procedural augmentation strictly requires human-curated domain expertise.
Table 3: Pass Rates (%) Across Skills Conditions
| Harness | Model | No Skills | With Curated Skills | Normalized Gain | Self-Generated |
|---|---|---|---|---|---|
| Gemini CLI | Gemini 3 Flash | 31.3 | 48.7 | 25.3 | – |
| Claude Code | Opus 4.5 | 22.0 | 45.3 | 29.9 | 21.6 |
| Codex | GPT-5.2 | 30.6 | 44.7 | 20.3 | 25.0 |
| Claude Code | Opus 4.6 | 30.6 | 44.5 | 20.0 | 32.0 |
| Gemini CLI | Gemini 3 Pro | 27.6 | 41.2 | 18.8 | – |
| Claude Code | Sonnet 4.5 | 17.3 | 31.8 | 17.5 | 15.2 |
| Claude Code | Haiku 4.5 | 11.0 | 27.7 | 18.8 | 11.0 |
| Mean | 24.3 | 40.6 | 21.5 | 21.0 |
3. Domain-Specific Efficacy
The benefits of Agent Skills are not universally distributed. Skills are highly effective at closing "procedural gaps" (concrete steps, constraints, and sanity checks) but offer diminishing returns in domains where models already possess strong conceptual priors.
For instance, Healthcare and Manufacturing saw massive performance uplifts, whereas Software Engineering and Mathematics experienced minimal gains.
Table 4: Skills Efficacy by Domain
| Domain | With Skills | No Skills | Delta |
|---|---|---|---|
| Healthcare | 86.1% | 34.2% | +51.9 |
| Manufacturing | 42.9% | 1.0% | +41.9 |
| Cybersecurity | 44.0% | 20.8% | +23.2 |
| Natural Science | 44.9% | 23.1% | +21.9 |
| Energy | 47.5% | 29.5% | +17.9 |
| Office & White Collar | 42.5% | 24.7% | +17.8 |
| Finance | 27.6% | 12.5% | +15.1 |
| Media & Content Production | 37.6% | 23.8% | +13.9 |
| Robotics | 27.0% | 20.0% | +7.0 |
| Mathematics | 47.3% | 41.3% | +6.0 |
| Software Engineering | 38.9% | 34.4% | +4.5 |
4. Optimal Skill Design: Less is More
More documentation does not equate to better performance. The study reveals that providing an excessive number of skills, or overly comprehensive documentation, creates cognitive overhead that degrades agent efficacy.
The optimal configuration consists of 2 to 3 focused skills of moderate length. Exhaustive documentation actually resulted in a negative performance delta compared to more compact guidance.
Table 5: Pass Rates by Number of Skills Provided
| Number of Skills | With Skills | No Skills | Delta |
|---|---|---|---|
| 1 skill | 42.2% | 24.4% | +17.8 |
| 2–3 skills | 42.0% | 23.4% | +18.6 |
| 4+ skills | 32.7% | 26.9% | +5.9 |
Table 6: Pass Rates by Skills Complexity Level
| Complexity | Pass Rate | Delta | Avg Token Count |
|---|---|---|---|
| Detailed | 42.7% | +18.8 | 1165 |
| Compact | 37.6% | +17.1 | 845 |
| Standard | 37.1% | +10.1 | 773 |
| Comprehensive | 39.9% | –2.9 | ~1400+ |
Strategic Implications for AI Engineering
The findings from SkillsBench carry significant implications for organizations deploying autonomous agents:
- Cost Efficiency Through Augmentation: The data indicates that smaller, more cost-effective models equipped with high-quality Skills can match or exceed the performance of larger, more expensive models operating without them. This shifts the Pareto frontier of cost-to-performance, making agent deployment highly scalable.
- The Importance of the Harness: Skill efficacy is heavily mediated by the agent harness. Some environments reliably invoke provided skills, while others ignore them. Engineering teams must evaluate not just the model, but the integration layer connecting the model to its procedural context.
- Investment in Curation: Because models fail to self-generate effective procedural guidance, organizations must invest in human-curated, domain-specific skill libraries. These libraries should prioritize concise, actionable steps over exhaustive theoretical documentation.
Conclusion
The SkillsBench research establishes a necessary empirical foundation for the future of agent augmentation. It proves that while Agent Skills are not a universal panacea, they are a highly potent tool when applied correctly. By focusing on human-curated, concise, and domain-specific procedural knowledge, engineering teams can significantly elevate the reliability and capability of autonomous AI agents.
References
Li, X., Chen, W., Liu, Y., et al. (2026). SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks. arXiv:2602.12670