Quantifying the Efficacy of Agent Skills: An Empirical Analysis of SkillsBench

Created: 2026-02-26 | Size: 10023 bytes

TL;DR

SkillsBench tested 7 frontier AI agents across 84 tasks and found that human-curated Agent Skills boost pass rates by 16.2 percentage points on average, but models that try to generate their own skills actually perform worse (-1.3pp). The sweet spot is 2–3 focused skills of moderate length; more documentation hurts rather than helps. Gains are largest in domains where models lack procedural knowledge (Healthcare +51.9pp, Manufacturing +41.9pp) and smallest where they already have strong priors (Software Engineering +4.5pp). The takeaway: invest in curated skill libraries, not bigger models.

Quantifying the Efficacy of Agent Skills: An Empirical Analysis of SkillsBench

The evolution of Large Language Models (LLMs) from text generators to autonomous agents has introduced a new paradigm in software development and automated workflows. However, a fundamental tension remains: while foundation models possess broad capabilities, they frequently lack the specialized, procedural knowledge required for complex, domain-specific tasks. Fine-tuning is often prohibitively expensive and sacrifices generalizability.

Enter Agent Skills, structured packages of procedural knowledge, code templates, and resources that augment LLM agents at inference time. Despite their rapid adoption in the AI community, there has been a distinct lack of standardized methodology to measure their actual impact.

A recent paper, "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks," addresses this critical gap. By introducing a rigorous evaluation framework, the researchers provide empirical data on how, when, and why Agent Skills improve LLM performance.

The SkillsBench Framework

SkillsBench is a comprehensive benchmark designed to treat Agent Skills as first-class evaluation artifacts. The benchmark comprises 86 tasks across 11 diverse domains, of which 84 are evaluated (2 excluded due to GPU requirements and verifier timeouts), utilizing deterministic verifiers to ensure reproducibility and prevent data leakage.

To isolate the impact of procedural augmentation, each task was evaluated under three distinct conditions:

No Skills (Baseline): The agent relies solely on its pre-trained knowledge.
Curated Skills: The agent is provided with human-authored, high-quality procedural guidance.
Self-Generated Skills: The agent is prompted to author its own procedural knowledge before attempting the task.

The benchmark tested 7 frontier agent-model configurations (including Claude Code, Gemini CLI, and Codex CLI) across 7,308 trajectories.

Defining Agent Skills

To understand the impact of Skills, it is necessary to distinguish them from other runtime augmentation paradigms. As outlined in the study, Skills uniquely combine modular packaging with procedural guidance and executable resources.

Table 1: Comparison of Runtime Augmentation Paradigms

Feature	Prompts	RAG	Tools	Agent Skills
Modular/reusable	×	✓	✓	✓
Procedural guidance	Limited	×	×	✓
Executable resources	×	×	✓	✓
Cross-model portable	✓	✓	✓	✓

The tasks themselves were rigorously stratified by difficulty, based on estimated human completion time, ensuring a robust test of agent capabilities:

Table 2: Task Difficulty Stratification

Difficulty	Count (%)	Estimated Human Time
Core	17 (19.8%)	< 60 min
Extended	43 (50.0%)	1–4 hours
Extreme	26 (30.2%)	> 4 hours

Key Findings: The Data Behind Agent Skills

The empirical results from SkillsBench yield several critical insights into the current state of agent augmentation.

1. Curated Skills Drive Substantial, Yet Variable, Improvements

On average, the introduction of curated Skills improved the task pass rate by 16.2 percentage points (pp). The highest overall performance was achieved by the Gemini CLI paired with Gemini 3 Flash, which reached a 48.7% pass rate when equipped with Skills.

2. The Failure of Self-Generated Skills

A pivotal finding of the study is that models cannot reliably author the procedural knowledge they require. When prompted to generate their own skills, agent performance actually degraded by an average of -1.3pp compared to the baseline. This demonstrates that effective procedural augmentation strictly requires human-curated domain expertise.

Table 3: Pass Rates (%) Across Skills Conditions

Harness	Model	No Skills	With Curated Skills	Normalized Gain	Self-Generated
Gemini CLI	Gemini 3 Flash	31.3	48.7	25.3	–
Claude Code	Opus 4.5	22.0	45.3	29.9	21.6
Codex	GPT-5.2	30.6	44.7	20.3	25.0
Claude Code	Opus 4.6	30.6	44.5	20.0	32.0
Gemini CLI	Gemini 3 Pro	27.6	41.2	18.8	–
Claude Code	Sonnet 4.5	17.3	31.8	17.5	15.2
Claude Code	Haiku 4.5	11.0	27.7	18.8	11.0
Mean		24.3	40.6	21.5	21.0

3. Domain-Specific Efficacy

The benefits of Agent Skills are not universally distributed. Skills are highly effective at closing "procedural gaps" (concrete steps, constraints, and sanity checks) but offer diminishing returns in domains where models already possess strong conceptual priors.

For instance, Healthcare and Manufacturing saw massive performance uplifts, whereas Software Engineering and Mathematics experienced minimal gains.

Table 4: Skills Efficacy by Domain

Domain	With Skills	No Skills	Delta
Healthcare	86.1%	34.2%	+51.9
Manufacturing	42.9%	1.0%	+41.9
Cybersecurity	44.0%	20.8%	+23.2
Natural Science	44.9%	23.1%	+21.9
Energy	47.5%	29.5%	+17.9
Office & White Collar	42.5%	24.7%	+17.8
Finance	27.6%	12.5%	+15.1
Media & Content Production	37.6%	23.8%	+13.9
Robotics	27.0%	20.0%	+7.0
Mathematics	47.3%	41.3%	+6.0
Software Engineering	38.9%	34.4%	+4.5

4. Optimal Skill Design: Less is More

More documentation does not equate to better performance. The study reveals that providing an excessive number of skills, or overly comprehensive documentation, creates cognitive overhead that degrades agent efficacy.

The optimal configuration consists of 2 to 3 focused skills of moderate length. Exhaustive documentation actually resulted in a negative performance delta compared to more compact guidance.

Table 5: Pass Rates by Number of Skills Provided

Number of Skills	With Skills	No Skills	Delta
1 skill	42.2%	24.4%	+17.8
2–3 skills	42.0%	23.4%	+18.6
4+ skills	32.7%	26.9%	+5.9

Table 6: Pass Rates by Skills Complexity Level

Complexity	Pass Rate	Delta	Avg Token Count
Detailed	42.7%	+18.8	1165
Compact	37.6%	+17.1	845
Standard	37.1%	+10.1	773
Comprehensive	39.9%	–2.9	~1400+

Strategic Implications for AI Engineering

The findings from SkillsBench carry significant implications for organizations deploying autonomous agents:

Cost Efficiency Through Augmentation: The data indicates that smaller, more cost-effective models equipped with high-quality Skills can match or exceed the performance of larger, more expensive models operating without them. This shifts the Pareto frontier of cost-to-performance, making agent deployment highly scalable.
The Importance of the Harness: Skill efficacy is heavily mediated by the agent harness. Some environments reliably invoke provided skills, while others ignore them. Engineering teams must evaluate not just the model, but the integration layer connecting the model to its procedural context.
Investment in Curation: Because models fail to self-generate effective procedural guidance, organizations must invest in human-curated, domain-specific skill libraries. These libraries should prioritize concise, actionable steps over exhaustive theoretical documentation.

Conclusion

The SkillsBench research establishes a necessary empirical foundation for the future of agent augmentation. It proves that while Agent Skills are not a universal panacea, they are a highly potent tool when applied correctly. By focusing on human-curated, concise, and domain-specific procedural knowledge, engineering teams can significantly elevate the reliability and capability of autonomous AI agents.

References

Li, X., Chen, W., Liu, Y., et al. (2026). SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks. arXiv:2602.12670