daita@system:~$ cat ./quantifying_efficacy_of_agent_skills.md

Quantifying the Efficacy of Agent Skills: An Empirical Analysis of SkillsBench

Created: 2026-02-26 | Size: 10023 bytes

TL;DR

SkillsBench tested 7 frontier AI agents across 84 tasks and found that human-curated Agent Skills boost pass rates by 16.2 percentage points on average, but models that try to generate their own skills actually perform worse (-1.3pp). The sweet spot is 2–3 focused skills of moderate length; more documentation hurts rather than helps. Gains are largest in domains where models lack procedural knowledge (Healthcare +51.9pp, Manufacturing +41.9pp) and smallest where they already have strong priors (Software Engineering +4.5pp). The takeaway: invest in curated skill libraries, not bigger models.

Quantifying the Efficacy of Agent Skills: An Empirical Analysis of SkillsBench

The evolution of Large Language Models (LLMs) from text generators to autonomous agents has introduced a new paradigm in software development and automated workflows. However, a fundamental tension remains: while foundation models possess broad capabilities, they frequently lack the specialized, procedural knowledge required for complex, domain-specific tasks. Fine-tuning is often prohibitively expensive and sacrifices generalizability.

Enter Agent Skills, structured packages of procedural knowledge, code templates, and resources that augment LLM agents at inference time. Despite their rapid adoption in the AI community, there has been a distinct lack of standardized methodology to measure their actual impact.

A recent paper, "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks," addresses this critical gap. By introducing a rigorous evaluation framework, the researchers provide empirical data on how, when, and why Agent Skills improve LLM performance.


The SkillsBench Framework

SkillsBench is a comprehensive benchmark designed to treat Agent Skills as first-class evaluation artifacts. The benchmark comprises 86 tasks across 11 diverse domains, of which 84 are evaluated (2 excluded due to GPU requirements and verifier timeouts), utilizing deterministic verifiers to ensure reproducibility and prevent data leakage.

To isolate the impact of procedural augmentation, each task was evaluated under three distinct conditions:

  1. No Skills (Baseline): The agent relies solely on its pre-trained knowledge.
  2. Curated Skills: The agent is provided with human-authored, high-quality procedural guidance.
  3. Self-Generated Skills: The agent is prompted to author its own procedural knowledge before attempting the task.

The benchmark tested 7 frontier agent-model configurations (including Claude Code, Gemini CLI, and Codex CLI) across 7,308 trajectories.

Defining Agent Skills

To understand the impact of Skills, it is necessary to distinguish them from other runtime augmentation paradigms. As outlined in the study, Skills uniquely combine modular packaging with procedural guidance and executable resources.

Table 1: Comparison of Runtime Augmentation Paradigms

FeaturePromptsRAGToolsAgent Skills
Modular/reusable×
Procedural guidanceLimited××
Executable resources××
Cross-model portable

The tasks themselves were rigorously stratified by difficulty, based on estimated human completion time, ensuring a robust test of agent capabilities:

Table 2: Task Difficulty Stratification

DifficultyCount (%)Estimated Human Time
Core17 (19.8%)< 60 min
Extended43 (50.0%)1–4 hours
Extreme26 (30.2%)> 4 hours

Key Findings: The Data Behind Agent Skills

The empirical results from SkillsBench yield several critical insights into the current state of agent augmentation.

1. Curated Skills Drive Substantial, Yet Variable, Improvements

On average, the introduction of curated Skills improved the task pass rate by 16.2 percentage points (pp). The highest overall performance was achieved by the Gemini CLI paired with Gemini 3 Flash, which reached a 48.7% pass rate when equipped with Skills.

2. The Failure of Self-Generated Skills

A pivotal finding of the study is that models cannot reliably author the procedural knowledge they require. When prompted to generate their own skills, agent performance actually degraded by an average of -1.3pp compared to the baseline. This demonstrates that effective procedural augmentation strictly requires human-curated domain expertise.

Table 3: Pass Rates (%) Across Skills Conditions

HarnessModelNo SkillsWith Curated SkillsNormalized GainSelf-Generated
Gemini CLIGemini 3 Flash31.348.725.3
Claude CodeOpus 4.522.045.329.921.6
CodexGPT-5.230.644.720.325.0
Claude CodeOpus 4.630.644.520.032.0
Gemini CLIGemini 3 Pro27.641.218.8
Claude CodeSonnet 4.517.331.817.515.2
Claude CodeHaiku 4.511.027.718.811.0
Mean24.340.621.521.0

3. Domain-Specific Efficacy

The benefits of Agent Skills are not universally distributed. Skills are highly effective at closing "procedural gaps" (concrete steps, constraints, and sanity checks) but offer diminishing returns in domains where models already possess strong conceptual priors.

For instance, Healthcare and Manufacturing saw massive performance uplifts, whereas Software Engineering and Mathematics experienced minimal gains.

Table 4: Skills Efficacy by Domain

DomainWith SkillsNo SkillsDelta
Healthcare86.1%34.2%+51.9
Manufacturing42.9%1.0%+41.9
Cybersecurity44.0%20.8%+23.2
Natural Science44.9%23.1%+21.9
Energy47.5%29.5%+17.9
Office & White Collar42.5%24.7%+17.8
Finance27.6%12.5%+15.1
Media & Content Production37.6%23.8%+13.9
Robotics27.0%20.0%+7.0
Mathematics47.3%41.3%+6.0
Software Engineering38.9%34.4%+4.5

4. Optimal Skill Design: Less is More

More documentation does not equate to better performance. The study reveals that providing an excessive number of skills, or overly comprehensive documentation, creates cognitive overhead that degrades agent efficacy.

The optimal configuration consists of 2 to 3 focused skills of moderate length. Exhaustive documentation actually resulted in a negative performance delta compared to more compact guidance.

Table 5: Pass Rates by Number of Skills Provided

Number of SkillsWith SkillsNo SkillsDelta
1 skill42.2%24.4%+17.8
2–3 skills42.0%23.4%+18.6
4+ skills32.7%26.9%+5.9

Table 6: Pass Rates by Skills Complexity Level

ComplexityPass RateDeltaAvg Token Count
Detailed42.7%+18.81165
Compact37.6%+17.1845
Standard37.1%+10.1773
Comprehensive39.9%–2.9~1400+

Strategic Implications for AI Engineering

The findings from SkillsBench carry significant implications for organizations deploying autonomous agents:

  1. Cost Efficiency Through Augmentation: The data indicates that smaller, more cost-effective models equipped with high-quality Skills can match or exceed the performance of larger, more expensive models operating without them. This shifts the Pareto frontier of cost-to-performance, making agent deployment highly scalable.
  2. The Importance of the Harness: Skill efficacy is heavily mediated by the agent harness. Some environments reliably invoke provided skills, while others ignore them. Engineering teams must evaluate not just the model, but the integration layer connecting the model to its procedural context.
  3. Investment in Curation: Because models fail to self-generate effective procedural guidance, organizations must invest in human-curated, domain-specific skill libraries. These libraries should prioritize concise, actionable steps over exhaustive theoretical documentation.

Conclusion

The SkillsBench research establishes a necessary empirical foundation for the future of agent augmentation. It proves that while Agent Skills are not a universal panacea, they are a highly potent tool when applied correctly. By focusing on human-curated, concise, and domain-specific procedural knowledge, engineering teams can significantly elevate the reliability and capability of autonomous AI agents.


References

Li, X., Chen, W., Liu, Y., et al. (2026). SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks. arXiv:2602.12670

daita@system:~$ _