Agent Skills: The Paradigm Shift Hiding in Plain Text

Created: 2026-03-18 | Size: 23518 bytes

TL;DR

Agent Skills are structured markdown files that give AI agents procedural knowledge at inference time. Anthropic's Claude Code pioneered the pattern in October 2025; Block's Goose adopted it two months later, building explicit compatibility with Claude's skill directories. The rapid adoption signals a fundamental shift in how we augment AI agents. Instead of building more tools (MCP servers), the industry is learning to encode human expertise in plain text. The simplicity is the point.

Something interesting happened in AI tooling over the past few months. Anthropic shipped a feature called Skills for Claude Code in October 2025. Two months later, Block's Goose adopted the same pattern, same file format, same directory conventions, and explicitly built compatibility with Claude's skill directories. The answer both teams landed on is a markdown file.

Not a protocol. Not an API. Not a binary. A markdown file with some YAML frontmatter and, optionally, a couple of helper scripts sitting next to it.

They call them Skills.

And they may be a bigger deal than MCP.

The Problem: Capable but Clueless

Large Language Models are extraordinary generalists. They can write code in dozens of languages, explain quantum physics, and draft legal contracts. But ask one to deploy your specific application using your team's specific process, and it fumbles. It doesn't know that your staging environment requires a VPN connection first. It doesn't know that your database migrations need to run in a particular order. It doesn't know your company's code review checklist.

This is the procedural knowledge gap. LLMs have broad conceptual knowledge but lack the specific, step-by-step expertise that makes someone effective in a particular domain or organization. It's also why tools like GSD exist, imposing structure and context discipline on top of agents that would otherwise drift.

The industry has tried several approaches to close this gap:

Fine-tuning: expensive, sacrifices generalizability, and needs to be redone every time the model updates.
RAG (Retrieval-Augmented Generation): good for facts, poor for procedures. Retrieving a paragraph about deployments is not the same as knowing how to deploy.
MCP (Model Context Protocol): gives agents capabilities (run shell commands, call APIs, query databases) but not judgment about how to use them.

MCP is powerful. It lets an AI agent actually do things. But giving someone a toolbox doesn't make them a carpenter.

Skills are the missing piece: they encode the expertise.

How Claude Skills Work

Anthropic introduced Skills in October 2025. The design is deliberately minimal.

A Skill is a directory containing a SKILL.md file, a markdown document with YAML frontmatter describing what the skill does, followed by the actual instructions:

.claude/skills/
└── code-review/
    └── SKILL.md

---
name: code-review
description: Comprehensive code review checklist for pull requests
---

# Code Review Checklist

When reviewing code, check each of these areas:

## Functionality

- Code does what the PR description claims
- Edge cases are handled
- Error handling is appropriate

## Security

- No credentials or secrets in code
- User input is validated
- SQL queries are parameterized

The frontmatter description is the key innovation. At session start, the Claude harness scans all available skill files and reads only the short description from each one. This is extraordinarily token-efficient: each skill costs a few dozen tokens of overhead. The full content is loaded only when the user's request matches a skill's purpose.

Here's a more realistic example, a deployment skill for a specific team's process:

.claude/skills/
└── deploy/
    ├── SKILL.md
    ├── pre-flight.sh
    └── rollback.sh

---
name: deploy
description: Deploy the application to staging or production using our release process
---

# Deployment Skill

## Pre-flight checklist
Before deploying, always run `bash .claude/skills/deploy/pre-flight.sh` to verify:
- VPN connection to staging network is active
- No pending migrations from other branches
- Feature flags for this release are configured in LaunchDarkly

## Deployment steps
1. Run migrations: `make db-migrate ENV=<target>`
2. Deploy: `make deploy ENV=<target> VERSION=<git-sha>`
3. Verify health: `make health-check ENV=<target>` — wait for all pods green
4. Smoke test: `make smoke ENV=<target>`

## If something goes wrong
Run `bash .claude/skills/deploy/rollback.sh <previous-sha>` immediately.
Open an incident in PagerDuty and notify #deployments in Slack.

## Never do these
- Do not deploy on Fridays after 4pm
- Do not skip migrations even if the diff looks clean
- Do not deploy directly to production without staging first

The model doesn't need to understand your entire infrastructure. It needs to know the things a junior engineer gets wrong on their first deployment. That's what the skill encodes, not general knowledge, but your specific gotchas.

Skills can include supporting files, such as Python scripts, templates, and configuration files, that the model can reference and execute through its existing tools. As Simon Willison put it in his analysis of the feature:

Skills are Markdown with a tiny bit of YAML metadata and some optional scripts in whatever you can make executable in the environment. They feel a lot closer to the spirit of LLMs - throw in some text and let the model figure it out.

How Goose Skills Work

Block's Goose arrived at a strikingly similar design. A Goose Skill is a directory with a SKILL.md file containing YAML frontmatter and instructions:

.agents/skills/
└── deployment/
    ├── SKILL.md
    ├── deploy.sh
    └── templates/
        └── config.template.json

Goose adds one interesting layer: multiple discovery paths. Skills can live in:

~/.claude/skills/ - global, shared with Claude (cross-tool compatibility)
~/.config/agents/skills/ - global, portable across AI coding agents
~/.config/goose/skills/ - global, Goose-specific
./.agents/skills/ - project-level, portable
./.goose/skills/ - project-level, Goose-specific

This hierarchy reveals an important design decision: Goose explicitly supports Claude's skill directory format. The ecosystem is already building toward interoperability.

Goose loads skills through its Summon extension, which handles both skill loading and subagent delegation. When a session starts, Goose adds discovered skills to its instructions and automatically loads the relevant ones when a request matches.

The Adoption

Take a step back and look at what happened. Anthropic shipped Skills in October 2025. Within two months, Goose adopted the same pattern. Not a fork, but a deliberate reimplementation sharing:

The same file format (SKILL.md with YAML frontmatter)
The same directory convention (skills in a named subdirectory)
The same lazy-loading strategy (scan descriptions at startup, load content on demand)
The same support for auxiliary files (scripts, templates alongside the markdown)
The same cross-model portability (nothing in the format is model-specific)

Goose didn't just copy the idea: they built explicit compatibility with Claude's .claude/skills/ directory, so the same skill files work across both tools without modification. That's not imitation. That's validation.

The speed of adoption tells us something important. When another team sees a design and decides to adopt it verbatim rather than invent their own, the design is probably right. When you have an agent that can read files and execute commands, the simplest way to give it domain expertise is to write that expertise down in a file it can read.

As Willison noted, there's nothing preventing any other model from using these same skill files. You can point Gemini CLI, Codex CLI, or any coding agent at a skills folder and say "read the SKILL.md and follow it." It just works. The format is the interface, and the format is plain text.

Skills vs. MCP: Complementary Layers

The temptation is to frame Skills as a replacement for MCP. Goose's own blog addressed this head-on in their post "Did Skills Kill MCP?":

Saying skills killed MCP is about as accurate as saying GitHub Actions killed Bash. [...] Bash still runs the commands. GitHub Actions still defines the workflow. Same system, different layers, no murders involved.

The analogy is apt. MCP provides capabilities: the ability to run shell commands, call APIs, manage files. Skills provide expertise, the knowledge of when and how to use those capabilities effectively.

Feature	Prompts	RAG	MCP Tools	Agent Skills
Modular/reusable	×	✓	✓	✓
Procedural guidance	Limited	×	×	✓
Executable resources	×	×	✓	✓
Cross-model portable	✓	✓	✓	✓

Skills occupy a unique niche: they're the only augmentation paradigm that combines modular reusability with procedural guidance and optional executable resources, all while remaining portable across models and tools.

MCP gives agents abilities. Skills teach agents how to use those abilities well.

The Precursor People Built by Hand

Before Skills existed as a formal feature, people were already building the pattern manually. A recent Reddit post describes what the author calls a "System Prompt Notebook", a structured document that serves as permanent external memory for an AI:

If you hired a genius employee who has severe amnesia, you wouldn't spend an hour every morning re-teaching them their entire job. Instead, you would write an employee handbook.

The post walks through creating a document with role definitions, rules, examples, and activation commands, essentially a handcrafted Skill without the infrastructure. Upload it at session start, reference it throughout, refresh when the model drifts.

This is exactly what Skills formalize and automate. The difference is:

Discoverability: Skills are auto-detected, not manually uploaded
Lazy loading: Only loaded when relevant, saving tokens
Shareability: A folder you can drop into any project or publish to GitHub
Composability: Multiple skills can coexist without conflicting

The System Prompt Notebook was the right instinct. Skills are the right implementation.

The Invisible Ramp

Every engineering team has a knowledge distribution problem. A handful of senior engineers hold the procedural knowledge that keeps the system running: the deployment gotchas, the reason migrations run in that specific order, the undocumented third step that prevents the race condition. This knowledge lives in their heads, not in documentation.

When those engineers leave, the knowledge goes with them. The team enters a months-long ramp where everyone rediscovers the same hard lessons. Post-mortems get written. Wikis get updated. Then the next engineer joins and ignores the wiki because it's out of date.

Skills are a forcing function for externalizing that knowledge in a form that's actually used. Not a Confluence page nobody reads. Not a wiki that decays. A file that a working AI agent reads before every relevant task.

The operational model shifts: instead of training new engineers on your deployment process, you write a deployment skill once, and every agent, human or AI, follows it automatically. The knowledge stays even when the people leave.

This is why the "curate, don't generate" finding from SkillsBench matters beyond benchmarks. When a senior engineer writes a skill, they're not just optimizing for AI performance. They're making their expertise durable.

The Data: Why Curation Matters

Recent empirical research from the SkillsBench benchmark, which we covered in detail previously, quantified what many practitioners suspected: Skills work, but how you build them matters enormously.

Key findings across 84 tasks and 7 frontier agents:

Human-curated Skills boost pass rates by +16.2 percentage points on average
Self-generated Skills (where the model writes its own) actually degrade performance by -1.3pp
The optimal configuration is 2–3 focused skills of moderate length
More documentation ≠ better performance: exhaustive skills hurt

The domain variance is striking. Healthcare (+51.9pp) and Manufacturing (+41.9pp) see massive gains because models lack procedural knowledge in those domains. Software Engineering (+4.5pp) sees minimal gains because models already have strong coding priors.

The implication is clear: invest in curated skill libraries, not bigger models. A smaller model with good Skills can match or exceed a larger model without them.

What Bad Skills Look Like

SkillsBench found that self-generated skills degrade performance by 1.3pp on average. That number understates the problem in practice. In domains where procedural correctness matters, a bad skill doesn't just fail to help, it actively misleads.

The failure modes are predictable.

Too long. Performance peaks at moderate length and declines with exhaustive documentation. The intuition: more content creates more opportunities for irrelevant guidance to crowd out relevant guidance. A skill that covers everything covers nothing well.

# Bad: kitchen-sink skill

## Overview
Our deployment process involves multiple stages and was designed in 2022 when we
migrated from Heroku to AWS. There are several historical reasons for the ordering
of steps that relate to a database incident in Q3...

[300 more lines of background, edge cases, and deprecated procedures]

# Good: focused skill

## Deploy to staging
1. Run `make db-migrate ENV=staging`
2. Run `make deploy ENV=staging VERSION=$(git rev-parse HEAD)`
3. Verify: `make health-check ENV=staging`

If step 3 fails: `make rollback ENV=staging`

Too generic. A skill that says "write good tests" teaches the model nothing it doesn't already know. Skills are valuable precisely because they contain your knowledge: your naming conventions, your team's definition of done, your specific gotchas. Generic guidance belongs in the system prompt. Skills are for the specific.

Model-generated. When you ask an AI to write its own skill, it produces confident-sounding text describing general best practices, not your specific procedures. It can't know what it doesn't know. SkillsBench validated this empirically: -1.3pp on average, and worse in specialized domains where the model's priors are weakest. Only someone who has actually done the work knows what to put in a skill.

Too many. The optimal configuration is 2–3 focused skills. More than that and the lazy-loading heuristic starts breaking down, and multiple partially-relevant skills load simultaneously, creating conflicting guidance. Prune aggressively.

The test for a good skill: give it to a capable engineer who has never worked in your codebase and ask them to follow it. If they succeed without asking questions, the skill is good. If they have questions, add the answers. If it takes more than 10 minutes to read, split it.

What This Means

The emergence of Skills signals a maturation in the AI tooling ecosystem. We've moved through three phases:

Prompt Engineering (2023): write better instructions in the chat window
Tool Integration (2024): give models access to APIs, databases, and filesystems via MCP
Knowledge Encoding (2025-2026): package domain expertise as portable, reusable artifacts

Each phase builds on the last. You still need good prompts. You still need tool access. And as agents move into production CI/CD pipelines (agentic continuous delivery), the need for reliable, codified expertise becomes even more critical. But the frontier of improvement has moved to encoding the procedural knowledge that makes an expert an expert.

The practical implications for engineering teams:

Build skill libraries alongside your codebase. Your deployment procedures, code review standards, and onboarding checklists are all candidates.
Curate, don't generate. The data shows that human-authored skills vastly outperform model-generated ones. This is a domain where human expertise genuinely matters.
Keep them focused. 2–3 targeted skills outperform kitchen-sink documentation. Write skills like you'd write good documentation: clear, concise, and actionable.
Make them portable. Use .agents/skills/ or .claude/skills/ directories that work across tools. Avoid vendor lock-in in your expertise encoding.

The most exciting thing about Skills is how boring they are. No new protocol to learn. No server to deploy. No SDK to install. Just markdown files that describe how to do things well.

The simplicity is the point.

Is the Skills Model Here to Stay?

AI tooling moves fast. Last year's breakthrough is this year's legacy. So the honest question is: are Skills a durable pattern, or just the current hype cycle?

There are strong reasons to think this one sticks.

Skills solve a problem that doesn't go away. Models will keep getting smarter, but they will never know your deployment process, your team's naming conventions, or your compliance requirements out of the box. That knowledge is inherently local and organizational. No amount of pre-training fixes it. Fine-tuning can approximate it, but at a cost that resets with every model generation. Skills solve this permanently: write once, use across every model upgrade.

The format is too simple to displace. Protocols get superseded. APIs get deprecated. SDKs get abandoned. But a markdown file in a folder? That's been a stable interface since the 1990s. There's nothing to version, nothing to maintain, nothing that breaks when a dependency updates. The absence of complexity is itself a form of durability. Any agent that can read a file can consume a Skill, and that's not going to change.

The adoption keeps expanding. Claude pioneered it. Goose adopted it within weeks. And the pattern didn't stop there.

OpenAI's Codex CLI uses .codex/skills/ directories with the same SKILL.md format, with YAML frontmatter plus procedural instructions, with optional scripts and reference files alongside. Their own repo includes skills like a PR babysitter that monitors CI, handles review comments, and auto-fixes failures. Same structure, same philosophy.

Google's Gemini CLI took a slightly different but parallel path. Instead of SKILL.md files, it uses GEMINI.md context files with a hierarchical discovery system (global, workspace, and just-in-time loading when a directory is accessed) and .toml-based custom commands for reusable workflows. The primitives differ, but the intent is identical: give the agent procedural knowledge without burning context tokens upfront.

When three major AI labs and an open-source project all build the same pattern within months of each other, you're looking at a de facto standard, not a fad. The .agents/skills/ convention is becoming what .github/workflows/ became for CI, a directory everyone recognizes.

The economics favor it. SkillsBench showed that a smaller model with curated Skills can match a larger model without them. That's a powerful economic argument. Instead of spending more on compute per token, organizations can invest in knowledge curation, a one-time cost with compounding returns. As models become commoditized and pricing compresses, the differentiator shifts from which model to what knowledge you feed it. Skills are the vehicle for that knowledge.

But there are open questions. Skill discoverability is still primitive: agents scan a directory and match on short descriptions. There's no versioning, no dependency management, no testing framework for skills. What happens when you have 200 skills and the agent picks the wrong one? What about skill conflicts? These are solvable problems, and the ecosystem will likely produce answers, but they're not solved yet.

The registry gap is the most obvious missing piece. GitHub Actions has the Marketplace. npm has the registry. Terraform has the Module Registry. Skills have... GitHub search. You can find public skill repositories: OpenAI's Codex repo includes skills for PR babysitting and CI management, and individual practitioners are publishing skill collections, but there's no canonical place to discover, rate, or install them.

This will change. The pattern is too useful and the format too simple for the ecosystem not to produce registries. When it does, the question of "which skills should I use" becomes as important as "which npm packages should I install", which means quality signaling, versioning, and community curation become critical infrastructure. The skills that get widely adopted will become de facto standards for how AI agents operate in specific domains. The .agents/skills/ convention is already becoming what .github/workflows/ became for CI. The marketplace is the next step.

There's also the question of whether models eventually internalize the patterns that Skills currently encode. If future models can reliably generate their own procedural knowledge (something SkillsBench shows they can't do today), the value proposition shifts. But even then, Skills would likely remain as a verification and consistency layer. You don't stop writing runbooks just because your engineers are experienced.

The most telling signal is what's not happening. Nobody is building a competitor to Skills. Nobody is proposing an alternative format or a richer protocol. The industry is just... adopting markdown files in directories. When a solution is so obvious that competition doesn't emerge, that's usually because the solution is correct.

Skills aren't exciting. They're not technically impressive. They don't require a PhD to understand or a startup to implement. They're just text files that make AI agents better at their jobs.

That's exactly why they'll last.

References

Willison, S. (2025). Claude Skills are awesome, maybe a bigger deal than MCP. simonwillison.net
Block. (2025). Did Skills Kill MCP? block.github.io/goose
u/Lumpy-Ad-173. (2025). Build An External AI Memory (Context) File. Reddit
Li, X., Chen, W., Liu, Y., et al. (2026). SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks. arXiv:2602.12670