Anthropic recently released their Agent Skills specification, a framework for teaching AI agents new capabilities through modular, reusable skill packages. It’s the backbone of Claude Code’s extensibility. I successfully replicated this architecture in DSPy, and in this post, I’ll break down how it works and what I learned.

code hosted here - https://github.com/ivanvza/dspy-skills

The Problem with Monolithic Prompts

When you build an LLM agent, you typically stuff everything into a system prompt:

You are a helpful assistant that can:
- Search the web
- Read files
- Run shell commands
- Query databases
- Generate reports
- ... 50 more things

This approach has problems:

  • Context explosion: Every capability eats tokens, even when unused
  • Maintenance nightmare: Adding features means editing one massive prompt
  • No composability: Can’t share capabilities across projects
  • Brittle: One bad instruction can break unrelated features

Anthropic’s solution? Progressive disclosure through skills.

What Are Agent Skills?

A skill is a self-contained directory that teaches an agent how to do something specific:

network-check/
├── SKILL.md              # Metadata + instructions
├── scripts/
│   ├── ping_host.py
│   └── check_port.py
└── references/
    └── troubleshooting.md

The magic is in the SKILL.md file:

---
name: network-check
description: Network connectivity testing toolkit for checking host reachability, port availability, and DNS resolution. Use when diagnosing network issues or verifying service availability.
allowed-tools: Bash(ping:*) Bash(nmap:*)
---
 
# Network Connectivity Check
 
When the user needs to diagnose network issues, follow this decision tree:
 
1. **Basic connectivity**: Run `ping_host.py` first
2. **Port check**: If host responds, use `check_port.py` for specific services
3. **DNS issues**: If hostname doesn't resolve, check DNS configuration
 
## Available Scripts
 
- `ping_host.py <host>` - Check if host is reachable
- `check_port.py <host> <port>` - Test specific port connectivity

The key insight: the agent doesn’t load the full instructions until it needs them.

Progressive Disclosure Architecture

This is what makes skills efficient:

flowchart LR
    subgraph "Phase 1: Discovery (~100 tokens/skill)"
        A[Scan Directories] --> B[Parse YAML Frontmatter]
        B --> C[Store Name + Description]
    end

    subgraph "Phase 2: Activation (on-demand)"
        D[User Request Matches] --> E[Load Full SKILL.md]
        E --> F[Instructions in Context]
    end

    subgraph "Phase 3: Execution (on-demand)"
        G[Run Scripts] --> H[Read References]
    end

    C --> D
    F --> G

Phase 1: At startup, only metadata is loaded. 10 skills = ~1000 tokens, not 50,000.

Phase 2: When a user asks something that matches a skill’s description, the agent activates it and loads full instructions.

Phase 3: Scripts execute without loading their source code into context. References are read only when needed.

This mirrors how humans work. You don’t hold every procedure manual in working memory—you know what you can do, and look up how when needed.

The DSPy Implementation

I built dspy-skills, a full implementation of the Agent Skills spec for DSPy’s ReAct agents. Here’s the architecture:

flowchart TB
    subgraph "User Application"
        A[SkillsReActAgent]
    end

    subgraph "Skill Management"
        B[SkillManager]
        C[Parser]
        D[Validator]
    end

    subgraph "Meta-Tools"
        E[list_skills]
        F[activate_skill]
        G[run_skill_script]
        H[read_skill_resource]
    end

    subgraph "Security"
        I[ScriptExecutor]
    end

    A --> B
    B --> C
    B --> D
    A --> E
    A --> F
    A --> G
    A --> H
    G --> I

The Core Classes

SkillManager handles the lifecycle:

class SkillManager:
    def __init__(self, skill_directories: list[Path]):
        self.skills: dict[str, LoadedSkill] = {}
        self.skill_directories = skill_directories
 
    def discover(self) -> list[LoadedSkill]:
        """Scan directories for SKILL.md files, load only metadata."""
        for directory in self.skill_directories:
            for skill_path in directory.glob("*/SKILL.md"):
                metadata = parse_frontmatter(skill_path)
                self.skills[metadata["name"]] = LoadedSkill(
                    name=metadata["name"],
                    description=metadata["description"],
                    path=skill_path.parent,
                    state=SkillState.DISCOVERED,
                    instructions=None  # Not loaded yet!
                )
        return list(self.skills.values())
 
    def activate(self, name: str) -> str:
        """Load full instructions for a skill."""
        skill = self.skills[name]
        skill.instructions = read_instructions(skill.path / "SKILL.md")
        skill.state = SkillState.ACTIVATED
        return skill.instructions

LoadedSkill tracks state:

@dataclass
class LoadedSkill:
    name: str
    description: str
    path: Path
    state: SkillState  # DISCOVERED or ACTIVATED
    instructions: Optional[str]
 
    @property
    def scripts_dir(self) -> Path:
        return self.path / "scripts"
 
    @property
    def references_dir(self) -> Path:
        return self.path / "references"

Meta-Tools: How the Agent Interacts

I give the agent four tools to interact with skills:

ToolPurposeWhen Used
list_skills()Show available skillsAgent needs to know what it can do
activate_skill(name)Load full instructionsTask matches a skill’s description
run_skill_script(skill, script, args)Execute bundled codeFollowing skill instructions
read_skill_resource(skill, type, file)Read reference docsNeeds additional context

Here’s how activate_skill works:

def create_activate_skill_tool(manager: SkillManager) -> Callable:
    def activate_skill(skill_name: str) -> str:
        """Activate a skill to get its full instructions.
 
        Args:
            skill_name: Name of the skill to activate
 
        Returns:
            The skill's full instructions and available resources
        """
        try:
            instructions = manager.activate(skill_name)
            skill = manager.get_skill(skill_name)
 
            # Build response with instructions and available resources
            response = f"# {skill_name} - Activated\n\n"
            response += instructions + "\n\n"
 
            # List available scripts
            if skill.scripts_dir.exists():
                scripts = list(skill.scripts_dir.glob("*"))
                if scripts:
                    response += "## Available Scripts\n"
                    for s in scripts:
                        response += f"- {s.name}\n"
 
            return response
 
        except SkillNotFoundError:
            return f"Error: Skill '{skill_name}' not found"
 
    return activate_skill

Secure Script Execution

Skills can include executable scripts. This is powerful but dangerous. The ScriptExecutor enforces security:

class ScriptExecutor:
    ALLOWED_INTERPRETERS = {"python3", "bash", "sh", "node"}
    DEFAULT_TIMEOUT = 30
 
    def __init__(self, config: ScriptConfig):
        self.timeout = config.timeout
        self.sandbox_enabled = config.sandbox
 
    def run(self, script_path: Path, args: list[str]) -> ExecutionResult:
        # 1. Validate script exists and is within skill directory
        if not self._is_safe_path(script_path):
            raise SecurityError("Path traversal detected")
 
        # 2. Determine interpreter from shebang or extension
        interpreter = self._get_interpreter(script_path)
        if interpreter not in self.ALLOWED_INTERPRETERS:
            raise SecurityError(f"Interpreter not allowed: {interpreter}")
 
        # 3. Build command
        cmd = [interpreter, str(script_path)] + args
 
        # 4. Optional sandboxing (firejail on Linux)
        if self.sandbox_enabled and shutil.which("firejail"):
            cmd = ["firejail", "--quiet", "--private"] + cmd
 
        # 5. Execute with timeout
        result = subprocess.run(
            cmd,
            capture_output=True,
            timeout=self.timeout,
            env=self._sanitized_env()
        )
 
        return ExecutionResult(
            stdout=result.stdout.decode(),
            stderr=result.stderr.decode(),
            exit_code=result.returncode
        )

Key security measures:

  • Path validation: Scripts must be inside their skill directory
  • Interpreter allowlist: Only python3, bash, sh, node
  • Timeout enforcement: Default 30 seconds, prevents runaway processes
  • Optional sandboxing: Firejail isolation on Linux
  • Environment sanitization: Only PATH and PYTHONPATH passed through

Putting It Together

Here’s how you use it:

import dspy
from dspy_skills import SkillsReActAgent, SkillsConfig
 
# Configure DSPy with your LLM
dspy.configure(lm=dspy.LM("anthropic/claude-sonnet-4-20250514"))
 
# Load skills from directories
config = SkillsConfig(
    skill_directories=["./skills", "~/.skills"],
    scripts={"enabled": True, "sandbox": True, "timeout": 30}
)
 
# Create the agent
agent = SkillsReActAgent(
    signature="request: str -> response: str",
    config=config
)
 
# Use it
result = agent(request="Check if api.example.com port 443 is accessible")

Real-World Example: System Process Check

Here’s an actual trace from the agent when asked “Check if there is anything odd running on the system”:

sequenceDiagram
    participant User
    participant Agent
    participant Skills as Skill Manager
    participant Script as ScriptExecutor

    User->>Agent: "Check if anything odd is running"

    Note over Agent: Thought: I need system-level commands
    Agent->>Skills: activate_skill("system_monitor")
    Skills-->>Agent: ❌ Error: Skill not found

    Note over Agent: Thought: Let me check available skills
    Agent->>Skills: list_skills()
    Skills-->>Agent: file-utils, json-tools, system-info...

    Note over Agent: Thought: system-info can list processes
    Agent->>Skills: activate_skill("system-info")
    Skills-->>Agent: ✅ Full instructions loaded

    Note over Agent: Thought: Use process_list.py
    Agent->>Script: run_skill_script("system-info", "process_list.py")
    Script-->>Agent: 535 processes listed

    Note over Agent: Thought: Nothing unusual detected
    Agent->>User: Process list shows normal activity

What’s happening here:

  1. Failed guess: The agent tried system_monitor (wrong name) and got an error
  2. Recovery: Instead of failing, it used list_skills() to discover available options
  3. Correct activation: Found system-info and activated it
  4. Script execution: Ran the bundled process_list.py script
  5. Completion: Analyzed results and responded

This demonstrates the resilience of the skill discovery pattern—the agent self-corrects when it guesses wrong.

A quick run if the above flow

uv run main.py
============================================================
Skills Agent Chat
Loaded 9 skills: file-utils, json-tools, web-fingerprint, network-recon, network-check, skill-creator, system-info, pentest-commands, web-scraper
Type 'quit' or 'exit' to stop, 'skills' to list skills
============================================================
 
> Can you do a quick check to see if there is anything odd running on my system.
 
== Acting Step ===
  next_thought: To check if there is anything odd running on the system, I need to execute a command that lists currently running processes or checks for any unusual activity. Since I'll be using system-level commands, I'll start by activating the necessary skills that allow for system checks.
  next_tool_name: activate_skill
  next_tool_args: {'skill_name': 'system_monitor'}
 
== Acting Step ===
  next_thought: Since the 'system_monitor' skill is not available, I will list all available skills to find one that can help with checking for unusual activities on the system.
  next_tool_name: list_skills
  next_tool_args: {}
 
== Acting Step ===
  next_thought: The 'system-info' skill available can provide essential information regarding system processes and performance. I will activate this skill to gather information about running processes and check for any unusual activity on the system.
  next_tool_name: activate_skill
  next_tool_args: {'skill_name': 'system-info'}
 
== Acting Step ===
  next_thought: I have activated the 'system-info' skill, which can provide detailed information about running processes on the system. Next, I will use the `process_list.py` script to get a list of currently running processes and check for any unusual activities.
  next_tool_name: run_skill_script
  next_tool_args: {'skill_name': 'system-info', 'script_name': 'process_list.py', 'arguments': ''}
 
== Acting Step ===
  next_thought: I have successfully listed all running processes on the system. The output shows a total of 535 processes running, including various system and user processes. There doesn't appear to be anything unusual based on the list provided, but I will conclude the task without further investigation since the user did not specify any suspicious activity.
  next_tool_name: finish
  next_tool_args: {}
 
== Acting Step ===
  reasoning: I successfully utilized the `system-info` skill to execute a process listing on the user's system. The command returned a total of 535 running processes, which included various system and user-specific processes. No unusual or suspicious entries were immediately discernible from the provided list of processes. Therefore, based on the available data, there doesn't seem to be any odd activity on the system.
  response: I have checked your system and found a total of 535 processes currently running. There doesn't seem to be anything unusual in the list of processes. If you have specific concerns or notice any symptoms of unusual activity, please let me know!
 
== Acting Step ===
  reasoning: I successfully utilized the `system-info` skill to execute a process listing on the user's system. The command returned a total of 535 running processes, which included various system and user-specific processes. No unusual or suspicious entries were immediately discernible from the provided list of processes. Therefore, based on the available data, there doesn't seem to be any odd activity on the system.
  response: I have checked your system and found a total of 535 processes currently running. There doesn't seem to be anything unusual in the list of processes. If you have specific concerns or notice any symptoms of unusual activity, please let me know!
 
== Acting Step ===
  reasoning: I successfully utilized the `system-info` skill to execute a process listing on the user's system. The command returned a total of 535 running processes, which included various system and user-specific processes. No unusual or suspicious entries were immediately discernible from the provided list of processes. Therefore, based on the available data, there doesn't seem to be any odd activity on the system.
  response: I have checked your system and found a total of 535 processes currently running. There doesn't seem to be anything unusual in the list of processes. If you have specific concerns or notice any symptoms of unusual activity, please let me know!
 
============================================================
 
I have checked your system and found a total of 535 processes currently running. There doesn't seem to be anything unusual in the list of processes. If you have specific concerns or notice any symptoms of unusual activity, please let me know!
 
============================================================

The Raw ReAct Trace

Here’s the actual thought/action/observation loop:

StepThoughtActionResult
0”I need system-level commands for checking processes”activate_skill("system_monitor")❌ Skill not found
1”Let me list available skills”list_skills()9 skills available
2”system-info can check processes”activate_skill("system-info")✅ Instructions loaded
3”I’ll use process_list.py”run_skill_script("system-info", "process_list.py")535 processes listed
4”Nothing unusual in the output”finish()Done

5 iterations. The agent recovered from an error, found the right skill, executed a script, and delivered results. No human intervention required.

Lessons Learned

1. Descriptions Are Everything

The skill description is how the agent decides what to activate. Bad descriptions = wrong skill selection.

Bad: “Network utilities” Good: “Network connectivity testing toolkit for checking host reachability, port availability, and DNS resolution. Use when diagnosing network issues or verifying service availability.”

Include:

  • What the skill does
  • When to use it
  • Key capabilities as keywords

2. Progressive Disclosure Actually Works

I tested with 20+ skills. Without progressive disclosure, context usage was ~40K tokens at startup. With it, ~2K tokens. The agent activates 1-2 skills per task on average.

3. Scripts > Generated Code

It’s tempting to let the agent generate code on the fly. Don’t. Bundled scripts are:

  • Tested and reliable
  • Secure (no arbitrary code execution)
  • Consistent (same behavior every time)
  • Auditable (you can review them)

4. Tool Restrictions Reduce Errors

The allowed-tools field in SKILL.md restricts what tools the agent can use when a skill is active:

allowed-tools: Bash(python:*) Bash(nmap:*) Read

This prevents the agent from going off-script. If a network skill is active, it can’t suddenly start editing files.

Comparison to Anthropic’s Implementation

FeatureAnthropic (Claude Code)DSPy Implementation
Skill formatSKILL.md + YAML frontmatterIdentical
Progressive disclosureYesYes
Script executionYes (sandboxed)Yes (firejail optional)
Tool restrictionsallowed-tools fieldallowed-tools field
Forked contextYes (context: fork)Not yet implemented
Model overrideYes (model field)Not yet implemented
HooksPreToolUse, PostToolUse, StopNot yet implemented

We’ve implemented the core specification. The advanced features (forked contexts, hooks) are on the roadmap.


Resources: