The Specific Situation

You wrote a skill that generates database migration files. It works when you invoke it with /generate-migration. But when a teammate says “create a migration for the users table,” nothing happens. When another teammate says “help me with database stuff,” the skill fires unexpectedly and generates an unwanted migration file. You shipped it without testing the trigger boundaries, and now the team does not trust the skill. Here is the testing workflow that prevents this.

Technical Foundation

Claude Code skills have three testable layers: trigger matching (does the description match the right requests?), execution (does the body produce the correct output?), and lifecycle (does the skill survive compaction and work across long sessions?).

Skills load their body once on invocation. The content enters the conversation as a single message and stays for the session. After auto-compaction, each skill retains at most 5,000 tokens from a shared 25,000-token budget. Testing must verify behavior both immediately after invocation and after compaction has occurred.

Claude watches skill directories for changes, so you can edit SKILL.md during a session and see updates immediately. This makes iterative testing fast.

The Working SKILL.md (Test Harness Skill)

Create a skill that validates other skills:

---
name: test-skill
description: >
  Test another skill by running trigger matching and execution checks.
  Use when the user says "test my skill" or "validate skill".
disable-model-invocation: true
argument-hint: "[skill-name]"
allowed-tools: Read Grep Glob
---

# Skill Test Harness

Test the skill specified by $ARGUMENTS.

## Test Sequence

1. **Locate the skill**: Find SKILL.md at
   `.claude/skills/$ARGUMENTS/SKILL.md` or
   `~/.claude/skills/$ARGUMENTS/SKILL.md`

2. **Validate frontmatter**: Read the file and check:
   - YAML parses without errors
   - `description` is present and under 1,536 characters
   - `name` uses only lowercase, numbers, hyphens (max 64 chars)
   - `allowed-tools` patterns use valid syntax if present

3. **Check trigger phrases**: List 5 natural language requests that
   SHOULD trigger this skill and 5 that SHOULD NOT. Ask the user
   to confirm the boundary is correct.

4. **Verify file references**: If the body references supporting
   files (templates/, scripts/, references/), confirm each file
   exists.

5. **Report**: Summarize pass/fail for each check.

The 5-Step Testing Workflow

Step 1: Trigger Boundary Testing

Ask Claude “What skills are available?” to confirm your skill appears in the list. Then test with five phrases:

# Should trigger:
"generate a migration for the users table"
"create database migration"
"write a migration file"

# Should NOT trigger:
"help me with the database"
"what tables exist?"

If the wrong phrases trigger the skill, refine the description field. Add specific keywords that match intended use. Remove vague terms.

Step 2: Argument Passing

Test $ARGUMENTS substitution by invoking with different inputs:

/generate-migration users           # Single arg
/generate-migration users status    # Multiple args
/generate-migration                 # No args (should handle gracefully)
/generate-migration "user sessions" # Quoted multi-word arg

Verify that $0, $1, and $ARGUMENTS resolve correctly in the skill body.

Step 3: Permission Validation

If the skill uses allowed-tools, verify Claude does not prompt for those tools during execution. Test that tools NOT in the allowed list still require permission. Check that deny rules in /permissions override allowed-tools.

Step 4: Compaction Survival

Run a long session with 20+ messages after invoking the skill. Then test the skill’s instructions again. If behavior degrades, re-invoke with /skill-name and verify it recovers. Skills get at most 5,000 tokens after compaction from a shared 25,000-token budget.

Step 5: Edge Cases

Common Problems and Fixes

Skill triggers on unrelated requests: The description is too broad. Replace “helps with database tasks” with “generates SQL migration files for schema changes.”

Skill works manually but not automatically: Check that disable-model-invocation is not set to true. Also verify the description contains words the user would naturally say.

Scripts fail in the skill: Verify execute permissions with ls -la scripts/. Ensure allowed-tools includes the correct pattern like Bash(bash ${CLAUDE_SKILL_DIR}/scripts/*).

Skill works in first session but fails after restart: The skill directory might have been moved or renamed. Run find ~/.claude/skills .claude/skills -name "SKILL.md" to confirm paths.

Production Gotchas

There is no built-in eval framework for skills in the open-source Claude Code. Anthropic’s official skill-creator plugin includes an eval system with test cases, grading agents, and benchmark comparisons. If you need rigorous testing, consider using it as a template.

The live change detection watches for file modifications, not just saves. Some editors write to a temp file and rename (atomic save), which the watcher may miss on certain filesystems. If edits do not take effect, try deleting and recreating the file.

When testing in CI environments, skills in .claude/skills/ are discovered normally. But ~/.claude/skills/ personal skills may not exist in CI containers. Test project-level skills separately from personal ones.

Checklist