Claude Code for Promptfoo — Workflow Guide
The Setup
You are using Promptfoo for systematic LLM prompt evaluation — testing prompts against multiple models with assertion-based grading. Promptfoo helps you catch prompt regressions, compare model performance, and document prompt quality. Claude Code can write Promptfoo configurations, but it creates one-off scripts instead of structured evaluation configs.
What Claude Code Gets Wrong By Default
-
Tests prompts manually with one-off scripts. Claude writes Node.js scripts that call the API and print results. Promptfoo provides a structured YAML config with test cases, assertions, and comparison views.
-
Evaluates against a single model. Claude tests with one provider. Promptfoo runs the same prompts against multiple models simultaneously for side-by-side comparison.
-
Skips assertion-based grading. Claude visually inspects output. Promptfoo supports assertions:
contains,icontains,javascript,llm-rubric,similarfor automated quality checks. -
Does not track prompt versions. Claude overwrites prompts without versioning. Promptfoo evaluations are reproducible — the YAML config serves as version-controlled prompt documentation.
The CLAUDE.md Configuration
# Promptfoo LLM Evaluation
## Eval Framework
- Tool: Promptfoo (prompt testing and evaluation)
- Config: promptfooconfig.yaml at project root
- CLI: npx promptfoo eval, npx promptfoo view
## Promptfoo Rules
- Config in promptfooconfig.yaml (YAML format)
- Providers: list of models to test against
- Prompts: template strings with {{variable}} placeholders
- Tests: array of test cases with vars and assertions
- Assertions: contains, icontains, javascript, llm-rubric, similar
- Run: npx promptfoo eval (executes all tests)
- View: npx promptfoo view (opens comparison UI)
- Cache: results cached by default for reproducibility
## Conventions
- promptfooconfig.yaml committed to version control
- Test cases cover edge cases and expected behaviors
- Use llm-rubric for subjective quality evaluation
- Compare models: claude, gpt-4, llama in providers list
- Variable datasets for comprehensive testing
- Share results with npx promptfoo share
- CI integration: npx promptfoo eval --no-cache in pipeline
Workflow Example
You want to evaluate a customer support prompt across models. Prompt Claude Code:
“Create a Promptfoo config that tests a customer support agent prompt across Claude, GPT-4, and Llama. Include 5 test cases covering complaint handling, refund requests, and technical support. Add assertions for tone (polite), accuracy (mentions policy), and response length.”
Claude Code should create promptfooconfig.yaml with three providers, the system prompt as a template, five test cases with vars for different customer messages, and assertions using contains for policy references, javascript for length checks, and llm-rubric for tone evaluation.
Common Pitfalls
-
Missing variable syntax in prompts. Claude uses
${variable}JavaScript template literals. Promptfoo uses{{variable}}Mustache-style syntax in prompt templates. Wrong syntax means variables are not substituted and tests pass with empty values. -
Over-relying on exact match assertions. Claude uses
equalsassertions for LLM output. LLM responses vary between runs. Usecontains,icontains, orsimilarfor fuzzy matching, andllm-rubricfor semantic evaluation. -
Cache confusion during development. Claude modifies prompts but gets old results. Promptfoo caches by default. Use
npx promptfoo eval --no-cachewhen iterating on prompts, or clear cache withnpx promptfoo cache clear.