Skill v1.0.1
currentAutomated scan100/1001 files
version: "1.0.1" name: eval-harness description: Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles. when_to_use: Use when the user wants formal evals, pass/fail regression checks, or eval-driven development for AI behavior.
Eval Harness Skill
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
Philosophy
Eval-Driven Development treats evals as the "unit tests of AI development":
- Define expected behavior BEFORE implementation
- Run evals continuously during development
- Track regressions with each change
- Use pass@k metrics for reliability measurement
Eval Types
Capability Evals
Test if Claude can do something it couldn't before:
[CAPABILITY EVAL: feature-name]Task: Description of what Claude should accomplishSuccess Criteria:-[ ] Criterion 1-[ ] Criterion 2-[ ] Criterion 3Expected Output: Description of expected result
Regression Evals
Ensure changes don't break existing functionality:
[REGRESSION EVAL: feature-name]Baseline: SHA or checkpoint nameTests:-existing-test-1: PASS/FAIL-existing-test-2: PASS/FAIL-existing-test-3: PASS/FAILResult: X/Y passed (previously Y/Y)
Grader Types
1. Code-Based Grader
Deterministic checks using code:
# Check if file contains expected patterngrep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"# Check if tests passnpm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"# Check if build succeedsnpm run build && echo "PASS" || echo "FAIL"
2. Model-Based Grader
Use Claude to evaluate open-ended outputs:
[MODEL GRADER PROMPT]Evaluate the following code change:1.Does it solve the stated problem?2.Is it well-structured?3.Are edge cases handled?4.Is error handling appropriate?Score: 1-5 (1=poor, 5=excellent)Reasoning: [explanation]
3. Human Grader
Flag for manual review:
[HUMAN REVIEW REQUIRED]Change: Description of what changedReason: Why human review is neededRisk Level: LOW/MEDIUM/HIGH
Metrics
pass@k
"At least one success in k attempts"
- pass@1: First attempt success rate
- pass@3: Success within 3 attempts
- Typical target: pass@3 > 90%
pass^k
"All k trials succeed"
- Higher bar for reliability
- pass^3: 3 consecutive successes
- Use for critical paths
Eval Workflow
1. Define (Before Coding)
## EVAL DEFINITION: feature-xyz### Capability Evals1.Can create new user account2.Can validate email format3.Can hash password securely### Regression Evals1.Existing login still works2.Session management unchanged3.Logout flow intact### Success Metrics-pass@3 > 90% for capability evals-pass^3 = 100% for regression evals
2. Implement
Write code to pass the defined evals.
3. Evaluate
# Run capability evals[Run each capability eval, record PASS/FAIL]# Run regression evalsnpm test -- --testPathPattern="existing"# Generate report
4. Report
EVAL REPORT: feature-xyz========================Capability Evals:create-user: PASS (pass@1)validate-email: PASS (pass@2)hash-password: PASS (pass@1)Overall: 3/3 passedRegression Evals:login-flow: PASSsession-mgmt: PASSlogout-flow: PASSOverall: 3/3 passedMetrics:pass@1: 67% (2/3)pass@3: 100% (3/3)Status: READY FOR REVIEW
Integration Patterns
Pre-Implementation
/eval define feature-name
Creates eval definition file at .claude/evals/feature-name.md
During Implementation
/eval check feature-name
Runs current evals and reports status
Post-Implementation
/eval report feature-name
Generates full eval report
Eval Storage
Store evals in project:
.claude/evals/feature-xyz.md # Eval definitionfeature-xyz.log # Eval run historybaseline.json # Regression baselines
Best Practices
- Define evals BEFORE coding - Forces clear thinking about success criteria
- Run evals frequently - Catch regressions early
- Track pass@k over time - Monitor reliability trends
- Use code graders when possible - Deterministic > probabilistic
- Human review for security - Never fully automate security checks
- Keep evals fast - Slow evals don't get run
- Version evals with code - Evals are first-class artifacts
Example: Adding Authentication
## EVAL: add-authentication### Phase 1: Define (10 min)Capability Evals:-[ ] User can register with email/password-[ ] User can login with valid credentials-[ ] Invalid credentials rejected with proper error-[ ] Sessions persist across page reloads-[ ] Logout clears sessionRegression Evals:-[ ] Public routes still accessible-[ ] API responses unchanged-[ ] Database schema compatible### Phase 2: Implement (varies)[Write code]### Phase 3: EvaluateRun: /eval check add-authentication### Phase 4: ReportEVAL REPORT: add-authentication==============================Capability: 5/5 passed (pass@3: 100%)Regression: 3/3 passed (pass^3: 100%)Status: SHIP IT