Skill v1.0.1

currentAutomated scan100/100

cloudnative-co/claude-code-starter-kit/eval-harness

1 files

──Details

PublishedMay 30, 2026 at 08:44 AM

Content Hashsha256:1405630d33ad7533...

Git SHA746ce69ab220

Bump Typepatch

Compare with v1.0.0

──Files

Files (1 file, 5.2 KB)

SKILL.md5.2 KBactive

SKILL.md · 229 lines · 5.2 KB

version: "1.0.1" name: eval-harness description: Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles. when_to_use: Use when the user wants formal evals, pass/fail regression checks, or eval-driven development for AI behavior.

Eval Harness Skill

A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.

Philosophy

Eval-Driven Development treats evals as the "unit tests of AI development":

Define expected behavior BEFORE implementation
Run evals continuously during development
Track regressions with each change
Use pass@k metrics for reliability measurement

Eval Types

Capability Evals

Test if Claude can do something it couldn't before:

markdown

[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
  -[ ] Criterion 1
  -[ ] Criterion 2
  -[ ] Criterion 3
Expected Output: Description of expected result

Regression Evals

Ensure changes don't break existing functionality:

markdown

[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
  -existing-test-1: PASS/FAIL
  -existing-test-2: PASS/FAIL
  -existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)

Grader Types

1. Code-Based Grader

Deterministic checks using code:

bash

# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
 
# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
 
# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"

2. Model-Based Grader

Use Claude to evaluate open-ended outputs:

markdown

[MODEL GRADER PROMPT]
Evaluate the following code change:
1.Does it solve the stated problem?
2.Is it well-structured?
3.Are edge cases handled?
4.Is error handling appropriate?
 
Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]

3. Human Grader

Flag for manual review:

markdown

[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH

Metrics

pass@k

"At least one success in k attempts"

pass@1: First attempt success rate
pass@3: Success within 3 attempts
Typical target: pass@3 > 90%

pass^k

"All k trials succeed"

Higher bar for reliability
pass^3: 3 consecutive successes
Use for critical paths

Eval Workflow

1. Define (Before Coding)

markdown

## EVAL DEFINITION: feature-xyz
 
### Capability Evals
1.Can create new user account
2.Can validate email format
3.Can hash password securely
 
### Regression Evals
1.Existing login still works
2.Session management unchanged
3.Logout flow intact
 
### Success Metrics
-pass@3 > 90% for capability evals
-pass^3 = 100% for regression evals

2. Implement

Write code to pass the defined evals.

3. Evaluate

bash

# Run capability evals
[Run each capability eval, record PASS/FAIL]
 
# Run regression evals
npm test -- --testPathPattern="existing"
 
# Generate report

4. Report

markdown

EVAL REPORT: feature-xyz
========================
 
Capability Evals:
  create-user:     PASS (pass@1)
  validate-email:  PASS (pass@2)
  hash-password:   PASS (pass@1)
  Overall:         3/3 passed
 
Regression Evals:
  login-flow:      PASS
  session-mgmt:    PASS
  logout-flow:     PASS
  Overall:         3/3 passed
 
Metrics:
  pass@1: 67% (2/3)
  pass@3: 100% (3/3)
 
Status: READY FOR REVIEW

Integration Patterns

Pre-Implementation

/eval define feature-name

Creates eval definition file at .claude/evals/feature-name.md

During Implementation

/eval check feature-name

Runs current evals and reports status

Post-Implementation

/eval report feature-name

Generates full eval report

Eval Storage

Store evals in project:

.claude/
  evals/
    feature-xyz.md      # Eval definition
    feature-xyz.log     # Eval run history
    baseline.json       # Regression baselines

Best Practices

Define evals BEFORE coding - Forces clear thinking about success criteria
Run evals frequently - Catch regressions early
Track pass@k over time - Monitor reliability trends
Use code graders when possible - Deterministic > probabilistic
Human review for security - Never fully automate security checks
Keep evals fast - Slow evals don't get run
Version evals with code - Evals are first-class artifacts

Example: Adding Authentication

markdown

## EVAL: add-authentication
 
### Phase 1: Define (10 min)
Capability Evals:
-[ ] User can register with email/password
-[ ] User can login with valid credentials
-[ ] Invalid credentials rejected with proper error
-[ ] Sessions persist across page reloads
-[ ] Logout clears session
 
Regression Evals:
-[ ] Public routes still accessible
-[ ] API responses unchanged
-[ ] Database schema compatible
 
### Phase 2: Implement (varies)
[Write code]
 
### Phase 3: Evaluate
Run: /eval check add-authentication
 
### Phase 4: Report
EVAL REPORT: add-authentication
==============================
Capability: 5/5 passed (pass@3: 100%)
Regression: 3/3 passed (pass^3: 100%)
Status: SHIP IT

← v1.0.0 All versions