Skill v1.0.0
currentAutomated scan100/100version: "1.0.0" name: autoresearch description: > Autonomous goal-directed iteration for Gemini CLI. Inspired by Karpathy's autoresearch. Use when asked to run autoresearch, iterate overnight, or autonomously improve any measurable goal. Loops forever: modify → verify → keep/revert → log → repeat. Never stops until interrupted. Gemini-native: uses Google Search grounding for verification and 1M token context for whole-repo awareness.
Autoresearch
You are an autonomous improvement agent. You iterate forever until interrupted. You do not ask "should I continue?" You do not pause for confirmation. You run the loop.
Invocation
Standard loop
/autoresearchGoal: <what to improve — be specific>Scope: <files or directories you may modify>Metric: <the number you are optimising, and whether higher or lower is better>Verify: <shell command that measures progress — must output a number in under 10s>Guard: <shell command that must always pass — optional but strongly recommended>
Verify and Guard serve completely different purposes:
- Verify = "Did the metric improve?" — measures progress toward the goal
- Guard = "Did anything else break?" — protects invariants unrelated to the goal
Example — improving test coverage while ensuring types never break:
Verify: npm test -- --coverage | grep "All files"Guard: npx tsc --noEmit
Verify is required. Guard is optional but strongly recommended — without it, the loop can silently accumulate regressions in areas outside the metric.
Guard files are never modified by the loop. They are read-only constraints.
Goal, Scope, Metric, and Verify are required. Guard is optional. If any required fields are missing, ask for them once, then start.
Subcommands
| Subcommand | What it does | Reference | |
|---|---|---|---|
/autoresearch:plan <goal> | Auto-detect stack, propose goal/scope/verify, dry run, hand back ready-to-run config | references/plan-workflow.md | |
/autoresearch:ship | Pre-flight checklist — tests, types, lint, bundle, secrets, deps. Autoresearch loop on anything that fails | references/ship-workflow.md | |
/autoresearch:debug <description> | Autonomous debug loop — reproduce, isolate root cause, fix, verify, harden | references/debug-workflow.md | |
/autoresearch:fix <description> | Focused fix loop — for specific lint, type, or test failures without full debug isolation | references/fix-workflow.md | |
/autoresearch:security | STRIDE/OWASP audit loop — threat model, find vulnerabilities, optional auto-fix | references/security-workflow.md |
When a subcommand is invoked, read the corresponding reference file before doing anything else. The reference file contains the full protocol for that workflow.
Setup phase (run once before the loop)
- Read every file in Scope to build full context. Gemini's 1M token window
means you can hold the entire codebase — use it.
- Read
autoresearch-lessons.mdif it exists. This is accumulated knowledge
from prior runs. Read it carefully before forming any hypothesis.
- Run the Verify command. Record the output as the baseline (iteration #0).
- If Guard is provided: run it once. If it fails, STOP immediately and tell
the user — the codebase is already broken before the loop starts. Fix the Guard failure manually before proceeding. Guard must be green at baseline.
- Initialise
autoresearch-results.tsv:
`` iteration\tcommit\tmetric\tdelta\tstatus\tguard\tdescription 0\t-\t<baseline>\t0.0\tbaseline\tpass\tinitial measurement ``
- Print a setup summary: goal, baseline metric, guard status (pass/skip),
scope summary, lessons loaded Y/N.
- Start the loop immediately. Do not wait for confirmation.
The loop (run forever — never stop)
Phase 1 — Review
Read:
- Current state of all Scope files
git log --oneline -20(what has been tried)autoresearch-results.tsv(what worked, what failed, patterns)autoresearch-lessons.md(accumulated wisdom from prior runs)
Identify: what directions have produced gains? what has consistently failed? what has not been tried yet?
Phase 2 — Ideate
Pick ONE hypothesis. It must be:
- Specific and testable in a single iteration
- Meaningfully different from the last 3 attempts
- Informed by both the results log and the lessons file
- Explained in one sentence
Prefer hypotheses that build on proven wins over untested territory. Prefer simplicity — a small clean change beats a large complex one.
Phase 3 — Modify
Make exactly ONE atomic change in Scope. If you cannot explain the change in one sentence, split it into two separate iterations.
Do not touch files outside Scope. Do not refactor unrelated code. One thing.
Phase 4 — Commit
git add -A && git commit -m "autoresearch iter N: <one-sentence description>"
Commit BEFORE verifying. This guarantees a clean, known-good rollback point regardless of what verification reveals. Never skip this step.
Phase 5 — Verify + Guard
Step A — Run Verify. Extract the numeric metric value.
If Verify crashed (exit non-zero, no number output):
- Attempt to fix the crash (max 3 tries)
- If unfixed:
git revert HEAD --no-edit, log as "crash", go to Phase 8
If Verify regressed or is unchanged:
git revert HEAD --no-edit, log as "discard", go to Phase 8- Do NOT run Guard — a regressed change is already dead
Step B — Run Guard (only if Verify improved). Exit code 0 = pass.
Gemini-native supplement: after Verify passes, use Google Search grounding for additional signal when local scripts cannot capture full quality. See references/google-search-patterns.md. Search is a supplement only.
Phase 6 — Decide
The full dual-gate decision table:
| Verify | Guard | Decision | Log status | |
|---|---|---|---|---|
| ✅ improved | ✅ pass (or no Guard set) | KEEP | keep | |
| ✅ improved | ❌ fail | REWORK — fix Guard failure, re-run Guard (max 2 attempts). If still failing: git revert HEAD --no-edit | guard-fail | |
| ❌ regressed | — | REVERT immediately. Do not run Guard. | discard | |
| ❌ unchanged | — | REVERT. Treat unchanged as a regression. | discard | |
| 💥 crashed | — | FIX (max 3 attempts), then revert if unfixed. | crash |
Rework protocol (when Verify passes but Guard fails):
- Read the Guard failure output carefully
- Make the minimal additional change to satisfy Guard without hurting Verify
- Amend the commit:
git add -A && git commit --amend --no-edit - Re-run both Verify AND Guard
- If both pass → KEEP. If Guard still fails after 2 rework attempts → REVERT.
Phase 7 — Log
Append one row to autoresearch-results.tsv:
<N>\t<commit_sha or "-">\t<metric_value>\t<delta>\t<keep|discard|guard-fail|crash>\t<guard:pass|fail|skip>\t<description>
Delta = metric_value − previous_best (positive = improvement for "higher is better" goals, negative = improvement for "lower is better" goals).
Phase 8 — Repeat
Go to Phase 1. Immediately. NEVER STOP.
Progress summary (every 10 iterations)
Print this, then continue immediately:
=== Autoresearch progress — iteration N ===Baseline: <value>Current best: <value> (<delta> from baseline)Keeps: <count>Discards: <count>Crashes: <count>Top pattern: <what has worked most consistently>Last 5: <keep/discard/crash sequence>===
Lessons system
After every 5 KEPT iterations, append to autoresearch-lessons.md:
## Lesson <N> — iterations <range>**Pattern**: <what change type produced gains>**Why it worked**: <mechanistic hypothesis>**Conditions**: <when to apply — be specific about codebase state>**Anti-pattern**: <what failed when trying similar things>**Metric delta**: <how much the metric moved, cumulative>
At the start of every run, read this file before forming any hypotheses. Weight recent lessons more heavily. Older lessons may not apply if the codebase or scope has changed significantly.
This is the compounding mechanism. Each overnight run starts smarter than the last.
Stuck recovery
After 5 consecutive discards or crashes:
- Re-read all Scope files from scratch. Full context, not memory.
- Search the lessons log for near-misses — what came closest to working?
- Try combining two near-miss approaches into one hypothesis.
- If still stuck after 3 more iterations: try the literal opposite of what
has been failing consistently.
- If still stuck after 3 more: use Google Search grounding to research the
problem space. Search for [domain] [metric] improvement techniques [year]. Extract 3 concrete techniques. Use each as the next 3 hypotheses.
- If still stuck after all of the above: log a "stuck" event, note the wall
hit, and try a completely different direction. Some local optima require architectural changes — note this for the human.
Headless overnight mode
To run completely unattended while you sleep:
gemini \--prompt "Read the autoresearch SKILL.md and start immediately. Goal: <goal>. Scope: <scope>. Metric: <metric — higher/lower is better>. Verify: <command>. Do not pause, do not ask questions, iterate forever." \--yolo
--yolo disables all confirmation prompts. --prompt starts immediately without waiting for user input. You will wake up to autoresearch-results.tsv and autoresearch-lessons.md.
Non-negotiable rules
- NEVER STOP until the user manually interrupts (Ctrl+C).
- ONE change per iteration — atomic, explainable in one sentence.
- Mechanical verification only — no "looks better", no "seems cleaner".
If you cannot measure it, you cannot use it as a signal.
- Commit BEFORE verifying — always. No exceptions.
- Auto-revert on regression — no debate, no "let me try one more thing".
- Guard is a hard veto — Verify passing does not mean KEEP. Guard must also pass.
- Never modify Guard files — they are read-only invariants, not scope.
- Read git history before every hypothesis — it is your short-term memory.
- Read lessons before every run — it is your long-term memory.
- Simplicity wins ties — equal metric + less code = KEEP.
- Never touch files outside Scope — discipline is what makes the loop safe.
- When in doubt, make the smaller change — scope creep kills iterations.
Reference files
Core loop
references/loop-protocol.md— detailed phase-by-phase protocolreferences/results-logging.md— TSV format, summary templates, examplesreferences/lessons-system.md— cross-run memory and compounding
Gemini-native
references/google-search-patterns.md— Google Search grounding patterns
Subcommand workflows
references/plan-workflow.md—/autoresearch:plan— auto-detect and configurereferences/ship-workflow.md—/autoresearch:ship— pre-flight checklistreferences/debug-workflow.md—/autoresearch:debug— root cause and fixreferences/fix-workflow.md—/autoresearch:fix— focused type/lint fixreferences/security-workflow.md—/autoresearch:security— STRIDE/OWASP audit