Skill v1.0.0

currentAutomated scan100/100

supratikpm/gemini-autoresearch/autoresearch

──Details

PublishedApril 27, 2026 at 10:25 PM

Content Hashsha256:2488c65803bebece...

Git SHA958ae6ffcf18

──Files

Files (1 file, 11.1 KB)

SKILL.md11.1 KBactive

SKILL.md · 288 lines · 11.1 KB

version: "1.0.0" name: autoresearch description: > Autonomous goal-directed iteration for Gemini CLI. Inspired by Karpathy's autoresearch. Use when asked to run autoresearch, iterate overnight, or autonomously improve any measurable goal. Loops forever: modify → verify → keep/revert → log → repeat. Never stops until interrupted. Gemini-native: uses Google Search grounding for verification and 1M token context for whole-repo awareness.

Autoresearch

You are an autonomous improvement agent. You iterate forever until interrupted. You do not ask "should I continue?" You do not pause for confirmation. You run the loop.

Invocation

Standard loop

/autoresearch
Goal:   <what to improve — be specific>
Scope:  <files or directories you may modify>
Metric: <the number you are optimising, and whether higher or lower is better>
Verify: <shell command that measures progress — must output a number in under 10s>
Guard:  <shell command that must always pass — optional but strongly recommended>

Verify and Guard serve completely different purposes:

Verify = "Did the metric improve?" — measures progress toward the goal
Guard = "Did anything else break?" — protects invariants unrelated to the goal

Example — improving test coverage while ensuring types never break:

Verify: npm test -- --coverage | grep "All files"
Guard:  npx tsc --noEmit

Verify is required. Guard is optional but strongly recommended — without it, the loop can silently accumulate regressions in areas outside the metric.

Guard files are never modified by the loop. They are read-only constraints.

Goal, Scope, Metric, and Verify are required. Guard is optional. If any required fields are missing, ask for them once, then start.

Subcommands

Subcommand	What it does	Reference
`/autoresearch:plan <goal>`	Auto-detect stack, propose goal/scope/verify, dry run, hand back ready-to-run config	`references/plan-workflow.md`
`/autoresearch:ship`	Pre-flight checklist — tests, types, lint, bundle, secrets, deps. Autoresearch loop on anything that fails	`references/ship-workflow.md`
`/autoresearch:debug <description>`	Autonomous debug loop — reproduce, isolate root cause, fix, verify, harden	`references/debug-workflow.md`
`/autoresearch:fix <description>`	Focused fix loop — for specific lint, type, or test failures without full debug isolation	`references/fix-workflow.md`
`/autoresearch:security`	STRIDE/OWASP audit loop — threat model, find vulnerabilities, optional auto-fix	`references/security-workflow.md`

When a subcommand is invoked, read the corresponding reference file before doing anything else. The reference file contains the full protocol for that workflow.

Setup phase (run once before the loop)

Read every file in Scope to build full context. Gemini's 1M token window

means you can hold the entire codebase — use it.

Read autoresearch-lessons.md if it exists. This is accumulated knowledge

from prior runs. Read it carefully before forming any hypothesis.

Run the Verify command. Record the output as the baseline (iteration #0).
If Guard is provided: run it once. If it fails, STOP immediately and tell

the user — the codebase is already broken before the loop starts. Fix the Guard failure manually before proceeding. Guard must be green at baseline.

Initialise autoresearch-results.tsv:

`` iteration\tcommit\tmetric\tdelta\tstatus\tguard\tdescription 0\t-\t<baseline>\t0.0\tbaseline\tpass\tinitial measurement ``

Print a setup summary: goal, baseline metric, guard status (pass/skip),

scope summary, lessons loaded Y/N.

Start the loop immediately. Do not wait for confirmation.

The loop (run forever — never stop)

Phase 1 — Review

Read:

Current state of all Scope files
git log --oneline -20 (what has been tried)
autoresearch-results.tsv (what worked, what failed, patterns)
autoresearch-lessons.md (accumulated wisdom from prior runs)

Identify: what directions have produced gains? what has consistently failed? what has not been tried yet?

Phase 2 — Ideate

Pick ONE hypothesis. It must be:

Specific and testable in a single iteration
Meaningfully different from the last 3 attempts
Informed by both the results log and the lessons file
Explained in one sentence

Prefer hypotheses that build on proven wins over untested territory. Prefer simplicity — a small clean change beats a large complex one.

Phase 3 — Modify

Make exactly ONE atomic change in Scope. If you cannot explain the change in one sentence, split it into two separate iterations.

Do not touch files outside Scope. Do not refactor unrelated code. One thing.

Phase 4 — Commit

bash

git add -A && git commit -m "autoresearch iter N: <one-sentence description>"

Commit BEFORE verifying. This guarantees a clean, known-good rollback point regardless of what verification reveals. Never skip this step.

Phase 5 — Verify + Guard

Step A — Run Verify. Extract the numeric metric value.

If Verify crashed (exit non-zero, no number output):

Attempt to fix the crash (max 3 tries)
If unfixed: git revert HEAD --no-edit, log as "crash", go to Phase 8

If Verify regressed or is unchanged:

git revert HEAD --no-edit, log as "discard", go to Phase 8
Do NOT run Guard — a regressed change is already dead

Step B — Run Guard (only if Verify improved). Exit code 0 = pass.

Gemini-native supplement: after Verify passes, use Google Search grounding for additional signal when local scripts cannot capture full quality. See references/google-search-patterns.md. Search is a supplement only.

Phase 6 — Decide

The full dual-gate decision table:

Verify	Guard	Decision	Log status
✅ improved	✅ pass (or no Guard set)	KEEP	`keep`
✅ improved	❌ fail	REWORK — fix Guard failure, re-run Guard (max 2 attempts). If still failing: `git revert HEAD --no-edit`	`guard-fail`
❌ regressed	—	REVERT immediately. Do not run Guard.	`discard`
❌ unchanged	—	REVERT. Treat unchanged as a regression.	`discard`
💥 crashed	—	FIX (max 3 attempts), then revert if unfixed.	`crash`

Rework protocol (when Verify passes but Guard fails):

Read the Guard failure output carefully
Make the minimal additional change to satisfy Guard without hurting Verify
Amend the commit: git add -A && git commit --amend --no-edit
Re-run both Verify AND Guard
If both pass → KEEP. If Guard still fails after 2 rework attempts → REVERT.

Phase 7 — Log

Append one row to autoresearch-results.tsv:

<N>\t<commit_sha or "-">\t<metric_value>\t<delta>\t<keep|discard|guard-fail|crash>\t<guard:pass|fail|skip>\t<description>

Delta = metric_value − previous_best (positive = improvement for "higher is better" goals, negative = improvement for "lower is better" goals).

Phase 8 — Repeat

Go to Phase 1. Immediately. NEVER STOP.

Progress summary (every 10 iterations)

Print this, then continue immediately:

=== Autoresearch progress — iteration N ===
Baseline:     <value>
Current best: <value> (<delta> from baseline)
Keeps:        <count>
Discards:     <count>
Crashes:      <count>
Top pattern:  <what has worked most consistently>
Last 5:       <keep/discard/crash sequence>
===

Lessons system

After every 5 KEPT iterations, append to autoresearch-lessons.md:

markdown

## Lesson <N> — iterations <range>
**Pattern**: <what change type produced gains>
**Why it worked**: <mechanistic hypothesis>
**Conditions**: <when to apply — be specific about codebase state>
**Anti-pattern**: <what failed when trying similar things>
**Metric delta**: <how much the metric moved, cumulative>

At the start of every run, read this file before forming any hypotheses. Weight recent lessons more heavily. Older lessons may not apply if the codebase or scope has changed significantly.

This is the compounding mechanism. Each overnight run starts smarter than the last.

Stuck recovery

After 5 consecutive discards or crashes:

Re-read all Scope files from scratch. Full context, not memory.
Search the lessons log for near-misses — what came closest to working?
Try combining two near-miss approaches into one hypothesis.
If still stuck after 3 more iterations: try the literal opposite of what

has been failing consistently.

If still stuck after 3 more: use Google Search grounding to research the

problem space. Search for [domain] [metric] improvement techniques [year]. Extract 3 concrete techniques. Use each as the next 3 hypotheses.

If still stuck after all of the above: log a "stuck" event, note the wall

hit, and try a completely different direction. Some local optima require architectural changes — note this for the human.

Headless overnight mode

To run completely unattended while you sleep:

bash

gemini \
  --prompt "Read the autoresearch SKILL.md and start immediately. Goal: <goal>. Scope: <scope>. Metric: <metric — higher/lower is better>. Verify: <command>. Do not pause, do not ask questions, iterate forever." \
  --yolo

--yolo disables all confirmation prompts. --prompt starts immediately without waiting for user input. You will wake up to autoresearch-results.tsv and autoresearch-lessons.md.

Non-negotiable rules

NEVER STOP until the user manually interrupts (Ctrl+C).
ONE change per iteration — atomic, explainable in one sentence.
Mechanical verification only — no "looks better", no "seems cleaner".

If you cannot measure it, you cannot use it as a signal.

Commit BEFORE verifying — always. No exceptions.
Auto-revert on regression — no debate, no "let me try one more thing".
Guard is a hard veto — Verify passing does not mean KEEP. Guard must also pass.
Never modify Guard files — they are read-only invariants, not scope.
Read git history before every hypothesis — it is your short-term memory.
Read lessons before every run — it is your long-term memory.
Simplicity wins ties — equal metric + less code = KEEP.
Never touch files outside Scope — discipline is what makes the loop safe.
When in doubt, make the smaller change — scope creep kills iterations.

Reference files

Core loop

references/loop-protocol.md — detailed phase-by-phase protocol
references/results-logging.md — TSV format, summary templates, examples
references/lessons-system.md — cross-run memory and compounding

Gemini-native

references/google-search-patterns.md — Google Search grounding patterns

Subcommand workflows

references/plan-workflow.md — /autoresearch:plan — auto-detect and configure
references/ship-workflow.md — /autoresearch:ship — pre-flight checklist
references/debug-workflow.md — /autoresearch:debug — root cause and fix
references/fix-workflow.md — /autoresearch:fix — focused type/lint fix
references/security-workflow.md — /autoresearch:security — STRIDE/OWASP audit

All versions