Reading time: 5 minutes / Become my affiliate / Sponsor this newsletter
Greetings from above,
Why did my AI skill go to therapy? Because it had too many unresolved issues. (56% of them, to be exact.)
ALEX'S STORY: I built a cold outreach skill in Claude. Feed it a prospect's name and company. Get a personalized email. Send it. Book meetings.
I tested it 50 times. It passed my quality checks 56% of the time. The rest? Generic subject lines. Openers written for nobody in particular. CTAs like "Let me know if you'd like to chat." One email opened with "In today's fast-paced business environment."
So I tweaked the prompt. Changed a word. Added a rule. Tested it once. Decided it felt better. Moved on. Two weeks later, same problems.
I was optimizing by vibes. Vibes don't scale.
Then I found Andrej Karpathy's autoresearch method (github.com/karpathy/autoresearch). He built it for optimizing ML training code: an AI agent modifies code, tests it, keeps improvements, discards failures. All on autopilot. It blew up. 42,000 GitHub stars.
Fortune called it "The Karpathy Loop." He ran it on code he'd already hand-optimized for months. The agent found 20 improvements he missed.
The principle is what grabbed me: try a small change, measure, keep or revert. Repeat. I adapted it for AI skills and prompts.
My cold outreach skill went from 56% to 94%. Five rounds. Zero hand-holding.
Today I'm giving you the exact system so you can run it on your own stuff. You’ll learn:
• How to auto-improve any AI skill without touching it yourself
• The scoring method that turns "vibes" into measurable quality
• The full autoresearch skill you can set up yourself
Let's build your competitive advantage!

How Jennifer Aniston’s LolaVie brand grew sales 40% with CTV ads
The DTC beauty category is crowded. To break through, Jennifer Aniston’s brand LolaVie, worked with Roku Ads Manager to easily set up, test, and optimize CTV ad creatives. The campaign helped drive a big lift in sales and customer growth, helping LolaVie break through in the crowded beauty category.
🔗 HOW IT WORKS (30 SECOND VERSION) 🔗
Think of a recipe that turns out great 7 out of 10 times. Instead of rewriting the whole thing, you change one ingredient. Cook it 5 times. Better? Keep. Worse? Revert. Repeat.
That's the entire method. Applied to your AI skills:
1. Run the skill and score the output against a yes/no checklist
2. Diagnose which checks fail most and why
3. Change ONE thing in the prompt to fix the most common failure
4. Test again. Score improved? Keep. Score dropped? Revert.
5. Repeat until 90%+ pass rate. Walk away. Come back to a skill that works.
The key: Your scoring checklist uses binary yes/no questions.
Not "rate this 1-10" (too vague).
Things like: "Does the subject line include the prospect's company name?" "Is the email under 75 words?" "Does it end with a specific question?" 3-6 questions. That's it.
You're not removing human judgment. You're concentrating it.
Instead of evaluating every single output by feel, you define what "good" means once. The agent handles the rest.
How scoring works:
Pass rate = total checks passed across all runs.
5 questions, 5 runs = 25 total checks. If 20 pass, that's 80%.
⚙️ THE AUTO-RESEARCH SKILL ⚙️
💡 This is the full skill for Claude Code and Cowork users. Drop it in your skills folder and point it at any skill that needs fixing. It runs the entire loop: baseline measurement, targeted changes, automatic revert on bad changes, live dashboard, full changelog.
File Setup: (2 minutes)
1. Create a folder called autoresearch in your skills directory
2. Create SKILL.md inside it and paste everything below
3. Tell Claude: "run autoresearch on my [skill name] skill"
The file setup takes 2 minutes. The real work is defining your 3-6 scoring questions.
---
name: autoresearch
description: Autonomously optimize any Claude skill by
running it repeatedly, scoring outputs against binary
evals, mutating the prompt, and keeping improvements.
Based on Karpathy's autoresearch methodology.
---
# Autoresearch for Skills
Most skills work about 70% of the time. The other 30%
you get garbage. The fix isn't to rewrite the skill
from scratch. It's to let an agent run it dozens of
times, score every output, and tighten the prompt until
that 30% disappears.
## The Core Job
Take any existing skill, define what 'good output'
looks like as binary yes/no checks, then run an
autonomous loop that:
1. Generates outputs from the skill using test inputs
2. Scores every output against the eval criteria
3. Mutates the skill prompt to fix failures
4. Keeps mutations that improve the score, discards
the rest
5. Repeats until the score ceiling is hit or the user
stops it
Output: An improved SKILL.md + results.tsv log +
changelog.md of every mutation attempted + a live HTML
dashboard you can watch in your browser.
## Before Starting: Gather Context
STOP. Do not run any experiments until all fields below
are confirmed with the user. Ask for any missing fields
before proceeding.
1. Target skill: Which skill to optimize? (exact path
to SKILL.md)
2. Test inputs: What 3-5 different prompts/scenarios
should we test with? (variety matters to avoid
overfitting)
3. Eval criteria: What 3-6 binary yes/no checks define
a good output?
4. Runs per experiment: How many times to run the skill
per mutation? Default: 5
5. Budget cap: Optional. Max number of experiment
cycles. Default: no cap (runs until you stop it)
## Step 1: Read the Skill
Before changing anything, read and understand the
target skill completely. Read the full SKILL.md.
Read any reference files. Identify the core job,
process steps, and output format. Note any existing
quality checks.
## Step 2: Build the Eval Suite
Convert the user's eval criteria into structured
tests. Every check must be binary. Pass or fail.
Format each eval as:
EVAL [number]: [Short name]
Question: [Yes/no question about the output]
Pass condition: [What 'yes' looks like]
Fail condition: [What triggers a 'no']
Rules for good evals:
- Binary only. Yes or no. No scales.
- Specific enough to be consistent across runs.
- Not so narrow the skill games the eval.
- 3-6 evals is the sweet spot.
Max score = [number of evals] x [runs per experiment]
## Step 3: Generate the Live Dashboard
Before running experiments, create a live HTML
dashboard at autoresearch-[skill-name]/dashboard.html
and open it in the browser.
The dashboard must:
- Auto-refresh every 10 seconds (reads results.json)
- Show score progression line chart
- Show colored bar per experiment (green = keep,
red = discard, blue = baseline)
- Show per-eval breakdown
- Show current status
Use Chart.js from CDN. Single self-contained HTML
file with inline CSS and JavaScript.
## Step 4: Establish Baseline
Run the skill AS-IS before changing anything. This is
experiment #0.
1. Create working directory: autoresearch-[skill-name]/
2. Back up original SKILL.md as SKILL.md.baseline
3. Run the skill [N] times using test inputs
4. Score every output against every eval
5. Record baseline score
IMPORTANT: Confirm baseline score with user before
proceeding. If already 90%+, ask if they want to
continue.
## Step 5: Run the Experiment Loop
This is the core loop. Once started, run autonomously.
LOOP:
1. ANALYZE: Which evals fail most? Read the actual
failing outputs. Identify the pattern.
2. HYPOTHESIZE: Pick ONE thing to change.
Good mutations:
- Add a specific instruction for common failure
- Reword ambiguous instruction to be more explicit
- Add an anti-pattern ('Do NOT do X')
- Move buried instruction higher (priority =
position)
- Add/improve example showing correct behavior
- Remove instruction causing over-optimization
Bad mutations:
- Rewriting entire skill from scratch
- Adding 10 new rules at once
- Vague instructions like 'make it better'
3. CHANGE: Edit SKILL.md with ONE targeted mutation.
4. TEST: Run skill [N] times with same test inputs.
5. SCORE: Run every output through every eval.
6. DECIDE:
- Score improved: KEEP. New baseline.
- Score same: DISCARD. Revert.
- Score worse: DISCARD. Revert.
7. LOG: Record in results.tsv and changelog.md.
8. REPEAT.
Stop conditions:
- User manually stops it
- Budget cap hit
- 90%+ pass rate for 3 consecutive experiments
## Step 6: Write the Changelog
After each experiment, append to changelog.md:
## Experiment [N] - [keep/discard]
Score: [X]/[max] ([percent]%)
Change: [One sentence describing the change]
Reasoning: [Why this was expected to help]
Result: [Which evals improved/declined]
## Step 7: Deliver Results
When the loop stops, present:
1. Score summary: Baseline -> Final (% improvement)
2. Total experiments run
3. Keep rate (kept vs discarded)
4. Top 3 changes that helped most
5. Remaining failure patterns
6. The improved SKILL.md (already saved)
7. Location of results.tsv and changelog.md
## Output Files
autoresearch-[skill-name]/
dashboard.html (live browser dashboard)
results.json (data powering dashboard)
results.tsv (score log for every experiment)
changelog.md (detailed mutation log)
SKILL.md.baseline (original before optimization)
📝 DON'T USE CLAUDE CODE? USE THIS PROMPT INSTEAD 📝
💡 What this does: If you don't use Claude Code or Cowork, this prompt turns any AI chat into a manual autoresearch agent. Same loop, same method. You just paste it into ChatGPT, Claude, Gemini, or whatever you use.
#CONTEXT:
You are an Autoresearch Agent specializing in iterative
prompt optimization. Based on Andrej Karpathy's
autoresearch methodology: try a small change, measure
the result, keep or revert. Repeat until quality
ceiling is hit.
#ROLE:
You are the Skill Optimizer. Your job is to take an
existing prompt or skill, run it against test inputs,
score the outputs with a binary checklist, and make
targeted improvements until the pass rate hits 90%+.
#PROCESS:
1. BASELINE: Run the prompt with the test input.
Score the output against ALL checklist questions.
Record the baseline pass rate.
2. DIAGNOSE: Identify the checklist question(s)
failing most often. Analyze WHY.
3. CHANGE: Make ONE targeted change to the prompt.
Options:
- Add a specific rule or constraint
- Add a banned words/phrases list
- Add a worked example of good output
- Adjust tone/length/format instructions
- Reorder priority of instructions
- Remove a rule causing tradeoff damage
4. TEST: Run the modified prompt with the SAME
test input. Score against ALL checklist questions.
5. EVALUATE:
- Overall score IMPROVED: Keep the change.
- Overall score DROPPED or SAME: Revert.
- One check improved, another degraded: Revert
(no net-negative tradeoffs).
6. REPEAT until:
- Pass rate hits 90%+ three consecutive times, OR
- 10 rounds with no improvement.
#INFORMATION ABOUT ME:
- My Prompt to optimize: [PASTE YOUR FULL PROMPT]
- My Test Input: [THE INPUT YOU'D NORMALLY FEED IT]
- My Scoring Checklist (3-6 yes/no questions):
1. [e.g., 'Does the subject line include the prospect's company name?']
2. [e.g., 'Is the email under 75 words?']
3. [e.g., 'Does it end with a specific question?']
4. [optional]
5. [optional]
#OUTPUT FORMAT:
After each round:
## ROUND [N] REPORT
- Failing check(s): [Which questions failed]
- Root cause: [Why the output fails]
- Change made: [Exact edit to the prompt]
- New score: [X/Y checks passing = Z%]
- Decision: KEEP / REVERT
## FINAL DELIVERY
When done:
1. The improved prompt (full text, ready to use)
2. Before/after score comparison
3. Changelog of every change and outcome
4. The original prompt (backup)
Customize these variables:
My Prompt: Paste the full prompt you want to optimize. Any format, any platform.
My Test Input: A realistic example from your actual workflow. Not a toy input.
Scoring Checklist: 3-6 yes/no questions. Be specific. "Does the subject line include the company name?" beats "Is the subject line good?"
🔧 WHEN TO USE THIS 🔧
• Any skill or prompt that works sometimes and gives you garbage the rest of the time
• Cold outreach, sales copy, blog drafts, social hooks, email sequences, any repeatable prompt
• When you've been tweaking a prompt for weeks and it still isn't consistent
• When you want to measure improvement instead of guessing at it
📋 SUMMARY 📋
• The autoresearch method (inspired by Karpathy) auto-improves any AI skill through a test-change-score loop.
• You define "good" with 3-6 yes/no questions. The agent handles everything else.
• One change at a time. Automatic revert on bad changes. Full changelog of everything tried.
• Works on outreach, sales copy, social hooks, email sequences, blog drafts, anything you can score.
📚 FREE RESOURCES 📚
📦 WRAP UP 📦
What you learned today:
1. The autoresearch loop - an AI agent tests, scores, and improves your skills while you do nothing.
2. Binary scoring - yes/no checklists beat vague ratings. 3-6 questions is all you need.
3. Two ways to run it - full skill for Claude Code/Cowork, standalone prompt for any AI chat.
This transforms "I think my prompt is better" into "I know it's 38 points better and here's the proof."
No more weekend prompt tweaking sessions.
You now have a system that improves your systems.
What did you think about today's edition?
What did you think about today's edition?
And as always, thanks for being part of my lovely community,
Keep building systems,
🔑 Alex from God of Prompt
P.S. Which skill should I run this on live and show you the full before/after? Cold outreach? Blog drafts? Social hooks? Hit reply. I'll pick the most popular one and do a full teardown next week.




