Autonomous Experiment Loops: From ML Research to Product Development

Brainstorm notes — March 2026

The Pattern

Karpathy’s autoresearch distills a universal optimization loop:

Change one variable → Deploy → Measure one metric → Keep or discard → Repeat forever

In autoresearch: edit train.py → train for 5 min → check val_bpb → keep/revert → repeat. An agent runs ~100 experiments overnight with zero human involvement.

The Generalization

This loop applies to anything with a modifiable variable and a numeric score. Marketing teams, product teams, and growth teams already do this manually (A/B testing, campaign optimization). The insight is: automate the full loop, remove the human bottleneck.

But applying it to real-world channels (cold email, ads, landing pages) has fundamental limits:

Constraint	ML Experiments	Real-World Channels
Feedback speed	5 minutes	72 hours to 90 days
Signal quality	Deterministic metric	Noisy, needs large samples
Cost of failure	Wasted GPU minutes	Burned real prospects
Reversibility	`git revert`	Can’t unsend an email

Real-world loops are 100-1000x slower and carry real consequences. The pattern is correct. The speed advantage disappears.

The Real Unlock: Synthetic Benchmarking

The way to recover the speed advantage is the same thing LLM labs discovered: don’t wait for real-world signal. Generate synthetic signal.

LLM labs can’t wait for millions of users to rate outputs. So they use synthetic data, synthetic evaluations, synthetic preferences. By the time the model ships, most learning already happened offline.

Apply the same idea to product development:

Build benchmarks — synthetic users, synthetic scenarios, expected outcomes
Run the app against benchmarks — agents exercise the product like real users would
Measure, fix, iterate — the full loop completes in seconds, not days
Deploy to real users — the app is already battle-tested

The feedback cycle drops from 72 hours to seconds. Now you can run 100 experiments overnight — not on a toy model, but on your actual product.

Every benchmark is also a regression test forever. The app can only get better, never silently regress. The moat isn’t experiment history — it’s a growing suite of synthetic evaluations that encode what “good” looks like for your specific product.

The Stack Inverts

OLD: Build for humans → deploy → hope users find bugs → fix → repeat slowly
NEW: Build for agents → benchmark → agents find issues → fix → repeat fast → then add humans

Humans become the last mile, not the testing ground. By the time a real user touches the app, it’s already been through thousands of synthetic runs.

The Open Problem: Agent Motivation

Benchmarks with pre-defined tasks work, but they’re scripted. The agent has no purpose to use the app — it’s just executing a checklist. This creates blind spots:

Scripted tasks test what you thought to test
Real usage involves emergent behavior, creative misuse, unexpected workflows
Pre-defined tasks can’t discover unknown unknowns

The question: how do you give agents a reason to use an app?

The Virtual Company Model

The answer: don’t just create test agents. Create virtual companies.

Each virtual company is a simulated business with:

A business model and goals
Revenue targets and constraints
Employees (agents) with roles and responsibilities
Real operational needs that require using other companies’ products

┌─────────────────┐         ┌─────────────────┐
│ Virtual Co. A    │         │ Virtual Co. B    │
│ (E-commerce)     │         │ (SaaS Analytics) │
│                  │ uses    │                  │
│ Needs analytics ─┼────────►│ Provides dashb.  │
│ to track sales   │         │ and reports      │
│                  │◄────────┼─ Needs customers │
│ Is a customer    │  uses   │ to demo product  │
└─────────────────┘         └─────────────────┘
         │                           │
         │ uses                      │ uses
         ▼                           ▼
┌─────────────────┐         ┌─────────────────┐
│ Virtual Co. C    │         │ Your App         │
│ (Marketing Agency│         │ (under test)     │
│                  │ uses    │                  │
│ Runs campaigns  ─┼────────►│ The product you  │
│ for A and B      │         │ are benchmarking │
└─────────────────┘         └─────────────────┘

Why this works:

Agents have purpose. They’re not running test scripts — they’re trying to hit quarterly targets, onboard new clients, solve operational problems. The app usage is a means to an end, not the end itself.
Emergent behavior. When Company A’s sales spike, Company B’s analytics dashboard gets hammered. When Company C runs a campaign, your app gets a burst of sign-ups. You didn’t script these scenarios — they emerged from the simulation.
Realistic failure modes. An agent trying to close a deal before end-of-quarter will push your app’s edge cases in ways a scripted test never would.
Cross-product interaction. Your app doesn’t exist in a vacuum. Virtual companies use multiple products together, exposing integration issues and workflow gaps.
The benchmark writes itself. Instead of hand-crafting test scenarios, you define company goals and let the agents figure out how to achieve them. New test coverage emerges automatically as companies evolve.

The Autoresearch Loop, Applied

Setup (fixed, like prepare.py):
  - Virtual company definitions (business model, goals, constraints)
  - Agent roles and responsibilities
  - Inter-company relationships
  - Scoring: did companies achieve their goals? How efficiently?

Sandbox (modified each iteration, like train.py):
  - Your app's code

Loop (like program.md):
  1. Agent modifies the app
  2. Virtual companies run for a simulated period
  3. Measure: goal completion rate, task success, error rate, time-to-completion
  4. Keep or discard the change
  5. Repeat

The metric isn’t val_bpb. It’s “did the virtual companies accomplish their goals using your product?” Lower friction, higher success rate, fewer errors = better product.

Key Limitation

Same as LLM labs: synthetic can diverge from reality. Virtual companies might develop usage patterns that real businesses never would. Periodic calibration against real user data is essential — use real usage to tune the simulation, not replace it.

Summary

Layer	What It Provides
Autoresearch pattern	The loop: change → measure → keep/discard
Synthetic benchmarks	Speed: seconds instead of days
Virtual companies	Purpose: agents with real goals, not scripts
Cross-company simulation	Emergence: unknown unknowns surface organically
Calibration against real	Accuracy: keeps synthetics honest

The new software development cycle: Build for agents → simulate virtual companies → benchmark → iterate at machine speed → then ship to humans.