L0 to L5: where AI agents land on the autonomy spectrum
A survey of how AI agents are shifting from human-approved to human-supervised — autonomy levels, product positioning, and what it means for agent runtimes.
You're three hours into a coding session with an AI agent. It's asked you to approve 47 file reads, 12 shell commands, and 8 file writes. By the fifteenth permission prompt, you're clicking "yes" without reading. By the thirtieth, you've turned off confirmations entirely.
This is the dirty secret of human-in-the-loop (HITL): it doesn't scale. The safety model that's supposed to keep humans in control instead trains them to rubber-stamp. When every action requires approval, no action gets real scrutiny. The signal drowns in noise.
The industry is waking up to this. A new model is emerging — call it human-on-the-loop (HOTL), agent-in-the-loop, or supervisory autonomy. Instead of approving every action, humans set constraints upfront and monitor for exceptions. The agent acts. The human watches. And intervenes only when something goes wrong.
Three models of human-agent collaboration
There are three distinct models for how humans and agents share control. Each makes different tradeoffs.
| Human-in-the-loop | Human-on-the-loop | Full autonomy | |
|---|---|---|---|
| Who decides | Human approves each action | Agent acts, human monitors | Agent acts within constraints |
| Human role | Gatekeeper | Supervisor | Architect |
| Escalation | Every action | Exceptions only | Policy violations only |
| Latency | High (blocked on human) | Low (async review) | None |
| Failure mode | Rubber-stamping | Missed exceptions | Uncaught errors |
| Best for | High-risk, low-volume | Medium-risk, medium-volume | Low-risk, high-volume |
HITL works when stakes are high and volume is low — a doctor reviewing a diagnosis, a lawyer checking a contract. It breaks when an agent makes 200 decisions per minute and expects a human to meaningfully evaluate each one.
HOTL inverts the relationship. The agent is the actor; the human is the circuit breaker. This works when most actions are safe and the dangerous ones are identifiable — which, in practice, describes most software engineering tasks.
Full autonomy works when the cost of an error is low enough to accept. Running tests, formatting code, searching documentation — these don't need human oversight at all.
The autonomy spectrum
Anthropic's research on measuring agent autonomy proposes a framework with levels from L0 (no AI involvement) to L5 (full AI autonomy). The key insight: most products don't sit at a single level. They vary by task.
Product Positioning on the Autonomy Spectrum
L0 — No AI: Human does everything. The baseline.
L1 — AI as tool: Human initiates, AI assists. Autocomplete, code suggestions. GitHub Copilot lives here by default.
L2 — AI as collaborator: AI proposes multi-step actions, human approves. Most chat-based coding tools default here — Claude Code's standard mode, Cursor's inline edits.
L3 — AI as autonomous agent: AI executes independently, human reviews results. Devin's default mode, Claude Code in auto-accept mode. The agent runs; you check the diff.
L4 — AI as trusted agent: AI acts and self-monitors, human intervenes on exceptions. OpenAI Operator approaches this for web tasks — it proceeds until it hits something it flags as risky.
L5 — Full AI autonomy: AI operates without human oversight. No production system publicly claims this level today.
Anthropic's data tells a striking story: Claude asks for clarification twice as often as humans interrupt it. 80% of tool calls include safeguards. And 73% of deployments have some form of human-in-the-loop. The industry defaults to caution — perhaps too much.
Where products land today
| Product | Default level | Escalation mechanism | Permission model |
|---|---|---|---|
| GitHub Copilot | L1 | None (inline suggestions) | Accept/reject per suggestion |
| ChatGPT | L1–L2 | User-initiated | Conversational approval |
| Cursor | L2 | Diff review | Accept/reject per edit |
| Claude Code | L2–L3 | Permission prompts, allowlists | Per-tool, configurable |
| OpenAI Operator | L3–L4 | Mandatory confirmation for high-risk | Action-category based |
| Devin | L3 | Async review | Session-level trust |
The pattern: products are moving rightward on the spectrum. Copilot started at L1 and added agent mode (L2–L3). Claude Code defaults to L2 but supports auto-accept (L3). Devin launched at L3. Each generation assumes more autonomy.
The oversight model tradeoffs
The three models optimize for different things, and no single model wins across all dimensions.
Oversight Model Tradeoffs
Speed: HITL is bottlenecked by human response time. HOTL removes the bottleneck for routine actions. Full autonomy removes it entirely.
Safety: HITL theoretically catches every mistake — but rubber-stamping defeats this. HOTL catches the important mistakes because humans focus attention on flagged exceptions. Full autonomy relies entirely on upfront constraints.
Scalability: HITL requires one human per agent session. HOTL lets one human supervise multiple agents. Full autonomy scales to unlimited agents.
User trust: HITL feels safest because you see everything. HOTL requires trusting the agent's judgment about what to escalate. Full autonomy requires trusting the constraints you set.
Cost: Every human approval has a cost — context switch time, latency, cognitive load. HOTL reduces this to exception handling. Full autonomy eliminates it.
The Karpathy loop
Andrej Karpathy's autoresearch experiment demonstrated what happens when you push toward L4–L5 in a controlled domain. He set up a research loop: an agent generates hypotheses, designs experiments, runs them, and analyzes results — 700 experiments over two days, yielding an 11% speed improvement on an open-source LLM training codebase.
The key: Karpathy didn't approve each experiment. He set the objective, defined the search space, and let the agent iterate. Human-on-the-loop at its purest — the human defines what to optimize, the agent figures out how.
Shopify's CEO Tobi Lütke pushed the same philosophy company-wide: before requesting more headcount, teams must demonstrate why the task can't be done with AI agents. The result isn't full autonomy — it's forcing teams to find the right autonomy level for each task.
The lesson: high autonomy works when the feedback loop is tight (experiments have measurable outcomes), the blast radius is contained (changes are local and reversible), and the human defines success criteria upfront.
Martin Fowler's harness engineering
Martin Fowler's "Humans and Agents in Software Engineering Loops" offers the clearest framework for thinking about this shift. He distinguishes two loops:
The "why" loop — humans. Why are we building this? What problem does it solve? What are the constraints? This is the domain of product sense, business judgment, and ethical reasoning.
The "how" loop — agents. How do we implement this? What code changes are needed? What's the most efficient approach? This is the domain of execution, optimization, and automation.
Fowler's insight: the right question isn't "should agents be autonomous?" but "what harness do we build around them?" A harness defines:
- Boundaries: What the agent can and can't do
- Escalation triggers: When the agent must pause and ask
- Observation points: What the human can see at any time
- Rollback capability: How to undo what the agent did
This reframes the human role from gatekeeper to harness engineer. You're not approving individual actions — you're designing the system that constrains and enables agent behavior. It's the difference between a driving instructor who grabs the wheel every 30 seconds and one who sets up the course and intervenes only when the student is about to hit something.
Regulation is catching up
The EU AI Act, taking full effect in August 2026, mandates human oversight for high-risk AI systems. Article 14 requires that high-risk systems "can be effectively overseen by natural persons" and that humans can "intervene on the functioning of the high-risk AI system or interrupt the system."
This doesn't mandate HITL — it mandates effective oversight. A human clicking "approve" 200 times per hour isn't effective oversight. A human monitoring a dashboard of agent actions, with the ability to halt and rollback, arguably is. The regulation is model-agnostic on the HITL/HOTL question, but the spirit of the law favors HOTL: meaningful oversight, not performative approval.
For developers building agent systems, this means risk-based autonomy isn't a nice-to-have — it's a compliance requirement. The agent runtime needs to support different oversight levels for different risk categories. A hardcoded "approve everything" or "approve nothing" model won't meet the bar.
What this means for agent runtimes
Most agent frameworks force a single autonomy model. You're either running with approvals on or approvals off. The HITL/HOTL choice is binary and global.
This is the wrong abstraction. Different capabilities carry different risks. File reads are safe; file deletes are dangerous. Local search is harmless; sending HTTP requests to external services has consequences. The autonomy level should vary by capability, not by session.
CrabTalk's composable command architecture makes this natural. Each command is a separate binary with its own trust profile:
- Local search runs at L3 — full autonomy within the local filesystem. No human approval needed for searching, reading, or indexing files.
- Gateway (external API calls) runs at L2–L3 with HOTL — the agent makes calls, but the gateway can require confirmation for state-mutating operations or calls to new endpoints.
- Skills and MCP servers run at configurable levels — the developer decides at install time whether a skill runs autonomously or requires confirmation.
The runtime doesn't pick one level. It lets the developer set the autonomy level per-command, per-capability, per-risk-category. This is harness engineering as a first-class runtime concept.
The composable model also solves the observation problem. Because each command is a separate process, you can monitor, log, and audit each capability independently. The gateway's HTTP traffic is observable without instrumenting the entire agent. The daemon's state is inspectable without pausing execution. Oversight is architectural, not bolted on.
The path forward
The industry is converging on a few principles:
-
Default to HOTL, not HITL. Human approval for every action creates a false sense of safety. Human monitoring with exception-based intervention is both safer and more scalable.
-
Risk-based autonomy. Match the oversight level to the risk level of each action. Don't treat file reads and database drops the same way.
-
Invest in observability, not gates. The value isn't in blocking the agent — it's in seeing what the agent is doing and being able to intervene quickly when needed.
-
Design for rollback. The best safety net isn't preventing mistakes — it's making mistakes reversible. Git commits, database transactions, staging environments.
-
Comply by design. The EU AI Act and similar regulation will require demonstrable human oversight. Build it into the runtime, not as an afterthought.
Human-in-the-loop served us well when agents were simple and tasks were few. As agents become more capable and tasks more numerous, the model has to evolve. The question isn't whether to give agents more autonomy — it's how to do it safely.
The answer isn't less human involvement. It's better human involvement — at the right level, at the right time, on the right things.
Sources
- Anthropic: Measuring Agent Autonomy — L0–L5 framework, Claude safeguard statistics
- Martin Fowler: Humans and Agents in Software Engineering Loops — Why/how loop distinction, harness engineering
- Fortune: Karpathy's Autoresearch Loop — 700 experiments, 11% speedup
- SiliconANGLE: Human-in-the-Loop Has Hit the Wall — Permission fatigue analysis
- EU AI Act: Article 14 — Human oversight requirements for high-risk systems
- OpenAI: Operator System Card — Mandatory confirmation for high-risk web actions
- McKinsey/QuantumBlack: The State of AI — Enterprise AI agent adoption patterns