INDUSTRY · JUNE 17, 2026 · 5 MIN READ

When the Prompt Is the Program: Eve and Agent Review

Vercel's Eve framework lets an entire agent live in a single instructions.md file. That collapses the unit of review from code to prose.


When the Prompt Is the Program: Eve and Agent Review

Vercel's agent framework direction points toward a model where the entire agent can be defined in a single instructions.md file written in plain English. The framing is worth taking seriously, because if a framework does for agents what Next.js did for web apps, the unit of software changes from typed code to prose.

The diff problem#

Traditional code review assumes a diff. A reviewer reads what changed, reasons about the delta, and approves or blocks. That model depends on the source of truth being structured enough to diff meaningfully.

A markdown file of natural language instructions does not satisfy that assumption. Two versions of instructions.md can differ by a single sentence and produce radically different agent behavior at runtime. The change is visible in the diff; the behavioral consequence is not. A reviewer reading the diff is essentially reading a spec change, not a code change, and spec review requires a completely different skill than code review. Most teams have no formal process for it at all.

The review bottleneck that emerged as teams scaled agent output was already severe when the source was generated TypeScript. Move the source to English prose and the problem compounds: volume stays high, but the signal in each diff gets harder to interpret, not easier.

The framework's own governance admission#

The Vercel AI SDK's tool approval mechanism treats approval as a first-class execution state: statuses of 'approved', 'denied', and 'user-approval' are built into the runtime, and execution pauses for external input. The framework does not assume the agent is correct; it assumes a human needs to be in the path for certain decisions. The same SDK includes a WorkflowAgent designed for durable workflow patterns.

That is not a minor convenience feature. It is the framework acknowledging, in the architecture itself, that an agent running in production needs checkpoints a language model cannot provide for itself.

This is the right instinct. It is also incomplete. Human-in-the-loop gates inside the framework catch individual tool invocations. They do not review whether the instructions.md itself is well-specified, whether the agent's behavior across 10,000 runs matches intent, or whether a two-word edit to the instructions introduced a security boundary violation. That layer sits outside what any single framework can enforce.

The numbers on agent-generated output#

The governance gap is not theoretical. Faros measured 22,000 developers across 4,000 teams and found that as teams moved to high agent adoption, the per-developer defect rate climbed from 9% to 54%, and median review duration rose 441.5%. PRs merged with zero review rose 31.3% , not because anyone decided to stop reviewing, but because volume exceeded human capacity.

Frameworks that collapse agent definition into a single markdown file do not reduce that volume. They accelerate it. An agent whose entire specification fits in a markdown file is an agent that can be spawned, forked, and modified in seconds. The review surface grows faster than the review capacity.

At agent-scale output, validation replaces review as the practical mechanism. The same agent that produces a clean 200-line plan produces 2,000 lines of code from it, and reviewing the code at that volume is skimming. The argument extends to prose programs: reviewing an instructions.md edit is tractable; reviewing the behavioral surface that edit creates across all possible inputs is not.

What Vercel's stack implies about governance#

The full Vercel agent stack in 2026 includes an AI Gateway backed by 200,000-plus teams and tens of trillions of tokens, a Sandbox for code execution, and a Workflow SDK. That is infrastructure for multi-model, multi-agent production systems. The stack makes building agents cheap. It does not make auditing them cheap.

The pattern across all framework-level governance tools is that they enforce the gates designers anticipated. They do not surface the failures designers did not think to specify. An agent that takes a subtly wrong action on every third invocation, an instructions file that silently grants broader tool access than intended, a behavioral drift between version 1 and version 3 of the same markdown , those failures pass through every in-framework gate cleanly.

The independent reviewer problem#

This is where prose-as-program creates a new category of review work rather than eliminating old review work.

When code is typed, review tools can analyze structure, data flow, and dependency graphs. When the program is English prose, the review surface is semantic. A one-sentence instruction change can be syntactically trivial and behaviorally significant. The reviewer needs to understand what the agent does, not just what the file says.

Hyrax's six agent domains , security, code quality, reliability, API and data, ops, and UX , cover the behavior of a running system, not just the structure of source files. For agent-native codebases, that behavioral coverage becomes the primary review surface. The instructions.md is one artifact. The tools it invokes, the data it handles, the failure modes it produces, and the access boundaries it respects are the rest. Hyrax submits a PR with verified fixes; the team merges. That model holds whether the codebase is typed Python or a directory of markdown agent definitions.

The deeper point is the one the Vercel AI SDK implicitly makes by shipping human-in-the-loop approval as a core primitive: governance cannot be an afterthought bolted on after agents ship. It needs to be a layer in the stack. The framework provides the runtime primitive. The independent reviewer provides the audit.

Hyrax is live at hyrax.dev.


Sources

  1. 01yage.ai , Vercel AI Cloud 2026
  2. 02deepwiki.com , Tool Calling and Multi-Step Agents
  3. 03addyo.substack.com , Agentic Code Review
  4. 04dev.to , Review Doesn't Scale, Validation Does
  5. 05dev.to , The Review Bottleneck