Benchmark methodology and results
A new evaluation methodology that captures user, agent, and filesystem interactions — applied to two task suites: agent self-correction on hidden side effects, and routine-task user interaction. Drawn from §5 of the paper. For micro-benchmarks see the separate performance page.
Existing agent benchmarks evaluate the model or framework in isolation, bypassing permission prompts. To capture the new interaction paradigm between user, agent, and filesystem, we built a lightweight interactive benchmark harness with four requirements:
During each run, the harness submits the prompt, watches the screen for permission dialogs, and answers each one according to a fixed per-agent policy (typically "allow + don't ask again"). It then waits for the agent to return to its input-ready state, with a 3-minute timeout. After execution, it checks the working directory against the task's expected state — file existence, contents, and permissions — and verifies expected strings in tool-call outputs.
We hand-crafted 11 opaque tasks where a routine command — a
linter, a build, a deploy script — silently does something
destructive: deletes README, overwrites .env, drops
src/, and so on. For each task we prepare a minimal
project (10–41 LoC) and ask the agent to perform a common operation.
We vary how much an agent can infer about a command before running it:
Without YoloFS, no baseline agent reliably prevents the destructive side effects. With YoloFS, after each command the agent sees a summary of file-level changes (creations, deletions, modifications) and can revert before the user commits. We provide no extra prompting to look for damage.
| Setup | Caught the damage | Notes |
|---|---|---|
| Claude Code + YoloFS | 11 / 11 | 8 self-corrected, 3 user-correctable (kept staged) |
| Claude Code (baseline) | 0 / 11 | fails all but optimizer (asks user) |
| Codex | 0 / 11 | delegated to user via sandbox-fallback prompt |
| Copilot | 0 / 11 | fails all but build-pkg (asks user) |
| Gemini | 0 / 11 | fails 9; 2 blocked by static policy (no work done either) |
git diff + ls, reverts,
then concludes: "CRITICAL: This is a destructive script, not a
legitimate formatter!"
make package, sees
src/, README.md, LICENSE
deleted; immediately reverts; then traces the three-level chain
through the Makefile.
.env with
production defaults), so the agent accepts. YoloFS still keeps them
staged so the user can reject before commit.
The takeaway: when the filesystem surfaces effects, agents self-correct on clearly-inappropriate damage, and the user stays in the loop on the rest.
Does shifting control to the filesystem actually reduce user interaction on common tasks? We ran 112 single-operation tasks across five framework configurations. Each task asks the agent to perform one filesystem operation (read, append, copy, move, delete, patch, etc.) on a path that may stay inside the project, cross the boundary, or traverse a symlink.
18 operations × 5–7 paths per operation = 112 tasks. Operations span file work (read, append, overwrite, patch, clear, delete, copy, move, create), directory work (create, delete, copy, move), and search (list, grep, glob, glob+read, glob+delete). Paths span project-direct, backtrack, reentry, external direct/backtrack, and symlinks to external files/dirs.
Five configurations: Claude Code with YoloFS, plus baseline Claude Code, Codex, Copilot, and Gemini. We instruct agents to use shell commands so the comparison is fair across frameworks with different built-in tool sets. Tool calls used solely for permission dialogs are excluded from tool-call counts. The harness selects "allow" and "don't ask again" whenever available, so each unique prompt is counted only once.
| Framework | Success | Tool calls / task | User prompts / task |
|---|---|---|---|
| Claude Code + YoloFS | 99% | ~1.0 | 0.4 |
| Codex | 100% | 1.8 | 0.4 |
| Claude Code (baseline) | 98% | ~1.0 | 0.9 |
| Copilot | 96% | ~1.1 | 1.3 |
| Gemini | 75% | 2.9 | 2.2 |
YoloFS matches the best baseline for prompts (0.4) at 99% success. Codex matches the prompt count, but only via its writable sandbox-with-fallback — exactly the design that whiffs every hidden-side-effect task in §B above. Claude prompts on most shell commands; Copilot prompts even more; Gemini's stricter policy causes both repeated access requests and lower task completion.
Tool-call counts explain the gap. Codex averages 1.8 because it first executes inside the sandbox; on block, it prompts the user and re-executes — doubling the call count. Gemini's 2.9 reflects retries on failed access attempts. YoloFS, Claude, and Copilot each hit ~1.0 calls.
Why effect-level control is less prompt-heavy. Codex wraps a
simple sed -i into a multi-line script with error
handling and verification — the longest reaches 330 characters. Its
"don't ask again" rule is awkward: too broad
(sed -i = any in-place edit) or too narrow (a
one-off compound command that won't recur). YoloFS prompts on the
accessed path; the user's decision applies regardless of
which command touches it, so rules generalize correctly.
For raw filesystem numbers (single-file I/O, metadata, snapshot scalability, kernel-build workflow), see the performance page.