How do you enforce a real definition-of-done?

NickyDigital · 2026-06-06

Agents marking build tasks "done" with no shipped code — and not escalating the blocker.\ \ I'm running several "autonomous" companies on Paperclip (opencodelocal agents, Manifest-routed). Hit a pattern that seems system-wide across all of them, and I'd love to know how others handle it.

The problem: my agents close Implement/Build issues as "done" with confident comments — "registration form fully functional," "component implemented and validated" — but the actual deliverables are markdown specs or code attached to the issue, not committed, building code in a repo.

In one company the board showed 33 issues "done" and a happy CEO report, while the real codebase was empty. I only found out by SSHing in and inspecting the filesystem myself.

The part that worries me more: the root cause was a missing capability — no GitHub repo / deploy target wired for the team to build into. The agents silently worked around it (dumping code as issue attachments) instead of escalating.

Zero approvals raised, zero "blocked" issues, no alert. As the founder I had no way to know without digging. The "significant events → founder" reporting never fired because the agents believe they succeeded.

So: procedurally the orgs are healthy (no loops, clean delegation), but there's no output-quality gate. Agents optimize "close the issue," not "ship working software." Green dashboard, zero product.

What we're going to try:

Definition-of-Done in [AGENTS.md](http://AGENTS.md) — for any Implement/Build task, "done" requires code committed to the repo and building. Specs/attachments stay inreview, never done.

Escalation rule — if a task can't complete due to a missing capability (repo, credential, deploy target, API key), the agent must raise an approval to the founder, not work around it.

A real alert channel (webhook/ntfy/email) so escalations + significant events actually reach me.

Answers

NickyDigital · 2026-06-08

Quick update — your reframe was right, and most of it mapped onto primitives we already had. Closing the loop on what we did, in case it helps anyone else.

Verified the execution-policy primitive is real and runtime-enforced (not just docs). In our deployment there's a dedicated service enforcing it, with tests and an audit table of every review/approval decision, and blocked is a real lifecycle state.

Confirmed the behavior you described: executor → done gets intercepted to inreview, reassigned to the reviewer, and the runtime excludes the original executor from reviewing their own work.

There's also an always-on comment-required backstop — a run that posts no comment gets re-woken once, then recorded as failed. That alone kills a chunk of silent completions.

On #4 (alerts) — can confirm roll-your-own works. We already run ntfy + a small poller over GET …/approvals?status=pending and …/issues?status=blocked.

On its own it's empty noise; the value only appears once #1–#3 make a stuck task surface as blocked/pending instead of a false "done." Agreed it deserves a first-class story.

What we shipped:

Capability-gate + Definition-of-Done in every agent's instructions (the belt): before a Build task, confirm repo + creds + deploy target exist; if any is missing, no workaround artifact — move to blocked, name exactly what's missing, raise a board approval, stop.

"Done" = committed SHA on the expected branch + build passes + tests pass, pasted as a comment.

Specs/attachments are never "done."

Aron Prins · 2026-06-07

This is the single most important failure mode in autonomous operation, and your instinct is right: a green dashboard over an empty repo isn't a discipline problem, it's a missing enforcement layer. The good news is most of what you're proposing in (1)–(4) maps onto primitives that already exist — and the parts that don't, I'll be honest about so you don't build on sand.

First, the reframe that matters: stop trying to fix this in AGENTS.md. Your four-point plan leans on instructions ("done requires committed code," "you must escalate"). Instructions are belt-and-suspenders, not a gate — an agent that believes it succeeded will sail straight through prose it has technically satisfied. The fix is structural. Three pieces:

1. Put an Execution Policy on every Implement/Build issue. This is the definition-of-done gate, and it's runtime-enforced.

Execution policies are the thing you're missing. From the docs: "Instead of trusting an agent to remember to hand work off for review, the runtime enforces review and approval stages automatically — the moment an executor tries to close the issue, the runtime intercepts the transition and routes the work to the right reviewer or approver." (Execution Policy (/docs/power-features/execution-policy))

Concretely: an executor that tries to move a Build issue to done cannot. The runtime forces it to inreview and hands it to your reviewer. The executor literally does not have the ability to mark its own build shipped. That's exactly the "specs/attachments stay inreview, never done" behavior you want — and it's enforced by the engine, not by hope.

2. Make the reviewer a verification agent whose only job is to prove the work is real.

A review/approval stage participant can be another agent. So add a dedicated Verifier/QA agent as the required reviewer on Build issues, and make its task brutally concrete — not "does this look done?" but:

git -C <repo> log --oneline -5 shows the commit that claims to implement this issue the build actually builds (npm run build / cargo build / whatever — paste the exit code and tail of output) tests pass (paste the run) approve only if green; request changes otherwise.

When it requests changes, the runtime routes back to the same executor automatically, and loops until the verifier approves. Every decision is recorded with actor + comment + run-id, so you get an audit trail. One important nuance so you don't over-engineer: there is no built-in CI/command stage type today — the policy stages are review/approval with agent or user participants. The "automated check" is implemented as the verifier agent running the build, which is the supported shape. (The pattern is sketched in Designing Execution Policies (/guides/operations/execution-policy-design); just know the concrete mechanism is the agent reviewer, not a native test-runner field.)

The deeper point: acceptance has to live in an artifact a reviewer can check, not in narrative. If a Build issue has no way to prove "it builds," that's the first thing the verifier writes, not the last. An issue that can't be verified is an issue that can't be autonomously "done."

NickyDigital · 2026-06-07

@aronprins thank you for this in depth. Let me chew through it and get back to you with questions! I will likely take you up on your generous offer to help build-issue setup.