Experiment report: building a software-dev company in Paperclip

Abdul · 2026-05-18

Hey everyone — I spent the last few days experimenting deeply with Paperclip AI to build an autonomous software-development company.

My goal was simple: give the company a project brief or repo, then let the agents plan, build, review, ship, and report with minimal human involvement.

Here’s what I tried:

First, I downloaded a public company from the Paperclip community and also downloaded the Paperclip docs. Then I asked Claude to review the docs + company setup and create a “Paperclip advisor” skill that could guide me on best practices.

After that, I started a fresh Claude session, activated the advisor skill, provided my Superpowers repo/project, and asked Claude to build a Paperclip company for it using the best practices from the docs and reference company.

The setup used Claude Opus as the adapter.

What I built:

A 7-agent software company: CEO, PM, Architect, Engineer, Reviewer, DevOps, and Knowledge Steward

Workflows for Spec → Plan → Build → Review → Ship → Retro

Operator playbooks, nudge scripts, acceptance checks, and verification scripts

Answers

Aron Prins · 2026-05-20

Right reframing — and the answer changes meaningfully now that you've said "real SaaS product, sustained across cycles." Solo Dev Shop is too thin for that, but you're also nowhere near needing the 40-agent shapes like fullstack-forge. The sweet spot is around 5 roles, and there's an existing reference template that lands almost exactly there.

Start from agentsys-engineering ([on github](https://github.com/paperclipai/companies/tree/main/agentsys-engineering)). Its roster is:

CEO ├── CTO (architecture, technical direction) ├── Staff Engineer (the actual building) ├── QA & Release Lead (acceptance + ship) └── Research / Perf Analyst (investigation + metrics)

Five roles, no overlapping lanes, every role owns a distinct output type. Compared to your previous 7-agent setup the key cuts are: no separate PM (CEO does intake/prioritization at this scale), no separate Lead Engineer + Engineer split (one Staff Engineer covers it), no DevOps role (QA & Release Lead owns ship for now), no Knowledge Steward (Research/Perf Analyst owns memory).

Why this shape works for SaaS specifically:

One builder, one boundary-keeper. The single Staff Engineer is the only agent that writes code. The QA & Release Lead is the only agent that decides "shipped." That removes the "Reviewer + Engineer fighting over the artifact" failure mode you saw. CTO ≠ Architect. Subtle but load-bearing — Architects often write specs that Engineers then re-spec their own way. A CTO sets direction (this stack, that database, these tradeoffs) but doesn't write the per-feature spec. Feature specs live in the issue, written by the CEO at intake. No PM at this scale. When you have one builder, a PM is mostly issue-shuffling overhead. The CEO can hold the backlog directly until throughput demands a PM — which is the bottleneck signal that earns the next hire.

For sustained SaaS work specifically, layer in (on top of the template):

Lock the cadence to a Routine, not to agent prose. Your "Spec → Plan → Build → Review → Ship → Retro" drift came from putting cadence in AGENTS.md. Don't. Use a weekly Routine assigned to the CEO that creates the next cycle's parent issue with a fixed structure (spec subtask → plan subtask → build subtask → review subtask → release subtask → retro subtask). The cadence then lives in the issue graph as blockedBy links, which agents can't drift around because the next subtask is literally not assignable yet. Persistent memory ≠ a Hermes agent yet. For your earlier scope (CLI tools) you don't need Hermes. For sustained SaaS — where decision history matters (why did we pick Postgres? what did we try and reject for auth?) — eventually yes. For now, a decisions/ directory in the repo with the CTO appending one file per architectural call gets you 80% of the value without a sixth agent. Acceptance lives in tests, not in Reviewer judgment. The role-drift you saw is downstream of a Reviewer that grades narrative. Make the QA & Release Lead's task literally: run the test suite, paste the output, decide pass/fail. No prose grading. If a feature has no acceptance test, that's the first thing the QA & Release Lead writes, not the last.

On the broader "doesn't drift, doesn't skip ownership boundaries, doesn't claim done without evidence" ask — that's not a roster problem. That's a workflow primitives problem. Three primitives to lean on hard:

Execution workspaces — each agent constrained to one repo dir. (/docs/projects-workflow/workspaces) parentId + goalId discipline on every subtask — the goal becomes the boundary. (/docs/api/agents heartbeat protocol step 9 (/docs/api/agents)) Approval gates at exactly two points: spec acceptance and merge. Everything between is autonomous. That's the documented "small dev shop" pattern in The Five Approval Patterns (/guides/operations/approval-patterns).

Abdul · 2026-05-18

Thanks @aronprins — this is very helpful, and I think your point about “debugging by adding if branches” is exactly what happened in my setup.

One clarification: my real goal is not to use Paperclip for small CLI tools or tiny tasks like “build a page.” Those were only validation tasks. My actual target is a real SaaS product, where I want the agent company to handle larger operational work: features, backend logic, frontend flows, integrations, tests, releases, docs, and ongoing iteration.

So I’m trying to understand the right Paperclip shape for a larger software project, not just a small dev shop that ships one-off utilities.

I verified the setup:

Paperclip CLI/server version: 2026.512.0

Runtime: Node v22.22.1, local trusted deployment

All 7 agents used claudelocal

CEO / PM / Architect / Reviewer used claude-opus-4-6

Lead Engineer / DevOps / Knowledge Steward used claude-sonnet-4-6

Knowledge Steward was Claude Sonnet, not Hermes

Aron Prins · 2026-05-18

This is a really useful write-up — thanks for posting it instead of just abandoning the experiment. Your own diagnosis is essentially right and matches what we've seen across other ambitious setups: when an agent stack misbehaves, adding instructions to fix it is almost always the wrong move. It's the AI-company version of debugging by adding if branches. A few concrete patterns that tend to land better than rule-stacking:

1. Cut the roster before you cut the rules. Seven agents is a lot for a small dev shop, especially if Engineer + Reviewer + DevOps are all spawning subtasks in each other's lanes. The org-design guide How Delegation Mirrors Human Org Design (/guides/concepts/delegation-mirrors-human-org-design) makes the point that an org with too many overlapping roles produces exactly the symptom you saw — agents inventing work adjacent to the brief because nobody owns the boundary. For dev work specifically, a CEO + a PM + a single Engineer + a Reviewer can ship most small features. Add specialists when the bottleneck is real, not anticipatory.

2. Move scope policing from prompts into executable checks. The "agents created issues outside the intended project" failure mode is best caught by Paperclip's primitives, not by instructions. Two practical levers:

Set the execution workspace on each agent so the working directory is constrained to one repo. See Execution Workspaces (/docs/projects-workflow/workspaces). Use goal/project assignment on issues. Every subtask should carry a parentId and goalId (see Agents API → heartbeat protocol (/docs/api/agents) step 9 — "Always set parentId and goalId on subtasks"). When that discipline is in place, the goal is the boundary and out-of-scope tickets become visible/auditable rather than buried.

3. Replace "did you finish?" with a verifiable artifact. The false-"done" pattern almost always traces to a Reviewer that grades narrative instead of artifacts. For your CLI tools experiment that's easy: the acceptance test is whether pytest passes and the binary produces the expected output for a known input. Make the Reviewer's task explicitly: "run X, paste stdout, decide pass/fail." If you can't write a check, that's a signal the task is too vague to be done autonomously.

4. Drop heartbeat polling on everyone except a single coordinator. The "agent recovers from blocked/stalled" friction is usually downstream of multiple agents waking themselves and re-deciding the world. Per Why Agents Do Nothing by Default (/guides/concepts/why-agents-do-nothing-by-default), the steady-state is heartbeats off and wake-on-assignment. Most stall loops disappear when only the CEO has a heartbeat and reaches down to others on demand.

5. The "operator at gates" pattern you arrived at is actually the documented one. See The Five Approval Patterns (/guides/operations/approval-patterns) — the right level of human involvement for a small dev shop is "operator approves spec → autonomous build → operator approves merge". The dream of zero-human-in-the-loop dev is a 2027 problem, not a today problem.

On the broader "this got too complex" feeling — that's the right instinct. Browse [paperclip.community/companies](https://paperclip.community/companies) for reference orgs that intentionally stay small; the Solo Dev Shop shape in particular is closer to a minimal viable dev company than a 7-role setup. Worth a look as a baseline to shrink toward.

Two questions back at you that would help me give you sharper advice:

Which Paperclip version were you on during the experiment, and was the "Knowledge Steward" a Hermes agent or just another Claude one? When agents drifted outside the project, were they creating new issues in other projects of the same company, or actually creating files in unrelated directories?