Review productivity for Issue / Review silent active run for Agent
Francisco · 2026-05-18
I got disconnected from internet and apparently the whole system does not understand the situation and start creating hundreds of review tickets, one to check the previous one.
Burning tokens like there's no tomorrow, it all started with the update.. any fix?
Answers
Aron Prins · 2026-05-25
Both verified, both filed cleanly — thanks Francisco.
#6596 (recovery storm): triaged. The two-part fix shape (connectivity gate at the head of the recovery chain + excluding recovery-origin runs from the silent-run watchdog) is the right shape; the gate is the higher-leverage of the two since it covers both Mechanism 1 and Mechanism 2 in one place. Your suggestion #4 (per-window circuit breaker) is the right belt-and-suspenders for the case where the gate has a false negative — worth landing as a separate, smaller PR after the gate so each fix can be evaluated independently. #6597 (Postgres PID liveness): also right. The fix surface is pgisready + listening-port verification rather than trusting the recorded PID alone. Smaller blast radius than #6596 but worth landing because the recovery mode (manual postmaster.pid deletion after a reboot) is exactly the kind of step that doesn't happen at 3am. #6608 (gemini session ID) is the same root pattern as #6596 — state layer treating infra failure as agent failure — but a separate, smaller fix. I've replied on that thread directly with the suggested additions; if you have a minute, dropping a one-line meta-comment on one of the three issues cross-referencing the other two would help the team see whether this deserves a tracking epic rather than three independent patches. You've already done the diagnostic work; that comment is essentially free.
Re "I cant post links" — that's almost certainly the new-account link-posting throttle on community.paperclip. It auto-relaxes after a few approved posts; you've cleared the threshold by now, so the next reply should let you embed full URLs. If you're still blocked after the next post, send me the post URL and I'll lift the throttle on your account manually.
One ask whenever the gate-fix PR lands: re-running your reproduction on a pre-release build would be the cleanest validation we could get for this class of bug. Your repro is the most reliable one I have for the storm pattern, and verifying against your real outage shape rather than a synthetic one is worth a lot. No pressure if your time is elsewhere — but if you're up for it, mention it on #6596 and the team will tag you when the PR is ready for testing.
Francisco · 2026-05-24
⏺ Los dos issues quedaron creados:
#6596 — Recovery storm: I cant post links
#6597 — Postgres
Aron Prins · 2026-05-20
Francisco — this is a good bug report. Your Claude agent traced the code paths accurately: I verified the same against current master and the claims hold up — scanSilentActiveRuns is in server/src/services/recovery/service.ts, ACTIVERUNOUTPUTSUSPICIONTHRESHOLDMS = 60 60 1000 is hardcoded as you said, the scheduler chain in server/src/index.ts runs every 30s with no reachability check, and a grep across the recovery service confirms no ECONNREFUSED / EAIAGAIN classification anywhere. Your "Mechanism 1 self-amplifies, Mechanism 2 is rate-capped" framing matches the source — Mechanism 2's per-issue creation cap is in productivity-review.ts, Mechanism 1 has no equivalent.
The diagnosis is right and the fix shape is right. The likely shape of the fix, in two PRs:
Connectivity gate at the head of the recovery chain (cheap outbound probe; skip the issue-creating reconcilers on miss). Exclude recovery-origin runs from the silent-run watchdog (or dedupe on (agent, sourceIssue) rather than runId) so the loop can't self-amplify even if the gate has a false negative.
Your suggestion #4 (a circuit breaker on N silent-run evaluations per window) is a good belt-and-suspenders layer regardless.
Stop-the-bleed during an outage today (until the fix lands):
Set HEARTBEATSCHEDULERENABLED=false for the duration of the outage and restart the server. Yes it's coarse — disables orphan reaping and retry promotion too — but it stops the storm cold. For unattended overnight outages this is what I'd put behind a shell alias.
If you can wrap that as a systemd-level watchdog (toggle the env var off when an outbound probe fails for >N minutes, toggle it back on when reachability returns), you've effectively implemented the connectivity gate at the process boundary rather than inside the recovery service. Not pretty, but operates without waiting for the fix.
On the side note re embedded Postgres / postmaster.pid PID recycling — also a real and reproducible issue. Worth a separate bug report so it doesn't get buried inside this one; the fix surface is different (pgisready + listening-port verification vs. trusting the recorded PID).
Could you file the recovery-storm report as a GitHub issue on paperclipai/paperclip with the diagnostic text essentially as-is? Your code-path references make it triagable in minutes. Thanks for the thorough write-up; this is exactly the kind of report that lands fixes quickly.
Francisco · 2026-05-18
Which Paperclip version are you on? ("the update" — which one, roughly the date or paperclip --version) (paperclipai 2026.513.0) Is the Reviewer creating new issues per review (so issue count is growing), or re-running on the same issues (run count growing, issue count flat)? Creating New Are these subtasks parented to one original issue, or detached? Parented Adapter type on the reviewer — Claude Local? Codex? Something else? Gemini CLI Local
Hi Aron, thanks for the reply, here are the answers I also posted a review fo the issue from a Calude agente I have running (from outside paperclip)
this started with when I updated paperclip to the last version.
Thanks
Fran
Francisco · 2026-05-18
I ran a Claude agente to review the error, here's the diagnose:
⏺ Here's a complete, self-contained write-up in English you can forward to the Paperclip developers. It's accurate to the installed build\ (paperclipai 2026.513.0, @paperclipai/server) with file/line references.
---
Bug report: heartbeat recovery has no connectivity awareness — network outages cause a self-amplifying storm of paid "review" runs
Environment
paperclipai 2026.513.0, package @paperclipai/server (global Homebrew install) Embedded PostgreSQL, single instance (instances/default) Heartbeat scheduler enabled (default; HEARTBEATSCHEDULERENABLED !== "false") Reproduced organically during real internet outages (including unattended/overnight)
Summary
When the host loses outbound internet (so adapter/LLM API calls fail or hang), the heartbeat recovery pipeline keeps running on its normal 30 s cadence and interprets the resulting symptoms as "agents are stuck and need review." It then creates auto-assigned review\ issues and wakes agents to work them — which themselves cannot reach the network, so they produce the same symptoms, which spawn more review work. The result is thousands of issues (predominantly titled Review silent active run for <agent>) and a large amount of wasted token spend, with zero useful output, until connectivity returns or the operator intervenes.
There are two distinct recovery mechanisms with the same blind spot. Both are described below.
Where the pipeline runs
Aron Prins · 2026-05-18
Sorry — that's a bad failure mode and burning tokens during a network blip is exactly the kind of thing we want to kill cleanly. Two things first, then I need a couple of details from you to actually root-cause it.
Stop the bleed right now (if it's still happening):
Pause the review agent: POST /api/agents/{agentId}/pause (or the Pause action from the agent page). Pause is the hard stop — no timer ticks, no wake-on-assignment, no wake-on-comment. See the Heartbeats & Routines (/docs/projects-workflow/routines) doc, "Pause vs. heartbeat off" — they're different and pause is what you want here. If multiple agents are caught in the review-of-review loop, pause all participants, not just the one that looks loudest. The loop usually has at least two agents handing the ball back and forth.
To diagnose:
Which Paperclip version are you on? ("the update" — which one, roughly the date or paperclip --version) Is the Reviewer creating new issues per review (so issue count is growing), or re-running on the same issues (run count growing, issue count flat)? Are these subtasks parented to one original issue, or detached? Adapter type on the reviewer — Claude Local? Codex? Something else?
Two structural notes while you gather that:
The heartbeat protocol's 409 Conflict rule ("Never retry a 409" (/docs/api/agents)) is specifically there to prevent agents stomping on each other's work. If the reviewer is bypassing checkout/409 and PATCH'ing to inprogress directly, you'd see exactly the runaway you're describing. Worth checking whether the reviewer is going through the documented checkout flow. On the network-blip side: the idempotencyKey field on the wakeup API is designed to coalesce duplicate wakes triggered during a flaky connection. If your review agent is generating wakes without one, every reconnect can stack.
Drop the version + answers to the four questions above and I'll match it against recent behavior changes. If we confirm a regression, the right place to file it is the upstream repo so the team can land a fix — happy to help you write up a clean repro once we have the data.