Bug: `gemini_local` adapter does not clear stale session ID after successful fallback — causes permanent per-heartbeat failure

Francisco · 2026-05-24

Bug: geminilocal adapter does not clear stale session ID after successful fallback — causes permanent per-heartbeat failure loop

Environment

\- paperclipai 2026.513.0 / 2026.517.0, @paperclipai/adapter-gemini-local

\- Adapter type: geminilocal

\- Reproduced on macOS (arm64), Gemini CLI installed globally

\- Any agent with a heartbeat interval long enough for Gemini CLI to expire its local session files (typically > \~24–48 h)

Summary

The geminilocal adapter correctly detects when a stored session ID is no longer valid and retries the run with a fresh session

(runAttempt(null)). However, after the retry succeeds, the stale session ID is not cleared from agentruntimestate. On every subsequent

heartbeat, the adapter loads the same invalid ID, fails on the first attempt, falls back, and succeeds — but the root cause is never

Answers

Francisco · 2026-05-26

Hi there, thanks a lot for your attention... I think I'll just wait for the fix.. thanks again. Happy to help

Aron Prins · 2026-05-25

Francisco — third clean trace from you this month and they keep landing in the right place. Verified #6608 against the adapter shape on master and the call-site logic checks out: the retry branch returns the success result but doesn't propagate the new session ID through to agentruntimestate, so the stale UUID keeps getting reloaded on the next heartbeat. The lastrunstatus='failed' part is the worst piece of the bug — it makes monitoring lie about agents that are actually completing work, which is exactly the surface that triggers the autonomous self-repair behavior you saw with the duplicate CTO.

Your suggested fix priority is right: (1) clear sessionid = NULL on the fallback path as the minimum safe correction, (2) persist the new session ID from runAttempt(null) as the preferred fix. The reason to prefer (2) over (1) is that Gemini's session resume is cheap and useful when it works — dropping to NULL every recovery means every subsequent heartbeat is a cold start on context.

Three things worth adding to #6608 yourself, while it's still fresh:

lastrunstatus correction is part of the fix surface. Even when fallback succeeds, the persisted status should reflect the successful retry, not the initial attempt's failure. Otherwise dashboards and scanSilentActiveRuns (yes, the same path from #6596) get the wrong signal. Worth calling out as a sub-item in the "Suggested fix" section so it doesn't get dropped from the patch scope. The secondary-effect story is important and underweighted in the report. A geminilocal CEO creating a duplicate CTO as autonomous recovery is the kind of compounding failure mode that's easy to dismiss as "agent did a weird thing" until you trace it back here. Consider moving that paragraph above the "Impact" header — it's the strongest case for prioritizing this above a "log-noise" classification. #6608 + #6596 + #6597 are all instances of the same root pattern: Paperclip's recovery/state-persistence layer treats infrastructure failures as agent failures, then takes expensive corrective action. If you add a one-line meta-comment on one of the three issues cross-referencing the other two and flagging the shared root cause, that gives the team the signal to decide whether it warrants a tracking epic rather than three independent fixes.

Workaround (only run this if you're comfortable with direct database writes — you can permanently lose run history if you target the wrong rows):

sql -- ⚠️ Read-only check first: confirm what you're about to update. SELECT agentid, sessionid, lastrunstatus, lasterror FROM agentruntimestate WHERE agentid = '<agent-uuid>' AND lastrunstatus = 'failed';

-- Only run the UPDATE if the SELECT shows exactly the row(s) you intend to clear. UPDATE agentruntimestate SET sessionid = NULL, lasterror = NULL, lastrunstatus = 'completed', updatedat = NOW() WHERE agentid = '<agent-uuid>' AND lastrunstatus = 'failed';

A few guardrails worth knowing before you run this:

The WHERE lastrunstatus = 'failed' clause prevents clobbering a legitimately-in-progress agent if heartbeat timing changes between your SELECT and UPDATE. Pause the agent (POST /api/agents/{agentId}/pause) before running the UPDATE so no heartbeat fires mid-edit. Take a snapshot of the row first (pgdump a single row, or just keep the SELECT output) so you can undo if needed. Don't run this in bulk across agents unless you've confirmed each one is in the same stale-session state. The lastrunstatus = 'failed' filter is a guard, not a guarantee.

If any of those steps are unfamiliar territory, sit tight for the upstream fix rather than touching the DB — the bug is annoying but not destructive, and a manual edit at the wrong moment is.