#16751 · @altendky · opened Mar 9, 2026 at 1:01 PM UTC · last updated Mar 21, 2026 at 8:48 PM UTC
fix(session): fix root causes and reconstruction of tool_use/tool_result mismatch (#16749)
Score breakdown
Impact
Clarity
Urgency
Ease Of Review
Guidelines
Readiness
Size
Trust
Traction
Summary
This PR provides a three-layered defense-in-depth fix for a widespread and critical "tool_use/tool_result mismatch" error that corrupts user sessions. It addresses root causes in stream processing and adds a reconstruction-time safety net, supported by real-world evidence and comprehensive new tests.
Description
Issue for this PR
Closes #16749 Related: #10616, #8377, #2720, #1662, #5750, #2214, #8312, #8010
Type of change
- [x] Bug fix
- [ ] New feature
- [ ] Refactor / code improvement
- [ ] Documentation
What does this PR do?
Fixes the root causes and provides a reconstruction-time safety net for the widespread tool_use ids were found without tool_result blocks immediately after error that corrupts sessions and makes them unrecoverable.
The fix is three layers of defense-in-depth, each catching what the previous one misses:
Layer 1 — processor.ts: Tool-error race condition (line 211)
The tool-error handler only processed errors for tools in "running" status. Due to the AI SDK's merged-stream event ordering, tool-error can arrive before tool-call, when the tool is still "pending". The error was silently dropped, leaving the tool in "pending" state to be cleaned up later as "Tool execution aborted" with empty input: {}.
Fix: Accept tool-error for both "running" and "pending" status. Uses Date.now() as start time for pending tools (which don't have a time.start field).
Layer 2 — processor.ts: Recovery step-finish before retry (line 374)
When a stream error interrupts processing before finish-step is reached, or the finish-step handler itself throws, the step boundary is never written. The retry loop's continue creates a new stream whose events are appended to the same DB message without a step-finish/step-start boundary. Both steps' content merges into one message, and toModelMessages() produces a single assistant block with interleaved tool_use/text that the Anthropic API rejects.
Fix: Before continueing the retry loop, scan parts backward for an unclosed step (step-start without a matching step-finish). If found, write a recovery step-finish with reason: "error" and zero tokens/cost. Wrapped in try/catch so recovery failures don't block the retry.
Layer 3 — message-v2.ts: Synthetic step-start injection (line 623)
A reconstruction-time safety net that handles already-corrupted DB data regardless of how step boundaries were lost.
Fix: In toModelMessages(), track whether we've seen a tool part in the current step (sawTool flag). If text or reasoning appears after a tool part without an intervening step-start, inject a synthetic { type: "step-start" } to force the AI SDK to split content into separate assistant+tool blocks.
How layers interact
| Layer | Where it acts | What it prevents |
|-------|--------------|-----------------|
| Layer 1 (tool-error race) | Stream event handling | Silent error drops that leave tools in wrong state |
| Layer 2 (recovery step-finish) | Retry loop, before continue | DB corruption at write time — ensures step boundaries are preserved |
| Layer 3 (synthetic step-start) | Message reconstruction | Handles already-corrupted DB data + any future edge cases the above layers miss |
Real-world evidence
Session ses_32fb35486ffeeJAHmplKU1gB2t, message msg_cd05ba534001gICo48Lsy1NHWp:
part_id | type | tool | status | error
-------------------------------+-------------+-------+--------+------------------------
prt_cd05bb9ac001... | step-start | | |
prt_cd05bb9ad001... | text | | |
prt_cd05bb9f0001... | tool | write | error | Tool execution aborted
← 96 SECOND GAP
prt_cd05d3273001... | text | | |
prt_cd05d35a8001... | tool | write | completed |
prt_cd05f3c5d001... | step-finish | | |
- The errored tool has
input: {}—tool-errorwas dropped because status was"pending"(Layer 1 root cause) - No
step-finish/step-startboundary between the two groups (Layer 2 root cause) - The 96-second gap is the retry delay
How did you verify your code works?
- All 20 message-v2 tests pass, 0 failures
- New test constructs the exact corrupted DB pattern (two merged steps with
[step-start, text, tool(error), text, tool(completed)]) and asserts the structural invariant: notextorreasoningpart appears after atool-callpart in the same assistantModelMessage - Before the fix:
Content types in this message: [text, tool-call, text, tool-call] - After the fix: passes (content split into separate blocks)
- 6 pre-existing compaction test failures unrelated to this change
Files changed
| File | Change |
|------|--------|
| packages/opencode/src/session/processor.ts | Tool-error race fix (accept "pending") + recovery step-finish before retry |
| packages/opencode/src/session/message-v2.ts | Synthetic step-start injection in toModelMessages() |
| packages/opencode/test/session/message-v2.test.ts | Test reproducing corrupted DB interleaving pattern |
Checklist
- [x] I have tested my changes locally
- [x] I have not included unrelated changes in this PR
Linked Issues
#16749 Missing step-finish/step-start parts after retryable stream errors cause tool_use/tool_result mismatch
View issueComments
No comments.
Changed Files
packages/opencode/src/session/message-v2.ts
+17−1packages/opencode/test/session/message-v2.test.ts
+98−0