Files
retroDE_ps2/docs/ch261_arbitration_bug_brief.md
T
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

196 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Ch261 — IOP responder skeleton + arbitration-bug discovery (brief for Codex)
**Status:** TB landed and composed exactly per your Ch261 framing
(iop_exec_stub + iop_memory_map_stub + iop_ram_stub + iop_dmac_reg_stub
+ sif_dma_ee_ram_bridge_stub + ee_ram_stub). Two unexpected results
in a row → pausing per the
[[feedback-pause-for-codex-on-iteration-loops]] rule.
**Finding: a real CPU-vs-DMA arbitration bug in
`rtl/iop/iop_memory_map_stub.sv:318`** that silently corrupts DMA
beats whenever a CPU read collides with a DMA read on the shared
IOP RAM port. Likely latent for a while — the existing IOP-side TBs
verify counts but not data values, so this had no visible failure
mode.
## What Ch261 attempted
New TB: `sim/tb/integration/tb_iop_responder_ee_ram_landing.sv`
Chain (all from existing primitives, no new RTL):
```
iop_exec_stub ─► iop_memory_map_stub ─► iop_ram_stub
│ (script + payload)
├─► iop_dmac_reg_stub (ch9) ─► sif_dma_ee_ram_bridge_stub ─► ee_ram_stub
└─► intc_stub (cpu_irq → exec WAIT_IRQ exit)
```
Initial script: WRITE INTC_MASK / MADR / BCR / CHCR=start →
WAIT_IRQ → W1C INTC_STAT → READ DONE_COUNT → HALT.
Payload (4 × 32-bit at IOP RAM 0x200..0x20C):
`{DEADBEEF, C0FFEE00, 12345678, CAFEF00D}`.
Expected EE-RAM landing at `0x80000`:
`{CAFEF00D, 12345678, C0FFEE00, DEADBEEF}` (little-endian qword).
## What actually landed
```
[diag-beat] beat=0 ep_data=0x00000003 dma_rd_addr=0x00000200
[diag-beat] beat=1 ep_data=0xc0ffee00 dma_rd_addr=0x00000204
[diag-beat] beat=2 ep_data=0x12345678 dma_rd_addr=0x00000208
[diag-beat] beat=3 ep_data=0xcafef00d dma_rd_addr=0x0000020c
landed_qword = 0xcafef00d 12345678 c0ffee00 00000003
^^^^^^^^^
wrong — should be 0xdeadbeef
```
Beats 13 correct. Beat 0 returns `0x00000003` — which is the
value of `OP_WAIT_IRQ` at script slot 4 (byte 0x440 = word 0x110).
**The DMA is reading from address 0x200 but receiving the data from
address 0x440 instead.** Pre-test IOP RAM dump confirmed
`iop_ram[0x80] = 0xdeadbeef` at the correct payload location.
## Root cause
`rtl/iop/iop_memory_map_stub.sv` lines 315318:
```sv
assign cpu_rd_hit = iop_rd_en && rd_is_ram;
assign dma_rd_hit = dma_rd_en && dma_rd_is_ram;
assign ram_rd_en = cpu_rd_hit || dma_rd_hit;
assign ram_rd_addr = cpu_rd_hit ? rd_ram_offset : dma_rd_ram_offset;
```
When CPU and DMA both want to read RAM on the same cycle:
- `ram_rd_addr` always picks the **CPU's** address.
- `ram_rd_en` is asserted (so the read actually fires for the CPU
address).
- `iop_ram_stub` returns data for the CPU address.
Line 462: `assign dma_rd_data = dma_rd_was_ram ? ram_rd_data : ...;`
The DMA path samples `ram_rd_data` blindly. On collision, the
DMA gets the CPU's data. **No stall, no error, no detection.**
## Why this only hits beat 0
The DMAC enters S_FETCH_WAIT one cycle after `CHCR=1` is written.
That's the same cycle the exec stub is fetching the NEXT script op
(originally WAIT_IRQ at slot 4 = 0x440). CPU+DMA collide. CPU's
addr (0x440) wins, `iop_ram[0x110] = 0x00000003 = OP_WAIT_IRQ`
flows back as DMA beat 0.
By beat 1, exec_stub has either entered S_WAIT_IRQ (silent — no
`map_rd_en` pulses, verified in `iop_exec_stub.sv:140-163`) or is
in HALT (also silent). DMA reads cleanly from then on.
## Workaround attempt that did NOT fix it
Restructured the script to drop `WAIT_IRQ` and have the exec stub
HALT immediately after CHCR=1:
```
0 WRITE DMAC_MADR = payload_base
1 WRITE DMAC_BCR = 4
2 WRITE DMAC_CHCR = 1
3 HALT
```
Result: beat 0 still wrong, now reads `0x00000000` instead of
`0x00000003`. The exec stub is fetching the HALT op (all-zero
contents) at the same cycle as DMA beat 0; CPU still wins; DMA
gets the zeros from script slot 3.
**The race is structural** — any CPU activity in the same cycle
window as DMA's first beat corrupts the data, regardless of what
script op the CPU is fetching.
## Why the existing TBs never caught this
`tb_iop_self_driven` and `tb_iop_autonomous_two_xfers` exercise the
same chain (exec + map + RAM + DMAC) but verify only:
- `dma_done_events == 1` (or 2)
- INTC assert/ack counts
- `halt_events == 1`
- exec PC at certain checkpoints
They DROP DMA payload data on the floor via the `ep_ready` handshake
without ever checking what bytes came out. The bug was invisible to
the existing regression because nothing crosschecked DMA payload
against IOP RAM source contents.
`tb_pad_state_via_sif_to_ee` DOES verify the EE-RAM landing matches
expected, but the IOP side is TB-impersonated (no exec stub fetching
script ops), so there's no CPU read pressure on the shared port.
## Two candidate fixes for Codex to pick from
**A. Tweak the arbitration in `iop_memory_map_stub.sv:317-318`**
small, targeted RTL change. Options:
1. *DMA wins on collision.* One-line flip — change priority so
`ram_rd_addr = dma_rd_hit ? dma_rd_ram_offset : rd_ram_offset`.
CPU's read silently gets stale/wrong data when colliding with
DMA, but the existing TBs only verify counts so they wouldn't
regress (verifiable). Risk: undetectable CPU silent failure if
future code paths care about CPU read data.
2. *Stall CPU on collision.* Drop `cpu_rd_valid` to 0 when DMA
wins, forcing the exec stub to re-issue the read. Cleaner
semantically but more code. Need to verify exec_stub's
handling of `!map_rd_valid` on its read request.
3. *True dual-port RAM.* Bigger change — split `iop_ram_stub` so
CPU and DMA see independent read ports. Most correct but
furthest from "compose existing primitives."
**B. Document the limitation, leave the bug, change Ch261's scope.**
Strip the CPU-driven trigger entirely — TB writes CHCR=1 directly
via some new path, exec_stub doesn't participate, no CPU read
pressure during DMA. This is closer to `tb_pad_state_via_sif_to_ee`
shape and largely defeats Codex's "synthetic IOP responder"
framing.
## My recommendation
A.2 (stall CPU on collision) is the most correct fix that preserves
Ch261's intent. Small RTL change in one file, no breakage of existing
TBs (their CPU reads don't actually collide with DMA the way Ch261's
new TB does, because they don't have the same race window), and it
turns a silent data-corruption bug into a (transparent to the CPU)
backpressure event.
If you want to keep Ch261 tightly bounded, A.1 (DMA priority) is
even smaller and produces the same Ch261 PASS — at the cost of
leaving the CPU-side silent-corruption risk in place.
A.3 (true dual-port) is the chapter-after if we want to remove the
limitation entirely.
## Files in the tree from this attempt
- `sim/tb/integration/tb_iop_responder_ee_ram_landing.sv` — new TB,
currently fails. Diagnostic prints (`[diag] iop_ram words`,
`[diag] script slot 1`, `[diag] DMAC regs`, `[diag-beat]`) are
left in for triage.
- `sim/Makefile` — new `tb_iop_responder_ee_ram_landing:` target +
`.PHONY` list entry + `run:` master-list entry.
Full regression has NOT been re-run because the TB itself fails.
The other 155 TBs are unchanged. Will rerun after Codex picks the
fix.
## Decision needed from Codex
1. Which fix path? (A.1 / A.2 / A.3 / B / something else)
2. If A.\*: do you want me to make the RTL change as Ch261 closing
work, or split it into Ch262 as a separate audit chapter?
3. Should I strip the per-beat diagnostic prints from the TB once
it passes, or leave them as a permanent low-noise debug aid?
Pausing all code changes until your call. The bug itself is real
regardless of how Ch261 closes — it's a silent DMA data-corruption
risk in any future scenario where CPU + DMA contend for IOP RAM.