Files
retroDE_ps2/docs/ch261_arbitration_bug_brief.md
T
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

7.7 KiB
Raw Blame History

Ch261 — IOP responder skeleton + arbitration-bug discovery (brief for Codex)

Status: TB landed and composed exactly per your Ch261 framing (iop_exec_stub + iop_memory_map_stub + iop_ram_stub + iop_dmac_reg_stub

Finding: a real CPU-vs-DMA arbitration bug in rtl/iop/iop_memory_map_stub.sv:318 that silently corrupts DMA beats whenever a CPU read collides with a DMA read on the shared IOP RAM port. Likely latent for a while — the existing IOP-side TBs verify counts but not data values, so this had no visible failure mode.

What Ch261 attempted

New TB: sim/tb/integration/tb_iop_responder_ee_ram_landing.sv

Chain (all from existing primitives, no new RTL):

iop_exec_stub  ─►  iop_memory_map_stub  ─►  iop_ram_stub
                          │                  (script + payload)
                          ├─►  iop_dmac_reg_stub (ch9)  ─►  sif_dma_ee_ram_bridge_stub  ─►  ee_ram_stub
                          └─►  intc_stub  (cpu_irq → exec WAIT_IRQ exit)

Initial script: WRITE INTC_MASK / MADR / BCR / CHCR=start → WAIT_IRQ → W1C INTC_STAT → READ DONE_COUNT → HALT.

Payload (4 × 32-bit at IOP RAM 0x200..0x20C): {DEADBEEF, C0FFEE00, 12345678, CAFEF00D}.

Expected EE-RAM landing at 0x80000: {CAFEF00D, 12345678, C0FFEE00, DEADBEEF} (little-endian qword).

What actually landed

[diag-beat] beat=0 ep_data=0x00000003  dma_rd_addr=0x00000200
[diag-beat] beat=1 ep_data=0xc0ffee00  dma_rd_addr=0x00000204
[diag-beat] beat=2 ep_data=0x12345678  dma_rd_addr=0x00000208
[diag-beat] beat=3 ep_data=0xcafef00d  dma_rd_addr=0x0000020c

landed_qword = 0xcafef00d 12345678 c0ffee00 00000003
                                              ^^^^^^^^^
                                              wrong — should be 0xdeadbeef

Beats 13 correct. Beat 0 returns 0x00000003 — which is the value of OP_WAIT_IRQ at script slot 4 (byte 0x440 = word 0x110).

The DMA is reading from address 0x200 but receiving the data from address 0x440 instead. Pre-test IOP RAM dump confirmed iop_ram[0x80] = 0xdeadbeef at the correct payload location.

Root cause

rtl/iop/iop_memory_map_stub.sv lines 315318:

assign cpu_rd_hit = iop_rd_en && rd_is_ram;
assign dma_rd_hit = dma_rd_en && dma_rd_is_ram;

assign ram_rd_en   = cpu_rd_hit || dma_rd_hit;
assign ram_rd_addr = cpu_rd_hit ? rd_ram_offset : dma_rd_ram_offset;

When CPU and DMA both want to read RAM on the same cycle:

  • ram_rd_addr always picks the CPU's address.
  • ram_rd_en is asserted (so the read actually fires for the CPU address).
  • iop_ram_stub returns data for the CPU address.

Line 462: assign dma_rd_data = dma_rd_was_ram ? ram_rd_data : ...;

The DMA path samples ram_rd_data blindly. On collision, the DMA gets the CPU's data. No stall, no error, no detection.

Why this only hits beat 0

The DMAC enters S_FETCH_WAIT one cycle after CHCR=1 is written. That's the same cycle the exec stub is fetching the NEXT script op (originally WAIT_IRQ at slot 4 = 0x440). CPU+DMA collide. CPU's addr (0x440) wins, iop_ram[0x110] = 0x00000003 = OP_WAIT_IRQ flows back as DMA beat 0.

By beat 1, exec_stub has either entered S_WAIT_IRQ (silent — no map_rd_en pulses, verified in iop_exec_stub.sv:140-163) or is in HALT (also silent). DMA reads cleanly from then on.

Workaround attempt that did NOT fix it

Restructured the script to drop WAIT_IRQ and have the exec stub HALT immediately after CHCR=1:

0  WRITE DMAC_MADR = payload_base
1  WRITE DMAC_BCR  = 4
2  WRITE DMAC_CHCR = 1
3  HALT

Result: beat 0 still wrong, now reads 0x00000000 instead of 0x00000003. The exec stub is fetching the HALT op (all-zero contents) at the same cycle as DMA beat 0; CPU still wins; DMA gets the zeros from script slot 3.

The race is structural — any CPU activity in the same cycle window as DMA's first beat corrupts the data, regardless of what script op the CPU is fetching.

Why the existing TBs never caught this

tb_iop_self_driven and tb_iop_autonomous_two_xfers exercise the same chain (exec + map + RAM + DMAC) but verify only:

  • dma_done_events == 1 (or 2)
  • INTC assert/ack counts
  • halt_events == 1
  • exec PC at certain checkpoints

They DROP DMA payload data on the floor via the ep_ready handshake without ever checking what bytes came out. The bug was invisible to the existing regression because nothing crosschecked DMA payload against IOP RAM source contents.

tb_pad_state_via_sif_to_ee DOES verify the EE-RAM landing matches expected, but the IOP side is TB-impersonated (no exec stub fetching script ops), so there's no CPU read pressure on the shared port.

Two candidate fixes for Codex to pick from

A. Tweak the arbitration in iop_memory_map_stub.sv:317-318 — small, targeted RTL change. Options:

  1. DMA wins on collision. One-line flip — change priority so ram_rd_addr = dma_rd_hit ? dma_rd_ram_offset : rd_ram_offset. CPU's read silently gets stale/wrong data when colliding with DMA, but the existing TBs only verify counts so they wouldn't regress (verifiable). Risk: undetectable CPU silent failure if future code paths care about CPU read data.
  2. Stall CPU on collision. Drop cpu_rd_valid to 0 when DMA wins, forcing the exec stub to re-issue the read. Cleaner semantically but more code. Need to verify exec_stub's handling of !map_rd_valid on its read request.
  3. True dual-port RAM. Bigger change — split iop_ram_stub so CPU and DMA see independent read ports. Most correct but furthest from "compose existing primitives."

B. Document the limitation, leave the bug, change Ch261's scope. Strip the CPU-driven trigger entirely — TB writes CHCR=1 directly via some new path, exec_stub doesn't participate, no CPU read pressure during DMA. This is closer to tb_pad_state_via_sif_to_ee shape and largely defeats Codex's "synthetic IOP responder" framing.

My recommendation

A.2 (stall CPU on collision) is the most correct fix that preserves Ch261's intent. Small RTL change in one file, no breakage of existing TBs (their CPU reads don't actually collide with DMA the way Ch261's new TB does, because they don't have the same race window), and it turns a silent data-corruption bug into a (transparent to the CPU) backpressure event.

If you want to keep Ch261 tightly bounded, A.1 (DMA priority) is even smaller and produces the same Ch261 PASS — at the cost of leaving the CPU-side silent-corruption risk in place.

A.3 (true dual-port) is the chapter-after if we want to remove the limitation entirely.

Files in the tree from this attempt

  • sim/tb/integration/tb_iop_responder_ee_ram_landing.sv — new TB, currently fails. Diagnostic prints ([diag] iop_ram words, [diag] script slot 1, [diag] DMAC regs, [diag-beat]) are left in for triage.
  • sim/Makefile — new tb_iop_responder_ee_ram_landing: target + .PHONY list entry + run: master-list entry.

Full regression has NOT been re-run because the TB itself fails. The other 155 TBs are unchanged. Will rerun after Codex picks the fix.

Decision needed from Codex

  1. Which fix path? (A.1 / A.2 / A.3 / B / something else)
  2. If A.*: do you want me to make the RTL change as Ch261 closing work, or split it into Ch262 as a separate audit chapter?
  3. Should I strip the per-beat diagnostic prints from the TB once it passes, or leave them as a permanent low-noise debug aid?

Pausing all code changes until your call. The bug itself is real regardless of how Ch261 closes — it's a silent DMA data-corruption risk in any future scenario where CPU + DMA contend for IOP RAM.