Files
retroDE_ps2/docs/ch278_closeout.md
T
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

6.4 KiB

Ch278 closeout — MMI2/PCPYLD (narrow, one sub-instruction only); next blocker is LQ

Status: Closed. Verdict from re-running qbert.elf: elf_first_unsupported_opcode (pc=0x00112C88 instr=0x78A90000)LQ (Load Quadword, opcode 0x1E, R5900 EE), the 128-bit load symmetric to Ch271's SQ. qbert ran the PCPYLD and trapped on the next instruction, which is the matching 128-bit load.

Numbers

Chapter Blocker qbert retire_count
Post-Ch276 (DSLL) BNEL at 0x00112C7C 27,016
Post-Ch277 (BNEL) PCPYLD at 0x00112C84 27,017
Post-Ch278 (PCPYLD) LQ at 0x00112C88 27,018

1-retire delta — PCPYLD retired, LQ trapped before retiring. Same compact "one opcode at a time" cadence; qbert's stdlib byte-walker is showing us each MIPS-III/MMI feature it touches in textbook order.

What landed

RTL — 4 surgical edits in ee_core_stub.sv

  1. Opcode/sub-instruction constants:
    localparam OP_MMI       = 6'h1C;
    localparam FUNC_MMI2    = 6'h09;
    localparam MMI2_PCPYLD  = 5'h0E;
    
  2. Narrow decode: is_pcpyld = is_mmi && (func == FUNC_MMI2) && (shamt == MMI2_PCPYLD). Three-way AND on opcode + funct + sa fields — any OTHER op=0x1C instruction continues to fall through to strict-trap.
  3. Added to is_rtype_alu group so the existing R-type writeback path handles it.
  4. rtype_alu_wb: else if (is_pcpyld) rtype_alu_wb = rt_val. Architectural rd[63:0] = rt[63:0] — the only observable effect in our 32-bit model.
  5. is_nop_class allow: added && !is_pcpyld to the catch-all so other MMI sub-instructions still trap. Critical per Codex's caution — do NOT NOP-class the whole MMI opcode.

Focused TB — tb_ee_core_pcpyld.sv

Two cases:

  1. Exact qbert encoding: pcpyld $t2, $t1, $t1 (rs=rt=$t1 in the actual qbert instruction — see process note below). Built via enc_rtype and asserted to equal 0x71295389. With $t1 = 0xBBBBBBBB, verifies $t2 = 0xBBBBBBBB.
  2. Distinct rs/rt sentinels (the rd<-rt proof): pcpyld $t3, $a0, $a1 with $a0 = 0xDEADBEEF, $a1 = 0xCAFEF00D. Verifies $t3 = 0xCAFEF00D (rt) and explicitly NOT 0xDEADBEEF (rs). Locks in the architectural rd-takes-from-rt semantics for the low 32 bits.

Result: retired=21 halt=1 trap=0 pc=0xbfc00148 errors=0 PASS.

Makefile + regression

  • tb_ee_core_pcpyld target.
  • Added to both regression lists.
  • Regression: 165 → 166.

Process note — decode mistake caught by encoder assertion

My initial decode of qbert's 0x71295389 claimed pcpyld $t2, $a1, $t1, reading the rs field as $a1=5. That was wrong: bits 25:21 of 0x71295389 are 01001 = 9 = $t1. The actual instruction is pcpyld $t2, $t1, $t1 (rs=rt=$t1).

The error was caught by the TB's enc_rtype assertion — the first run produced 0x70A95389 instead of the expected 0x71295389, and the inline $error exposed the difference. The encoder-output assertion pattern (enc_rtype(...) === 0x...) has now caught misdecodes in Ch272 (DADDU was clean), Ch276 (DSLL was clean), and Ch278 (PCPYLD was not). Always including the assertion is paying off.

The corrected encoding pcpyld $t2, $t1, $t1 still falls under the same architectural semantic — $rd = $rt low 32 — because both rs and rt are $t1 in this specific qbert encoding. So Codex's "rd <= rt_val" implementation is correct regardless.

qbert disassembly check (Ch279 framing)

The trap at PC 0x00112C88 is one word past PCPYLD (0x00112C84

  • 4):
0x00112C84: 0x71295389  pcpyld $t2, $t1, $t1
0x00112C88: 0x78A90000  lq     $t1, 0($a1)        <-- next blocker

LQ is the 128-bit load: rt[127:0] = mem[base+imm][127:0]. In our 32-bit register model, $rt[31:0] = mem[base+imm][31:0] (low 32 bits only; upper 96 unrepresentable). This is the symmetric counterpart to Ch271 SQ.

Recommendation for Codex's Ch279 — LQ

Symmetric to SQ. Two possible implementation shapes:

(A) Minimal: single 32-bit read at EA, writeback to $rt.

  • 16-byte alignment required (ea[3:0] == 0); misaligned → AdES.
  • Reuse the existing S_MEM_REQ → S_MEM_WAIT → writeback FSM that LW uses. The single-word read returns the low 32 bits.
  • Upper 96 bits of $rt aren't modelled in our regfile, so there's nothing to do with the high beats.
  • Documented approximation: same as SQ — only the architectural low 32 bits are observable.
  • ~4 RTL edits.

(B) Symmetric: 4-beat read FSM reading 32 bits per beat.

  • Mirrors Ch271's SQ structure exactly.
  • All 4 reads issued; the implementation discards beats 1-3 (since we have no GPR storage for them).
  • ~8 RTL edits.
  • Slightly more uniform with SQ but no observable behavior difference from (A).

My read: (A), because the upper 96 bits are unrepresentable. A 4-beat read costs sim cycles for zero benefit. We can revisit if/when 128-bit GPRs are added.

Implementation outline for (A):

  1. localparam OP_LQ = 6'h1E.
  2. is_lq decode signal.
  3. Add 16-byte alignment check: extend is_align_fault with is_quad_load_access && (ea[3:0] != 0) (or just extend is_quad_access to cover both SQ and LQ).
  4. Add LQ to the FSM transition: else if (is_lq) state <= S_MEM_REQ. Reuse the existing S_MEM_WAIT writeback path.
  5. Hook LQ into the LW/LB/LBU writeback case as a "word load with 16-byte aligned EA".
  6. Add !is_lq to is_nop_class allow-list.

Focused TB mirrors tb_ee_core_sq shape: pre-poke RAM with distinct non-zero values, execute lq $rt, 0($base), verify $rt = low 32 bits of mem[base]. Cross-check that an LW at the same EA returns the same value (proving LQ degenerates to LW in our model for the observable lane).

Files changed

  • rtl/ee/ee_core_stub.sv — 4 surgical edits.
  • sim/tb/integration/tb_ee_core_pcpyld.sv — new focused TB.
  • sim/Makefile — target + both regression lists.

Regression

In flight; expected 166/166.

Pattern review

Eight qbert chapters now. The pattern continues to compress. RTL edits per chapter (qbert track):

| Ch271 SQ | 5 | NEW 4-beat write | | Ch272 DADDU | 4 | NEW ALU-low-32 | | Ch273 SYSCALL HLE | 2 | NEW gated dispatcher | | Ch274 BEQL | 6 | NEW branch+squash | | Ch275 SD | 7 | REUSE SQ counter | | Ch276 DSLL | 4 | REUSE DADDU | | Ch277 BNEL | 6 | REUSE BEQL squash (generalized) | | Ch278 PCPYLD | 4 | NEW MMI narrow-decode |

Ch279 LQ should be ~4 edits (reuse LW path + new alignment).