Files
retroDE_ps2/docs/ch283_closeout.md
T
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

7.5 KiB

Ch283 closeout — 128-bit GPR shadow + PCPYUD (the upper-half MMI op)

Status: Closed. Verdict from re-running qbert.elf: elf_first_unsupported_opcode (pc=0x00113378 instr=0xdfbf0000) — opcode 0x37 = LD (Load Doubleword), encoding ld $ra, 0($ra). This is the end-of-function return-address restore pattern, hit after the byte-walker PCPYUD path completes and the function returns. qbert retire_count: 27,024 → 27,067 (+43). The Ch283 chapter introduced the architectural seam Codex framed as the right middle path between "fake PCPYUD as zero" (silent divergence) and "widen the whole EE core to 128 bits" (multi-chapter cross-cutting work): a parallel 128-bit GPR shadow (gpr128) that LQ/SQ/SD and every MMI op now flow through, while the legacy 32-bit regfile remains the canonical scalar surface.

What landed (architectural summary)

The EE core now has two parallel GPR storages:

width who writes it who reads it
regfile [0:31] 32 every scalar op (unchanged) scalar decode, branches, ALU operands
gpr128 [0:31] 128 every scalar op (via mirror — zero-extended); MMI ops; LQ MMI ops needing upper bits; SQ/SD per-beat sources

Invariant: gpr128[i][31:0] === regfile[i] always. Scalar writes zero-extend into gpr128[i][127:32]; MMI/LQ writes can land non-zero bits there. This is the R5900 rule that scalar ops clear the upper bits of their destination — Codex framed it as "define upper bits conservatively," and zero is the conservative answer.

RTL — surgical edits in ee_core_stub.sv

  1. Declaration + resetlogic [127:0] gpr128 [0:31]; next to regfile. Reset clears all 32 to 128'd0.
  2. Read helpersrs128_val / rt128_val next to rs_val / rt_val, both with the $0 → 0 guard.
  3. Scalar-write mirrors — every existing regfile[X] <= Y now has a paired gpr128[X] <= {96'd0, Y}. Sites touched: SYSCALL HLE (3), I-type ALU writeback, R-type ALU writeback, MFHI/MFLO, JAL/JALR link, MFC0, Ch215 jmp_buf restore (12) + final $v0, LW/LB/LBU/LH/LHU load returns. Load path was refactored to compute load_wb once and write both stores.
  4. MMI 128-bit writeback — new rtype_alu128_wb combinational block computes the full 128-bit MMI result for PCPYLD/PSUBB/PNOR/ PAND/PCPYUD. The R-type writeback site picks between the full 128-bit value (when is_mmi_wb) and the zero-extended scalar value (every other R-type op). The existing 32-bit rtype_alu_wb still lands the correct low 32 into regfile.
  5. LQ 4-beat FSMis_lq now takes a dedicated dispatch arm that initializes sq_beat <= 0 and re-uses S_MEM_REQ/S_MEM_WAIT four times. Beat N's map_rd_addr = ea + N*4. Each beat captures map_rd_data into the matching 32-bit lane of gpr128[rt]. Last beat mirrors gpr128[rt][31:0] to regfile[rt] and retires once. Replaces the Ch279 single-beat LW-style approximation.
  6. SQ/SD per-beat source upgrade — beats now pull from gpr128[rt][lane] instead of "low 32 then zero": SQ emits all four lanes, SD emits the low two.
  7. PCPYUD decode + armslocalparam MMI3_PCPYUD = 5'h0E, is_pcpyud decode (MMI3 / sa 0x0E), added to is_rtype_alu and is_nop_class exclusion. Low-32 arm in rtype_alu_wb uses rt128_val[95:64] (= low 32 of $rt's upper doubleword); full 128-bit arm in rtype_alu128_wb is {rs128[127:64], rt128[127:64]}.

Focused TB — tb_ee_core_pcpyud.sv

Three cases:

  1. Exact qbert encoding asserted == 0x704923A9. pcpyud $a0, $v0, $t1 with $v0 and $t1 set by scalar LUI+ORI (upper halves architecturally 0). PCPYUD's low-32 result = 0 — exactly what qbert sees on every byte-walker iteration.
  2. PCPYLD-then-PCPYUD round-trip. pcpyld $t2, $t0, $t1 puts $t0[31:0] = 0xAABBCCDD into gpr128[$t2][95:64]. pcpyud $t3, $t2, $t2 then extracts $t2's upper-D into both halves of $t3. Verified: regfile[$t3] == 0xAABBCCDD and peeked gpr128[$t3][127:64] == 0x00000000_AABBCCDD. Proves the gpr128 shadow is actually carrying upper bits.
  3. PCPYUD with rt=$0. Exercises the rs-upper-D path alone. $t5 low = 0, gpr128[$t5][127:64] inherits $t2's upper-D.

Result: retired=23 halt=1 trap=0 pc=0xbfc00150 errors=0 PASS.

Makefile + regression

  • tb_ee_core_pcpyud target with build + run rules.
  • Added to both the PHONY target list (line 407) and the run: master list (line 2510) — per the dual-list rule.
  • Regression: 170 → 171.

qbert progression

Chapter Blocker qbert retire_count
Post-Ch281 (PNOR) PAND at 0x00112C98 27,022
Post-Ch282 (PAND) PCPYUD at 0x00112CA0 27,024
Post-Ch283 (PCPYUD) LD at 0x00113378 27,067

+43 retires past Ch282. qbert finished the byte-walker MMI sequence (LQ → PSUBB → PNOR → PAND → PCPYUD → reduce/branch), returned from that branch, did a chunk of follow-on work, then hit ld $ra, 0($ra) — the end-of-function return-address restore. LD is the read-side of SD and is now the Ch284 candidate.

Side-effect check: the new full-128-bit LQ feeds real upper-half data into PCPYUD. The fact that qbert advanced through the PCPYUD site and 43 more instructions means the byte-walker's downstream logic accepts the actual data (not zero), and made a real branch decision based on it. Snapshot at halt:

  • $a0 = 0x33323130 — ASCII "0123", which strongly suggests qbert is mid-string processing (the byte-walker did its job).
  • $v1 = 0x0012c2c6, $a1 = 0x0011c326, $a2/$a3 = 0x0012c2c0.

This is the first chapter where the qbert run produces visible content-shaped state (ASCII bytes in registers) rather than just opcode-blocker telemetry.

Pattern review (13 chapters)

Ch Blocker Edits Pattern
271 SQ 5 NEW 4-beat write
272 DADDU 4 NEW ALU-low-32
273 SYSCALL HLE 2 NEW gated dispatcher
274 BEQL 6 NEW branch+squash
275 SD 7 REUSE SQ counter
276 DSLL 4 REUSE DADDU
277 BNEL 6 REUSE BEQL squash
278 PCPYLD 4 NEW MMI narrow-decode
279 LQ 5 REUSE LW path
280 PSUBB 5 REUSE MMI narrow (byte-SIMD new)
281 PNOR 5 REUSE MMI narrow + NOR arm
282 PAND 5 REUSE MMI narrow + AND arm
283 PCPYUD + gpr128 architectural NEW 128-bit shadow

Ch283 breaks the surgical-one-opcode cadence because it has to: this is the first chapter that the "low-32-only" approximation could not keep absorbing. The MMI narrow-decode pattern from Ch278 still works (PCPYUD adds the same 3-way is_mmi+func+sa decode), but the writeback now needs full-128 storage, which retroactively forced LQ/SQ/SD/PCPYLD/PSUBB/PNOR/PAND to also flow through gpr128.

That's a one-time investment. Future MMI ops that need upper bits (PCPYH, PINTEH, PCEQB, PMADDH, etc.) can ride the existing seam: read rs128_val/rt128_val, write rtype_alu128_wb. No more architectural work to add upper-half ops.

Files changed

  • rtl/ee/ee_core_stub.sv — declarations + 36 scalar-write mirrors
    • MMI 128-bit writeback + PCPYUD decode + LQ 4-beat FSM + SQ/SD per-beat sources.
  • sim/tb/integration/tb_ee_core_pcpyud.sv — new focused TB.
  • sim/Makefile — target + both regression lists.

Regression

171/171 PASS (was 170/170 in Ch282).