RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression (272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps, and all dump-derived textures/traces) is excluded via .gitignore and stays local. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7.5 KiB
Ch283 closeout — 128-bit GPR shadow + PCPYUD (the upper-half MMI op)
Status: Closed. Verdict from re-running qbert.elf:
elf_first_unsupported_opcode (pc=0x00113378 instr=0xdfbf0000) —
opcode 0x37 = LD (Load Doubleword), encoding ld $ra, 0($ra).
This is the end-of-function return-address restore pattern, hit
after the byte-walker PCPYUD path completes and the function
returns. qbert retire_count: 27,024 → 27,067 (+43). The Ch283
chapter introduced the
architectural seam Codex framed as the right middle path between
"fake PCPYUD as zero" (silent divergence) and "widen the whole EE
core to 128 bits" (multi-chapter cross-cutting work): a parallel
128-bit GPR shadow (gpr128) that LQ/SQ/SD and every MMI op now
flow through, while the legacy 32-bit regfile remains the canonical
scalar surface.
What landed (architectural summary)
The EE core now has two parallel GPR storages:
| width | who writes it | who reads it | |
|---|---|---|---|
regfile [0:31] |
32 | every scalar op (unchanged) | scalar decode, branches, ALU operands |
gpr128 [0:31] |
128 | every scalar op (via mirror — zero-extended); MMI ops; LQ | MMI ops needing upper bits; SQ/SD per-beat sources |
Invariant: gpr128[i][31:0] === regfile[i] always. Scalar writes
zero-extend into gpr128[i][127:32]; MMI/LQ writes can land non-zero
bits there. This is the R5900 rule that scalar ops clear the upper
bits of their destination — Codex framed it as "define upper bits
conservatively," and zero is the conservative answer.
RTL — surgical edits in ee_core_stub.sv
- Declaration + reset —
logic [127:0] gpr128 [0:31];next toregfile. Reset clears all 32 to 128'd0. - Read helpers —
rs128_val/rt128_valnext tors_val/rt_val, both with the$0 → 0guard. - Scalar-write mirrors — every existing
regfile[X] <= Ynow has a pairedgpr128[X] <= {96'd0, Y}. Sites touched: SYSCALL HLE (3), I-type ALU writeback, R-type ALU writeback, MFHI/MFLO, JAL/JALR link, MFC0, Ch215 jmp_buf restore (12) + final $v0, LW/LB/LBU/LH/LHU load returns. Load path was refactored to computeload_wbonce and write both stores. - MMI 128-bit writeback — new
rtype_alu128_wbcombinational block computes the full 128-bit MMI result for PCPYLD/PSUBB/PNOR/ PAND/PCPYUD. The R-type writeback site picks between the full 128-bit value (whenis_mmi_wb) and the zero-extended scalar value (every other R-type op). The existing 32-bitrtype_alu_wbstill lands the correct low 32 intoregfile. - LQ 4-beat FSM —
is_lqnow takes a dedicated dispatch arm that initializessq_beat <= 0and re-uses S_MEM_REQ/S_MEM_WAIT four times. Beat N'smap_rd_addr = ea + N*4. Each beat capturesmap_rd_datainto the matching 32-bit lane ofgpr128[rt]. Last beat mirrorsgpr128[rt][31:0]toregfile[rt]and retires once. Replaces the Ch279 single-beat LW-style approximation. - SQ/SD per-beat source upgrade — beats now pull from
gpr128[rt][lane]instead of "low 32 then zero": SQ emits all four lanes, SD emits the low two. - PCPYUD decode + arms —
localparam MMI3_PCPYUD = 5'h0E,is_pcpyuddecode (MMI3 / sa 0x0E), added tois_rtype_aluandis_nop_classexclusion. Low-32 arm inrtype_alu_wbusesrt128_val[95:64](= low 32 of $rt's upper doubleword); full 128-bit arm inrtype_alu128_wbis{rs128[127:64], rt128[127:64]}.
Focused TB — tb_ee_core_pcpyud.sv
Three cases:
- Exact qbert encoding asserted == 0x704923A9.
pcpyud $a0, $v0, $t1with $v0 and $t1 set by scalar LUI+ORI (upper halves architecturally 0). PCPYUD's low-32 result = 0 — exactly what qbert sees on every byte-walker iteration. - PCPYLD-then-PCPYUD round-trip.
pcpyld $t2, $t0, $t1puts $t0[31:0] = 0xAABBCCDD intogpr128[$t2][95:64].pcpyud $t3, $t2, $t2then extracts $t2's upper-D into both halves of $t3. Verified:regfile[$t3] == 0xAABBCCDDand peekedgpr128[$t3][127:64] == 0x00000000_AABBCCDD. Proves the gpr128 shadow is actually carrying upper bits. - PCPYUD with rt=$0. Exercises the rs-upper-D path alone. $t5 low = 0, gpr128[$t5][127:64] inherits $t2's upper-D.
Result: retired=23 halt=1 trap=0 pc=0xbfc00150 errors=0 PASS.
Makefile + regression
tb_ee_core_pcpyudtarget with build + run rules.- Added to both the PHONY target list (line 407) and the
run:master list (line 2510) — per the dual-list rule. - Regression: 170 → 171.
qbert progression
| Chapter | Blocker | qbert retire_count |
|---|---|---|
| Post-Ch281 (PNOR) | PAND at 0x00112C98 | 27,022 |
| Post-Ch282 (PAND) | PCPYUD at 0x00112CA0 | 27,024 |
| Post-Ch283 (PCPYUD) | LD at 0x00113378 | 27,067 |
+43 retires past Ch282. qbert finished the byte-walker MMI sequence
(LQ → PSUBB → PNOR → PAND → PCPYUD → reduce/branch), returned from
that branch, did a chunk of follow-on work, then hit ld $ra, 0($ra) — the end-of-function return-address restore. LD is the
read-side of SD and is now the Ch284 candidate.
Side-effect check: the new full-128-bit LQ feeds real upper-half data into PCPYUD. The fact that qbert advanced through the PCPYUD site and 43 more instructions means the byte-walker's downstream logic accepts the actual data (not zero), and made a real branch decision based on it. Snapshot at halt:
$a0 = 0x33323130— ASCII"0123", which strongly suggests qbert is mid-string processing (the byte-walker did its job).$v1 = 0x0012c2c6,$a1 = 0x0011c326,$a2/$a3 = 0x0012c2c0.
This is the first chapter where the qbert run produces visible content-shaped state (ASCII bytes in registers) rather than just opcode-blocker telemetry.
Pattern review (13 chapters)
| Ch | Blocker | Edits | Pattern |
|---|---|---|---|
| 271 | SQ | 5 | NEW 4-beat write |
| 272 | DADDU | 4 | NEW ALU-low-32 |
| 273 | SYSCALL HLE | 2 | NEW gated dispatcher |
| 274 | BEQL | 6 | NEW branch+squash |
| 275 | SD | 7 | REUSE SQ counter |
| 276 | DSLL | 4 | REUSE DADDU |
| 277 | BNEL | 6 | REUSE BEQL squash |
| 278 | PCPYLD | 4 | NEW MMI narrow-decode |
| 279 | LQ | 5 | REUSE LW path |
| 280 | PSUBB | 5 | REUSE MMI narrow (byte-SIMD new) |
| 281 | PNOR | 5 | REUSE MMI narrow + NOR arm |
| 282 | PAND | 5 | REUSE MMI narrow + AND arm |
| 283 | PCPYUD + gpr128 | architectural | NEW 128-bit shadow |
Ch283 breaks the surgical-one-opcode cadence because it has to: this
is the first chapter that the "low-32-only" approximation could not
keep absorbing. The MMI narrow-decode pattern from Ch278 still works
(PCPYUD adds the same 3-way is_mmi+func+sa decode), but the
writeback now needs full-128 storage, which retroactively forced
LQ/SQ/SD/PCPYLD/PSUBB/PNOR/PAND to also flow through gpr128.
That's a one-time investment. Future MMI ops that need upper bits
(PCPYH, PINTEH, PCEQB, PMADDH, etc.) can ride the existing seam:
read rs128_val/rt128_val, write rtype_alu128_wb. No more
architectural work to add upper-half ops.
Files changed
rtl/ee/ee_core_stub.sv— declarations + 36 scalar-write mirrors- MMI 128-bit writeback + PCPYUD decode + LQ 4-beat FSM + SQ/SD per-beat sources.
sim/tb/integration/tb_ee_core_pcpyud.sv— new focused TB.sim/Makefile— target + both regression lists.
Regression
171/171 PASS (was 170/170 in Ch282).