# Ch283 closeout — 128-bit GPR shadow + PCPYUD (the upper-half MMI op) **Status:** Closed. **Verdict from re-running qbert.elf:** `elf_first_unsupported_opcode (pc=0x00113378 instr=0xdfbf0000)` — opcode 0x37 = **LD (Load Doubleword)**, encoding `ld $ra, 0($ra)`. This is the end-of-function return-address restore pattern, hit *after* the byte-walker PCPYUD path completes and the function returns. qbert retire_count: 27,024 → **27,067** (+43). The Ch283 chapter introduced the architectural seam Codex framed as the right middle path between "fake PCPYUD as zero" (silent divergence) and "widen the whole EE core to 128 bits" (multi-chapter cross-cutting work): a parallel **128-bit GPR shadow** (`gpr128`) that LQ/SQ/SD and every MMI op now flow through, while the legacy 32-bit `regfile` remains the canonical scalar surface. ## What landed (architectural summary) The EE core now has two parallel GPR storages: | | width | who writes it | who reads it | |---|---|---|---| | `regfile [0:31]` | 32 | every scalar op (unchanged) | scalar decode, branches, ALU operands | | `gpr128 [0:31]` | 128 | every scalar op (via mirror — zero-extended); MMI ops; LQ | MMI ops needing upper bits; SQ/SD per-beat sources | **Invariant:** `gpr128[i][31:0] === regfile[i]` always. Scalar writes zero-extend into `gpr128[i][127:32]`; MMI/LQ writes can land non-zero bits there. This is the R5900 rule that scalar ops clear the upper bits of their destination — Codex framed it as "define upper bits conservatively," and zero is the conservative answer. ## RTL — surgical edits in `ee_core_stub.sv` 1. **Declaration + reset** — `logic [127:0] gpr128 [0:31];` next to `regfile`. Reset clears all 32 to 128'd0. 2. **Read helpers** — `rs128_val` / `rt128_val` next to `rs_val` / `rt_val`, both with the `$0 → 0` guard. 3. **Scalar-write mirrors** — every existing `regfile[X] <= Y` now has a paired `gpr128[X] <= {96'd0, Y}`. Sites touched: SYSCALL HLE (3), I-type ALU writeback, R-type ALU writeback, MFHI/MFLO, JAL/JALR link, MFC0, Ch215 jmp_buf restore (12) + final $v0, LW/LB/LBU/LH/LHU load returns. Load path was refactored to compute `load_wb` once and write both stores. 4. **MMI 128-bit writeback** — new `rtype_alu128_wb` combinational block computes the full 128-bit MMI result for PCPYLD/PSUBB/PNOR/ PAND/PCPYUD. The R-type writeback site picks between the full 128-bit value (when `is_mmi_wb`) and the zero-extended scalar value (every other R-type op). The existing 32-bit `rtype_alu_wb` still lands the correct low 32 into `regfile`. 5. **LQ 4-beat FSM** — `is_lq` now takes a dedicated dispatch arm that initializes `sq_beat <= 0` and re-uses S_MEM_REQ/S_MEM_WAIT four times. Beat N's `map_rd_addr = ea + N*4`. Each beat captures `map_rd_data` into the matching 32-bit lane of `gpr128[rt]`. Last beat mirrors `gpr128[rt][31:0]` to `regfile[rt]` and retires once. Replaces the Ch279 single-beat LW-style approximation. 6. **SQ/SD per-beat source upgrade** — beats now pull from `gpr128[rt][lane]` instead of "low 32 then zero": SQ emits all four lanes, SD emits the low two. 7. **PCPYUD decode + arms** — `localparam MMI3_PCPYUD = 5'h0E`, `is_pcpyud` decode (MMI3 / sa 0x0E), added to `is_rtype_alu` and `is_nop_class` exclusion. Low-32 arm in `rtype_alu_wb` uses `rt128_val[95:64]` (= low 32 of $rt's upper doubleword); full 128-bit arm in `rtype_alu128_wb` is `{rs128[127:64], rt128[127:64]}`. ## Focused TB — `tb_ee_core_pcpyud.sv` Three cases: 1. **Exact qbert encoding asserted** == 0x704923A9. `pcpyud $a0, $v0, $t1` with $v0 and $t1 set by scalar LUI+ORI (upper halves architecturally 0). PCPYUD's low-32 result = 0 — exactly what qbert sees on every byte-walker iteration. 2. **PCPYLD-then-PCPYUD round-trip.** `pcpyld $t2, $t0, $t1` puts $t0[31:0] = 0xAABBCCDD into `gpr128[$t2][95:64]`. `pcpyud $t3, $t2, $t2` then extracts $t2's upper-D into both halves of $t3. Verified: `regfile[$t3] == 0xAABBCCDD` *and* peeked `gpr128[$t3][127:64] == 0x00000000_AABBCCDD`. Proves the gpr128 shadow is actually carrying upper bits. 3. **PCPYUD with rt=$0.** Exercises the rs-upper-D path alone. $t5 low = 0, gpr128[$t5][127:64] inherits $t2's upper-D. Result: `retired=23 halt=1 trap=0 pc=0xbfc00150 errors=0 PASS`. ## Makefile + regression - `tb_ee_core_pcpyud` target with build + run rules. - Added to both the PHONY target list (line 407) and the `run:` master list (line 2510) — per the dual-list rule. - Regression: 170 → **171**. ## qbert progression | Chapter | Blocker | qbert retire_count | |---------|---------|---------------------| | Post-Ch281 (PNOR) | PAND at 0x00112C98 | 27,022 | | Post-Ch282 (PAND) | PCPYUD at 0x00112CA0 | 27,024 | | **Post-Ch283 (PCPYUD)** | **LD at 0x00113378** | **27,067** | +43 retires past Ch282. qbert finished the byte-walker MMI sequence (`LQ → PSUBB → PNOR → PAND → PCPYUD → reduce/branch`), returned from that branch, did a chunk of follow-on work, then hit `ld $ra, 0($ra)` — the end-of-function return-address restore. LD is the read-side of SD and is now the Ch284 candidate. Side-effect check: the new full-128-bit LQ feeds real upper-half data into PCPYUD. The fact that qbert advanced through the PCPYUD site and 43 more instructions means the byte-walker's downstream logic accepts the actual data (not zero), and made a real branch decision based on it. Snapshot at halt: - `$a0 = 0x33323130` — ASCII `"0123"`, which strongly suggests qbert is mid-string processing (the byte-walker did its job). - `$v1 = 0x0012c2c6`, `$a1 = 0x0011c326`, `$a2/$a3 = 0x0012c2c0`. This is the first chapter where the qbert run produces visible *content-shaped* state (ASCII bytes in registers) rather than just opcode-blocker telemetry. ## Pattern review (13 chapters) | Ch | Blocker | Edits | Pattern | |-----|--------------|-------|---------| | 271 | SQ | 5 | NEW 4-beat write | | 272 | DADDU | 4 | NEW ALU-low-32 | | 273 | SYSCALL HLE | 2 | NEW gated dispatcher | | 274 | BEQL | 6 | NEW branch+squash | | 275 | SD | 7 | REUSE SQ counter | | 276 | DSLL | 4 | REUSE DADDU | | 277 | BNEL | 6 | REUSE BEQL squash | | 278 | PCPYLD | 4 | NEW MMI narrow-decode | | 279 | LQ | 5 | REUSE LW path | | 280 | PSUBB | 5 | REUSE MMI narrow (byte-SIMD new) | | 281 | PNOR | 5 | REUSE MMI narrow + NOR arm | | 282 | PAND | 5 | REUSE MMI narrow + AND arm | | **283** | **PCPYUD + gpr128** | **architectural** | **NEW 128-bit shadow** | Ch283 breaks the surgical-one-opcode cadence because it has to: this is the first chapter that the "low-32-only" approximation could not keep absorbing. The MMI narrow-decode pattern from Ch278 still works (PCPYUD adds the same 3-way is_mmi+func+sa decode), but the *writeback* now needs full-128 storage, which retroactively forced LQ/SQ/SD/PCPYLD/PSUBB/PNOR/PAND to also flow through `gpr128`. That's a one-time investment. Future MMI ops that need upper bits (PCPYH, PINTEH, PCEQB, PMADDH, etc.) can ride the existing seam: read `rs128_val`/`rt128_val`, write `rtype_alu128_wb`. No more architectural work to add upper-half ops. ## Files changed - `rtl/ee/ee_core_stub.sv` — declarations + 36 scalar-write mirrors + MMI 128-bit writeback + PCPYUD decode + LQ 4-beat FSM + SQ/SD per-beat sources. - `sim/tb/integration/tb_ee_core_pcpyud.sv` — new focused TB. - `sim/Makefile` — target + both regression lists. ## Regression **171/171 PASS** (was 170/170 in Ch282).