retroDE_ps2/docs/ch283_closeout.md

# Ch283 closeout — 128-bit GPR shadow + PCPYUD (the upper-half MMI op)

**Status:** Closed. **Verdict from re-running qbert.elf:**
`elf_first_unsupported_opcode (pc=0x00113378 instr=0xdfbf0000)` —
opcode 0x37 = **LD (Load Doubleword)**, encoding `ld $ra, 0($ra)`.
This is the end-of-function return-address restore pattern, hit
*after* the byte-walker PCPYUD path completes and the function
returns. qbert retire_count: 27,024 → **27,067** (+43). The Ch283
chapter introduced the
architectural seam Codex framed as the right middle path between
"fake PCPYUD as zero" (silent divergence) and "widen the whole EE
core to 128 bits" (multi-chapter cross-cutting work): a parallel
**128-bit GPR shadow** (`gpr128`) that LQ/SQ/SD and every MMI op now
flow through, while the legacy 32-bit `regfile` remains the canonical
scalar surface.

## What landed (architectural summary)

The EE core now has two parallel GPR storages:

| | width | who writes it | who reads it |
|---|---|---|---|
| `regfile [0:31]` | 32 | every scalar op (unchanged) | scalar decode, branches, ALU operands |
| `gpr128 [0:31]` | 128 | every scalar op (via mirror — zero-extended); MMI ops; LQ | MMI ops needing upper bits; SQ/SD per-beat sources |

**Invariant:** `gpr128[i][31:0] === regfile[i]` always. Scalar writes
zero-extend into `gpr128[i][127:32]`; MMI/LQ writes can land non-zero
bits there. This is the R5900 rule that scalar ops clear the upper
bits of their destination — Codex framed it as "define upper bits
conservatively," and zero is the conservative answer.

## RTL — surgical edits in `ee_core_stub.sv`

1. **Declaration + reset** — `logic [127:0] gpr128 [0:31];` next to
   `regfile`. Reset clears all 32 to 128'd0.
2. **Read helpers** — `rs128_val` / `rt128_val` next to `rs_val` /
   `rt_val`, both with the `$0 → 0` guard.
3. **Scalar-write mirrors** — every existing `regfile[X] <= Y` now
   has a paired `gpr128[X] <= {96'd0, Y}`. Sites touched: SYSCALL HLE
   (3), I-type ALU writeback, R-type ALU writeback, MFHI/MFLO,
   JAL/JALR link, MFC0, Ch215 jmp_buf restore (12) + final $v0,
   LW/LB/LBU/LH/LHU load returns. Load path was refactored to compute
   `load_wb` once and write both stores.
4. **MMI 128-bit writeback** — new `rtype_alu128_wb` combinational
   block computes the full 128-bit MMI result for PCPYLD/PSUBB/PNOR/
   PAND/PCPYUD. The R-type writeback site picks between the full
   128-bit value (when `is_mmi_wb`) and the zero-extended scalar
   value (every other R-type op). The existing 32-bit `rtype_alu_wb`
   still lands the correct low 32 into `regfile`.
5. **LQ 4-beat FSM** — `is_lq` now takes a dedicated dispatch arm
   that initializes `sq_beat <= 0` and re-uses S_MEM_REQ/S_MEM_WAIT
   four times. Beat N's `map_rd_addr = ea + N*4`. Each beat captures
   `map_rd_data` into the matching 32-bit lane of `gpr128[rt]`. Last
   beat mirrors `gpr128[rt][31:0]` to `regfile[rt]` and retires once.
   Replaces the Ch279 single-beat LW-style approximation.
6. **SQ/SD per-beat source upgrade** — beats now pull from
   `gpr128[rt][lane]` instead of "low 32 then zero": SQ emits all
   four lanes, SD emits the low two.
7. **PCPYUD decode + arms** — `localparam MMI3_PCPYUD = 5'h0E`,
   `is_pcpyud` decode (MMI3 / sa 0x0E), added to `is_rtype_alu` and
   `is_nop_class` exclusion. Low-32 arm in `rtype_alu_wb` uses
   `rt128_val[95:64]` (= low 32 of $rt's upper doubleword); full
   128-bit arm in `rtype_alu128_wb` is `{rs128[127:64],
   rt128[127:64]}`.

## Focused TB — `tb_ee_core_pcpyud.sv`

Three cases:

1. **Exact qbert encoding asserted** == 0x704923A9. `pcpyud $a0, $v0,
   $t1` with $v0 and $t1 set by scalar LUI+ORI (upper halves
   architecturally 0). PCPYUD's low-32 result = 0 — exactly what
   qbert sees on every byte-walker iteration.
2. **PCPYLD-then-PCPYUD round-trip.** `pcpyld $t2, $t0, $t1` puts
   $t0[31:0] = 0xAABBCCDD into `gpr128[$t2][95:64]`. `pcpyud $t3,
   $t2, $t2` then extracts $t2's upper-D into both halves of $t3.
   Verified: `regfile[$t3] == 0xAABBCCDD` *and* peeked
   `gpr128[$t3][127:64] == 0x00000000_AABBCCDD`. Proves the gpr128
   shadow is actually carrying upper bits.
3. **PCPYUD with rt=$0.** Exercises the rs-upper-D path alone. $t5
   low = 0, gpr128[$t5][127:64] inherits $t2's upper-D.

Result: `retired=23 halt=1 trap=0 pc=0xbfc00150 errors=0 PASS`.

## Makefile + regression

- `tb_ee_core_pcpyud` target with build + run rules.
- Added to both the PHONY target list (line 407) and the `run:`
  master list (line 2510) — per the dual-list rule.
- Regression: 170 → **171**.

## qbert progression

| Chapter | Blocker | qbert retire_count |
|---------|---------|---------------------|
| Post-Ch281 (PNOR)   | PAND at 0x00112C98   | 27,022 |
| Post-Ch282 (PAND)   | PCPYUD at 0x00112CA0 | 27,024 |
| **Post-Ch283 (PCPYUD)** | **LD at 0x00113378** | **27,067** |

+43 retires past Ch282. qbert finished the byte-walker MMI sequence
(`LQ → PSUBB → PNOR → PAND → PCPYUD → reduce/branch`), returned from
that branch, did a chunk of follow-on work, then hit `ld $ra,
0($ra)` — the end-of-function return-address restore. LD is the
read-side of SD and is now the Ch284 candidate.

Side-effect check: the new full-128-bit LQ feeds real upper-half
data into PCPYUD. The fact that qbert advanced through the PCPYUD
site and 43 more instructions means the byte-walker's downstream
logic accepts the actual data (not zero), and made a real branch
decision based on it. Snapshot at halt:

- `$a0 = 0x33323130` — ASCII `"0123"`, which strongly suggests
  qbert is mid-string processing (the byte-walker did its job).
- `$v1 = 0x0012c2c6`, `$a1 = 0x0011c326`, `$a2/$a3 = 0x0012c2c0`.

This is the first chapter where the qbert run produces visible
*content-shaped* state (ASCII bytes in registers) rather than just
opcode-blocker telemetry.

## Pattern review (13 chapters)

| Ch  | Blocker      | Edits | Pattern |
|-----|--------------|-------|---------|
| 271 | SQ           | 5     | NEW 4-beat write |
| 272 | DADDU        | 4     | NEW ALU-low-32 |
| 273 | SYSCALL HLE  | 2     | NEW gated dispatcher |
| 274 | BEQL         | 6     | NEW branch+squash |
| 275 | SD           | 7     | REUSE SQ counter |
| 276 | DSLL         | 4     | REUSE DADDU |
| 277 | BNEL         | 6     | REUSE BEQL squash |
| 278 | PCPYLD       | 4     | NEW MMI narrow-decode |
| 279 | LQ           | 5     | REUSE LW path |
| 280 | PSUBB        | 5     | REUSE MMI narrow (byte-SIMD new) |
| 281 | PNOR         | 5     | REUSE MMI narrow + NOR arm |
| 282 | PAND         | 5     | REUSE MMI narrow + AND arm |
| **283** | **PCPYUD + gpr128**  | **architectural** | **NEW 128-bit shadow** |

Ch283 breaks the surgical-one-opcode cadence because it has to: this
is the first chapter that the "low-32-only" approximation could not
keep absorbing. The MMI narrow-decode pattern from Ch278 still works
(PCPYUD adds the same 3-way is_mmi+func+sa decode), but the
*writeback* now needs full-128 storage, which retroactively forced
LQ/SQ/SD/PCPYLD/PSUBB/PNOR/PAND to also flow through `gpr128`.

That's a one-time investment. Future MMI ops that need upper bits
(PCPYH, PINTEH, PCEQB, PMADDH, etc.) can ride the existing seam:
read `rs128_val`/`rt128_val`, write `rtype_alu128_wb`. No more
architectural work to add upper-half ops.

## Files changed

- `rtl/ee/ee_core_stub.sv` — declarations + 36 scalar-write mirrors
  + MMI 128-bit writeback + PCPYUD decode + LQ 4-beat FSM + SQ/SD
  per-beat sources.
- `sim/tb/integration/tb_ee_core_pcpyud.sv` — new focused TB.
- `sim/Makefile` — target + both regression lists.

## Regression

**171/171 PASS** (was 170/170 in Ch282).