retroDE_ps2/docs/ch280_closeout.md

# Ch280 closeout — PSUBB byte-wise SIMD; next blocker is PNOR

**Status:** Closed. **Verdict from re-running qbert.elf:**
`elf_first_unsupported_opcode (pc=0x00112C94 instr=0x70091CE9)` —
opcode `0x1C` (MMI) + funct `0x29` (MMI3) + sa `0x13` = **PNOR**
(Parallel Not-OR). qbert's byte-walker advanced past PSUBB on the
first try.

## Numbers

| Chapter | Blocker | qbert retire_count |
|---------|---------|---------------------|
| Post-Ch278 (PCPYLD) | LQ at 0x00112C88 | 27,018 |
| Post-Ch279 (LQ) | PSUBB at 0x00112C90 | 27,020 |
| **Post-Ch280 (PSUBB)** | **PNOR at 0x00112C94** | **27,021** |

1-retire delta — PSUBB itself retired, PNOR is the next instruction.

## What landed

### RTL — 5 surgical edits in `ee_core_stub.sv`

1. **Constants**: `FUNC_MMI0 = 6'h08` and `MMI0_PSUBB = 5'h09`.
2. **Decode**: `is_psubb = is_mmi && (func == FUNC_MMI0) &&
   (shamt == MMI0_PSUBB)`. Three-way AND keeps the decode narrow
   — any other op=0x1C/funct=0x08 sub-instruction (PADDW, PADDH,
   PADDB, ...) continues to strict-trap.
3. **`is_rtype_alu` group**: added `is_psubb`.
4. **`rtype_alu_wb` arm**: 4 independent byte subtracts:
   ```sv
   else if (is_psubb) begin
       rtype_alu_wb[ 7: 0] = rs_val[ 7: 0] - rt_val[ 7: 0];
       rtype_alu_wb[15: 8] = rs_val[15: 8] - rt_val[15: 8];
       rtype_alu_wb[23:16] = rs_val[23:16] - rt_val[23:16];
       rtype_alu_wb[31:24] = rs_val[31:24] - rt_val[31:24];
   end
   ```
   Each lane is naturally modulo-256; no carry between bytes.
5. **`is_nop_class` allow**: `!is_psubb` added.

5 LOC of real change.

### Focused TB — `tb_ee_core_psubb.sv`

Three cases:

1. **Distinct lanes (qbert encoding shape)**: `$t1 = 0x10203040`,
   `$t2 = 0x01020304` → `$v0 = 0x0F1E2D3C`. Encoder-output
   asserted to equal `0x712A1248` (qbert's literal instruction).
2. **All-wrap**: `$t3 = 0`, `$t4 = 0x01020304` → `$t5 = 0xFFFEFDFC`.
   Proves all 4 byte lanes underflow independently to 0xFx.
3. **No cross-byte borrow**: `$t6 = 0x12345600`, `$t7 = 0x00000001`
   → `$t8 = 0x123456FF`. The low byte borrows (0x00 - 0x01 =
   0xFF) but **must not propagate into byte 1**. Byte 1 stays
   at 0x56 (= 0x56 - 0x00). This is the critical SIMD property.

Result: `retired=28 halt=1 trap=0 pc=0xbfc00164 errors=0 PASS`.

### Makefile + regression

- `tb_ee_core_psubb` target.
- Added to both regression lists.
- Regression: 167 → **168**.

## Recommendation for Codex's Ch281 — PNOR

`0x70091CE9` at PC `0x00112C94`:
- opcode 0x1C (MMI)
- funct 0x29 (MMI3 sub-group)
- sa 0x13 (PNOR within MMI3)
- rs=$zero, rt=$t1, rd=$v1
- → `pnor $v1, $0, $t1`

Architectural: 128-bit `rd = ~(rs | rt)`. For our 32-bit model:
`$rd[31:0] = ~($rs[31:0] | $rt[31:0])` — **bit-identical to the
existing standard NOR** (SPECIAL funct 0x27). The only difference
between PNOR and NOR is the architectural width.

With `rs = $zero`, PNOR is the canonical MIPS "NOT" pseudo-instruction:
`pnor $rd, $0, $rt` ≡ `not $rd, $rt`.

Implementation outline (mirrors Ch278 PCPYLD + Ch280 PSUBB):

1. `localparam FUNC_MMI3 = 6'h29`.
2. `localparam MMI3_PNOR = 5'h13`.
3. `is_pnor = is_mmi && (func == FUNC_MMI3) && (shamt == MMI3_PNOR)`.
4. Add to `is_rtype_alu`.
5. **Reuse the existing NOR writeback arm**:
   ```sv
   else if (is_nor || is_pnor) rtype_alu_wb = ~(rs_val | rt_val);
   ```
6. Add `!is_pnor` to `is_nop_class` allow-list.

~4 LOC.

Focused TB:
- Exact qbert encoding asserted == `0x70091CE9`.
- NOT-of-zero: `pnor $rd, $0, $0` → `$rd = 0xFFFFFFFF`.
- NOT-of-pattern: `pnor $rd, $0, 0xAAAAAAAA` → `$rd = 0x55555555`.
- General NOR: `pnor $rd, 0xF0F0F0F0, 0x0F0F0F0F` → `$rd = 0`.

**Likely follow-ons after PNOR**: byte-walker reductions like
**PMFHL** (move from HI/LO), or another mask op like **PAND**
(MMI2 sa=0x12) / **POR** (MMI3 sa=0x12). Codex may want to
consider folding the bitwise MMI family (PAND/POR/PXOR/PNOR) into
one chapter since they're all reuses of existing ALU arms.

## Files changed

- `rtl/ee/ee_core_stub.sv` — 5 surgical edits.
- `sim/tb/integration/tb_ee_core_psubb.sv` — new focused TB.
- `sim/Makefile` — target + both regression lists.

## Regression

In flight; expected **168/168**.

## Pattern review (10 qbert chapters)

| Ch | Blocker | Edits | Pattern |
|----|---------|-------|---------|
| 271 SQ | first | 5 | NEW 4-beat write |
| 272 DADDU | | 4 | NEW ALU-low-32 |
| 273 SYSCALL HLE | | 2 | NEW gated dispatcher |
| 274 BEQL | | 6 | NEW branch+squash |
| 275 SD | | 7 | REUSE SQ counter |
| 276 DSLL | | 4 | REUSE DADDU |
| 277 BNEL | | 6 | REUSE BEQL squash |
| 278 PCPYLD | | 4 | NEW MMI narrow-decode |
| 279 LQ | | 5 | REUSE LW path |
| **280 PSUBB** | | **5** | **REUSE MMI narrow (byte-SIMD)** |

10 chapters in, qbert at 27,021 retires, regression at 168.
SIMD byte-walker pattern is locking in: LQ → PSUBB → PNOR
(likely → PMFHL → branch). Each chapter is now ~4-5 LOC + a
TB; cadence holds at sub-half-day per chapter.