# Ch277 closeout — BNEL squash-on-not-taken; qbert hits MMI (PCPYLD) one instruction later

**Status:** Closed. **Verdict from re-running qbert.elf:**
`elf_first_unsupported_opcode (pc=0x00112C84 instr=0x71295389)` —
opcode `0x1C` (R5900 EE **MMI**) + funct `0x09` (MMI2 sub-group)
+ sa `0x0E` = **PCPYLD** (Parallel Copy Lower Doubleword). qbert
ran the BNEL correctly (squashed not-taken — PC went 0xC7C →
0xC84 = +8 bytes, confirming the squash path), then trapped on
the very next instruction, an MMI/PCPYLD.

## Numbers

| Chapter | Blocker | qbert retire_count |
|---------|---------|---------------------|
| Post-Ch274 (BEQL) | SD at 0x00112DAC | 26,985 |
| Post-Ch275 (SD) | DSLL at 0x00112C54 | 27,006 |
| Post-Ch276 (DSLL) | BNEL at 0x00112C7C | 27,016 |
| **Post-Ch277 (BNEL)** | **PCPYLD at 0x00112C84** | **27,017** |

1-retire delta — BNEL itself retired (the squash path), then
PCPYLD trapped before retiring.

## What landed

### RTL — surgical edits in `ee_core_stub.sv`

1. `localparam OP_BNEL = 6'h15` alongside `OP_BNE`/`OP_BEQL`.
2. `is_bnel` decode signal.
3. Added `is_bnel` to the `is_branch` group.
4. Extended `branch_taken` with `(is_bnel && (rs_val != rt_val))`.
5. **Generalized the squash signal**: renamed `is_beql_squash`
   to `is_branch_likely_squash`, now covering BEQL (squash on
   `rs == rt`... wait, *not* equal — branch likely SQUASHES on
   the NOT-TAKEN condition) and BNEL (squash on `rs == rt`):

   ```sv
   assign is_branch_likely_squash =
           (is_beql && (rs_val != rt_val))   // Ch274 — BEQL not-taken
        || (is_bnel && (rs_val == rt_val));  // Ch277 — BNEL not-taken
   ```

   `retire_advance` updated to reference the new name. Adding
   BLEZL/BGTZL/REGIMM-likely later is now a one-line OR-extension.
6. Added `!is_bnel` to the `is_nop_class` allow-list.

About 6 LOC of real change. Pure pattern-reuse from Ch274.

### Focused TB — `tb_ee_core_bnel.sv`

Three cases mirroring `tb_ee_core_beql`:

1. **BNEL TAKEN** (`$t0 = 5`, `$t1 = 7`, differ → taken): branch
   reaches target; delay slot executes (writes a sentinel into
   `$t5`). Cross-check: `$t6 = 0xCAFE` at target.
2. **BNEL NOT-TAKEN** (`$t2 = $t3 = 3`, equal → squash): delay
   slot squashed. Inline BNE chain verifies `$t5` stays at
   `0xBEEF0000` (the OR-INTO probe didn't execute). `$t7 = 0x2222`
   at PC+8.
3. **BNE NOT-TAKEN cross-check** (same operands): plain BNE's
   delay slot DOES execute → `$t5 = 0xBABE0CAB`. Proves BNEL
   differs.

Result: `retired=21 halt=1 trap=0 pc=0xbfc00158 errors=0 PASS`.

### Makefile + regression

- `tb_ee_core_bnel` target.
- Added to both PHONY list and `run:` master.
- Regression: 164 → **165**.

## Recommendation for Codex's Ch278 — PCPYLD (MMI2)

**`pcpyld $t2, $a1, $t1`** at PC `0x00112C84`, instr `0x71295389`.

Decoded:
- opcode `0x1C` (MMI prefix)
- funct `0x09` (MMI2 sub-group selector)
- sa `0x0E` (PCPYLD sub-instruction)
- rs `5` (`$a1`), rt `9` (`$t1`), rd `10` (`$t2`)

PCPYLD architectural semantics (R5900 EE, 128-bit MMI):
```
rd[127:64] = rs[63:0]    // upper 64 of rd = lower 64 of rs
rd[63:0]   = rt[63:0]    // lower 64 of rd = lower 64 of rt
```

For our **32-bit register model**:
- We can't represent `rd[127:64]` (no upper bits).
- `rd[63:0] = rt[63:0]` collapses to `$rd[31:0] = $rt[31:0]`
  (lower 32 bits).

**Minimal Ch278 scope**:
1. Decode the MMI2/PCPYLD path: opcode `0x1C` + funct `0x09` +
   sa `0x0E` → set `is_pcpyld`.
2. Add to `is_rtype_alu` group.
3. In `rtype_alu_wb`: `else if (is_pcpyld) rtype_alu_wb = rt_val;`
   (low 32 bits of $rt → $rd).
4. Add `!is_pcpyld` to `is_nop_class` allow-list.

Document the approximation explicitly in the RTL: upper bits of
$rd (which would carry $rs's lower 64 in a real EE) are not
modelled. For qbert's specific call pattern at this PC, the
data being shuffled is likely 128-bit packed bytes for a
strlen-style byte-walker (`$a0 = 0x80808080` is the classic
"detect high bit per byte" mask); the **low 32 bits** are the
relevant observable.

**Important Codex caution**: do NOT NOP-class the entire MMI
opcode (`0x1C`). MMI has ~80 sub-instructions (MMI0/MMI1/MMI2/
MMI3 sub-tables); some are real data movement (PCPYLD, PCPYUD,
PCPYH), some are arithmetic (PADDB, PSUBB, PMULTW), some are
SIMD compares (PCEQB, PCEQH). Each needs its own decode arm or
careful approximation. The qbert track is fine with one
sub-instruction per chapter — same incremental cadence we've
maintained throughout.

**Likely follow-ons** after PCPYLD: any other MMI2 op qbert's
byte-walker uses. Common candidates given the `0x80808080`
sentinel: **PCEQB** (parallel compare equal byte) and **PMFHL**
(parallel move from HI/LO).

## Files changed

- `rtl/ee/ee_core_stub.sv` — 6 surgical edits.
- `sim/tb/integration/tb_ee_core_bnel.sv` — new focused TB.
- `sim/Makefile` — target + both regression lists.

## Regression

In flight; expected **165/165**.

## Pattern review

Seven qbert chapters (Ch271–Ch277). The qbert-driven track keeps
producing one chapter per blocker at sub-half-day cadence:

| Chapter | Blocker | retire_count |
|---------|---------|--------------|
| Ch271 SQ | (init) | 12 → 26,958 |
| Ch272 DADDU | | → 26,960 |
| Ch273 SYSCALL HLE | | → 26,980 |
| Ch274 BEQL | | → 26,985 |
| Ch275 SD | | → 27,006 |
| Ch276 DSLL | | → 27,016 |
| **Ch277 BNEL** | | **→ 27,017** |

The MMI surface (PCPYLD and likely siblings) will broaden the
opcode count quickly — that's expected when a real program
starts using SIMD-style operations for stdlib-class work.