# Ch277 closeout — BNEL squash-on-not-taken; qbert hits MMI (PCPYLD) one instruction later **Status:** Closed. **Verdict from re-running qbert.elf:** `elf_first_unsupported_opcode (pc=0x00112C84 instr=0x71295389)` — opcode `0x1C` (R5900 EE **MMI**) + funct `0x09` (MMI2 sub-group) + sa `0x0E` = **PCPYLD** (Parallel Copy Lower Doubleword). qbert ran the BNEL correctly (squashed not-taken — PC went 0xC7C → 0xC84 = +8 bytes, confirming the squash path), then trapped on the very next instruction, an MMI/PCPYLD. ## Numbers | Chapter | Blocker | qbert retire_count | |---------|---------|---------------------| | Post-Ch274 (BEQL) | SD at 0x00112DAC | 26,985 | | Post-Ch275 (SD) | DSLL at 0x00112C54 | 27,006 | | Post-Ch276 (DSLL) | BNEL at 0x00112C7C | 27,016 | | **Post-Ch277 (BNEL)** | **PCPYLD at 0x00112C84** | **27,017** | 1-retire delta — BNEL itself retired (the squash path), then PCPYLD trapped before retiring. ## What landed ### RTL — surgical edits in `ee_core_stub.sv` 1. `localparam OP_BNEL = 6'h15` alongside `OP_BNE`/`OP_BEQL`. 2. `is_bnel` decode signal. 3. Added `is_bnel` to the `is_branch` group. 4. Extended `branch_taken` with `(is_bnel && (rs_val != rt_val))`. 5. **Generalized the squash signal**: renamed `is_beql_squash` to `is_branch_likely_squash`, now covering BEQL (squash on `rs == rt`... wait, *not* equal — branch likely SQUASHES on the NOT-TAKEN condition) and BNEL (squash on `rs == rt`): ```sv assign is_branch_likely_squash = (is_beql && (rs_val != rt_val)) // Ch274 — BEQL not-taken || (is_bnel && (rs_val == rt_val)); // Ch277 — BNEL not-taken ``` `retire_advance` updated to reference the new name. Adding BLEZL/BGTZL/REGIMM-likely later is now a one-line OR-extension. 6. Added `!is_bnel` to the `is_nop_class` allow-list. About 6 LOC of real change. Pure pattern-reuse from Ch274. ### Focused TB — `tb_ee_core_bnel.sv` Three cases mirroring `tb_ee_core_beql`: 1. **BNEL TAKEN** (`$t0 = 5`, `$t1 = 7`, differ → taken): branch reaches target; delay slot executes (writes a sentinel into `$t5`). Cross-check: `$t6 = 0xCAFE` at target. 2. **BNEL NOT-TAKEN** (`$t2 = $t3 = 3`, equal → squash): delay slot squashed. Inline BNE chain verifies `$t5` stays at `0xBEEF0000` (the OR-INTO probe didn't execute). `$t7 = 0x2222` at PC+8. 3. **BNE NOT-TAKEN cross-check** (same operands): plain BNE's delay slot DOES execute → `$t5 = 0xBABE0CAB`. Proves BNEL differs. Result: `retired=21 halt=1 trap=0 pc=0xbfc00158 errors=0 PASS`. ### Makefile + regression - `tb_ee_core_bnel` target. - Added to both PHONY list and `run:` master. - Regression: 164 → **165**. ## Recommendation for Codex's Ch278 — PCPYLD (MMI2) **`pcpyld $t2, $a1, $t1`** at PC `0x00112C84`, instr `0x71295389`. Decoded: - opcode `0x1C` (MMI prefix) - funct `0x09` (MMI2 sub-group selector) - sa `0x0E` (PCPYLD sub-instruction) - rs `5` (`$a1`), rt `9` (`$t1`), rd `10` (`$t2`) PCPYLD architectural semantics (R5900 EE, 128-bit MMI): ``` rd[127:64] = rs[63:0] // upper 64 of rd = lower 64 of rs rd[63:0] = rt[63:0] // lower 64 of rd = lower 64 of rt ``` For our **32-bit register model**: - We can't represent `rd[127:64]` (no upper bits). - `rd[63:0] = rt[63:0]` collapses to `$rd[31:0] = $rt[31:0]` (lower 32 bits). **Minimal Ch278 scope**: 1. Decode the MMI2/PCPYLD path: opcode `0x1C` + funct `0x09` + sa `0x0E` → set `is_pcpyld`. 2. Add to `is_rtype_alu` group. 3. In `rtype_alu_wb`: `else if (is_pcpyld) rtype_alu_wb = rt_val;` (low 32 bits of $rt → $rd). 4. Add `!is_pcpyld` to `is_nop_class` allow-list. Document the approximation explicitly in the RTL: upper bits of $rd (which would carry $rs's lower 64 in a real EE) are not modelled. For qbert's specific call pattern at this PC, the data being shuffled is likely 128-bit packed bytes for a strlen-style byte-walker (`$a0 = 0x80808080` is the classic "detect high bit per byte" mask); the **low 32 bits** are the relevant observable. **Important Codex caution**: do NOT NOP-class the entire MMI opcode (`0x1C`). MMI has ~80 sub-instructions (MMI0/MMI1/MMI2/ MMI3 sub-tables); some are real data movement (PCPYLD, PCPYUD, PCPYH), some are arithmetic (PADDB, PSUBB, PMULTW), some are SIMD compares (PCEQB, PCEQH). Each needs its own decode arm or careful approximation. The qbert track is fine with one sub-instruction per chapter — same incremental cadence we've maintained throughout. **Likely follow-ons** after PCPYLD: any other MMI2 op qbert's byte-walker uses. Common candidates given the `0x80808080` sentinel: **PCEQB** (parallel compare equal byte) and **PMFHL** (parallel move from HI/LO). ## Files changed - `rtl/ee/ee_core_stub.sv` — 6 surgical edits. - `sim/tb/integration/tb_ee_core_bnel.sv` — new focused TB. - `sim/Makefile` — target + both regression lists. ## Regression In flight; expected **165/165**. ## Pattern review Seven qbert chapters (Ch271–Ch277). The qbert-driven track keeps producing one chapter per blocker at sub-half-day cadence: | Chapter | Blocker | retire_count | |---------|---------|--------------| | Ch271 SQ | (init) | 12 → 26,958 | | Ch272 DADDU | | → 26,960 | | Ch273 SYSCALL HLE | | → 26,980 | | Ch274 BEQL | | → 26,985 | | Ch275 SD | | → 27,006 | | Ch276 DSLL | | → 27,016 | | **Ch277 BNEL** | | **→ 27,017** | The MMI surface (PCPYLD and likely siblings) will broaden the opcode count quickly — that's expected when a real program starts using SIMD-style operations for stdlib-class work.