# Ch274 closeout — BEQL with squash-on-not-taken; qbert lands in a function prologue, next blocker is SD **Status:** Closed. **Verdict from re-running qbert.elf:** `elf_first_unsupported_opcode (pc=0x00112DAC instr=0xFFBF0020)` — **SD** (Store Doubleword, MIPS-III). qbert passed the C++ constructor walker's BEQL correctly, JAL'd into a function at PC `0x00112DAC`, and trapped on the very first instruction of that function — the canonical `sd $ra, 0x20($sp)` register-save prologue. ## Numbers | Chapter | Blocker | qbert retire_count | |---------|---------|---------------------| | Post-Ch271 (SQ) | DADDU at 0x00100068 | 26,958 | | Post-Ch272 (DADDU) | SYSCALL at 0x00100070 | 26,960 | | Post-Ch273 (SYSCALL HLE) | BEQL at 0x001000C0 | 26,980 | | **Post-Ch274 (BEQL)** | **SD at 0x00112DAC** | **26,985** | The 5-retire delta covers: BEQL squash → `addiu $v0, $v1, 4` → `lw $a0, 0($v0)` → `addiu $a1, $v0, 4` → `jal 0x00112DAC` → first instruction of the called function (SD, traps). The ~78 KB PC jump to `0x00112DAC` confirms the BEQL squash worked — qbert's `$a0` was NOT clobbered to 0 by the squashed delay slot, the LW loaded the real constructor-pointer, and the JAL dispatched correctly. ## What landed ### RTL — surgical edits in `ee_core_stub.sv` 1. **Opcode**: `localparam OP_BEQL = 6'h14` alongside `OP_BEQ`. 2. **Decode**: `is_beql` signal + `assign is_beql = (opcode == OP_BEQL)`. 3. **Branch logic**: BEQL added to `is_branch` group and to `branch_taken` (same `(rs_val == rt_val)` condition as BEQ). 4. **New signal `is_beql_squash`**: `is_beql && (rs_val != rt_val)` — the load-bearing case. 5. **`retire_advance`**: when `is_beql_squash` is true, `next_pc <= pc + 32'd8` (skip the delay slot directly); `new_branch_pending` stays low so no stale target leaks. Existing BEQ/BNE/jump path unchanged. 6. **Decoder allow-list**: added `!is_beql` to the `is_nop_class` catch-all so SQ doesn't get strict-trap'd. About 6 LOC of real change. ### Focused TB — `tb_ee_core_beql.sv` Three cases per Codex's spec: 1. **BEQL taken** (`$t0 == $t1`): branch reaches target; delay slot DOES execute (writes a sentinel into `$t5`). Cross-checked by `$t6 = 0xCAFE` at the target. 2. **BEQL not-taken** (`$t2 != $t3`): delay slot squashed. `$t7 = 0x2222` at PC+8 proves we landed correctly past the squash. **Inline BNE chain verifies `$t5` was NOT clobbered by the squashed delay slot** (`$t5` stays at its pre-BEQL `0xBEEF0000` value). 3. **BEQ not-taken cross-check** (same operands): plain BEQ's delay slot DOES execute, so `$t5` gets `0xCAB` ORed into the low 16 bits (`$t5 = 0xBABE0CAB`). Proves BEQL's squash differs from BEQ's no-squash behavior. Encoding gotcha caught during TB authoring: my initial delay slots used `ori $t5, $0, ...` (clobbers `$t5` regardless of prior value) instead of `ori $t5, $t5, ...` (ORs into `$t5`, preserving high bits). The first build FAILED the Case-3 check with `$t5=0x00000CAB` instead of `0xBABE0CAB`. Fixed by changing the rs field to RT5 so the delay slot ORs into the existing value — making both "delay-fired" and "delay-squashed" cases distinguishable by the high half-word. Result: `retired=21 halt=1 trap=0 pc=0xbfc00158 errors=0 PASS`. ### Makefile + regression - `tb_ee_core_beql` target. - Added to both PHONY list and `run:` master. - Regression: 161 → **162**. ## qbert disassembly around the new blocker (PC 0x00112DAC) The JAL at `0x001000D4` calls into a function at `0x00112DAC`. That function's prologue is: ``` 0x00112DAC: 0xFFBF0020 sd $ra, 0x20($sp) <-- TRAP (opcode 0x3F, MIPS-III SD) ``` **SD** (Store Doubleword) is the MIPS-III 64-bit cousin of SW. PS2 ELFs use it everywhere in function prologues to save 64-bit register values (`$ra`, `$s*`) onto the stack. ## Recommendation for Codex's Ch275 **Implement SD as a 2-beat 32-bit-stripe write FSM**, mirroring Ch271's SQ pattern but smaller: - **Decode**: opcode `6'h3F` → `is_sd`. - **Alignment**: SD requires 8-byte alignment (`ea[2:0] == 0`). Misaligned → AdES path (same as existing SW alignment). - **FSM**: reuse the `sq_beat` counter (or add `sd_beat`); 2 beats this time. Beat 0 writes `rt_val` (low 32 bits of $rt) at EA; beat 1 writes 0 at EA+4 (upper 32 bits of $rt not modelled — same approximation we made for SQ beats 1-3). - **For `sd $ra,...`**: real PS2 callees later `LD` to restore 64-bit `$ra`. Our model's upper 32 bits are always 0, so the round-trip works as long as the function doesn't do 64-bit math on `$ra` itself (rare). Focused TB shape (mirrors `tb_ee_core_sq`): - Pre-poke RAM target with non-zero junk. - Execute `sd $rt, 0(base)` with `$rt` non-zero in low 32 bits. - LW + BNE chain verifies `mem[base+0] = rt_val_low` and `mem[base+4] = 0`. - Direct hierarchical RAM peek for belt-and-braces. This is structurally identical to Ch271 with `4 → 2` beats and `16 → 8` byte alignment. Should be ~30 minutes of work. Likely follow-on after SD: **LD** (Load Doubleword, opcode 0x37). When the called function eventually returns, it'll `LD $ra, 0x20($sp)` to restore the saved register; our model needs the corresponding 2-beat read path. Codex may want to fold SD+LD into one chapter since they're symmetric. ## Files changed - `rtl/ee/ee_core_stub.sv` — 6 surgical edits. - `sim/tb/integration/tb_ee_core_beql.sv` — new focused TB. - `sim/Makefile` — target + both regression lists. ## Regression In flight at the moment of writing; expected **162/162** (was 161, +1 for `tb_ee_core_beql`). ## Process notes - **Cross-check via BEQ in the same TB.** Codex specifically asked for the BEQ cross-check, and it caught a real difference: Case 3 (BEQ not-taken) writes `$t5` low bits while Case 2 (BEQL not-taken) does NOT. Without the cross- check, a regression where BEQL accidentally behaved like BEQ would silently pass on the "PC landed at PC+8" check alone. - **OR-INTO vs OR-FROM-ZERO encoding bugs are easy to make.** My first TB pass had `ori $rt, $0, imm` (overwriting), which loses info about whether the delay slot fired. Always use `ori $rt, $rt, imm` (or similar accumulating op) in delay-slot probes so "did it fire?" is observable by a bitwise comparison rather than a value comparison. - **The pattern continues to compress.** Ch271 SQ took 5 edits + a TB. Ch272 DADDU took 4 + a TB. Ch273 SYSCALL HLE took 2 + a TB (plus a runner update). Ch274 BEQL is 6 + a TB. Each is a 1-day chapter at most. The qbert progression is now `12 → 26,958 → 26,960 → 26,980 → 26,985 retires` — the runner is doing its job.