# Ch279 closeout — LQ as single-beat low-word load; next blocker is PSUBB (MMI0) **Status:** Closed. **Verdict from re-running qbert.elf:** `elf_first_unsupported_opcode (pc=0x00112C90 instr=0x712A1248)` — opcode `0x1C` (MMI) + funct `0x08` (MMI0 sub-table) + sa `0x09` = **PSUBB** (Parallel Subtract Byte). qbert ran LQ + one more instruction, then trapped on the byte-wise SIMD subtract that sits at the heart of its stdlib byte-walker. ## Numbers | Chapter | Blocker | qbert retire_count | |---------|---------|---------------------| | Post-Ch277 (BNEL) | PCPYLD at 0x00112C84 | 27,017 | | Post-Ch278 (PCPYLD) | LQ at 0x00112C88 | 27,018 | | **Post-Ch279 (LQ)** | **PSUBB at 0x00112C90** | **27,020** | 2-retire delta: LQ + the next instruction (probably another register move) before PSUBB. The chain qbert is running here is the canonical SIMD byte-walker — load a 128-bit chunk, do a byte-wise compare/subtract against a sentinel, mask, test. ## What landed ### RTL — 4 surgical edits in `ee_core_stub.sv` 1. `localparam OP_LQ = 6'h1E` alongside `OP_LW`. 2. `is_lq` decode signal. 3. **Alignment**: extended `is_quad_access = is_sq || is_lq` so the existing 16-byte alignment fault `ea[3:0] != 0` covers LQ too. Misaligned LQ trips the AdEL path (it's a load, so the existing `is_align_store` group correctly doesn't include it — exception code is ADEL not ADES). 4. **FSM transition**: added `|| is_lq` to the LW/LB/LBU/LH/LHU loads list. The existing `S_MEM_REQ → S_MEM_WAIT` path handles the 32-bit read; `S_MEM_WAIT`'s default writeback `regfile[rt_idx] <= map_rd_data` fires for LQ because none of is_lb/lbu/lh/lhu match (the if-else chain falls through to the default LW arm). 5. `!is_lq` added to `is_nop_class` catch-all. 5 surgical edits total. The "reuse LW path" decision keeps the chapter small. ### Focused TB — `tb_ee_core_lq.sv` Cases: 1. **Exact qbert encoding shape**: `lq $t1, 0($a1)` built via `enc_i(OP_LQ, RA1, RT1, 0)` and asserted to equal `0x78A90000`. (We use this assertion to lock the encoding even though the actual exec uses `lq $t1, 0($v0)` with a different base — same opcode shape, different register index.) 2. **Value check**: pre-poke phys 0x400..0x40F with 4 distinct patterns (`0xAABBCCDD / 0x11112222 / 0x33334444 / 0x55556666`) so a buggy implementation reading the wrong lane would fail. Verify `$t1 = 0xAABBCCDD` (the low 32 of the qword). 3. **LW cross-check**: LW at the same EA reads the same value. Confirms LQ is decoded as a "single-beat low-word load" consistent with the existing LW path. 4. **No-modify check**: post-halt hierarchical RAM peek confirms all 4 lanes still hold the pre-pokes (LQ doesn't write). Result: `retired=13 halt=1 trap=0 pc=0xbfc00128 errors=0 PASS`. ### Makefile + regression - `tb_ee_core_lq` target. - Added to both regression lists. - Regression: 166 → **167**. ## Recommendation for Codex's Ch280 — PSUBB PSUBB at PC `0x00112C90`, instr `0x712A1248`: - opcode 0x1C (MMI) - funct 0x08 (MMI0 sub-table) - sa 0x09 (PSUBB within MMI0) - rs=$t1, rt=$t2, rd=$v0 - → `psubb $v0, $t1, $t2` Architectural: `rd[7+8i:8i] = rs[7+8i:8i] - rt[7+8i:8i]` for i ∈ [0..15], 16 parallel byte subtractions with no carry/borrow between byte lanes. For our 32-bit model: 4 parallel byte subtractions on the low 32 bits. Implementation outline (mirrors Ch278 PCPYLD's narrow-decode): 1. `localparam FUNC_MMI0 = 6'h08`. 2. `localparam MMI0_PSUBB = 5'h09`. 3. `is_psubb = is_mmi && (func == FUNC_MMI0) && (shamt == MMI0_PSUBB)`. 4. Add to `is_rtype_alu` group. 5. New writeback arm: ```sv else if (is_psubb) begin rtype_alu_wb[ 7: 0] = rs_val[ 7: 0] - rt_val[ 7: 0]; rtype_alu_wb[15: 8] = rs_val[15: 8] - rt_val[15: 8]; rtype_alu_wb[23:16] = rs_val[23:16] - rt_val[23:16]; rtype_alu_wb[31:24] = rs_val[31:24] - rt_val[31:24]; end ``` (Each byte sub is naturally modulo-256, no carry between lanes — that's the SIMD semantic.) 6. Add `!is_psubb` to `is_nop_class` allow-list. Focused TB: - Identity check: `psubb $rd, $rs, $0` → `$rd = $rs` (each byte minus 0). - Lane-isolation check: `psubb $rd, $rs, $rt` with `$rs = 0x10203040`, `$rt = 0x01010101` → `$rd = 0x0F1F2F3F` (proves each byte subtracts independently, no inter-lane carry/borrow). - Wrap check: `psubb $rd, 0x00010203, 0x01010101` → `$rd = 0xFF000102` (proves bit 7 doesn't carry into byte 1). - Exact qbert encoding assertion against `0x712A1248`. ~4 LOC change. **Likely follow-ons** in this byte-walker context: **PCEQB** (parallel compare equal byte) and **PMFHL/LH** (parallel move from HI/LO low halves). The string-walker pattern is: 1. LQ a chunk of memory. 2. PSUBB or PCEQB against a sentinel. 3. PMFHL or some other reduction. 4. Branch. ## Files changed - `rtl/ee/ee_core_stub.sv` — 5 surgical edits. - `sim/tb/integration/tb_ee_core_lq.sv` — new focused TB. - `sim/Makefile` — target + both regression lists. ## Regression In flight; expected **167/167**. ## Pattern review 9 qbert chapters. The MMI sub-decode pattern from Ch278 is about to be reused (PSUBB shares the same shape: MMI prefix + funct + sa selector). Anticipated: PSUBB in 4 edits, mirror of PCPYLD. | Chapter | Blocker | Edits | Pattern | |---------|---------|-------|---------| | Ch271 SQ | SQ | 5 | NEW 4-beat write | | Ch272 DADDU | DADDU | 4 | NEW ALU-low-32 | | Ch273 SYSCALL HLE | SYSCALL #60 | 2 | NEW gated dispatcher | | Ch274 BEQL | BEQL | 6 | NEW branch+squash | | Ch275 SD | SD | 7 | REUSE SQ counter | | Ch276 DSLL | DSLL | 4 | REUSE DADDU | | Ch277 BNEL | BNEL | 6 | REUSE BEQL squash | | Ch278 PCPYLD | PCPYLD | 4 | NEW MMI narrow-decode | | **Ch279 LQ** | **LQ** | **5** | **REUSE LW path** | The runner-pick-next-blocker loop is producing one chapter per sub-half-day. The qbert track is on rails.