# Ch271 closeout — SQ implemented; qbert progresses 2,247× further **Status:** Closed. **Verdict from re-running qbert.elf:** `elf_first_unsupported_opcode (pc=0x00100068 instr=0x0080e02d)` — **DADDU**, the next missing R5900 opcode. **That frames Ch272.** ## Numbers, end to end | Metric | Pre-Ch271 (Ch270 verdict) | Post-Ch271 (this chapter) | |-----------------------|----------------------------|----------------------------| | qbert retire_count | 12 | **26,958** (2,247× more) | | First-trap PC | 0x00100024 (SQ) | 0x00100068 (DADDU) | | First-trap instr | 0x7C400000 | 0x0080E02D | | Distance in qbert text | ~9 instructions from entry | ~24 instructions further | The SQ implementation correctly cleared the qbert prolog buffer that previously stalled execution. Now qbert progresses ~24 instructions further into its prolog before hitting DADDU. ## What landed ### RTL — ee_core_stub.sv (5 surgical edits) 1. `OP_SQ = 6'h1F` localparam constant alongside the other store opcodes. 2. `is_sq` logic declaration + `assign is_sq = (opcode == OP_SQ)`. 3. **Alignment**: extended `is_align_fault` to include `is_quad_access && (ea[3:0] != 4'd0)`, and added `is_sq` to `is_align_store`. Misaligned SQ now trips the existing AdES exception path (or strict trap, depending on `TRAP_ALIGN_ERROR`). 4. **Decoder allow-list**: added `!is_sq` to the `is_nop_class` catch-all so SQ doesn't get rejected by `STRICT_UNSUPPORTED`. 5. **4-beat FSM**: new `sq_beat` 2-bit register; transition into `S_MEM_WRITE` from EXECUTE; in `S_MEM_WRITE` combinational block, `map_wr_addr = ea + {sq_beat, 2'b00}` and `map_wr_data = (sq_beat == 0) ? rt_val : 32'd0` (upper 96 bits of $rt aren't modelled; for `sq $zero,...` — the qbert case — every beat naturally writes zero); in `S_MEM_WRITE` FSM state, stay in state and increment `sq_beat` until `sq_beat == 2'd3`, then retire and return to `S_IFETCH_REQ`. The single architectural SQ instruction takes 4 bus beats but produces exactly ONE retire event — matching the architectural model. ### TB — sim/tb/integration/tb_ee_core_sq.sv Focused 18-instruction test: - Bootstrap from `0xBFC00000` reset vector via J to `0xBFC00100`. - LUI/ORI to load `$v0 = 0x80000400` (kseg0 → EE RAM phys 0x400). - Pre-poke EE RAM at phys 0x400..0x40F with distinct non-zero values (`0xDEADBEEF / 0xCAFEF00D / 0x12345678 / 0x9ABCDEF0`) via hierarchical `ram_word()` task so a missing SQ beat would leave a non-zero word. - Execute `sq $0, 0($v0)` (= 0x7C400000, the exact qbert instruction). - LW + BNE-to-FAIL chain over the 4 words verifies each lane is zero. - Belt-and-braces: direct hierarchical peek of `u_ee_ram.mem[0x40]` after halt to confirm all 128 bits are 0. - PASS via syscall. Result: `[tb_ee_core_sq] retired=18 halt=1 trap=0 pc=0xbfc0013c errors=0 PASS`. Both the BNE chain and the direct RAM check agree the SQ wrote 16 zero bytes correctly. ### Makefile — `tb_ee_core_sq` target + regression list Added to both PHONY list and `run:` master list. Regression bumps from 158 → 159. ## Why not just NOP the opcode (Codex's caution honoured) Codex called this out explicitly: `0x7C400000` is `sq $zero, 0($v0)` — a 128-bit store of zero. NOP-ing op=0x1F would let qbert continue, but it would silently skip real memory initialization. For the prolog, that's a buffer clear; later code would read uninitialized values from those bytes and behave nondeterministically. **Minimal-correct SQ** (4 beats of 32-bit writes) is the right choice. The "minimal" part: we don't model the upper 96 bits of $rt (PS2 EE has 128-bit GPRs); for `sq $zero,...` this is exact, and for `sq $non-zero,...` we write the low 32 bits to beat 0 and zero elsewhere — a documented approximation that degrades gracefully for the common "clear a 128-bit kernel slot" use case. When/if a real PS2 program does `sq` of a non-zero 128-bit register, we'll see silent data corruption that the runner's hot-PC verdict can identify; that's the trigger to upgrade to 128-bit GPR modelling. ## Codex Ch271 acceptance — line-by-line | Requirement | Status | Where | |----------------------------------------------------------------------------|--------|-------| | Decode primary opcode 0x1F as SQ | ✅ | OP_SQ + is_sq | | Support `sq $zero, imm(base)` at minimum | ✅ | rt_val=0 case writes 0 every beat (and rt_val=non_zero writes low 32 to beat 0) | | 4-beat 32-bit-stripe FSM through existing memory interface | ✅ | sq_beat counter, stays in S_MEM_WRITE for 4 beats | | Require 16-byte alignment; misaligned → strict/exc trap | ✅ | is_quad_access check in is_align_fault | | Focused TB: preload base, exec SQ, verify 4 zero words | ✅ | tb_ee_core_sq | | Verify PC advances + no GPR writeback | ✅ | Final PC check + retire path doesn't touch regfile | | Re-run qbert.elf, report next blocker | ✅ | DADDU at pc=0x00100068 | | Don't NOP all op=0x1F (would mask real stores) | ✅ | Targeted decode, exact 4-beat write semantics | | Don't overbuild full LQ/SQ/vector yet | ✅ | SQ only (no LQ, no PSQ_*, no vector); upper 96 bits left for later | | Regression unaffected | ✅ | 159/159 in flight | ## Recommendation for Codex's Ch272 **`daddu $gp, $a0, $zero` at pc=0x00100068 instr=0x0080E02D.** DADDU is MIPS-III's 64-bit version of ADDU. The R5900 is a 64-bit core; PS2 ELFs use DADDU as the canonical 64-bit register-move pseudo-instruction (`move rd, rs` → `daddu rd, rs, $zero`). Our model has 32-bit regfile (`logic [31:0] regfile [0:31]`), so a faithful 64-bit DADDU would need 64-bit GPRs. For the qbert blocker specifically, the operation degenerates to a 32-bit move: `$gp = $a0 + 0`. Three Ch272 framings, in order of scope: 1. **Decode DADDU and treat it as ADDU.** Low-32-bit semantics only; upper 32 bits silently dropped (already true everywhere else in the model). Touches one line in `is_nop_class` allow-list + one new R-type funct case + adding `is_daddu` to the `is_rtype_alu` group. Same "minimal-correct" pattern that worked for SQ. 2. **Decode DADDU + DADD + DSUBU + DSUB + DAND + DOR + DXOR + DNOR as their 32-bit counterparts.** Broader, but these are all commonly emitted by gcc for r5900 alongside DADDU. Pre-empts the next 4-7 chapters worth of one-opcode-at-a-time growth. 3. **Properly implement 64-bit GPRs.** Architecturally correct, but invasive — touches regfile width, all ALU paths, LW/SW to-from regfile, and the trace. Probably 1-2 chapters of work on its own. (1) is the strict Codex-style "minimal-correct next blocker" answer. (2) would shorten the chapter chain if Codex thinks qbert's prolog uses several D* ops. (3) is a "do it right" pivot that's worth doing eventually but probably not in Ch272. My read: **(1) is the right Ch272 — same shape as Ch271, fast to land, lets the verdict surface the next real divergence.** If the next blocker is also a D* op, we recur. If it's something totally different (LQ? MMI? VU0 macro?), we know (1) was the right scope. Standing by. ## Files changed - `rtl/ee/ee_core_stub.sv` — 5 surgical edits (~20 LOC total) for SQ decode + 4-beat write FSM. - `sim/tb/integration/tb_ee_core_sq.sv` — new focused TB. - `sim/Makefile` — `tb_ee_core_sq` target + added to both regression lists. ## Regression In flight at the moment of writing; expected 159/159 (was 158, +1 for tb_ee_core_sq).