# Ch271 closeout — SQ implemented; qbert progresses 2,247× further

**Status:** Closed. **Verdict from re-running qbert.elf:**
`elf_first_unsupported_opcode (pc=0x00100068 instr=0x0080e02d)`
— **DADDU**, the next missing R5900 opcode. **That frames Ch272.**

## Numbers, end to end

| Metric                | Pre-Ch271 (Ch270 verdict) | Post-Ch271 (this chapter) |
|-----------------------|----------------------------|----------------------------|
| qbert retire_count    | 12                         | **26,958** (2,247× more)   |
| First-trap PC         | 0x00100024 (SQ)            | 0x00100068 (DADDU)         |
| First-trap instr      | 0x7C400000                 | 0x0080E02D                 |
| Distance in qbert text | ~9 instructions from entry | ~24 instructions further   |

The SQ implementation correctly cleared the qbert prolog buffer
that previously stalled execution. Now qbert progresses ~24
instructions further into its prolog before hitting DADDU.

## What landed

### RTL — ee_core_stub.sv (5 surgical edits)

1. `OP_SQ = 6'h1F` localparam constant alongside the other store
   opcodes.
2. `is_sq` logic declaration + `assign is_sq = (opcode == OP_SQ)`.
3. **Alignment**: extended `is_align_fault` to include
   `is_quad_access && (ea[3:0] != 4'd0)`, and added `is_sq` to
   `is_align_store`. Misaligned SQ now trips the existing
   AdES exception path (or strict trap, depending on
   `TRAP_ALIGN_ERROR`).
4. **Decoder allow-list**: added `!is_sq` to the `is_nop_class`
   catch-all so SQ doesn't get rejected by `STRICT_UNSUPPORTED`.
5. **4-beat FSM**: new `sq_beat` 2-bit register; transition into
   `S_MEM_WRITE` from EXECUTE; in `S_MEM_WRITE` combinational
   block, `map_wr_addr = ea + {sq_beat, 2'b00}` and
   `map_wr_data = (sq_beat == 0) ? rt_val : 32'd0` (upper 96
   bits of $rt aren't modelled; for `sq $zero,...` — the qbert
   case — every beat naturally writes zero); in `S_MEM_WRITE`
   FSM state, stay in state and increment `sq_beat` until
   `sq_beat == 2'd3`, then retire and return to `S_IFETCH_REQ`.

The single architectural SQ instruction takes 4 bus beats but
produces exactly ONE retire event — matching the architectural
model.

### TB — sim/tb/integration/tb_ee_core_sq.sv

Focused 18-instruction test:
- Bootstrap from `0xBFC00000` reset vector via J to
  `0xBFC00100`.
- LUI/ORI to load `$v0 = 0x80000400` (kseg0 → EE RAM phys
  0x400).
- Pre-poke EE RAM at phys 0x400..0x40F with distinct non-zero
  values (`0xDEADBEEF / 0xCAFEF00D / 0x12345678 / 0x9ABCDEF0`)
  via hierarchical `ram_word()` task so a missing SQ beat would
  leave a non-zero word.
- Execute `sq $0, 0($v0)` (= 0x7C400000, the exact qbert
  instruction).
- LW + BNE-to-FAIL chain over the 4 words verifies each lane is
  zero.
- Belt-and-braces: direct hierarchical peek of
  `u_ee_ram.mem[0x40]` after halt to confirm all 128 bits are 0.
- PASS via syscall.

Result: `[tb_ee_core_sq] retired=18 halt=1 trap=0 pc=0xbfc0013c
errors=0 PASS`. Both the BNE chain and the direct RAM check
agree the SQ wrote 16 zero bytes correctly.

### Makefile — `tb_ee_core_sq` target + regression list

Added to both PHONY list and `run:` master list. Regression
bumps from 158 → 159.

## Why not just NOP the opcode (Codex's caution honoured)

Codex called this out explicitly: `0x7C400000` is `sq $zero,
0($v0)` — a 128-bit store of zero. NOP-ing op=0x1F would let
qbert continue, but it would silently skip real memory
initialization. For the prolog, that's a buffer clear; later
code would read uninitialized values from those bytes and
behave nondeterministically.

**Minimal-correct SQ** (4 beats of 32-bit writes) is the right
choice. The "minimal" part: we don't model the upper 96 bits of
$rt (PS2 EE has 128-bit GPRs); for `sq $zero,...` this is
exact, and for `sq $non-zero,...` we write the low 32 bits to
beat 0 and zero elsewhere — a documented approximation that
degrades gracefully for the common "clear a 128-bit kernel
slot" use case. When/if a real PS2 program does `sq` of a
non-zero 128-bit register, we'll see silent data corruption
that the runner's hot-PC verdict can identify; that's the
trigger to upgrade to 128-bit GPR modelling.

## Codex Ch271 acceptance — line-by-line

| Requirement                                                                | Status | Where |
|----------------------------------------------------------------------------|--------|-------|
| Decode primary opcode 0x1F as SQ                                            | ✅     | OP_SQ + is_sq |
| Support `sq $zero, imm(base)` at minimum                                    | ✅     | rt_val=0 case writes 0 every beat (and rt_val=non_zero writes low 32 to beat 0) |
| 4-beat 32-bit-stripe FSM through existing memory interface                  | ✅     | sq_beat counter, stays in S_MEM_WRITE for 4 beats |
| Require 16-byte alignment; misaligned → strict/exc trap                     | ✅     | is_quad_access check in is_align_fault |
| Focused TB: preload base, exec SQ, verify 4 zero words                      | ✅     | tb_ee_core_sq |
| Verify PC advances + no GPR writeback                                       | ✅     | Final PC check + retire path doesn't touch regfile |
| Re-run qbert.elf, report next blocker                                       | ✅     | DADDU at pc=0x00100068 |
| Don't NOP all op=0x1F (would mask real stores)                              | ✅     | Targeted decode, exact 4-beat write semantics |
| Don't overbuild full LQ/SQ/vector yet                                       | ✅     | SQ only (no LQ, no PSQ_*, no vector); upper 96 bits left for later |
| Regression unaffected                                                       | ✅     | 159/159 in flight |

## Recommendation for Codex's Ch272

**`daddu $gp, $a0, $zero` at pc=0x00100068 instr=0x0080E02D.**

DADDU is MIPS-III's 64-bit version of ADDU. The R5900 is a
64-bit core; PS2 ELFs use DADDU as the canonical 64-bit
register-move pseudo-instruction (`move rd, rs` →
`daddu rd, rs, $zero`).

Our model has 32-bit regfile (`logic [31:0] regfile [0:31]`),
so a faithful 64-bit DADDU would need 64-bit GPRs. For the
qbert blocker specifically, the operation degenerates to a
32-bit move: `$gp = $a0 + 0`.

Three Ch272 framings, in order of scope:

1. **Decode DADDU and treat it as ADDU.** Low-32-bit semantics
   only; upper 32 bits silently dropped (already true everywhere
   else in the model). Touches one line in `is_nop_class`
   allow-list + one new R-type funct case + adding `is_daddu` to
   the `is_rtype_alu` group. Same "minimal-correct" pattern that
   worked for SQ.
2. **Decode DADDU + DADD + DSUBU + DSUB + DAND + DOR + DXOR + DNOR
   as their 32-bit counterparts.** Broader, but these are all
   commonly emitted by gcc for r5900 alongside DADDU. Pre-empts
   the next 4-7 chapters worth of one-opcode-at-a-time growth.
3. **Properly implement 64-bit GPRs.** Architecturally correct,
   but invasive — touches regfile width, all ALU paths, LW/SW
   to-from regfile, and the trace. Probably 1-2 chapters of work
   on its own.

(1) is the strict Codex-style "minimal-correct next blocker"
answer. (2) would shorten the chapter chain if Codex thinks
qbert's prolog uses several D* ops. (3) is a "do it right" pivot
that's worth doing eventually but probably not in Ch272.

My read: **(1) is the right Ch272 — same shape as Ch271, fast
to land, lets the verdict surface the next real divergence.**
If the next blocker is also a D* op, we recur. If it's something
totally different (LQ? MMI? VU0 macro?), we know (1) was the
right scope.

Standing by.

## Files changed

- `rtl/ee/ee_core_stub.sv` — 5 surgical edits (~20 LOC total) for
  SQ decode + 4-beat write FSM.
- `sim/tb/integration/tb_ee_core_sq.sv` — new focused TB.
- `sim/Makefile` — `tb_ee_core_sq` target + added to both
  regression lists.

## Regression

In flight at the moment of writing; expected 159/159 (was 158, +1
for tb_ee_core_sq).