RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression (272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps, and all dump-derived textures/traces) is excluded via .gitignore and stays local. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7.8 KiB
Ch271 closeout — SQ implemented; qbert progresses 2,247× further
Status: Closed. Verdict from re-running qbert.elf:
elf_first_unsupported_opcode (pc=0x00100068 instr=0x0080e02d)
— DADDU, the next missing R5900 opcode. That frames Ch272.
Numbers, end to end
| Metric | Pre-Ch271 (Ch270 verdict) | Post-Ch271 (this chapter) |
|---|---|---|
| qbert retire_count | 12 | 26,958 (2,247× more) |
| First-trap PC | 0x00100024 (SQ) | 0x00100068 (DADDU) |
| First-trap instr | 0x7C400000 | 0x0080E02D |
| Distance in qbert text | ~9 instructions from entry | ~24 instructions further |
The SQ implementation correctly cleared the qbert prolog buffer that previously stalled execution. Now qbert progresses ~24 instructions further into its prolog before hitting DADDU.
What landed
RTL — ee_core_stub.sv (5 surgical edits)
OP_SQ = 6'h1Flocalparam constant alongside the other store opcodes.is_sqlogic declaration +assign is_sq = (opcode == OP_SQ).- Alignment: extended
is_align_faultto includeis_quad_access && (ea[3:0] != 4'd0), and addedis_sqtois_align_store. Misaligned SQ now trips the existing AdES exception path (or strict trap, depending onTRAP_ALIGN_ERROR). - Decoder allow-list: added
!is_sqto theis_nop_classcatch-all so SQ doesn't get rejected bySTRICT_UNSUPPORTED. - 4-beat FSM: new
sq_beat2-bit register; transition intoS_MEM_WRITEfrom EXECUTE; inS_MEM_WRITEcombinational block,map_wr_addr = ea + {sq_beat, 2'b00}andmap_wr_data = (sq_beat == 0) ? rt_val : 32'd0(upper 96 bits of $rt aren't modelled; forsq $zero,...— the qbert case — every beat naturally writes zero); inS_MEM_WRITEFSM state, stay in state and incrementsq_beatuntilsq_beat == 2'd3, then retire and return toS_IFETCH_REQ.
The single architectural SQ instruction takes 4 bus beats but produces exactly ONE retire event — matching the architectural model.
TB — sim/tb/integration/tb_ee_core_sq.sv
Focused 18-instruction test:
- Bootstrap from
0xBFC00000reset vector via J to0xBFC00100. - LUI/ORI to load
$v0 = 0x80000400(kseg0 → EE RAM phys 0x400). - Pre-poke EE RAM at phys 0x400..0x40F with distinct non-zero
values (
0xDEADBEEF / 0xCAFEF00D / 0x12345678 / 0x9ABCDEF0) via hierarchicalram_word()task so a missing SQ beat would leave a non-zero word. - Execute
sq $0, 0($v0)(= 0x7C400000, the exact qbert instruction). - LW + BNE-to-FAIL chain over the 4 words verifies each lane is zero.
- Belt-and-braces: direct hierarchical peek of
u_ee_ram.mem[0x40]after halt to confirm all 128 bits are 0. - PASS via syscall.
Result: [tb_ee_core_sq] retired=18 halt=1 trap=0 pc=0xbfc0013c errors=0 PASS. Both the BNE chain and the direct RAM check
agree the SQ wrote 16 zero bytes correctly.
Makefile — tb_ee_core_sq target + regression list
Added to both PHONY list and run: master list. Regression
bumps from 158 → 159.
Why not just NOP the opcode (Codex's caution honoured)
Codex called this out explicitly: 0x7C400000 is sq $zero, 0($v0) — a 128-bit store of zero. NOP-ing op=0x1F would let
qbert continue, but it would silently skip real memory
initialization. For the prolog, that's a buffer clear; later
code would read uninitialized values from those bytes and
behave nondeterministically.
Minimal-correct SQ (4 beats of 32-bit writes) is the right
choice. The "minimal" part: we don't model the upper 96 bits of
$rt (PS2 EE has 128-bit GPRs); for sq $zero,... this is
exact, and for sq $non-zero,... we write the low 32 bits to
beat 0 and zero elsewhere — a documented approximation that
degrades gracefully for the common "clear a 128-bit kernel
slot" use case. When/if a real PS2 program does sq of a
non-zero 128-bit register, we'll see silent data corruption
that the runner's hot-PC verdict can identify; that's the
trigger to upgrade to 128-bit GPR modelling.
Codex Ch271 acceptance — line-by-line
| Requirement | Status | Where |
|---|---|---|
| Decode primary opcode 0x1F as SQ | ✅ | OP_SQ + is_sq |
Support sq $zero, imm(base) at minimum |
✅ | rt_val=0 case writes 0 every beat (and rt_val=non_zero writes low 32 to beat 0) |
| 4-beat 32-bit-stripe FSM through existing memory interface | ✅ | sq_beat counter, stays in S_MEM_WRITE for 4 beats |
| Require 16-byte alignment; misaligned → strict/exc trap | ✅ | is_quad_access check in is_align_fault |
| Focused TB: preload base, exec SQ, verify 4 zero words | ✅ | tb_ee_core_sq |
| Verify PC advances + no GPR writeback | ✅ | Final PC check + retire path doesn't touch regfile |
| Re-run qbert.elf, report next blocker | ✅ | DADDU at pc=0x00100068 |
| Don't NOP all op=0x1F (would mask real stores) | ✅ | Targeted decode, exact 4-beat write semantics |
| Don't overbuild full LQ/SQ/vector yet | ✅ | SQ only (no LQ, no PSQ_*, no vector); upper 96 bits left for later |
| Regression unaffected | ✅ | 159/159 in flight |
Recommendation for Codex's Ch272
daddu $gp, $a0, $zero at pc=0x00100068 instr=0x0080E02D.
DADDU is MIPS-III's 64-bit version of ADDU. The R5900 is a
64-bit core; PS2 ELFs use DADDU as the canonical 64-bit
register-move pseudo-instruction (move rd, rs →
daddu rd, rs, $zero).
Our model has 32-bit regfile (logic [31:0] regfile [0:31]),
so a faithful 64-bit DADDU would need 64-bit GPRs. For the
qbert blocker specifically, the operation degenerates to a
32-bit move: $gp = $a0 + 0.
Three Ch272 framings, in order of scope:
- Decode DADDU and treat it as ADDU. Low-32-bit semantics
only; upper 32 bits silently dropped (already true everywhere
else in the model). Touches one line in
is_nop_classallow-list + one new R-type funct case + addingis_dadduto theis_rtype_alugroup. Same "minimal-correct" pattern that worked for SQ. - Decode DADDU + DADD + DSUBU + DSUB + DAND + DOR + DXOR + DNOR as their 32-bit counterparts. Broader, but these are all commonly emitted by gcc for r5900 alongside DADDU. Pre-empts the next 4-7 chapters worth of one-opcode-at-a-time growth.
- Properly implement 64-bit GPRs. Architecturally correct, but invasive — touches regfile width, all ALU paths, LW/SW to-from regfile, and the trace. Probably 1-2 chapters of work on its own.
(1) is the strict Codex-style "minimal-correct next blocker" answer. (2) would shorten the chapter chain if Codex thinks qbert's prolog uses several D* ops. (3) is a "do it right" pivot that's worth doing eventually but probably not in Ch272.
My read: (1) is the right Ch272 — same shape as Ch271, fast to land, lets the verdict surface the next real divergence. If the next blocker is also a D* op, we recur. If it's something totally different (LQ? MMI? VU0 macro?), we know (1) was the right scope.
Standing by.
Files changed
rtl/ee/ee_core_stub.sv— 5 surgical edits (~20 LOC total) for SQ decode + 4-beat write FSM.sim/tb/integration/tb_ee_core_sq.sv— new focused TB.sim/Makefile—tb_ee_core_sqtarget + added to both regression lists.
Regression
In flight at the moment of writing; expected 159/159 (was 158, +1 for tb_ee_core_sq).