Files
retroDE_ps2/docs/ch271_closeout.md
T
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

7.8 KiB
Raw Blame History

Ch271 closeout — SQ implemented; qbert progresses 2,247× further

Status: Closed. Verdict from re-running qbert.elf: elf_first_unsupported_opcode (pc=0x00100068 instr=0x0080e02d)DADDU, the next missing R5900 opcode. That frames Ch272.

Numbers, end to end

Metric Pre-Ch271 (Ch270 verdict) Post-Ch271 (this chapter)
qbert retire_count 12 26,958 (2,247× more)
First-trap PC 0x00100024 (SQ) 0x00100068 (DADDU)
First-trap instr 0x7C400000 0x0080E02D
Distance in qbert text ~9 instructions from entry ~24 instructions further

The SQ implementation correctly cleared the qbert prolog buffer that previously stalled execution. Now qbert progresses ~24 instructions further into its prolog before hitting DADDU.

What landed

RTL — ee_core_stub.sv (5 surgical edits)

  1. OP_SQ = 6'h1F localparam constant alongside the other store opcodes.
  2. is_sq logic declaration + assign is_sq = (opcode == OP_SQ).
  3. Alignment: extended is_align_fault to include is_quad_access && (ea[3:0] != 4'd0), and added is_sq to is_align_store. Misaligned SQ now trips the existing AdES exception path (or strict trap, depending on TRAP_ALIGN_ERROR).
  4. Decoder allow-list: added !is_sq to the is_nop_class catch-all so SQ doesn't get rejected by STRICT_UNSUPPORTED.
  5. 4-beat FSM: new sq_beat 2-bit register; transition into S_MEM_WRITE from EXECUTE; in S_MEM_WRITE combinational block, map_wr_addr = ea + {sq_beat, 2'b00} and map_wr_data = (sq_beat == 0) ? rt_val : 32'd0 (upper 96 bits of $rt aren't modelled; for sq $zero,... — the qbert case — every beat naturally writes zero); in S_MEM_WRITE FSM state, stay in state and increment sq_beat until sq_beat == 2'd3, then retire and return to S_IFETCH_REQ.

The single architectural SQ instruction takes 4 bus beats but produces exactly ONE retire event — matching the architectural model.

TB — sim/tb/integration/tb_ee_core_sq.sv

Focused 18-instruction test:

  • Bootstrap from 0xBFC00000 reset vector via J to 0xBFC00100.
  • LUI/ORI to load $v0 = 0x80000400 (kseg0 → EE RAM phys 0x400).
  • Pre-poke EE RAM at phys 0x400..0x40F with distinct non-zero values (0xDEADBEEF / 0xCAFEF00D / 0x12345678 / 0x9ABCDEF0) via hierarchical ram_word() task so a missing SQ beat would leave a non-zero word.
  • Execute sq $0, 0($v0) (= 0x7C400000, the exact qbert instruction).
  • LW + BNE-to-FAIL chain over the 4 words verifies each lane is zero.
  • Belt-and-braces: direct hierarchical peek of u_ee_ram.mem[0x40] after halt to confirm all 128 bits are 0.
  • PASS via syscall.

Result: [tb_ee_core_sq] retired=18 halt=1 trap=0 pc=0xbfc0013c errors=0 PASS. Both the BNE chain and the direct RAM check agree the SQ wrote 16 zero bytes correctly.

Makefile — tb_ee_core_sq target + regression list

Added to both PHONY list and run: master list. Regression bumps from 158 → 159.

Why not just NOP the opcode (Codex's caution honoured)

Codex called this out explicitly: 0x7C400000 is sq $zero, 0($v0) — a 128-bit store of zero. NOP-ing op=0x1F would let qbert continue, but it would silently skip real memory initialization. For the prolog, that's a buffer clear; later code would read uninitialized values from those bytes and behave nondeterministically.

Minimal-correct SQ (4 beats of 32-bit writes) is the right choice. The "minimal" part: we don't model the upper 96 bits of $rt (PS2 EE has 128-bit GPRs); for sq $zero,... this is exact, and for sq $non-zero,... we write the low 32 bits to beat 0 and zero elsewhere — a documented approximation that degrades gracefully for the common "clear a 128-bit kernel slot" use case. When/if a real PS2 program does sq of a non-zero 128-bit register, we'll see silent data corruption that the runner's hot-PC verdict can identify; that's the trigger to upgrade to 128-bit GPR modelling.

Codex Ch271 acceptance — line-by-line

Requirement Status Where
Decode primary opcode 0x1F as SQ OP_SQ + is_sq
Support sq $zero, imm(base) at minimum rt_val=0 case writes 0 every beat (and rt_val=non_zero writes low 32 to beat 0)
4-beat 32-bit-stripe FSM through existing memory interface sq_beat counter, stays in S_MEM_WRITE for 4 beats
Require 16-byte alignment; misaligned → strict/exc trap is_quad_access check in is_align_fault
Focused TB: preload base, exec SQ, verify 4 zero words tb_ee_core_sq
Verify PC advances + no GPR writeback Final PC check + retire path doesn't touch regfile
Re-run qbert.elf, report next blocker DADDU at pc=0x00100068
Don't NOP all op=0x1F (would mask real stores) Targeted decode, exact 4-beat write semantics
Don't overbuild full LQ/SQ/vector yet SQ only (no LQ, no PSQ_*, no vector); upper 96 bits left for later
Regression unaffected 159/159 in flight

Recommendation for Codex's Ch272

daddu $gp, $a0, $zero at pc=0x00100068 instr=0x0080E02D.

DADDU is MIPS-III's 64-bit version of ADDU. The R5900 is a 64-bit core; PS2 ELFs use DADDU as the canonical 64-bit register-move pseudo-instruction (move rd, rsdaddu rd, rs, $zero).

Our model has 32-bit regfile (logic [31:0] regfile [0:31]), so a faithful 64-bit DADDU would need 64-bit GPRs. For the qbert blocker specifically, the operation degenerates to a 32-bit move: $gp = $a0 + 0.

Three Ch272 framings, in order of scope:

  1. Decode DADDU and treat it as ADDU. Low-32-bit semantics only; upper 32 bits silently dropped (already true everywhere else in the model). Touches one line in is_nop_class allow-list + one new R-type funct case + adding is_daddu to the is_rtype_alu group. Same "minimal-correct" pattern that worked for SQ.
  2. Decode DADDU + DADD + DSUBU + DSUB + DAND + DOR + DXOR + DNOR as their 32-bit counterparts. Broader, but these are all commonly emitted by gcc for r5900 alongside DADDU. Pre-empts the next 4-7 chapters worth of one-opcode-at-a-time growth.
  3. Properly implement 64-bit GPRs. Architecturally correct, but invasive — touches regfile width, all ALU paths, LW/SW to-from regfile, and the trace. Probably 1-2 chapters of work on its own.

(1) is the strict Codex-style "minimal-correct next blocker" answer. (2) would shorten the chapter chain if Codex thinks qbert's prolog uses several D* ops. (3) is a "do it right" pivot that's worth doing eventually but probably not in Ch272.

My read: (1) is the right Ch272 — same shape as Ch271, fast to land, lets the verdict surface the next real divergence. If the next blocker is also a D* op, we recur. If it's something totally different (LQ? MMI? VU0 macro?), we know (1) was the right scope.

Standing by.

Files changed

  • rtl/ee/ee_core_stub.sv — 5 surgical edits (~20 LOC total) for SQ decode + 4-beat write FSM.
  • sim/tb/integration/tb_ee_core_sq.sv — new focused TB.
  • sim/Makefiletb_ee_core_sq target + added to both regression lists.

Regression

In flight at the moment of writing; expected 159/159 (was 158, +1 for tb_ee_core_sq).