Files
retroDE_ps2/docs/ch274_closeout.md
T
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

6.5 KiB

Ch274 closeout — BEQL with squash-on-not-taken; qbert lands in a function prologue, next blocker is SD

Status: Closed. Verdict from re-running qbert.elf: elf_first_unsupported_opcode (pc=0x00112DAC instr=0xFFBF0020)SD (Store Doubleword, MIPS-III). qbert passed the C++ constructor walker's BEQL correctly, JAL'd into a function at PC 0x00112DAC, and trapped on the very first instruction of that function — the canonical sd $ra, 0x20($sp) register-save prologue.

Numbers

Chapter Blocker qbert retire_count
Post-Ch271 (SQ) DADDU at 0x00100068 26,958
Post-Ch272 (DADDU) SYSCALL at 0x00100070 26,960
Post-Ch273 (SYSCALL HLE) BEQL at 0x001000C0 26,980
Post-Ch274 (BEQL) SD at 0x00112DAC 26,985

The 5-retire delta covers: BEQL squash → addiu $v0, $v1, 4lw $a0, 0($v0)addiu $a1, $v0, 4jal 0x00112DAC → first instruction of the called function (SD, traps). The ~78 KB PC jump to 0x00112DAC confirms the BEQL squash worked — qbert's $a0 was NOT clobbered to 0 by the squashed delay slot, the LW loaded the real constructor-pointer, and the JAL dispatched correctly.

What landed

RTL — surgical edits in ee_core_stub.sv

  1. Opcode: localparam OP_BEQL = 6'h14 alongside OP_BEQ.
  2. Decode: is_beql signal + assign is_beql = (opcode == OP_BEQL).
  3. Branch logic: BEQL added to is_branch group and to branch_taken (same (rs_val == rt_val) condition as BEQ).
  4. New signal is_beql_squash: is_beql && (rs_val != rt_val) — the load-bearing case.
  5. retire_advance: when is_beql_squash is true, next_pc <= pc + 32'd8 (skip the delay slot directly); new_branch_pending stays low so no stale target leaks. Existing BEQ/BNE/jump path unchanged.
  6. Decoder allow-list: added !is_beql to the is_nop_class catch-all so SQ doesn't get strict-trap'd.

About 6 LOC of real change.

Focused TB — tb_ee_core_beql.sv

Three cases per Codex's spec:

  1. BEQL taken ($t0 == $t1): branch reaches target; delay slot DOES execute (writes a sentinel into $t5). Cross-checked by $t6 = 0xCAFE at the target.
  2. BEQL not-taken ($t2 != $t3): delay slot squashed. $t7 = 0x2222 at PC+8 proves we landed correctly past the squash. Inline BNE chain verifies $t5 was NOT clobbered by the squashed delay slot ($t5 stays at its pre-BEQL 0xBEEF0000 value).
  3. BEQ not-taken cross-check (same operands): plain BEQ's delay slot DOES execute, so $t5 gets 0xCAB ORed into the low 16 bits ($t5 = 0xBABE0CAB). Proves BEQL's squash differs from BEQ's no-squash behavior.

Encoding gotcha caught during TB authoring: my initial delay slots used ori $t5, $0, ... (clobbers $t5 regardless of prior value) instead of ori $t5, $t5, ... (ORs into $t5, preserving high bits). The first build FAILED the Case-3 check with $t5=0x00000CAB instead of 0xBABE0CAB. Fixed by changing the rs field to RT5 so the delay slot ORs into the existing value — making both "delay-fired" and "delay-squashed" cases distinguishable by the high half-word.

Result: retired=21 halt=1 trap=0 pc=0xbfc00158 errors=0 PASS.

Makefile + regression

  • tb_ee_core_beql target.
  • Added to both PHONY list and run: master.
  • Regression: 161 → 162.

qbert disassembly around the new blocker (PC 0x00112DAC)

The JAL at 0x001000D4 calls into a function at 0x00112DAC. That function's prologue is:

0x00112DAC: 0xFFBF0020   sd $ra, 0x20($sp)   <-- TRAP (opcode 0x3F, MIPS-III SD)

SD (Store Doubleword) is the MIPS-III 64-bit cousin of SW. PS2 ELFs use it everywhere in function prologues to save 64-bit register values ($ra, $s*) onto the stack.

Recommendation for Codex's Ch275

Implement SD as a 2-beat 32-bit-stripe write FSM, mirroring Ch271's SQ pattern but smaller:

  • Decode: opcode 6'h3Fis_sd.
  • Alignment: SD requires 8-byte alignment (ea[2:0] == 0). Misaligned → AdES path (same as existing SW alignment).
  • FSM: reuse the sq_beat counter (or add sd_beat); 2 beats this time. Beat 0 writes rt_val (low 32 bits of $rt) at EA; beat 1 writes 0 at EA+4 (upper 32 bits of $rt not modelled — same approximation we made for SQ beats 1-3).
  • For sd $ra,...: real PS2 callees later LD to restore 64-bit $ra. Our model's upper 32 bits are always 0, so the round-trip works as long as the function doesn't do 64-bit math on $ra itself (rare).

Focused TB shape (mirrors tb_ee_core_sq):

  • Pre-poke RAM target with non-zero junk.
  • Execute sd $rt, 0(base) with $rt non-zero in low 32 bits.
  • LW + BNE chain verifies mem[base+0] = rt_val_low and mem[base+4] = 0.
  • Direct hierarchical RAM peek for belt-and-braces.

This is structurally identical to Ch271 with 4 → 2 beats and 16 → 8 byte alignment. Should be ~30 minutes of work.

Likely follow-on after SD: LD (Load Doubleword, opcode 0x37). When the called function eventually returns, it'll LD $ra, 0x20($sp) to restore the saved register; our model needs the corresponding 2-beat read path. Codex may want to fold SD+LD into one chapter since they're symmetric.

Files changed

  • rtl/ee/ee_core_stub.sv — 6 surgical edits.
  • sim/tb/integration/tb_ee_core_beql.sv — new focused TB.
  • sim/Makefile — target + both regression lists.

Regression

In flight at the moment of writing; expected 162/162 (was 161, +1 for tb_ee_core_beql).

Process notes

  • Cross-check via BEQ in the same TB. Codex specifically asked for the BEQ cross-check, and it caught a real difference: Case 3 (BEQ not-taken) writes $t5 low bits while Case 2 (BEQL not-taken) does NOT. Without the cross- check, a regression where BEQL accidentally behaved like BEQ would silently pass on the "PC landed at PC+8" check alone.
  • OR-INTO vs OR-FROM-ZERO encoding bugs are easy to make. My first TB pass had ori $rt, $0, imm (overwriting), which loses info about whether the delay slot fired. Always use ori $rt, $rt, imm (or similar accumulating op) in delay-slot probes so "did it fire?" is observable by a bitwise comparison rather than a value comparison.
  • The pattern continues to compress. Ch271 SQ took 5 edits + a TB. Ch272 DADDU took 4 + a TB. Ch273 SYSCALL HLE took 2 + a TB (plus a runner update). Ch274 BEQL is 6 + a TB. Each is a 1-day chapter at most. The qbert progression is now 12 → 26,958 → 26,960 → 26,980 → 26,985 retires — the runner is doing its job.