Files
retroDE_ps2/docs/ch274_closeout.md
T
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

159 lines
6.5 KiB
Markdown

# Ch274 closeout — BEQL with squash-on-not-taken; qbert lands in a function prologue, next blocker is SD
**Status:** Closed. **Verdict from re-running qbert.elf:**
`elf_first_unsupported_opcode (pc=0x00112DAC instr=0xFFBF0020)`
**SD** (Store Doubleword, MIPS-III). qbert passed the C++
constructor walker's BEQL correctly, JAL'd into a function at
PC `0x00112DAC`, and trapped on the very first instruction of
that function — the canonical `sd $ra, 0x20($sp)` register-save
prologue.
## Numbers
| Chapter | Blocker | qbert retire_count |
|---------|---------|---------------------|
| Post-Ch271 (SQ) | DADDU at 0x00100068 | 26,958 |
| Post-Ch272 (DADDU) | SYSCALL at 0x00100070 | 26,960 |
| Post-Ch273 (SYSCALL HLE) | BEQL at 0x001000C0 | 26,980 |
| **Post-Ch274 (BEQL)** | **SD at 0x00112DAC** | **26,985** |
The 5-retire delta covers: BEQL squash → `addiu $v0, $v1, 4`
`lw $a0, 0($v0)``addiu $a1, $v0, 4``jal 0x00112DAC`
first instruction of the called function (SD, traps). The
~78 KB PC jump to `0x00112DAC` confirms the BEQL squash worked
— qbert's `$a0` was NOT clobbered to 0 by the squashed delay
slot, the LW loaded the real constructor-pointer, and the JAL
dispatched correctly.
## What landed
### RTL — surgical edits in `ee_core_stub.sv`
1. **Opcode**: `localparam OP_BEQL = 6'h14` alongside `OP_BEQ`.
2. **Decode**: `is_beql` signal + `assign is_beql = (opcode == OP_BEQL)`.
3. **Branch logic**: BEQL added to `is_branch` group and to
`branch_taken` (same `(rs_val == rt_val)` condition as BEQ).
4. **New signal `is_beql_squash`**:
`is_beql && (rs_val != rt_val)` — the load-bearing case.
5. **`retire_advance`**: when `is_beql_squash` is true,
`next_pc <= pc + 32'd8` (skip the delay slot directly);
`new_branch_pending` stays low so no stale target leaks.
Existing BEQ/BNE/jump path unchanged.
6. **Decoder allow-list**: added `!is_beql` to the `is_nop_class`
catch-all so SQ doesn't get strict-trap'd.
About 6 LOC of real change.
### Focused TB — `tb_ee_core_beql.sv`
Three cases per Codex's spec:
1. **BEQL taken** (`$t0 == $t1`): branch reaches target;
delay slot DOES execute (writes a sentinel into `$t5`).
Cross-checked by `$t6 = 0xCAFE` at the target.
2. **BEQL not-taken** (`$t2 != $t3`): delay slot squashed.
`$t7 = 0x2222` at PC+8 proves we landed correctly past the
squash. **Inline BNE chain verifies `$t5` was NOT clobbered
by the squashed delay slot** (`$t5` stays at its pre-BEQL
`0xBEEF0000` value).
3. **BEQ not-taken cross-check** (same operands): plain BEQ's
delay slot DOES execute, so `$t5` gets `0xCAB` ORed into the
low 16 bits (`$t5 = 0xBABE0CAB`). Proves BEQL's squash
differs from BEQ's no-squash behavior.
Encoding gotcha caught during TB authoring: my initial delay
slots used `ori $t5, $0, ...` (clobbers `$t5` regardless of
prior value) instead of `ori $t5, $t5, ...` (ORs into `$t5`,
preserving high bits). The first build FAILED the Case-3 check
with `$t5=0x00000CAB` instead of `0xBABE0CAB`. Fixed by changing
the rs field to RT5 so the delay slot ORs into the existing
value — making both "delay-fired" and "delay-squashed" cases
distinguishable by the high half-word.
Result: `retired=21 halt=1 trap=0 pc=0xbfc00158 errors=0 PASS`.
### Makefile + regression
- `tb_ee_core_beql` target.
- Added to both PHONY list and `run:` master.
- Regression: 161 → **162**.
## qbert disassembly around the new blocker (PC 0x00112DAC)
The JAL at `0x001000D4` calls into a function at `0x00112DAC`.
That function's prologue is:
```
0x00112DAC: 0xFFBF0020 sd $ra, 0x20($sp) <-- TRAP (opcode 0x3F, MIPS-III SD)
```
**SD** (Store Doubleword) is the MIPS-III 64-bit cousin of SW.
PS2 ELFs use it everywhere in function prologues to save
64-bit register values (`$ra`, `$s*`) onto the stack.
## Recommendation for Codex's Ch275
**Implement SD as a 2-beat 32-bit-stripe write FSM**, mirroring
Ch271's SQ pattern but smaller:
- **Decode**: opcode `6'h3F``is_sd`.
- **Alignment**: SD requires 8-byte alignment (`ea[2:0] == 0`).
Misaligned → AdES path (same as existing SW alignment).
- **FSM**: reuse the `sq_beat` counter (or add `sd_beat`); 2
beats this time. Beat 0 writes `rt_val` (low 32 bits of $rt)
at EA; beat 1 writes 0 at EA+4 (upper 32 bits of $rt not
modelled — same approximation we made for SQ beats 1-3).
- **For `sd $ra,...`**: real PS2 callees later `LD` to restore
64-bit `$ra`. Our model's upper 32 bits are always 0, so
the round-trip works as long as the function doesn't do
64-bit math on `$ra` itself (rare).
Focused TB shape (mirrors `tb_ee_core_sq`):
- Pre-poke RAM target with non-zero junk.
- Execute `sd $rt, 0(base)` with `$rt` non-zero in low 32 bits.
- LW + BNE chain verifies `mem[base+0] = rt_val_low` and
`mem[base+4] = 0`.
- Direct hierarchical RAM peek for belt-and-braces.
This is structurally identical to Ch271 with `4 → 2` beats
and `16 → 8` byte alignment. Should be ~30 minutes of work.
Likely follow-on after SD: **LD** (Load Doubleword, opcode
0x37). When the called function eventually returns, it'll
`LD $ra, 0x20($sp)` to restore the saved register; our
model needs the corresponding 2-beat read path. Codex may
want to fold SD+LD into one chapter since they're symmetric.
## Files changed
- `rtl/ee/ee_core_stub.sv` — 6 surgical edits.
- `sim/tb/integration/tb_ee_core_beql.sv` — new focused TB.
- `sim/Makefile` — target + both regression lists.
## Regression
In flight at the moment of writing; expected **162/162** (was
161, +1 for `tb_ee_core_beql`).
## Process notes
- **Cross-check via BEQ in the same TB.** Codex specifically
asked for the BEQ cross-check, and it caught a real
difference: Case 3 (BEQ not-taken) writes `$t5` low bits
while Case 2 (BEQL not-taken) does NOT. Without the cross-
check, a regression where BEQL accidentally behaved like
BEQ would silently pass on the "PC landed at PC+8" check
alone.
- **OR-INTO vs OR-FROM-ZERO encoding bugs are easy to make.**
My first TB pass had `ori $rt, $0, imm` (overwriting),
which loses info about whether the delay slot fired. Always
use `ori $rt, $rt, imm` (or similar accumulating op) in
delay-slot probes so "did it fire?" is observable by a
bitwise comparison rather than a value comparison.
- **The pattern continues to compress.** Ch271 SQ took 5
edits + a TB. Ch272 DADDU took 4 + a TB. Ch273 SYSCALL HLE
took 2 + a TB (plus a runner update). Ch274 BEQL is 6 + a
TB. Each is a 1-day chapter at most. The qbert progression
is now `12 → 26,958 → 26,960 → 26,980 → 26,985 retires`
the runner is doing its job.