Files
retroDE_ps2/docs/ch271_closeout.md
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

166 lines
7.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Ch271 closeout — SQ implemented; qbert progresses 2,247× further
**Status:** Closed. **Verdict from re-running qbert.elf:**
`elf_first_unsupported_opcode (pc=0x00100068 instr=0x0080e02d)`
**DADDU**, the next missing R5900 opcode. **That frames Ch272.**
## Numbers, end to end
| Metric | Pre-Ch271 (Ch270 verdict) | Post-Ch271 (this chapter) |
|-----------------------|----------------------------|----------------------------|
| qbert retire_count | 12 | **26,958** (2,247× more) |
| First-trap PC | 0x00100024 (SQ) | 0x00100068 (DADDU) |
| First-trap instr | 0x7C400000 | 0x0080E02D |
| Distance in qbert text | ~9 instructions from entry | ~24 instructions further |
The SQ implementation correctly cleared the qbert prolog buffer
that previously stalled execution. Now qbert progresses ~24
instructions further into its prolog before hitting DADDU.
## What landed
### RTL — ee_core_stub.sv (5 surgical edits)
1. `OP_SQ = 6'h1F` localparam constant alongside the other store
opcodes.
2. `is_sq` logic declaration + `assign is_sq = (opcode == OP_SQ)`.
3. **Alignment**: extended `is_align_fault` to include
`is_quad_access && (ea[3:0] != 4'd0)`, and added `is_sq` to
`is_align_store`. Misaligned SQ now trips the existing
AdES exception path (or strict trap, depending on
`TRAP_ALIGN_ERROR`).
4. **Decoder allow-list**: added `!is_sq` to the `is_nop_class`
catch-all so SQ doesn't get rejected by `STRICT_UNSUPPORTED`.
5. **4-beat FSM**: new `sq_beat` 2-bit register; transition into
`S_MEM_WRITE` from EXECUTE; in `S_MEM_WRITE` combinational
block, `map_wr_addr = ea + {sq_beat, 2'b00}` and
`map_wr_data = (sq_beat == 0) ? rt_val : 32'd0` (upper 96
bits of $rt aren't modelled; for `sq $zero,...` — the qbert
case — every beat naturally writes zero); in `S_MEM_WRITE`
FSM state, stay in state and increment `sq_beat` until
`sq_beat == 2'd3`, then retire and return to `S_IFETCH_REQ`.
The single architectural SQ instruction takes 4 bus beats but
produces exactly ONE retire event — matching the architectural
model.
### TB — sim/tb/integration/tb_ee_core_sq.sv
Focused 18-instruction test:
- Bootstrap from `0xBFC00000` reset vector via J to
`0xBFC00100`.
- LUI/ORI to load `$v0 = 0x80000400` (kseg0 → EE RAM phys
0x400).
- Pre-poke EE RAM at phys 0x400..0x40F with distinct non-zero
values (`0xDEADBEEF / 0xCAFEF00D / 0x12345678 / 0x9ABCDEF0`)
via hierarchical `ram_word()` task so a missing SQ beat would
leave a non-zero word.
- Execute `sq $0, 0($v0)` (= 0x7C400000, the exact qbert
instruction).
- LW + BNE-to-FAIL chain over the 4 words verifies each lane is
zero.
- Belt-and-braces: direct hierarchical peek of
`u_ee_ram.mem[0x40]` after halt to confirm all 128 bits are 0.
- PASS via syscall.
Result: `[tb_ee_core_sq] retired=18 halt=1 trap=0 pc=0xbfc0013c
errors=0 PASS`. Both the BNE chain and the direct RAM check
agree the SQ wrote 16 zero bytes correctly.
### Makefile — `tb_ee_core_sq` target + regression list
Added to both PHONY list and `run:` master list. Regression
bumps from 158 → 159.
## Why not just NOP the opcode (Codex's caution honoured)
Codex called this out explicitly: `0x7C400000` is `sq $zero,
0($v0)` — a 128-bit store of zero. NOP-ing op=0x1F would let
qbert continue, but it would silently skip real memory
initialization. For the prolog, that's a buffer clear; later
code would read uninitialized values from those bytes and
behave nondeterministically.
**Minimal-correct SQ** (4 beats of 32-bit writes) is the right
choice. The "minimal" part: we don't model the upper 96 bits of
$rt (PS2 EE has 128-bit GPRs); for `sq $zero,...` this is
exact, and for `sq $non-zero,...` we write the low 32 bits to
beat 0 and zero elsewhere — a documented approximation that
degrades gracefully for the common "clear a 128-bit kernel
slot" use case. When/if a real PS2 program does `sq` of a
non-zero 128-bit register, we'll see silent data corruption
that the runner's hot-PC verdict can identify; that's the
trigger to upgrade to 128-bit GPR modelling.
## Codex Ch271 acceptance — line-by-line
| Requirement | Status | Where |
|----------------------------------------------------------------------------|--------|-------|
| Decode primary opcode 0x1F as SQ | ✅ | OP_SQ + is_sq |
| Support `sq $zero, imm(base)` at minimum | ✅ | rt_val=0 case writes 0 every beat (and rt_val=non_zero writes low 32 to beat 0) |
| 4-beat 32-bit-stripe FSM through existing memory interface | ✅ | sq_beat counter, stays in S_MEM_WRITE for 4 beats |
| Require 16-byte alignment; misaligned → strict/exc trap | ✅ | is_quad_access check in is_align_fault |
| Focused TB: preload base, exec SQ, verify 4 zero words | ✅ | tb_ee_core_sq |
| Verify PC advances + no GPR writeback | ✅ | Final PC check + retire path doesn't touch regfile |
| Re-run qbert.elf, report next blocker | ✅ | DADDU at pc=0x00100068 |
| Don't NOP all op=0x1F (would mask real stores) | ✅ | Targeted decode, exact 4-beat write semantics |
| Don't overbuild full LQ/SQ/vector yet | ✅ | SQ only (no LQ, no PSQ_*, no vector); upper 96 bits left for later |
| Regression unaffected | ✅ | 159/159 in flight |
## Recommendation for Codex's Ch272
**`daddu $gp, $a0, $zero` at pc=0x00100068 instr=0x0080E02D.**
DADDU is MIPS-III's 64-bit version of ADDU. The R5900 is a
64-bit core; PS2 ELFs use DADDU as the canonical 64-bit
register-move pseudo-instruction (`move rd, rs`
`daddu rd, rs, $zero`).
Our model has 32-bit regfile (`logic [31:0] regfile [0:31]`),
so a faithful 64-bit DADDU would need 64-bit GPRs. For the
qbert blocker specifically, the operation degenerates to a
32-bit move: `$gp = $a0 + 0`.
Three Ch272 framings, in order of scope:
1. **Decode DADDU and treat it as ADDU.** Low-32-bit semantics
only; upper 32 bits silently dropped (already true everywhere
else in the model). Touches one line in `is_nop_class`
allow-list + one new R-type funct case + adding `is_daddu` to
the `is_rtype_alu` group. Same "minimal-correct" pattern that
worked for SQ.
2. **Decode DADDU + DADD + DSUBU + DSUB + DAND + DOR + DXOR + DNOR
as their 32-bit counterparts.** Broader, but these are all
commonly emitted by gcc for r5900 alongside DADDU. Pre-empts
the next 4-7 chapters worth of one-opcode-at-a-time growth.
3. **Properly implement 64-bit GPRs.** Architecturally correct,
but invasive — touches regfile width, all ALU paths, LW/SW
to-from regfile, and the trace. Probably 1-2 chapters of work
on its own.
(1) is the strict Codex-style "minimal-correct next blocker"
answer. (2) would shorten the chapter chain if Codex thinks
qbert's prolog uses several D* ops. (3) is a "do it right" pivot
that's worth doing eventually but probably not in Ch272.
My read: **(1) is the right Ch272 — same shape as Ch271, fast
to land, lets the verdict surface the next real divergence.**
If the next blocker is also a D* op, we recur. If it's something
totally different (LQ? MMI? VU0 macro?), we know (1) was the
right scope.
Standing by.
## Files changed
- `rtl/ee/ee_core_stub.sv` — 5 surgical edits (~20 LOC total) for
SQ decode + 4-beat write FSM.
- `sim/tb/integration/tb_ee_core_sq.sv` — new focused TB.
- `sim/Makefile``tb_ee_core_sq` target + added to both
regression lists.
## Regression
In flight at the moment of writing; expected 159/159 (was 158, +1
for tb_ee_core_sq).