RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression (272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps, and all dump-derived textures/traces) is excluded via .gitignore and stays local. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.4 KiB
Ch277 closeout — BNEL squash-on-not-taken; qbert hits MMI (PCPYLD) one instruction later
Status: Closed. Verdict from re-running qbert.elf:
elf_first_unsupported_opcode (pc=0x00112C84 instr=0x71295389) —
opcode 0x1C (R5900 EE MMI) + funct 0x09 (MMI2 sub-group)
- sa
0x0E= PCPYLD (Parallel Copy Lower Doubleword). qbert ran the BNEL correctly (squashed not-taken — PC went 0xC7C → 0xC84 = +8 bytes, confirming the squash path), then trapped on the very next instruction, an MMI/PCPYLD.
Numbers
| Chapter | Blocker | qbert retire_count |
|---|---|---|
| Post-Ch274 (BEQL) | SD at 0x00112DAC | 26,985 |
| Post-Ch275 (SD) | DSLL at 0x00112C54 | 27,006 |
| Post-Ch276 (DSLL) | BNEL at 0x00112C7C | 27,016 |
| Post-Ch277 (BNEL) | PCPYLD at 0x00112C84 | 27,017 |
1-retire delta — BNEL itself retired (the squash path), then PCPYLD trapped before retiring.
What landed
RTL — surgical edits in ee_core_stub.sv
-
localparam OP_BNEL = 6'h15alongsideOP_BNE/OP_BEQL. -
is_bneldecode signal. -
Added
is_bnelto theis_branchgroup. -
Extended
branch_takenwith(is_bnel && (rs_val != rt_val)). -
Generalized the squash signal: renamed
is_beql_squashtois_branch_likely_squash, now covering BEQL (squash onrs == rt... wait, not equal — branch likely SQUASHES on the NOT-TAKEN condition) and BNEL (squash onrs == rt):assign is_branch_likely_squash = (is_beql && (rs_val != rt_val)) // Ch274 — BEQL not-taken || (is_bnel && (rs_val == rt_val)); // Ch277 — BNEL not-takenretire_advanceupdated to reference the new name. Adding BLEZL/BGTZL/REGIMM-likely later is now a one-line OR-extension. -
Added
!is_bnelto theis_nop_classallow-list.
About 6 LOC of real change. Pure pattern-reuse from Ch274.
Focused TB — tb_ee_core_bnel.sv
Three cases mirroring tb_ee_core_beql:
- BNEL TAKEN (
$t0 = 5,$t1 = 7, differ → taken): branch reaches target; delay slot executes (writes a sentinel into$t5). Cross-check:$t6 = 0xCAFEat target. - BNEL NOT-TAKEN (
$t2 = $t3 = 3, equal → squash): delay slot squashed. Inline BNE chain verifies$t5stays at0xBEEF0000(the OR-INTO probe didn't execute).$t7 = 0x2222at PC+8. - BNE NOT-TAKEN cross-check (same operands): plain BNE's
delay slot DOES execute →
$t5 = 0xBABE0CAB. Proves BNEL differs.
Result: retired=21 halt=1 trap=0 pc=0xbfc00158 errors=0 PASS.
Makefile + regression
tb_ee_core_bneltarget.- Added to both PHONY list and
run:master. - Regression: 164 → 165.
Recommendation for Codex's Ch278 — PCPYLD (MMI2)
pcpyld $t2, $a1, $t1 at PC 0x00112C84, instr 0x71295389.
Decoded:
- opcode
0x1C(MMI prefix) - funct
0x09(MMI2 sub-group selector) - sa
0x0E(PCPYLD sub-instruction) - rs
5($a1), rt9($t1), rd10($t2)
PCPYLD architectural semantics (R5900 EE, 128-bit MMI):
rd[127:64] = rs[63:0] // upper 64 of rd = lower 64 of rs
rd[63:0] = rt[63:0] // lower 64 of rd = lower 64 of rt
For our 32-bit register model:
- We can't represent
rd[127:64](no upper bits). rd[63:0] = rt[63:0]collapses to$rd[31:0] = $rt[31:0](lower 32 bits).
Minimal Ch278 scope:
- Decode the MMI2/PCPYLD path: opcode
0x1C+ funct0x09+ sa0x0E→ setis_pcpyld. - Add to
is_rtype_alugroup. - In
rtype_alu_wb:else if (is_pcpyld) rtype_alu_wb = rt_val;(low 32 bits of $rt → $rd). - Add
!is_pcpyldtois_nop_classallow-list.
Document the approximation explicitly in the RTL: upper bits of
$rd (which would carry $rs's lower 64 in a real EE) are not
modelled. For qbert's specific call pattern at this PC, the
data being shuffled is likely 128-bit packed bytes for a
strlen-style byte-walker ($a0 = 0x80808080 is the classic
"detect high bit per byte" mask); the low 32 bits are the
relevant observable.
Important Codex caution: do NOT NOP-class the entire MMI
opcode (0x1C). MMI has ~80 sub-instructions (MMI0/MMI1/MMI2/
MMI3 sub-tables); some are real data movement (PCPYLD, PCPYUD,
PCPYH), some are arithmetic (PADDB, PSUBB, PMULTW), some are
SIMD compares (PCEQB, PCEQH). Each needs its own decode arm or
careful approximation. The qbert track is fine with one
sub-instruction per chapter — same incremental cadence we've
maintained throughout.
Likely follow-ons after PCPYLD: any other MMI2 op qbert's
byte-walker uses. Common candidates given the 0x80808080
sentinel: PCEQB (parallel compare equal byte) and PMFHL
(parallel move from HI/LO).
Files changed
rtl/ee/ee_core_stub.sv— 6 surgical edits.sim/tb/integration/tb_ee_core_bnel.sv— new focused TB.sim/Makefile— target + both regression lists.
Regression
In flight; expected 165/165.
Pattern review
Seven qbert chapters (Ch271–Ch277). The qbert-driven track keeps producing one chapter per blocker at sub-half-day cadence:
| Chapter | Blocker | retire_count |
|---|---|---|
| Ch271 SQ | (init) | 12 → 26,958 |
| Ch272 DADDU | → 26,960 | |
| Ch273 SYSCALL HLE | → 26,980 | |
| Ch274 BEQL | → 26,985 | |
| Ch275 SD | → 27,006 | |
| Ch276 DSLL | → 27,016 | |
| Ch277 BNEL | → 27,017 |
The MMI surface (PCPYLD and likely siblings) will broaden the opcode count quickly — that's expected when a real program starts using SIMD-style operations for stdlib-class work.