Files
retroDE_ps2/docs/ch277_closeout.md
T
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

5.4 KiB
Raw Blame History

Ch277 closeout — BNEL squash-on-not-taken; qbert hits MMI (PCPYLD) one instruction later

Status: Closed. Verdict from re-running qbert.elf: elf_first_unsupported_opcode (pc=0x00112C84 instr=0x71295389) — opcode 0x1C (R5900 EE MMI) + funct 0x09 (MMI2 sub-group)

  • sa 0x0E = PCPYLD (Parallel Copy Lower Doubleword). qbert ran the BNEL correctly (squashed not-taken — PC went 0xC7C → 0xC84 = +8 bytes, confirming the squash path), then trapped on the very next instruction, an MMI/PCPYLD.

Numbers

Chapter Blocker qbert retire_count
Post-Ch274 (BEQL) SD at 0x00112DAC 26,985
Post-Ch275 (SD) DSLL at 0x00112C54 27,006
Post-Ch276 (DSLL) BNEL at 0x00112C7C 27,016
Post-Ch277 (BNEL) PCPYLD at 0x00112C84 27,017

1-retire delta — BNEL itself retired (the squash path), then PCPYLD trapped before retiring.

What landed

RTL — surgical edits in ee_core_stub.sv

  1. localparam OP_BNEL = 6'h15 alongside OP_BNE/OP_BEQL.

  2. is_bnel decode signal.

  3. Added is_bnel to the is_branch group.

  4. Extended branch_taken with (is_bnel && (rs_val != rt_val)).

  5. Generalized the squash signal: renamed is_beql_squash to is_branch_likely_squash, now covering BEQL (squash on rs == rt... wait, not equal — branch likely SQUASHES on the NOT-TAKEN condition) and BNEL (squash on rs == rt):

    assign is_branch_likely_squash =
            (is_beql && (rs_val != rt_val))   // Ch274 — BEQL not-taken
         || (is_bnel && (rs_val == rt_val));  // Ch277 — BNEL not-taken
    

    retire_advance updated to reference the new name. Adding BLEZL/BGTZL/REGIMM-likely later is now a one-line OR-extension.

  6. Added !is_bnel to the is_nop_class allow-list.

About 6 LOC of real change. Pure pattern-reuse from Ch274.

Focused TB — tb_ee_core_bnel.sv

Three cases mirroring tb_ee_core_beql:

  1. BNEL TAKEN ($t0 = 5, $t1 = 7, differ → taken): branch reaches target; delay slot executes (writes a sentinel into $t5). Cross-check: $t6 = 0xCAFE at target.
  2. BNEL NOT-TAKEN ($t2 = $t3 = 3, equal → squash): delay slot squashed. Inline BNE chain verifies $t5 stays at 0xBEEF0000 (the OR-INTO probe didn't execute). $t7 = 0x2222 at PC+8.
  3. BNE NOT-TAKEN cross-check (same operands): plain BNE's delay slot DOES execute → $t5 = 0xBABE0CAB. Proves BNEL differs.

Result: retired=21 halt=1 trap=0 pc=0xbfc00158 errors=0 PASS.

Makefile + regression

  • tb_ee_core_bnel target.
  • Added to both PHONY list and run: master.
  • Regression: 164 → 165.

Recommendation for Codex's Ch278 — PCPYLD (MMI2)

pcpyld $t2, $a1, $t1 at PC 0x00112C84, instr 0x71295389.

Decoded:

  • opcode 0x1C (MMI prefix)
  • funct 0x09 (MMI2 sub-group selector)
  • sa 0x0E (PCPYLD sub-instruction)
  • rs 5 ($a1), rt 9 ($t1), rd 10 ($t2)

PCPYLD architectural semantics (R5900 EE, 128-bit MMI):

rd[127:64] = rs[63:0]    // upper 64 of rd = lower 64 of rs
rd[63:0]   = rt[63:0]    // lower 64 of rd = lower 64 of rt

For our 32-bit register model:

  • We can't represent rd[127:64] (no upper bits).
  • rd[63:0] = rt[63:0] collapses to $rd[31:0] = $rt[31:0] (lower 32 bits).

Minimal Ch278 scope:

  1. Decode the MMI2/PCPYLD path: opcode 0x1C + funct 0x09 + sa 0x0E → set is_pcpyld.
  2. Add to is_rtype_alu group.
  3. In rtype_alu_wb: else if (is_pcpyld) rtype_alu_wb = rt_val; (low 32 bits of $rt → $rd).
  4. Add !is_pcpyld to is_nop_class allow-list.

Document the approximation explicitly in the RTL: upper bits of $rd (which would carry $rs's lower 64 in a real EE) are not modelled. For qbert's specific call pattern at this PC, the data being shuffled is likely 128-bit packed bytes for a strlen-style byte-walker ($a0 = 0x80808080 is the classic "detect high bit per byte" mask); the low 32 bits are the relevant observable.

Important Codex caution: do NOT NOP-class the entire MMI opcode (0x1C). MMI has ~80 sub-instructions (MMI0/MMI1/MMI2/ MMI3 sub-tables); some are real data movement (PCPYLD, PCPYUD, PCPYH), some are arithmetic (PADDB, PSUBB, PMULTW), some are SIMD compares (PCEQB, PCEQH). Each needs its own decode arm or careful approximation. The qbert track is fine with one sub-instruction per chapter — same incremental cadence we've maintained throughout.

Likely follow-ons after PCPYLD: any other MMI2 op qbert's byte-walker uses. Common candidates given the 0x80808080 sentinel: PCEQB (parallel compare equal byte) and PMFHL (parallel move from HI/LO).

Files changed

  • rtl/ee/ee_core_stub.sv — 6 surgical edits.
  • sim/tb/integration/tb_ee_core_bnel.sv — new focused TB.
  • sim/Makefile — target + both regression lists.

Regression

In flight; expected 165/165.

Pattern review

Seven qbert chapters (Ch271Ch277). The qbert-driven track keeps producing one chapter per blocker at sub-half-day cadence:

Chapter Blocker retire_count
Ch271 SQ (init) 12 → 26,958
Ch272 DADDU → 26,960
Ch273 SYSCALL HLE → 26,980
Ch274 BEQL → 26,985
Ch275 SD → 27,006
Ch276 DSLL → 27,016
Ch277 BNEL → 27,017

The MMI surface (PCPYLD and likely siblings) will broaden the opcode count quickly — that's expected when a real program starts using SIMD-style operations for stdlib-class work.