Files
retroDE_ps2/docs/decisions/0008-gs-tiled-vram-feasibility-baseline.md
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

9.9 KiB
Raw Permalink Blame History

0008 — GS Tiled-VRAM Feasibility Baseline + Test #2 Spec

Status: Accepted (baseline); Test #2 = spec only, not implemented Date: 2026-05-28 Chapter: Phase-3 hardware de-risk (LPDDR4B bandwidth spike) → GS architecture pivot Supersedes: nothing. Companion to 0006-vram-roadmap.md and 0007-ee-core-reality-checkpoint.md. Authors: lead architect, with Codex co-review and a parallel outside-perspective review.


1. Executive Summary

The single missing number that gated the whole "is a faithful-enough PS2 GS physically possible on this board?" question has been measured on real silicon, and it clears the first gate cleanly.

A standalone HPS-coexistent diagnostic core (de25_lpddr4_bw, an ao486-cloned shell + a saturating AXI4 traffic generator on the FPGA-side LPDDR4B EMIF) sustained, over a 256 MiB sequential stream at the EMIF user clock (310 MHz exactly):

phase cycles sustained
write 9,786,835 8.50 GB/s
read 9,913,927 8.39 GB/s
  • ~86% of the 256-bit fabric port (≈27.4 of 32 bytes/cycle).
  • ~79% of the ~10.7 GB/s LPDDR4 PHY peak. (Both ceilings, one consistent result.)
  • Read ≈ write at MAX_OUTSTANDING=16 → the bus is bandwidth-bound, not latency-bound. Nothing to sweep; the number is trustworthy as-is.

Verdict on the bandwidth gate: GREEN. Before measuring, the working assumption was "probably impossible." 8.4 GB/s sustained changes the tone to "feasible if the GS is architected around tiling from day one." The board is not killed by LPDDR bandwidth. Full-4 MB-VRAM-in-M20K remains off the table (0006); the tiled-VRAM path is no longer fantasy.

The gate still standing is texture + locality, not raw sequential bandwidth. Test #1 measured framebuffer-shaped sequential traffic. It cannot see random texture reads, CLUT indirection, or the tile-reload churn that primitive disorder and alpha-overdraw produce. Test #2 — a tiled-raster microbenchmark driven by real game traffic — is the measurement that finally answers "faithful-enough GS on this board: yes or no." This document specs it; it does not implement it.


2. Test #1 — the measured baseline (authoritative)

  • Memory under test: FPGA-side LPDDR4B, 1 GB, 32-bit, 2666 MT/s (DE25-Nano Rev B), via the same EMIF_Qsys hard-IP ao486 ships. User port: 256-bit AXI4 @ 310 MHz (IOPLL ×62/10 off the 50 MHz reference — exact).
  • Theoretical ceilings: ~9.92 GB/s if you count the 256-bit (32-byte) port at 310 MHz; ~10.7 GB/s from the DRAM PHY (32-bit × 2666 MT/s). These are the same physical limit viewed two ways. (Historical note: an earlier "78 GB/s" was a bits-treated-as-bytes error — do not resurrect it.)
  • Method: saturating sequential write phase then read phase over 256 MiB, 4 KiB AXI-legal bursts (128 beats × 32 B; see note), up to 16 in flight, raw emif_clk cycle counts exposed — GB/s computed off-chip so no Fmax assumption is baked in.
  • Conclusion: sequential tile-stream bandwidth is viable; no need to sweep outstanding-count; result internally consistent against both ceilings.
  • Caveat (explicit): does not model random texture reads, CLUT, Z, or alpha-blend / framebuffer RMW behavior. That is Test #2.

Bring-up footnote (durable lesson): the first board run flagged a bresp error because the bursts were 8 KiB (AWLEN=255), which violates the AXI4 4 KiB-boundary rule; the EMIF NAK'd them with SLVERR. Fixed to 4 KiB bursts (AWLEN=127). Any future AXI master in this family — including the GS tiled-VRAM DMA — must cap bursts at 4 KiB. See reference-lpddr4-bw-spike.


3. What LPDDR4 actually carries per frame in a tiled design

Tiling moves framebuffer/Z read-modify-write on-chip (M20K), so the three things crossing the DDR boundary per frame are wildly unequal:

  1. Framebuffer writeout (tile flush to DDR): trivial. 640×480×4 B ≈ 1.2 MB/frame → ~70 MB/s @ 60 fps. Noise against 8.4 GB/s. Ignore it.
  2. Texture fetch: the dominant unknown. Textures are too large to sit in M20K beside the framebuffer tile, so they stream from DDR. Locality-driven.
  3. Tile reload from primitive disorder. When primitives don't arrive in tile order, a tile gets evicted and re-fetched. Also locality-driven.

Items 2 and 3 are why a synthetic test would lie and the emulation traces are the only honest source of truth: both depend on real access patterns and working-set shape, not on raw throughput.


4. The PS2-specific tilt: palettized textures

PS2 textures were overwhelmingly palettized — 4-bit and 8-bit indexed through CLUT, not 32-bit RGBA. That is a quarter to an eighth the per-texel DRAM traffic of the naive 32-bit assumption. Budgeting texture bandwidth as if every texel were 32-bit would massively overestimate the wall.

Prerequisite measurement before any Test #2 RTL: a texture-format histogram — what fraction of texel fetches are 4-bit / 8-bit / 16-bit / 32-bit, plus the overdraw factor on busy scenes. That histogram sizes the entire texture-bandwidth question before a line of RTL is written.

REALITY CHECK (2026-05-28, post-review): an outside review assumed this histogram could be "extracted from the trace library / the 301 chapters." It cannot — no real-game GS trace corpus exists in-repo. A full-disk search confirmed every GS/texture artifact here is synthetic (hand-authored testbench sprites, bake.py test cards), and the two live-emulator capture harnesses (DobieStation, PCSX2) are parked/blocked (sim/golden/README.md, third_party/*/NOTES.md). The 301 chapters are EE-opcode/BIOS work, not GS captures. So this number must be captured fresh, not extracted. Building Test #2 against an assumed distribution would be the GS-side repeat of the Ch215 oracle-confusion trap. The realistic source is PCSX2 GS dumps (.gs/.gs.xz — a built-in PCSX2 GS-debugger feature that records real GIF + privileged-register traffic, incl. TEX0/CLUT, replayable offline); a prebuilt PCSX2 binary sidesteps the in-repo CMake-deps block. The prerequisite to the prerequisite is therefore acquiring real GS dumps (needs PCSX2 + games the owner owns), then a software .gs parser (no RTL).


5. Test #2 — Tiled-Raster Microbenchmark Spec

Goal: measure sustained DDR bandwidth and tile-reload rate under real PS2 workloads, in a tiled rasterizer fragment (tile color/Z resident in M20K, RMW on-chip, tile + texture streamed to/from LPDDR4B).

5.1 Two trace-data prerequisites (do these FIRST — they scope the build)

  1. Texture-format histogram (§4): texel-fetch distribution by bit-depth + overdraw factor, from real game traffic.
  2. Worst-case stimulus selection (§7): identify the single most alpha-blended / overdraw-punishing in-game scene in the trace library — the design must clear peak, not mean.

5.2 Workload knobs (sweep matrix)

  • Tile size: 32×32 and 64×32 pixels (start).
  • Color format: PSMCT32 first; later PSMCT16 / PSMT8.
  • Z buffer: on / off.
  • Alpha blend: on / off.
  • Texture mode: solid color · small cached texture · streaming texture · CLUT texture.
  • Primitive mix: fullscreen sprites · many small sprites · triangles.

5.3 Metrics (per configuration)

  • tiles/sec, pixels/sec
  • bytes/pixel external (the locality number that matters)
  • LPDDR4 read GB/s and write GB/s (reuse the Test #1 counter approach)
  • M20K footprint (tile color + Z + any texture cache)
  • tile-reload rate (evictions/frame under the real primitive order)

5.4 Stimulus

Driven by representative GS primitive + texture traffic pulled from the emulation history — specifically the worst-case scene from §5.1(2). Not a boot screen or menu: those are bandwidth-trivial and will hand back a gorgeous green result that collapses in-game.


6. Permanent architecture this implies

The GS that survives this board almost certainly is:

  • On-chip tile color + Z buffers (M20K), RMW resolved on-chip.
  • LPDDR4B as backing VRAM (no full 4 MB VRAM in M20K — consistent with 0006).
  • Texture cache or texture-tile streamer feeding the rasterizer from DDR.
  • Scanout either from the tiled framebuffer cache or a resolved linear buffer.

DSP budget is not the constraint (the shipped raster demo used 4/376 DSP). Bandwidth and on-chip working-set are.


7. Methodological guardrails

  • Traces are truth. Texture/locality numbers cannot be synthesized honestly; pull them from real game traffic.
  • Test the peak, not the mean. The torture case is alpha-blended overdraw (smoke, fog, transparency, particles) — simultaneously worst for tile RMW and often texture-heavy. Find the worst frame and make that the stimulus.
  • Don't over-trust the green. Test #1 green ≠ faithful-GS feasible. Only Test #2 under real game traffic produces the integer that answers the question.

8. Status / Next

  • Bandwidth gate: GREEN (this doc, §12). New feasibility baseline.
  • Strategic pivot endorsed by Codex + outside review: the next serious work moves from qbert opcode-growth (Track A oracle, 0007) toward GS tiled-VRAM architecture feasibility — because that path now has a plausible physical foundation.
  • Immediate next step (no RTL): ACQUIRE real GS traffic first — the trace corpus does not exist (see §4 reality check). Capture PCSX2 GS dumps from real games (owner-supplied, prebuilt PCSX2), then write a software .gs parser to produce the texture-format histogram + locate the worst-case alpha-overdraw frame (§5.1). Only then is the Test #2 stimulus honest.
  • Then: build the Test #2 microbenchmark to this spec; its sustained number under real game traffic is the final yes/no on faithful-enough GS on this board.
  • Chapter numbering note: "Ch306" is already this repo's EE-core reality checkpoint (0007). This GS line is a later chapter (Ch307+); the label, not the substance, is what differs from Codex's framing.