RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression (272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps, and all dump-derived textures/traces) is excluded via .gitignore and stays local. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9.9 KiB
0008 — GS Tiled-VRAM Feasibility Baseline + Test #2 Spec
Status: Accepted (baseline); Test #2 = spec only, not implemented Date: 2026-05-28 Chapter: Phase-3 hardware de-risk (LPDDR4B bandwidth spike) → GS architecture pivot Supersedes: nothing. Companion to 0006-vram-roadmap.md and 0007-ee-core-reality-checkpoint.md. Authors: lead architect, with Codex co-review and a parallel outside-perspective review.
1. Executive Summary
The single missing number that gated the whole "is a faithful-enough PS2 GS physically possible on this board?" question has been measured on real silicon, and it clears the first gate cleanly.
A standalone HPS-coexistent diagnostic core (de25_lpddr4_bw, an ao486-cloned
shell + a saturating AXI4 traffic generator on the FPGA-side LPDDR4B EMIF)
sustained, over a 256 MiB sequential stream at the EMIF user clock (310 MHz
exactly):
| phase | cycles | sustained |
|---|---|---|
| write | 9,786,835 | 8.50 GB/s |
| read | 9,913,927 | 8.39 GB/s |
- ~86% of the 256-bit fabric port (≈27.4 of 32 bytes/cycle).
- ~79% of the ~10.7 GB/s LPDDR4 PHY peak. (Both ceilings, one consistent result.)
- Read ≈ write at
MAX_OUTSTANDING=16→ the bus is bandwidth-bound, not latency-bound. Nothing to sweep; the number is trustworthy as-is.
Verdict on the bandwidth gate: GREEN. Before measuring, the working assumption was "probably impossible." 8.4 GB/s sustained changes the tone to "feasible if the GS is architected around tiling from day one." The board is not killed by LPDDR bandwidth. Full-4 MB-VRAM-in-M20K remains off the table (0006); the tiled-VRAM path is no longer fantasy.
The gate still standing is texture + locality, not raw sequential bandwidth. Test #1 measured framebuffer-shaped sequential traffic. It cannot see random texture reads, CLUT indirection, or the tile-reload churn that primitive disorder and alpha-overdraw produce. Test #2 — a tiled-raster microbenchmark driven by real game traffic — is the measurement that finally answers "faithful-enough GS on this board: yes or no." This document specs it; it does not implement it.
2. Test #1 — the measured baseline (authoritative)
- Memory under test: FPGA-side LPDDR4B, 1 GB, 32-bit, 2666 MT/s (DE25-Nano
Rev B), via the same
EMIF_Qsyshard-IP ao486 ships. User port: 256-bit AXI4 @ 310 MHz (IOPLL ×62/10 off the 50 MHz reference — exact). - Theoretical ceilings: ~9.92 GB/s if you count the 256-bit (32-byte) port at 310 MHz; ~10.7 GB/s from the DRAM PHY (32-bit × 2666 MT/s). These are the same physical limit viewed two ways. (Historical note: an earlier "78 GB/s" was a bits-treated-as-bytes error — do not resurrect it.)
- Method: saturating sequential write phase then read phase over 256 MiB, 4 KiB AXI-legal bursts (128 beats × 32 B; see note), up to 16 in flight, raw emif_clk cycle counts exposed — GB/s computed off-chip so no Fmax assumption is baked in.
- Conclusion: sequential tile-stream bandwidth is viable; no need to sweep outstanding-count; result internally consistent against both ceilings.
- Caveat (explicit): does not model random texture reads, CLUT, Z, or alpha-blend / framebuffer RMW behavior. That is Test #2.
Bring-up footnote (durable lesson): the first board run flagged a bresp error because the bursts were 8 KiB (AWLEN=255), which violates the AXI4 4 KiB-boundary rule; the EMIF NAK'd them with SLVERR. Fixed to 4 KiB bursts (AWLEN=127). Any future AXI master in this family — including the GS tiled-VRAM DMA — must cap bursts at 4 KiB. See reference-lpddr4-bw-spike.
3. What LPDDR4 actually carries per frame in a tiled design
Tiling moves framebuffer/Z read-modify-write on-chip (M20K), so the three things crossing the DDR boundary per frame are wildly unequal:
- Framebuffer writeout (tile flush to DDR): trivial. 640×480×4 B ≈ 1.2 MB/frame → ~70 MB/s @ 60 fps. Noise against 8.4 GB/s. Ignore it.
- Texture fetch: the dominant unknown. Textures are too large to sit in M20K beside the framebuffer tile, so they stream from DDR. Locality-driven.
- Tile reload from primitive disorder. When primitives don't arrive in tile order, a tile gets evicted and re-fetched. Also locality-driven.
Items 2 and 3 are why a synthetic test would lie and the emulation traces are the only honest source of truth: both depend on real access patterns and working-set shape, not on raw throughput.
4. The PS2-specific tilt: palettized textures
PS2 textures were overwhelmingly palettized — 4-bit and 8-bit indexed through CLUT, not 32-bit RGBA. That is a quarter to an eighth the per-texel DRAM traffic of the naive 32-bit assumption. Budgeting texture bandwidth as if every texel were 32-bit would massively overestimate the wall.
Prerequisite measurement before any Test #2 RTL: a texture-format histogram — what fraction of texel fetches are 4-bit / 8-bit / 16-bit / 32-bit, plus the overdraw factor on busy scenes. That histogram sizes the entire texture-bandwidth question before a line of RTL is written.
REALITY CHECK (2026-05-28, post-review): an outside review assumed this histogram could be "extracted from the trace library / the 301 chapters." It cannot — no real-game GS trace corpus exists in-repo. A full-disk search confirmed every GS/texture artifact here is synthetic (hand-authored testbench sprites,
bake.pytest cards), and the two live-emulator capture harnesses (DobieStation, PCSX2) are parked/blocked (sim/golden/README.md,third_party/*/NOTES.md). The 301 chapters are EE-opcode/BIOS work, not GS captures. So this number must be captured fresh, not extracted. Building Test #2 against an assumed distribution would be the GS-side repeat of the Ch215 oracle-confusion trap. The realistic source is PCSX2 GS dumps (.gs/.gs.xz— a built-in PCSX2 GS-debugger feature that records real GIF + privileged-register traffic, incl. TEX0/CLUT, replayable offline); a prebuilt PCSX2 binary sidesteps the in-repo CMake-deps block. The prerequisite to the prerequisite is therefore acquiring real GS dumps (needs PCSX2 + games the owner owns), then a software.gsparser (no RTL).
5. Test #2 — Tiled-Raster Microbenchmark Spec
Goal: measure sustained DDR bandwidth and tile-reload rate under real PS2 workloads, in a tiled rasterizer fragment (tile color/Z resident in M20K, RMW on-chip, tile + texture streamed to/from LPDDR4B).
5.1 Two trace-data prerequisites (do these FIRST — they scope the build)
- Texture-format histogram (§4): texel-fetch distribution by bit-depth + overdraw factor, from real game traffic.
- Worst-case stimulus selection (§7): identify the single most alpha-blended / overdraw-punishing in-game scene in the trace library — the design must clear peak, not mean.
5.2 Workload knobs (sweep matrix)
- Tile size: 32×32 and 64×32 pixels (start).
- Color format: PSMCT32 first; later PSMCT16 / PSMT8.
- Z buffer: on / off.
- Alpha blend: on / off.
- Texture mode: solid color · small cached texture · streaming texture · CLUT texture.
- Primitive mix: fullscreen sprites · many small sprites · triangles.
5.3 Metrics (per configuration)
- tiles/sec, pixels/sec
- bytes/pixel external (the locality number that matters)
- LPDDR4 read GB/s and write GB/s (reuse the Test #1 counter approach)
- M20K footprint (tile color + Z + any texture cache)
- tile-reload rate (evictions/frame under the real primitive order)
5.4 Stimulus
Driven by representative GS primitive + texture traffic pulled from the emulation history — specifically the worst-case scene from §5.1(2). Not a boot screen or menu: those are bandwidth-trivial and will hand back a gorgeous green result that collapses in-game.
6. Permanent architecture this implies
The GS that survives this board almost certainly is:
- On-chip tile color + Z buffers (M20K), RMW resolved on-chip.
- LPDDR4B as backing VRAM (no full 4 MB VRAM in M20K — consistent with 0006).
- Texture cache or texture-tile streamer feeding the rasterizer from DDR.
- Scanout either from the tiled framebuffer cache or a resolved linear buffer.
DSP budget is not the constraint (the shipped raster demo used 4/376 DSP). Bandwidth and on-chip working-set are.
7. Methodological guardrails
- Traces are truth. Texture/locality numbers cannot be synthesized honestly; pull them from real game traffic.
- Test the peak, not the mean. The torture case is alpha-blended overdraw (smoke, fog, transparency, particles) — simultaneously worst for tile RMW and often texture-heavy. Find the worst frame and make that the stimulus.
- Don't over-trust the green. Test #1 green ≠ faithful-GS feasible. Only Test #2 under real game traffic produces the integer that answers the question.
8. Status / Next
- Bandwidth gate: GREEN (this doc, §1–2). New feasibility baseline.
- Strategic pivot endorsed by Codex + outside review: the next serious work moves from qbert opcode-growth (Track A oracle, 0007) toward GS tiled-VRAM architecture feasibility — because that path now has a plausible physical foundation.
- Immediate next step (no RTL): ACQUIRE real GS traffic first — the trace
corpus does not exist (see §4 reality check). Capture PCSX2 GS dumps from
real games (owner-supplied, prebuilt PCSX2), then write a software
.gsparser to produce the texture-format histogram + locate the worst-case alpha-overdraw frame (§5.1). Only then is the Test #2 stimulus honest. - Then: build the Test #2 microbenchmark to this spec; its sustained number under real game traffic is the final yes/no on faithful-enough GS on this board.
- Chapter numbering note: "Ch306" is already this repo's EE-core reality checkpoint (0007). This GS line is a later chapter (Ch307+); the label, not the substance, is what differs from Codex's framing.