# 0008 — GS Tiled-VRAM Feasibility Baseline + Test #2 Spec Status: Accepted (baseline); Test #2 = spec only, not implemented Date: 2026-05-28 Chapter: Phase-3 hardware de-risk (LPDDR4B bandwidth spike) → GS architecture pivot Supersedes: nothing. Companion to 0006-vram-roadmap.md and 0007-ee-core-reality-checkpoint.md. Authors: lead architect, with Codex co-review and a parallel outside-perspective review. --- ## 1. Executive Summary The single missing number that gated the whole "is a faithful-enough PS2 GS physically possible on this board?" question has been **measured on real silicon**, and it clears the first gate cleanly. A standalone HPS-coexistent diagnostic core (`de25_lpddr4_bw`, an ao486-cloned shell + a saturating AXI4 traffic generator on the FPGA-side LPDDR4B EMIF) sustained, over a 256 MiB sequential stream at the EMIF user clock (310 MHz exactly): | phase | cycles | sustained | |------|------|------| | write | 9,786,835 | **8.50 GB/s** | | read | 9,913,927 | **8.39 GB/s** | - **~86%** of the 256-bit fabric port (≈27.4 of 32 bytes/cycle). - **~79%** of the ~10.7 GB/s LPDDR4 PHY peak. (Both ceilings, one consistent result.) - **Read ≈ write** at `MAX_OUTSTANDING=16` → the bus is **bandwidth-bound, not latency-bound**. Nothing to sweep; the number is trustworthy as-is. **Verdict on the bandwidth gate: GREEN.** Before measuring, the working assumption was "probably impossible." 8.4 GB/s sustained changes the tone to **"feasible *if* the GS is architected around tiling from day one."** The board is not killed by LPDDR bandwidth. Full-4 MB-VRAM-in-M20K remains off the table (0006); the tiled-VRAM path is no longer fantasy. **The gate still standing is texture + locality**, not raw sequential bandwidth. Test #1 measured framebuffer-shaped sequential traffic. It cannot see random texture reads, CLUT indirection, or the tile-reload churn that primitive disorder and alpha-overdraw produce. **Test #2** — a tiled-raster microbenchmark driven by *real game traffic* — is the measurement that finally answers "faithful-enough GS on this board: yes or no." This document specs it; it does not implement it. --- ## 2. Test #1 — the measured baseline (authoritative) - **Memory under test:** FPGA-side LPDDR4B, 1 GB, 32-bit, 2666 MT/s (DE25-Nano Rev B), via the same `EMIF_Qsys` hard-IP ao486 ships. User port: **256-bit AXI4 @ 310 MHz** (IOPLL ×62/10 off the 50 MHz reference — exact). - **Theoretical ceilings:** ~9.92 GB/s if you count the 256-bit (32-byte) port at 310 MHz; ~10.7 GB/s from the DRAM PHY (32-bit × 2666 MT/s). These are the same physical limit viewed two ways. (Historical note: an earlier "78 GB/s" was a bits-treated-as-bytes error — do not resurrect it.) - **Method:** saturating sequential write phase then read phase over 256 MiB, 4 KiB AXI-legal bursts (128 beats × 32 B; see note), up to 16 in flight, raw emif_clk cycle counts exposed — GB/s computed off-chip so no Fmax assumption is baked in. - **Conclusion:** sequential tile-stream bandwidth is **viable**; no need to sweep outstanding-count; result internally consistent against both ceilings. - **Caveat (explicit):** does **not** model random texture reads, CLUT, Z, or alpha-blend / framebuffer RMW behavior. That is Test #2. > Bring-up footnote (durable lesson): the first board run flagged a bresp > error because the bursts were 8 KiB (AWLEN=255), which violates the AXI4 > 4 KiB-boundary rule; the EMIF NAK'd them with SLVERR. Fixed to 4 KiB bursts > (AWLEN=127). **Any future AXI master in this family — including the GS > tiled-VRAM DMA — must cap bursts at 4 KiB.** See [[reference-lpddr4-bw-spike]]. --- ## 3. What LPDDR4 actually carries per frame in a tiled design Tiling moves framebuffer/Z **read-modify-write** on-chip (M20K), so the three things crossing the DDR boundary per frame are wildly unequal: 1. **Framebuffer writeout (tile flush to DDR): trivial.** 640×480×4 B ≈ 1.2 MB/frame → ~70 MB/s @ 60 fps. Noise against 8.4 GB/s. Ignore it. 2. **Texture fetch: the dominant unknown.** Textures are too large to sit in M20K beside the framebuffer tile, so they stream from DDR. Locality-driven. 3. **Tile reload from primitive disorder.** When primitives don't arrive in tile order, a tile gets evicted and re-fetched. Also locality-driven. Items 2 and 3 are why a synthetic test would *lie* and the emulation traces are the only honest source of truth: both depend on real access patterns and working-set shape, not on raw throughput. --- ## 4. The PS2-specific tilt: palettized textures PS2 textures were overwhelmingly **palettized — 4-bit and 8-bit indexed through CLUT**, not 32-bit RGBA. That is a **quarter to an eighth** the per-texel DRAM traffic of the naive 32-bit assumption. Budgeting texture bandwidth as if every texel were 32-bit would massively overestimate the wall. **Prerequisite measurement before any Test #2 RTL:** a **texture-format histogram** — what fraction of texel fetches are 4-bit / 8-bit / 16-bit / 32-bit, plus the **overdraw factor** on busy scenes. That histogram sizes the entire texture-bandwidth question before a line of RTL is written. > **REALITY CHECK (2026-05-28, post-review):** an outside review assumed this > histogram could be "extracted from the trace library / the 301 chapters." > **It cannot — no real-game GS trace corpus exists in-repo.** A full-disk > search confirmed every GS/texture artifact here is synthetic (hand-authored > testbench sprites, `bake.py` test cards), and the two live-emulator capture > harnesses (DobieStation, PCSX2) are parked/blocked (`sim/golden/README.md`, > `third_party/*/NOTES.md`). The 301 chapters are EE-opcode/BIOS work, not GS > captures. So this number must be **captured fresh, not extracted.** Building > Test #2 against an assumed distribution would be the GS-side repeat of the > Ch215 oracle-confusion trap. The realistic source is **PCSX2 GS dumps** > (`.gs`/`.gs.xz` — a built-in PCSX2 GS-debugger feature that records real > GIF + privileged-register traffic, incl. TEX0/CLUT, replayable offline); > a prebuilt PCSX2 binary sidesteps the in-repo CMake-deps block. The > prerequisite to the prerequisite is therefore **acquiring real GS dumps** > (needs PCSX2 + games the owner owns), then a software `.gs` parser (no RTL). --- ## 5. Test #2 — Tiled-Raster Microbenchmark Spec Goal: measure **sustained DDR bandwidth and tile-reload rate under real PS2 workloads**, in a tiled rasterizer fragment (tile color/Z resident in M20K, RMW on-chip, tile + texture streamed to/from LPDDR4B). ### 5.1 Two trace-data prerequisites (do these FIRST — they scope the build) 1. **Texture-format histogram** (§4): texel-fetch distribution by bit-depth + overdraw factor, from real game traffic. 2. **Worst-case stimulus selection** (§7): identify the single most alpha-blended / overdraw-punishing in-game scene in the trace library — the design must clear *peak*, not mean. ### 5.2 Workload knobs (sweep matrix) - **Tile size:** 32×32 and 64×32 pixels (start). - **Color format:** PSMCT32 first; later PSMCT16 / PSMT8. - **Z buffer:** on / off. - **Alpha blend:** on / off. - **Texture mode:** solid color · small cached texture · streaming texture · CLUT texture. - **Primitive mix:** fullscreen sprites · many small sprites · triangles. ### 5.3 Metrics (per configuration) - tiles/sec, pixels/sec - **bytes/pixel external** (the locality number that matters) - LPDDR4 read GB/s and write GB/s (reuse the Test #1 counter approach) - M20K footprint (tile color + Z + any texture cache) - tile-reload rate (evictions/frame under the real primitive order) ### 5.4 Stimulus Driven by **representative GS primitive + texture traffic pulled from the emulation history** — specifically the worst-case scene from §5.1(2). **Not** a boot screen or menu: those are bandwidth-trivial and will hand back a gorgeous green result that collapses in-game. --- ## 6. Permanent architecture this implies The GS that survives this board almost certainly is: - **On-chip tile color + Z buffers** (M20K), RMW resolved on-chip. - **LPDDR4B as backing VRAM** (no full 4 MB VRAM in M20K — consistent with 0006). - **Texture cache or texture-tile streamer** feeding the rasterizer from DDR. - **Scanout** either from the tiled framebuffer cache or a resolved linear buffer. DSP budget is not the constraint (the shipped raster demo used 4/376 DSP). Bandwidth and on-chip working-set are. --- ## 7. Methodological guardrails - **Traces are truth.** Texture/locality numbers cannot be synthesized honestly; pull them from real game traffic. - **Test the peak, not the mean.** The torture case is alpha-blended overdraw (smoke, fog, transparency, particles) — simultaneously worst for tile RMW and often texture-heavy. Find the worst frame and make *that* the stimulus. - **Don't over-trust the green.** Test #1 green ≠ faithful-GS feasible. Only Test #2 under real game traffic produces the integer that answers the question. --- ## 8. Status / Next - **Bandwidth gate: GREEN** (this doc, §1–2). New feasibility baseline. - **Strategic pivot endorsed by Codex + outside review:** the next serious work moves from qbert opcode-growth (Track A oracle, 0007) toward **GS tiled-VRAM architecture feasibility** — because that path now has a plausible physical foundation. - **Immediate next step (no RTL): ACQUIRE real GS traffic first** — the trace corpus does not exist (see §4 reality check). Capture PCSX2 GS dumps from real games (owner-supplied, prebuilt PCSX2), then write a software `.gs` parser to produce the texture-format histogram + locate the worst-case alpha-overdraw frame (§5.1). Only then is the Test #2 stimulus honest. - **Then:** build the Test #2 microbenchmark to this spec; its sustained number under real game traffic is the final yes/no on faithful-enough GS on this board. - **Chapter numbering note:** "Ch306" is already this repo's EE-core reality checkpoint (0007). This GS line is a later chapter (Ch307+); the label, not the substance, is what differs from Codex's framing.