Files
retroDE_ps2/docs/decisions/0008-gs-tiled-vram-feasibility-baseline.md
T
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

202 lines
9.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 0008 — GS Tiled-VRAM Feasibility Baseline + Test #2 Spec
Status: Accepted (baseline); Test #2 = spec only, not implemented
Date: 2026-05-28
Chapter: Phase-3 hardware de-risk (LPDDR4B bandwidth spike) → GS architecture pivot
Supersedes: nothing. Companion to 0006-vram-roadmap.md and 0007-ee-core-reality-checkpoint.md.
Authors: lead architect, with Codex co-review and a parallel outside-perspective review.
---
## 1. Executive Summary
The single missing number that gated the whole "is a faithful-enough PS2 GS
physically possible on this board?" question has been **measured on real
silicon**, and it clears the first gate cleanly.
A standalone HPS-coexistent diagnostic core (`de25_lpddr4_bw`, an ao486-cloned
shell + a saturating AXI4 traffic generator on the FPGA-side LPDDR4B EMIF)
sustained, over a 256 MiB sequential stream at the EMIF user clock (310 MHz
exactly):
| phase | cycles | sustained |
|------|------|------|
| write | 9,786,835 | **8.50 GB/s** |
| read | 9,913,927 | **8.39 GB/s** |
- **~86%** of the 256-bit fabric port (≈27.4 of 32 bytes/cycle).
- **~79%** of the ~10.7 GB/s LPDDR4 PHY peak. (Both ceilings, one consistent result.)
- **Read ≈ write** at `MAX_OUTSTANDING=16` → the bus is **bandwidth-bound, not
latency-bound**. Nothing to sweep; the number is trustworthy as-is.
**Verdict on the bandwidth gate: GREEN.** Before measuring, the working
assumption was "probably impossible." 8.4 GB/s sustained changes the tone to
**"feasible *if* the GS is architected around tiling from day one."** The board
is not killed by LPDDR bandwidth. Full-4 MB-VRAM-in-M20K remains off the table
(0006); the tiled-VRAM path is no longer fantasy.
**The gate still standing is texture + locality**, not raw sequential
bandwidth. Test #1 measured framebuffer-shaped sequential traffic. It cannot
see random texture reads, CLUT indirection, or the tile-reload churn that
primitive disorder and alpha-overdraw produce. **Test #2** — a tiled-raster
microbenchmark driven by *real game traffic* — is the measurement that finally
answers "faithful-enough GS on this board: yes or no." This document specs it;
it does not implement it.
---
## 2. Test #1 — the measured baseline (authoritative)
- **Memory under test:** FPGA-side LPDDR4B, 1 GB, 32-bit, 2666 MT/s (DE25-Nano
Rev B), via the same `EMIF_Qsys` hard-IP ao486 ships. User port: **256-bit
AXI4 @ 310 MHz** (IOPLL ×62/10 off the 50 MHz reference — exact).
- **Theoretical ceilings:** ~9.92 GB/s if you count the 256-bit (32-byte) port
at 310 MHz; ~10.7 GB/s from the DRAM PHY (32-bit × 2666 MT/s). These are the
same physical limit viewed two ways. (Historical note: an earlier "78 GB/s"
was a bits-treated-as-bytes error — do not resurrect it.)
- **Method:** saturating sequential write phase then read phase over 256 MiB,
4 KiB AXI-legal bursts (128 beats × 32 B; see note), up to 16 in flight, raw
emif_clk cycle counts exposed — GB/s computed off-chip so no Fmax assumption
is baked in.
- **Conclusion:** sequential tile-stream bandwidth is **viable**; no need to
sweep outstanding-count; result internally consistent against both ceilings.
- **Caveat (explicit):** does **not** model random texture reads, CLUT, Z, or
alpha-blend / framebuffer RMW behavior. That is Test #2.
> Bring-up footnote (durable lesson): the first board run flagged a bresp
> error because the bursts were 8 KiB (AWLEN=255), which violates the AXI4
> 4 KiB-boundary rule; the EMIF NAK'd them with SLVERR. Fixed to 4 KiB bursts
> (AWLEN=127). **Any future AXI master in this family — including the GS
> tiled-VRAM DMA — must cap bursts at 4 KiB.** See [[reference-lpddr4-bw-spike]].
---
## 3. What LPDDR4 actually carries per frame in a tiled design
Tiling moves framebuffer/Z **read-modify-write** on-chip (M20K), so the three
things crossing the DDR boundary per frame are wildly unequal:
1. **Framebuffer writeout (tile flush to DDR): trivial.** 640×480×4 B ≈ 1.2
MB/frame → ~70 MB/s @ 60 fps. Noise against 8.4 GB/s. Ignore it.
2. **Texture fetch: the dominant unknown.** Textures are too large to sit in
M20K beside the framebuffer tile, so they stream from DDR. Locality-driven.
3. **Tile reload from primitive disorder.** When primitives don't arrive in
tile order, a tile gets evicted and re-fetched. Also locality-driven.
Items 2 and 3 are why a synthetic test would *lie* and the emulation traces are
the only honest source of truth: both depend on real access patterns and
working-set shape, not on raw throughput.
---
## 4. The PS2-specific tilt: palettized textures
PS2 textures were overwhelmingly **palettized — 4-bit and 8-bit indexed through
CLUT**, not 32-bit RGBA. That is a **quarter to an eighth** the per-texel DRAM
traffic of the naive 32-bit assumption. Budgeting texture bandwidth as if every
texel were 32-bit would massively overestimate the wall.
**Prerequisite measurement before any Test #2 RTL:** a **texture-format
histogram** — what fraction of texel fetches are 4-bit / 8-bit / 16-bit /
32-bit, plus the **overdraw factor** on busy scenes. That histogram sizes the
entire texture-bandwidth question before a line of RTL is written.
> **REALITY CHECK (2026-05-28, post-review):** an outside review assumed this
> histogram could be "extracted from the trace library / the 301 chapters."
> **It cannot — no real-game GS trace corpus exists in-repo.** A full-disk
> search confirmed every GS/texture artifact here is synthetic (hand-authored
> testbench sprites, `bake.py` test cards), and the two live-emulator capture
> harnesses (DobieStation, PCSX2) are parked/blocked (`sim/golden/README.md`,
> `third_party/*/NOTES.md`). The 301 chapters are EE-opcode/BIOS work, not GS
> captures. So this number must be **captured fresh, not extracted.** Building
> Test #2 against an assumed distribution would be the GS-side repeat of the
> Ch215 oracle-confusion trap. The realistic source is **PCSX2 GS dumps**
> (`.gs`/`.gs.xz` — a built-in PCSX2 GS-debugger feature that records real
> GIF + privileged-register traffic, incl. TEX0/CLUT, replayable offline);
> a prebuilt PCSX2 binary sidesteps the in-repo CMake-deps block. The
> prerequisite to the prerequisite is therefore **acquiring real GS dumps**
> (needs PCSX2 + games the owner owns), then a software `.gs` parser (no RTL).
---
## 5. Test #2 — Tiled-Raster Microbenchmark Spec
Goal: measure **sustained DDR bandwidth and tile-reload rate under real PS2
workloads**, in a tiled rasterizer fragment (tile color/Z resident in M20K, RMW
on-chip, tile + texture streamed to/from LPDDR4B).
### 5.1 Two trace-data prerequisites (do these FIRST — they scope the build)
1. **Texture-format histogram** (§4): texel-fetch distribution by bit-depth +
overdraw factor, from real game traffic.
2. **Worst-case stimulus selection** (§7): identify the single most
alpha-blended / overdraw-punishing in-game scene in the trace library — the
design must clear *peak*, not mean.
### 5.2 Workload knobs (sweep matrix)
- **Tile size:** 32×32 and 64×32 pixels (start).
- **Color format:** PSMCT32 first; later PSMCT16 / PSMT8.
- **Z buffer:** on / off.
- **Alpha blend:** on / off.
- **Texture mode:** solid color · small cached texture · streaming texture ·
CLUT texture.
- **Primitive mix:** fullscreen sprites · many small sprites · triangles.
### 5.3 Metrics (per configuration)
- tiles/sec, pixels/sec
- **bytes/pixel external** (the locality number that matters)
- LPDDR4 read GB/s and write GB/s (reuse the Test #1 counter approach)
- M20K footprint (tile color + Z + any texture cache)
- tile-reload rate (evictions/frame under the real primitive order)
### 5.4 Stimulus
Driven by **representative GS primitive + texture traffic pulled from the
emulation history** — specifically the worst-case scene from §5.1(2). **Not** a
boot screen or menu: those are bandwidth-trivial and will hand back a gorgeous
green result that collapses in-game.
---
## 6. Permanent architecture this implies
The GS that survives this board almost certainly is:
- **On-chip tile color + Z buffers** (M20K), RMW resolved on-chip.
- **LPDDR4B as backing VRAM** (no full 4 MB VRAM in M20K — consistent with 0006).
- **Texture cache or texture-tile streamer** feeding the rasterizer from DDR.
- **Scanout** either from the tiled framebuffer cache or a resolved linear buffer.
DSP budget is not the constraint (the shipped raster demo used 4/376 DSP).
Bandwidth and on-chip working-set are.
---
## 7. Methodological guardrails
- **Traces are truth.** Texture/locality numbers cannot be synthesized honestly;
pull them from real game traffic.
- **Test the peak, not the mean.** The torture case is alpha-blended overdraw
(smoke, fog, transparency, particles) — simultaneously worst for tile RMW and
often texture-heavy. Find the worst frame and make *that* the stimulus.
- **Don't over-trust the green.** Test #1 green ≠ faithful-GS feasible. Only
Test #2 under real game traffic produces the integer that answers the question.
---
## 8. Status / Next
- **Bandwidth gate: GREEN** (this doc, §12). New feasibility baseline.
- **Strategic pivot endorsed by Codex + outside review:** the next serious work
moves from qbert opcode-growth (Track A oracle, 0007) toward **GS tiled-VRAM
architecture feasibility** — because that path now has a plausible physical
foundation.
- **Immediate next step (no RTL): ACQUIRE real GS traffic first** — the trace
corpus does not exist (see §4 reality check). Capture PCSX2 GS dumps from
real games (owner-supplied, prebuilt PCSX2), then write a software `.gs`
parser to produce the texture-format histogram + locate the worst-case
alpha-overdraw frame (§5.1). Only then is the Test #2 stimulus honest.
- **Then:** build the Test #2 microbenchmark to this spec; its sustained number
under real game traffic is the final yes/no on faithful-enough GS on this board.
- **Chapter numbering note:** "Ch306" is already this repo's EE-core reality
checkpoint (0007). This GS line is a later chapter (Ch307+); the label, not
the substance, is what differs from Codex's framing.