ec82764bef
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression (272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps, and all dump-derived textures/traces) is excluded via .gitignore and stays local. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
202 lines
9.9 KiB
Markdown
202 lines
9.9 KiB
Markdown
# 0008 — GS Tiled-VRAM Feasibility Baseline + Test #2 Spec
|
||
|
||
Status: Accepted (baseline); Test #2 = spec only, not implemented
|
||
Date: 2026-05-28
|
||
Chapter: Phase-3 hardware de-risk (LPDDR4B bandwidth spike) → GS architecture pivot
|
||
Supersedes: nothing. Companion to 0006-vram-roadmap.md and 0007-ee-core-reality-checkpoint.md.
|
||
Authors: lead architect, with Codex co-review and a parallel outside-perspective review.
|
||
|
||
---
|
||
|
||
## 1. Executive Summary
|
||
|
||
The single missing number that gated the whole "is a faithful-enough PS2 GS
|
||
physically possible on this board?" question has been **measured on real
|
||
silicon**, and it clears the first gate cleanly.
|
||
|
||
A standalone HPS-coexistent diagnostic core (`de25_lpddr4_bw`, an ao486-cloned
|
||
shell + a saturating AXI4 traffic generator on the FPGA-side LPDDR4B EMIF)
|
||
sustained, over a 256 MiB sequential stream at the EMIF user clock (310 MHz
|
||
exactly):
|
||
|
||
| phase | cycles | sustained |
|
||
|------|------|------|
|
||
| write | 9,786,835 | **8.50 GB/s** |
|
||
| read | 9,913,927 | **8.39 GB/s** |
|
||
|
||
- **~86%** of the 256-bit fabric port (≈27.4 of 32 bytes/cycle).
|
||
- **~79%** of the ~10.7 GB/s LPDDR4 PHY peak. (Both ceilings, one consistent result.)
|
||
- **Read ≈ write** at `MAX_OUTSTANDING=16` → the bus is **bandwidth-bound, not
|
||
latency-bound**. Nothing to sweep; the number is trustworthy as-is.
|
||
|
||
**Verdict on the bandwidth gate: GREEN.** Before measuring, the working
|
||
assumption was "probably impossible." 8.4 GB/s sustained changes the tone to
|
||
**"feasible *if* the GS is architected around tiling from day one."** The board
|
||
is not killed by LPDDR bandwidth. Full-4 MB-VRAM-in-M20K remains off the table
|
||
(0006); the tiled-VRAM path is no longer fantasy.
|
||
|
||
**The gate still standing is texture + locality**, not raw sequential
|
||
bandwidth. Test #1 measured framebuffer-shaped sequential traffic. It cannot
|
||
see random texture reads, CLUT indirection, or the tile-reload churn that
|
||
primitive disorder and alpha-overdraw produce. **Test #2** — a tiled-raster
|
||
microbenchmark driven by *real game traffic* — is the measurement that finally
|
||
answers "faithful-enough GS on this board: yes or no." This document specs it;
|
||
it does not implement it.
|
||
|
||
---
|
||
|
||
## 2. Test #1 — the measured baseline (authoritative)
|
||
|
||
- **Memory under test:** FPGA-side LPDDR4B, 1 GB, 32-bit, 2666 MT/s (DE25-Nano
|
||
Rev B), via the same `EMIF_Qsys` hard-IP ao486 ships. User port: **256-bit
|
||
AXI4 @ 310 MHz** (IOPLL ×62/10 off the 50 MHz reference — exact).
|
||
- **Theoretical ceilings:** ~9.92 GB/s if you count the 256-bit (32-byte) port
|
||
at 310 MHz; ~10.7 GB/s from the DRAM PHY (32-bit × 2666 MT/s). These are the
|
||
same physical limit viewed two ways. (Historical note: an earlier "78 GB/s"
|
||
was a bits-treated-as-bytes error — do not resurrect it.)
|
||
- **Method:** saturating sequential write phase then read phase over 256 MiB,
|
||
4 KiB AXI-legal bursts (128 beats × 32 B; see note), up to 16 in flight, raw
|
||
emif_clk cycle counts exposed — GB/s computed off-chip so no Fmax assumption
|
||
is baked in.
|
||
- **Conclusion:** sequential tile-stream bandwidth is **viable**; no need to
|
||
sweep outstanding-count; result internally consistent against both ceilings.
|
||
- **Caveat (explicit):** does **not** model random texture reads, CLUT, Z, or
|
||
alpha-blend / framebuffer RMW behavior. That is Test #2.
|
||
|
||
> Bring-up footnote (durable lesson): the first board run flagged a bresp
|
||
> error because the bursts were 8 KiB (AWLEN=255), which violates the AXI4
|
||
> 4 KiB-boundary rule; the EMIF NAK'd them with SLVERR. Fixed to 4 KiB bursts
|
||
> (AWLEN=127). **Any future AXI master in this family — including the GS
|
||
> tiled-VRAM DMA — must cap bursts at 4 KiB.** See [[reference-lpddr4-bw-spike]].
|
||
|
||
---
|
||
|
||
## 3. What LPDDR4 actually carries per frame in a tiled design
|
||
|
||
Tiling moves framebuffer/Z **read-modify-write** on-chip (M20K), so the three
|
||
things crossing the DDR boundary per frame are wildly unequal:
|
||
|
||
1. **Framebuffer writeout (tile flush to DDR): trivial.** 640×480×4 B ≈ 1.2
|
||
MB/frame → ~70 MB/s @ 60 fps. Noise against 8.4 GB/s. Ignore it.
|
||
2. **Texture fetch: the dominant unknown.** Textures are too large to sit in
|
||
M20K beside the framebuffer tile, so they stream from DDR. Locality-driven.
|
||
3. **Tile reload from primitive disorder.** When primitives don't arrive in
|
||
tile order, a tile gets evicted and re-fetched. Also locality-driven.
|
||
|
||
Items 2 and 3 are why a synthetic test would *lie* and the emulation traces are
|
||
the only honest source of truth: both depend on real access patterns and
|
||
working-set shape, not on raw throughput.
|
||
|
||
---
|
||
|
||
## 4. The PS2-specific tilt: palettized textures
|
||
|
||
PS2 textures were overwhelmingly **palettized — 4-bit and 8-bit indexed through
|
||
CLUT**, not 32-bit RGBA. That is a **quarter to an eighth** the per-texel DRAM
|
||
traffic of the naive 32-bit assumption. Budgeting texture bandwidth as if every
|
||
texel were 32-bit would massively overestimate the wall.
|
||
|
||
**Prerequisite measurement before any Test #2 RTL:** a **texture-format
|
||
histogram** — what fraction of texel fetches are 4-bit / 8-bit / 16-bit /
|
||
32-bit, plus the **overdraw factor** on busy scenes. That histogram sizes the
|
||
entire texture-bandwidth question before a line of RTL is written.
|
||
|
||
> **REALITY CHECK (2026-05-28, post-review):** an outside review assumed this
|
||
> histogram could be "extracted from the trace library / the 301 chapters."
|
||
> **It cannot — no real-game GS trace corpus exists in-repo.** A full-disk
|
||
> search confirmed every GS/texture artifact here is synthetic (hand-authored
|
||
> testbench sprites, `bake.py` test cards), and the two live-emulator capture
|
||
> harnesses (DobieStation, PCSX2) are parked/blocked (`sim/golden/README.md`,
|
||
> `third_party/*/NOTES.md`). The 301 chapters are EE-opcode/BIOS work, not GS
|
||
> captures. So this number must be **captured fresh, not extracted.** Building
|
||
> Test #2 against an assumed distribution would be the GS-side repeat of the
|
||
> Ch215 oracle-confusion trap. The realistic source is **PCSX2 GS dumps**
|
||
> (`.gs`/`.gs.xz` — a built-in PCSX2 GS-debugger feature that records real
|
||
> GIF + privileged-register traffic, incl. TEX0/CLUT, replayable offline);
|
||
> a prebuilt PCSX2 binary sidesteps the in-repo CMake-deps block. The
|
||
> prerequisite to the prerequisite is therefore **acquiring real GS dumps**
|
||
> (needs PCSX2 + games the owner owns), then a software `.gs` parser (no RTL).
|
||
|
||
---
|
||
|
||
## 5. Test #2 — Tiled-Raster Microbenchmark Spec
|
||
|
||
Goal: measure **sustained DDR bandwidth and tile-reload rate under real PS2
|
||
workloads**, in a tiled rasterizer fragment (tile color/Z resident in M20K, RMW
|
||
on-chip, tile + texture streamed to/from LPDDR4B).
|
||
|
||
### 5.1 Two trace-data prerequisites (do these FIRST — they scope the build)
|
||
1. **Texture-format histogram** (§4): texel-fetch distribution by bit-depth +
|
||
overdraw factor, from real game traffic.
|
||
2. **Worst-case stimulus selection** (§7): identify the single most
|
||
alpha-blended / overdraw-punishing in-game scene in the trace library — the
|
||
design must clear *peak*, not mean.
|
||
|
||
### 5.2 Workload knobs (sweep matrix)
|
||
- **Tile size:** 32×32 and 64×32 pixels (start).
|
||
- **Color format:** PSMCT32 first; later PSMCT16 / PSMT8.
|
||
- **Z buffer:** on / off.
|
||
- **Alpha blend:** on / off.
|
||
- **Texture mode:** solid color · small cached texture · streaming texture ·
|
||
CLUT texture.
|
||
- **Primitive mix:** fullscreen sprites · many small sprites · triangles.
|
||
|
||
### 5.3 Metrics (per configuration)
|
||
- tiles/sec, pixels/sec
|
||
- **bytes/pixel external** (the locality number that matters)
|
||
- LPDDR4 read GB/s and write GB/s (reuse the Test #1 counter approach)
|
||
- M20K footprint (tile color + Z + any texture cache)
|
||
- tile-reload rate (evictions/frame under the real primitive order)
|
||
|
||
### 5.4 Stimulus
|
||
Driven by **representative GS primitive + texture traffic pulled from the
|
||
emulation history** — specifically the worst-case scene from §5.1(2). **Not** a
|
||
boot screen or menu: those are bandwidth-trivial and will hand back a gorgeous
|
||
green result that collapses in-game.
|
||
|
||
---
|
||
|
||
## 6. Permanent architecture this implies
|
||
|
||
The GS that survives this board almost certainly is:
|
||
|
||
- **On-chip tile color + Z buffers** (M20K), RMW resolved on-chip.
|
||
- **LPDDR4B as backing VRAM** (no full 4 MB VRAM in M20K — consistent with 0006).
|
||
- **Texture cache or texture-tile streamer** feeding the rasterizer from DDR.
|
||
- **Scanout** either from the tiled framebuffer cache or a resolved linear buffer.
|
||
|
||
DSP budget is not the constraint (the shipped raster demo used 4/376 DSP).
|
||
Bandwidth and on-chip working-set are.
|
||
|
||
---
|
||
|
||
## 7. Methodological guardrails
|
||
|
||
- **Traces are truth.** Texture/locality numbers cannot be synthesized honestly;
|
||
pull them from real game traffic.
|
||
- **Test the peak, not the mean.** The torture case is alpha-blended overdraw
|
||
(smoke, fog, transparency, particles) — simultaneously worst for tile RMW and
|
||
often texture-heavy. Find the worst frame and make *that* the stimulus.
|
||
- **Don't over-trust the green.** Test #1 green ≠ faithful-GS feasible. Only
|
||
Test #2 under real game traffic produces the integer that answers the question.
|
||
|
||
---
|
||
|
||
## 8. Status / Next
|
||
|
||
- **Bandwidth gate: GREEN** (this doc, §1–2). New feasibility baseline.
|
||
- **Strategic pivot endorsed by Codex + outside review:** the next serious work
|
||
moves from qbert opcode-growth (Track A oracle, 0007) toward **GS tiled-VRAM
|
||
architecture feasibility** — because that path now has a plausible physical
|
||
foundation.
|
||
- **Immediate next step (no RTL): ACQUIRE real GS traffic first** — the trace
|
||
corpus does not exist (see §4 reality check). Capture PCSX2 GS dumps from
|
||
real games (owner-supplied, prebuilt PCSX2), then write a software `.gs`
|
||
parser to produce the texture-format histogram + locate the worst-case
|
||
alpha-overdraw frame (§5.1). Only then is the Test #2 stimulus honest.
|
||
- **Then:** build the Test #2 microbenchmark to this spec; its sustained number
|
||
under real game traffic is the final yes/no on faithful-enough GS on this board.
|
||
- **Chapter numbering note:** "Ch306" is already this repo's EE-core reality
|
||
checkpoint (0007). This GS line is a later chapter (Ch307+); the label, not
|
||
the substance, is what differs from Codex's framing.
|