Files
retroDE_ps2/docs/decisions/0006-vram-roadmap.md
T
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

114 lines
5.1 KiB
Markdown

# Decision 0006: VRAM Roadmap
Status: `In progress` — Ch251.4 near-term rescue applied, longer-term work
queued.
## Context
The Ch251 hardware demo build (`de25_nano_psmct32_raster_demo_top`) failed the
Quartus Fitter on Agilex 5 with **516 / 358 M20K** (144%). The Fitter resource
report attributed ~410 M20Ks to two replicated `vram_bram_stub` banks:
```
u_demo|u_vram|mem_rtl_0 Logical Size: 4194304 bits M20K blocks: 204.800
u_demo|u_vram|mem_rtl_1 Logical Size: 4194304 bits M20K blocks: 204.800
```
Root cause: `vram_bram_stub` exposes **1 write + 2 independent read ports**.
An M20K block has at most two physical ports total (and at most one write
port). To honour 1W + 2R, Quartus replicates the entire storage so each read
port gets its own simple-dual-port BRAM, with the write fanned to both copies.
True dual-port would not have rescued this — TDP still gives only 2 physical
ports, not 3.
The two read ports serve distinct clients:
- **read** — PCRTC scanout (every pixel)
- **read2** — PSMT4 RMW old-byte read on the rasterizer write path
The Ch251 build draws PSMCT32 sprites only. The PSMT4 RMW pipe is wired but
never fires (`is_t4_emit` stays low), so read2 is dead weight on hardware.
## Decision (Near-Term — Ch251.4)
Add a parameter `ENABLE_READ2` to `vram_bram_stub`:
- Default `1` keeps every simulation TB and every PSMT4-exercising path
byte-identical.
- Hardware top (`de25_nano_psmct32_raster_demo_top`) overrides to `0`. When
disabled, the read2 always_ff branch contains **no reference** to `mem`,
so Quartus infers a single 1W+1R simple-dual-port BRAM (~205 M20Ks at
512 KiB) instead of two replicas (~410 M20Ks).
This is a **scoped hardware-demo build profile**, not a general fix. It is
correct only as long as the hardware build is PSMCT32 (or any non-PSMT4
format). Any future hardware build that exercises PSMT4 RMW must either
re-enable read2 (and accept the M20K cost) or first land the long-term
architecture below.
## Decision (Long-Term)
Before the GS path expands beyond PSMCT32 on hardware (PSMT4 RMW, broader
format coverage, or a larger framebuffer), replace the replicated-multi-read
VRAM with one of:
1. **Arbitrated TDP VRAM scheduler** — one TDP backing memory. Port A serves
PCRTC reads with priority; port B serves the writer / RMW path. PSMT4 RMW
becomes multi-cycle and may stall raster writes. This is the most correct
long-term FPGA shape.
2. **Line-buffer scanout** — PCRTC reads short bursts into a small line
FIFO/line-buffer once per scanline, freeing the VRAM ports for writes for
the rest of the line. More complex but closer to a scalable video
architecture.
3. **Bank/tile partitioning** — split VRAM by banks so different clients
typically hit different banks. Still needs conflict handling. Useful as a
later optimization, not as the first replacement.
Eventually larger memory surfaces (≥ a few MiB of true PS2 VRAM, or the
32 MiB main RAM) will need SDRAM/HPS/DDR-backed storage with tiled BRAM
caches; the all-M20K convenience model does not scale.
## Triggers — when to revisit (Ch252)
Re-open this decision and land one of the long-term options above when
**any** of the following becomes true on a hardware build:
1. **PSMT4 RMW returns to the rasterizer write path on hardware.** Any
GS draw flow that consults `is_t4_emit` needs the second VRAM read
port live, which re-introduces the replication cost.
2. **More than one VRAM read client during scanout.** The current
profile is one read client (PCRTC). A second simultaneous read
consumer — texture cache fetch, CLUT sampler from VRAM, secondary
display window, anything that races PCRTC for read bandwidth —
recreates the 1W+nR shape that forced Quartus replication in the
first place.
3. **VRAM_BYTES grows beyond the current 512 KiB profile.** 512 KiB
already costs ~205 M20Ks per replica at Agilex 5 packing. Any
expansion (larger framebuffer, multi-format scratch space, texture
storage) at the current replicated shape exceeds the device budget.
A simulation/elaboration tripwire in `vram_bram_stub.sv` fires
(`$display` + `$fatal`) when `ENABLE_READ2 = 1` **and**
`BYTES >= 262_144` (256 KiB). 256 KiB is not magical — it is the
threshold above which replicated VRAM becomes a board-level
architectural decision rather than a casual parameter flip. The
tripwire is a loud canary in lint / sim; the **real protection is the
board-top parameter profile**.
## Consequences
- Ch251 ships on hardware with the read2-strip build profile. The
bring-up runbook documents the override so anyone reading it later sees
the explicit trade-off.
- Simulation regressions stay byte-identical (default `ENABLE_READ2 = 1`).
- Any chapter that re-enables PSMT4 on hardware **must** land an arbitrated
/ line-buffered VRAM design first. Surfacing this as a decision record
keeps it from quietly slipping when scope expands.
- The Ch251 failure was a warning shot about VRAM strategy, not a fundamental
blocker on the PS2 core. Actual 512 KiB framebuffer storage is ~205 M20Ks;
the over-budget portion was the second full copy.