Files
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

5.1 KiB

Decision 0006: VRAM Roadmap

Status: In progress — Ch251.4 near-term rescue applied, longer-term work queued.

Context

The Ch251 hardware demo build (de25_nano_psmct32_raster_demo_top) failed the Quartus Fitter on Agilex 5 with 516 / 358 M20K (144%). The Fitter resource report attributed ~410 M20Ks to two replicated vram_bram_stub banks:

u_demo|u_vram|mem_rtl_0   Logical Size: 4194304 bits   M20K blocks: 204.800
u_demo|u_vram|mem_rtl_1   Logical Size: 4194304 bits   M20K blocks: 204.800

Root cause: vram_bram_stub exposes 1 write + 2 independent read ports. An M20K block has at most two physical ports total (and at most one write port). To honour 1W + 2R, Quartus replicates the entire storage so each read port gets its own simple-dual-port BRAM, with the write fanned to both copies. True dual-port would not have rescued this — TDP still gives only 2 physical ports, not 3.

The two read ports serve distinct clients:

  • read — PCRTC scanout (every pixel)
  • read2 — PSMT4 RMW old-byte read on the rasterizer write path

The Ch251 build draws PSMCT32 sprites only. The PSMT4 RMW pipe is wired but never fires (is_t4_emit stays low), so read2 is dead weight on hardware.

Decision (Near-Term — Ch251.4)

Add a parameter ENABLE_READ2 to vram_bram_stub:

  • Default 1 keeps every simulation TB and every PSMT4-exercising path byte-identical.
  • Hardware top (de25_nano_psmct32_raster_demo_top) overrides to 0. When disabled, the read2 always_ff branch contains no reference to mem, so Quartus infers a single 1W+1R simple-dual-port BRAM (~205 M20Ks at 512 KiB) instead of two replicas (~410 M20Ks).

This is a scoped hardware-demo build profile, not a general fix. It is correct only as long as the hardware build is PSMCT32 (or any non-PSMT4 format). Any future hardware build that exercises PSMT4 RMW must either re-enable read2 (and accept the M20K cost) or first land the long-term architecture below.

Decision (Long-Term)

Before the GS path expands beyond PSMCT32 on hardware (PSMT4 RMW, broader format coverage, or a larger framebuffer), replace the replicated-multi-read VRAM with one of:

  1. Arbitrated TDP VRAM scheduler — one TDP backing memory. Port A serves PCRTC reads with priority; port B serves the writer / RMW path. PSMT4 RMW becomes multi-cycle and may stall raster writes. This is the most correct long-term FPGA shape.

  2. Line-buffer scanout — PCRTC reads short bursts into a small line FIFO/line-buffer once per scanline, freeing the VRAM ports for writes for the rest of the line. More complex but closer to a scalable video architecture.

  3. Bank/tile partitioning — split VRAM by banks so different clients typically hit different banks. Still needs conflict handling. Useful as a later optimization, not as the first replacement.

Eventually larger memory surfaces (≥ a few MiB of true PS2 VRAM, or the 32 MiB main RAM) will need SDRAM/HPS/DDR-backed storage with tiled BRAM caches; the all-M20K convenience model does not scale.

Triggers — when to revisit (Ch252)

Re-open this decision and land one of the long-term options above when any of the following becomes true on a hardware build:

  1. PSMT4 RMW returns to the rasterizer write path on hardware. Any GS draw flow that consults is_t4_emit needs the second VRAM read port live, which re-introduces the replication cost.

  2. More than one VRAM read client during scanout. The current profile is one read client (PCRTC). A second simultaneous read consumer — texture cache fetch, CLUT sampler from VRAM, secondary display window, anything that races PCRTC for read bandwidth — recreates the 1W+nR shape that forced Quartus replication in the first place.

  3. VRAM_BYTES grows beyond the current 512 KiB profile. 512 KiB already costs ~205 M20Ks per replica at Agilex 5 packing. Any expansion (larger framebuffer, multi-format scratch space, texture storage) at the current replicated shape exceeds the device budget.

A simulation/elaboration tripwire in vram_bram_stub.sv fires ($display + $fatal) when ENABLE_READ2 = 1 and BYTES >= 262_144 (256 KiB). 256 KiB is not magical — it is the threshold above which replicated VRAM becomes a board-level architectural decision rather than a casual parameter flip. The tripwire is a loud canary in lint / sim; the real protection is the board-top parameter profile.

Consequences

  • Ch251 ships on hardware with the read2-strip build profile. The bring-up runbook documents the override so anyone reading it later sees the explicit trade-off.
  • Simulation regressions stay byte-identical (default ENABLE_READ2 = 1).
  • Any chapter that re-enables PSMT4 on hardware must land an arbitrated / line-buffered VRAM design first. Surfacing this as a decision record keeps it from quietly slipping when scope expands.
  • The Ch251 failure was a warning shot about VRAM strategy, not a fundamental blocker on the PS2 core. Actual 512 KiB framebuffer storage is ~205 M20Ks; the over-budget portion was the second full copy.