ec82764bef
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression (272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps, and all dump-derived textures/traces) is excluded via .gitignore and stays local. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
114 lines
5.1 KiB
Markdown
114 lines
5.1 KiB
Markdown
# Decision 0006: VRAM Roadmap
|
|
|
|
Status: `In progress` — Ch251.4 near-term rescue applied, longer-term work
|
|
queued.
|
|
|
|
## Context
|
|
|
|
The Ch251 hardware demo build (`de25_nano_psmct32_raster_demo_top`) failed the
|
|
Quartus Fitter on Agilex 5 with **516 / 358 M20K** (144%). The Fitter resource
|
|
report attributed ~410 M20Ks to two replicated `vram_bram_stub` banks:
|
|
|
|
```
|
|
u_demo|u_vram|mem_rtl_0 Logical Size: 4194304 bits M20K blocks: 204.800
|
|
u_demo|u_vram|mem_rtl_1 Logical Size: 4194304 bits M20K blocks: 204.800
|
|
```
|
|
|
|
Root cause: `vram_bram_stub` exposes **1 write + 2 independent read ports**.
|
|
An M20K block has at most two physical ports total (and at most one write
|
|
port). To honour 1W + 2R, Quartus replicates the entire storage so each read
|
|
port gets its own simple-dual-port BRAM, with the write fanned to both copies.
|
|
True dual-port would not have rescued this — TDP still gives only 2 physical
|
|
ports, not 3.
|
|
|
|
The two read ports serve distinct clients:
|
|
|
|
- **read** — PCRTC scanout (every pixel)
|
|
- **read2** — PSMT4 RMW old-byte read on the rasterizer write path
|
|
|
|
The Ch251 build draws PSMCT32 sprites only. The PSMT4 RMW pipe is wired but
|
|
never fires (`is_t4_emit` stays low), so read2 is dead weight on hardware.
|
|
|
|
## Decision (Near-Term — Ch251.4)
|
|
|
|
Add a parameter `ENABLE_READ2` to `vram_bram_stub`:
|
|
|
|
- Default `1` keeps every simulation TB and every PSMT4-exercising path
|
|
byte-identical.
|
|
- Hardware top (`de25_nano_psmct32_raster_demo_top`) overrides to `0`. When
|
|
disabled, the read2 always_ff branch contains **no reference** to `mem`,
|
|
so Quartus infers a single 1W+1R simple-dual-port BRAM (~205 M20Ks at
|
|
512 KiB) instead of two replicas (~410 M20Ks).
|
|
|
|
This is a **scoped hardware-demo build profile**, not a general fix. It is
|
|
correct only as long as the hardware build is PSMCT32 (or any non-PSMT4
|
|
format). Any future hardware build that exercises PSMT4 RMW must either
|
|
re-enable read2 (and accept the M20K cost) or first land the long-term
|
|
architecture below.
|
|
|
|
## Decision (Long-Term)
|
|
|
|
Before the GS path expands beyond PSMCT32 on hardware (PSMT4 RMW, broader
|
|
format coverage, or a larger framebuffer), replace the replicated-multi-read
|
|
VRAM with one of:
|
|
|
|
1. **Arbitrated TDP VRAM scheduler** — one TDP backing memory. Port A serves
|
|
PCRTC reads with priority; port B serves the writer / RMW path. PSMT4 RMW
|
|
becomes multi-cycle and may stall raster writes. This is the most correct
|
|
long-term FPGA shape.
|
|
|
|
2. **Line-buffer scanout** — PCRTC reads short bursts into a small line
|
|
FIFO/line-buffer once per scanline, freeing the VRAM ports for writes for
|
|
the rest of the line. More complex but closer to a scalable video
|
|
architecture.
|
|
|
|
3. **Bank/tile partitioning** — split VRAM by banks so different clients
|
|
typically hit different banks. Still needs conflict handling. Useful as a
|
|
later optimization, not as the first replacement.
|
|
|
|
Eventually larger memory surfaces (≥ a few MiB of true PS2 VRAM, or the
|
|
32 MiB main RAM) will need SDRAM/HPS/DDR-backed storage with tiled BRAM
|
|
caches; the all-M20K convenience model does not scale.
|
|
|
|
## Triggers — when to revisit (Ch252)
|
|
|
|
Re-open this decision and land one of the long-term options above when
|
|
**any** of the following becomes true on a hardware build:
|
|
|
|
1. **PSMT4 RMW returns to the rasterizer write path on hardware.** Any
|
|
GS draw flow that consults `is_t4_emit` needs the second VRAM read
|
|
port live, which re-introduces the replication cost.
|
|
|
|
2. **More than one VRAM read client during scanout.** The current
|
|
profile is one read client (PCRTC). A second simultaneous read
|
|
consumer — texture cache fetch, CLUT sampler from VRAM, secondary
|
|
display window, anything that races PCRTC for read bandwidth —
|
|
recreates the 1W+nR shape that forced Quartus replication in the
|
|
first place.
|
|
|
|
3. **VRAM_BYTES grows beyond the current 512 KiB profile.** 512 KiB
|
|
already costs ~205 M20Ks per replica at Agilex 5 packing. Any
|
|
expansion (larger framebuffer, multi-format scratch space, texture
|
|
storage) at the current replicated shape exceeds the device budget.
|
|
|
|
A simulation/elaboration tripwire in `vram_bram_stub.sv` fires
|
|
(`$display` + `$fatal`) when `ENABLE_READ2 = 1` **and**
|
|
`BYTES >= 262_144` (256 KiB). 256 KiB is not magical — it is the
|
|
threshold above which replicated VRAM becomes a board-level
|
|
architectural decision rather than a casual parameter flip. The
|
|
tripwire is a loud canary in lint / sim; the **real protection is the
|
|
board-top parameter profile**.
|
|
|
|
## Consequences
|
|
|
|
- Ch251 ships on hardware with the read2-strip build profile. The
|
|
bring-up runbook documents the override so anyone reading it later sees
|
|
the explicit trade-off.
|
|
- Simulation regressions stay byte-identical (default `ENABLE_READ2 = 1`).
|
|
- Any chapter that re-enables PSMT4 on hardware **must** land an arbitrated
|
|
/ line-buffered VRAM design first. Surfacing this as a decision record
|
|
keeps it from quietly slipping when scope expands.
|
|
- The Ch251 failure was a warning shot about VRAM strategy, not a fundamental
|
|
blocker on the PS2 core. Actual 512 KiB framebuffer storage is ~205 M20Ks;
|
|
the over-budget portion was the second full copy.
|