Files
retroDE_ps2/docs/decisions/0009-combined-tex-alpha-depth-schedule.md
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

3.3 KiB
Raw Permalink Blame History

0009 — Combined textured + alpha + depth: per-pixel memory-op schedule

Status: proven in sim (Ch302), board-pending. Local BRAM probe; NOT yet tiled VRAM.

Why this exists

Before designing tiled/LPDDR-backed VRAM we need the exact per-pixel read/write schedule a primitive that is simultaneously textured + alpha-blended + depth-tested demands. Until Ch302 those three GS features were mutually exclusive (each the sole read2 consumer for its primitive). Ch302 lifts that — behind the default-off COMBINED_TAZ param — with an explicit walker-stalling multi-beat FSM in gs_stub, so the schedule is observable and asserted.

Speed was explicitly NOT a goal; the correct, observable schedule is.

The per-pixel schedule (single read2 port, single write port)

Z-test is issued FIRST so a hidden pixel costs one read and nothing else:

Beat read2 (1-cyc registered) compute write port
0 CB_Z issue stored-Z read (z_rd_en)
1 CB_ZW (issue texel read iff Z passes) Z-test (GEQUAL): frag_z vs stored_z. FAIL → stop (no texel/dest read, no write; advance)
2 CB_T issue dest-color read (fb_rd_en) latch texel as Cs + As (=texel α)
3 CB_FB blend Cv=((CsCd)·As)>>7+Cd write color (blended) → FB
4 CB_ZWR write Z → Z-buffer (skip if ZMSK); then advance walker

The three reads land on the single read2 port in separate cycles, so the existing read2 priority mux + its mutual-exclusion $error asserts are untouched (one consumer per cycle). The two writes serialize on the single write port (color beat 3, Z beat 4). The walker does not advance to the next candidate pixel until BOTH writes complete.

The concrete requirement for tiled VRAM

  • hidden pixel: 1 read, 0 writes (stored-Z only).
  • visible pixel: 3 reads + 2 writes — stored-Z, texel, dest-color reads; color + Z writes.

So tile-local memory must serve up to 3 reads + 2 writes per pixel. The options this makes concrete (no longer hand-wavy):

  • a 2-read-port tile RAM (e.g. texel + Z in parallel, dest folded in) + a write path, OR
  • a 3-phase read schedule on fewer ports (what this probe does, serialized), trading throughput for ports, OR
  • tile-local banking that absorbs the dest read-modify-write locally.

Z-first ordering means the texel/dest bandwidth is only spent on visible pixels — a real saving the tiled design should preserve.

Verification (tb_top_psmct32_combined_demo)

A green Z-writing background + one TME+ABE+ZTE triangle whose interpolated Z crosses the background Z (top half passes, bottom fails). A memory-op tracer records, per pixel, the read enables + write addresses and asserts the SEQUENCE (not just final pixels):

  • depth-FAIL: z-read=1, texel-read=0, dest-read=0, color-write=0, Z-write=0 → pixel stays background green.
  • depth-PASS: z-read=1, texel-read=1, dest-read=1, color-write=1, Z-write=1 → blend(texel, green); texel RGB and green dest both present. Result: 35 PASS / 7 FAIL / 160 outside, errors=0. Param=0 keeps all prior demos byte-identical.

Out of scope (deliberately)

Perspective (affine only — perspective proven separately, Ch301), alpha-test / texture-alpha discard, non-PSMCT32 dest, and throughput (multi-beat is fine here).