RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression (272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps, and all dump-derived textures/traces) is excluded via .gitignore and stays local. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3.3 KiB
0009 — Combined textured + alpha + depth: per-pixel memory-op schedule
Status: proven in sim (Ch302), board-pending. Local BRAM probe; NOT yet tiled VRAM.
Why this exists
Before designing tiled/LPDDR-backed VRAM we need the exact per-pixel read/write
schedule a primitive that is simultaneously textured + alpha-blended +
depth-tested demands. Until Ch302 those three GS features were mutually
exclusive (each the sole read2 consumer for its primitive). Ch302 lifts that —
behind the default-off COMBINED_TAZ param — with an explicit walker-stalling
multi-beat FSM in gs_stub, so the schedule is observable and asserted.
Speed was explicitly NOT a goal; the correct, observable schedule is.
The per-pixel schedule (single read2 port, single write port)
Z-test is issued FIRST so a hidden pixel costs one read and nothing else:
| Beat | read2 (1-cyc registered) | compute | write port |
|---|---|---|---|
0 CB_Z |
issue stored-Z read (z_rd_en) |
— | — |
1 CB_ZW |
(issue texel read iff Z passes) | Z-test (GEQUAL): frag_z vs stored_z. FAIL → stop (no texel/dest read, no write; advance) | — |
2 CB_T |
issue dest-color read (fb_rd_en) |
latch texel as Cs + As (=texel α) | — |
3 CB_FB |
— | blend Cv=((Cs−Cd)·As)>>7+Cd |
write color (blended) → FB |
4 CB_ZWR |
— | — | write Z → Z-buffer (skip if ZMSK); then advance walker |
The three reads land on the single read2 port in separate cycles, so the
existing read2 priority mux + its mutual-exclusion $error asserts are untouched
(one consumer per cycle). The two writes serialize on the single write port
(color beat 3, Z beat 4). The walker does not advance to the next candidate
pixel until BOTH writes complete.
The concrete requirement for tiled VRAM
- hidden pixel: 1 read, 0 writes (stored-Z only).
- visible pixel: 3 reads + 2 writes — stored-Z, texel, dest-color reads; color + Z writes.
So tile-local memory must serve up to 3 reads + 2 writes per pixel. The options this makes concrete (no longer hand-wavy):
- a 2-read-port tile RAM (e.g. texel + Z in parallel, dest folded in) + a write path, OR
- a 3-phase read schedule on fewer ports (what this probe does, serialized), trading throughput for ports, OR
- tile-local banking that absorbs the dest read-modify-write locally.
Z-first ordering means the texel/dest bandwidth is only spent on visible pixels — a real saving the tiled design should preserve.
Verification (tb_top_psmct32_combined_demo)
A green Z-writing background + one TME+ABE+ZTE triangle whose interpolated Z crosses the background Z (top half passes, bottom fails). A memory-op tracer records, per pixel, the read enables + write addresses and asserts the SEQUENCE (not just final pixels):
- depth-FAIL: z-read=1, texel-read=0, dest-read=0, color-write=0, Z-write=0 → pixel stays background green.
- depth-PASS: z-read=1, texel-read=1, dest-read=1, color-write=1, Z-write=1 → blend(texel, green); texel RGB and green dest both present. Result: 35 PASS / 7 FAIL / 160 outside, errors=0. Param=0 keeps all prior demos byte-identical.
Out of scope (deliberately)
Perspective (affine only — perspective proven separately, Ch301), alpha-test / texture-alpha discard, non-PSMCT32 dest, and throughput (multi-beat is fine here).