Files
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

7219 lines
414 KiB
Systemverilog
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
// retroDE_ps2 — gs_stub
//
// GS shell with two architecturally distinct write ports (Ch75), a
// primitive-shape observer (Ch76Ch81), a minimal interior
// rasterizer (Ch84), a D3D-style top-left fill rule on triangle
// edges (Ch85), per-pixel Gouraud color interpolation across the
// triangle interior (Ch86), a 2-entry primitive queue feeding
// the SCAN FSM (Ch87 — absorbs up to 3 back-to-back closes
// before raster_overflow), and a 3-stage pixel pipeline
// (Ch88 — S0 coord gen / S1 edge test + bary / S2 color +
// fb_addr + emit, with R_DRAIN to flush before next pop). The
// TEX0_1 (GIF reg# 0x06) is latched (Ch98) with sub-field
// decoder outputs for CBP/CPSM/CSM/CSA/CLD; CSA flows into
// gs_pcrtc_stub.clut_csa for the indexed-color CLUT lookup
// path. Ch99..Ch102 added a 1-cycle `tex0_1_wr_q` pulse on
// every TEX0_1 commit; clut_loader_stub watches it and (when
// CSM=CSM2, CPSM ∈ {PSMCT32, PSMCT16}, AND the CLD-mode policy
// permits — CLD=0 never, =1 always full, =2 if CBP changed,
// =3 if CBP/CPSM/CSA any-changed, =4 always but only 16-entry
// CSA window, ∈ {5..7} reserved no-op) walks the entries
// copying CLUT bytes from VRAM into clut_stub via vram_stub's
// second read port. PSMCT16 entries are RGB5A1-unpacked to
// PSMCT32 ABGR with 5→8 bit-replicate so clut_stub stays a
// single PSMCT32 staging area regardless of source format.
// raster_pixel_emit channel is a streaming write port for
// vram_stub (Ch89 — first persistence layer; PSMCT32 lane only,
// linear byte addressing); gs_pcrtc_stub (Ch90) is the dual
// scanout engine that reads vram_stub back as r/g/b for video,
// configured by PMODE / DISPFB1 (Ch91) and DISPLAY1 (Ch92, Ch93)
// latches written through the privileged CPU MMIO port — no TB
// sideband. DISPLAY1.DX/DY/DW/DH gate the display window inside
// the active area; DISPLAY1.MAGH/MAGV (Ch93) scale each VRAM
// column/line across (MAGH+1)/(MAGV+1) VCK pulses/lines.
// DISPFB1.PSM picks the scanout pixel format (Ch94 / Ch96 /
// Ch103 — PSMCT32, PSMCT16, PSMT8, and PSMT4 supported; other
// PSMs force scanout off). PSMT8 / PSMT4 surface the
// index/nibble as grayscale by default, and as CLUT-decoded
// RGB when `clut_enable=1` with a programmed clut_stub (Ch97). The raster channel emits
// PSMCT32 OR PSMCT16 (Ch95 — pack ABGR → RGB5A1 in S2;
// raster_pixel_be_q gates which bytes vram_stub commits).
// FRAME_1.PSM at the time of close picks the format. Ch105 adds
// PSMT8 raster-emit: when ras_bpp_shift==0 (FRAME_1.PSM=0x13), the
// rasterizer takes the natural ABGR's R channel (low 8 bits) as
// the PSMT8 index — the same lane real PS2 hardware writes when
// the destination FB is PSMT8 — places it in the LOW byte of the
// emit lane, and sets raster_pixel_be_q=4'b0001 so vram_stub
// commits exactly the 1 byte at fb_addr without stomping
// neighbouring pixels in the same 32-bit word. Ch106 closes the
// indexed-write gap with PSMT4: 2 pixels per byte, byte address =
// pixel_index >> 1, low/high nibble keyed by pixel_index[0]. The
// natural ABGR's low 4 bits become the PSMT4 index; emit places
// it in the targeted nibble position with be=4'b0001 and a
// per-bit mask (raster_pixel_mask_q = 0x0F or 0xF0) so vram_stub
// merges only that nibble — the OTHER nibble of the same byte is
// preserved. Back-to-back same-byte emits (e.g. PSMT4 pixels at
// x=0 and x=1) chain through NBA semantics without bypass logic.
// Ch110 — host→local image-transfer (TRX) register set.
// BITBLTBUF (0x50), TRXPOS (0x51), TRXREG (0x52), TRXDIR (0x53)
// are latched as 64-bit raw quadwords; `trxdir_wr_q` pulses for
// one cycle when TRXDIR commits, arming gif_image_xfer_stub for
// a host→local upload. The transfer engine consumes IMAGE-mode
// qwords (FLG=2) coming out of gif_packed_stub and writes the
// pixels into vram_stub at the BITBLTBUF/TRXPOS/TRXREG-described
// destination — driver-shaped palette/texture upload, no TB
// sideband on the framebuffer side.
// What this stub still does NOT do:
// model sub-pixel edge adjustment; expose per-pixel PSM at the
// raster channel. ALL FOUR common GS PSMs (CT32 / CT16 / T8 / T4)
// have OPTIONAL canonical-swizzle paths gated by `PSMCT32_SWIZZLE`
// (Ch122), `PSMCT16_SWIZZLE` (Ch128), `PSMT8_SWIZZLE` (Ch134),
// and `PSMT4_SWIZZLE` (Ch140) respectively — all default OFF, so
// fb_addr is linear by default and any TB that doesn't set those
// parameters sees the legacy linear framebuffer layout. (PSMT4's
// swizzle module also outputs a `nibble_hi` selector that
// replaces `s2_pixel_index[0]` in the existing Ch106 nibble RMW
// machinery when its gate is on.) R/G/B/A
// interpolation is integer truncation (not real-GS fixed-point
// rounding), Q is passed through from the closing vertex's RGBAQ
// (no float interp), and the queue is shallow enough that
// sustained STRIP/FAN streams beyond ~3 simultaneous closes still
// trigger raster_overflow.
//
// Contract refs:
// docs/stub_module_plan.md (Wave 1, item 6)
// docs/contracts/gif_gs.md (authoritative for port + event taxonomy)
//
// Write ports (separate namespaces — do not conflate):
// - reg_wr_* : privileged-block writes, 16-bit offset within
// 0x12000000. CPU MMIO path. BGCOLOR (offset 0x00E0)
// latches into bg_{r,g,b}; other offsets emit EV_MODE.
// - gif_reg_* : GIF A+D / REGLIST writes, 8-bit reg# per PCSX2's
// GSRegs.h. gif_packed_stub drives this with
// REAL_AD_REG_MAP=1. Decodes PRIM/RGBAQ/XYZF2/XYZ2/
// FRAME_1/ZBUF_1 into per-register 64-bit latches;
// unknown reg numbers emit EV_MODE.
//
// Primitive observer (Ch76Ch87 — recognition + Gouraud raster + 2-entry primitive queue):
// - PRIM[2:0] selects vertex threshold and discrete-vs-strip mode:
// POINT=1 (discrete), LINE=2 (discrete), LINE_STRIP=2 (strip),
// TRIANGLE=3 (discrete), TRI_STRIP=3 (strip), TRI_FAN=3 (strip),
// SPRITE=2 (discrete), reserved=no draw.
// - XYZ2 (reg# 0x05) and XYZF2 (reg# 0x04) are vertex commits.
// - Discrete primitives: count vertices to threshold, fire a draw,
// reset counter to 0. One primitive per N vertices.
// - Strip / fan primitives (Ch77): count vertices to anchor
// threshold, fire a draw, then SATURATE the counter at the
// threshold. Every subsequent vertex commit fires another draw,
// so the cadence becomes "first N vertices anchor + one primitive
// per additional vertex." TRI_FAN's pivot-vertex semantics are
// fully modeled at the recognition layer (Ch78Ch81): saturated
// fan extensions PIN the v0 slot to the pivot vertex's identity
// (raw, color, decoded fields, format flag). The Ch84 raster
// engine consumes those pinned slots directly; Ch85 added the
// D3D-style top-left fill rule on triangle edges, and Ch86
// added per-pixel Gouraud R/G/B/A interpolation across the
// interior. What stays out of scope from the real GS raster
// rules: sub-pixel edge adjustment (4-bit fractional X/Y is
// discarded), real-PS2 fixed-point rounding (R/G/B/A use
// integer truncation toward zero). ALL FOUR common GS PSMs
// (CT32 / CT16 / T8 / T4) raster emits are linear by DEFAULT
// but support the canonical PS2 GS page/block swizzle behind
// parameter gates: `PSMCT32_SWIZZLE=1` (Ch122) routes through
// `gs_swizzle_psmct32_stub`; `PSMCT16_SWIZZLE=1` (Ch128)
// through `gs_swizzle_psmct16_stub`; `PSMT8_SWIZZLE=1`
// (Ch134) through `gs_swizzle_psmt8_stub` (page=128×64 px,
// bw_pg=FBW>>1); `PSMT4_SWIZZLE=1` (Ch140) through
// `gs_swizzle_psmt4_stub` (page=128×128 px, bw_pg=FBW>>1).
// PSMT4's swizzle module additionally outputs a `nibble_hi`
// selector that replaces `s2_pixel_index[0]` in the existing
// Ch106 nibble RMW machinery when its gate is on. The four
// parameters are independent. Default 0 on all four keeps
// every existing PSMCT32 / PSMCT16 / PSMT8 / PSMT4 raster TB
// on the legacy linear formula.
// - When a draw fires, prim_complete pulses one cycle,
// prim_complete_count increments, and EV_PRIM_DRAW preempts the
// cycle's EV_WRITE in the trace (the xyz2_q / xyzf2_q latch still
// updates).
// - Ch78: a rolling vertex-identity window (v_curr / v_prev /
// v_prev_prev / v_pivot) is exposed via prim_v0_q / prim_v1_q /
// prim_v2_q so a TB can verify *which* vertices form each closed
// primitive — distinguishes TRI_STRIP rolling triangles
// {v_n-2, v_n-1, v_n} from TRI_FAN pivot triangles
// {v_pivot, v_n-1, v_n}. Snapshot is registered in lockstep with
// the EV_PRIM_DRAW pulse; held between draws.
// - Ch79: prim_color_q snapshots rgbaq_q at close so each closed
// primitive carries its draw-time color (the closing-vertex
// color, equivalent to prim_color_v_close in Ch80 terms).
// - Ch80: per-vertex Gouraud color — a parallel rolling color
// window (c_curr / c_prev / c_prev_prev / c_pivot) sampled at
// each vertex commit. Snapshot outputs prim_color_v0_q /
// prim_color_v1_q / prim_color_v2_q carry the per-vertex color
// tuple aligned with prim_v0_q / v1_q / v2_q. Distinguishes
// flat-shaded (all slots equal) from Gouraud (slots differ)
// primitives at trace level.
// - Ch81: structured-field decode. A parallel format-flag
// rolling window (xyzf2_curr / xyzf2_prev / xyzf2_prev_prev /
// xyzf2_pivot) records whether each vertex came from XYZ2 or
// XYZF2. Decoded snapshot outputs prim_v0_decoded_q /
// prim_v1_decoded_q / prim_v2_decoded_q (vertex_t) and
// prim_v0_color_decoded_q / v1 / v2 (color_t) carry per-channel
// fields (x/y/z/fog/is_xyzf2 + r/g/b/a/q) so a downstream
// rasterizer or pixel-emit path doesn't have to re-derive them
// from raw 64-bit payloads. Cleared on PRIM write same as the
// vertex/color windows.
// - Ch82: minimal pixel emit. One pixel per closed primitive
// (the closing vertex), addressed by frame_1_q.FBP / FBW and
// the decoded screen X/Y. pixel_emit pulses on the same edge
// as prim_complete; pixel_x_q / pixel_y_q / pixel_color_q /
// pixel_fb_addr_q snapshot the resolved fb destination.
// - Ch83: PSM-aware byte-width. fb_addr math uses a per-PSM
// bpp_shift (PSMCT32 / PSMZ32 / host-word formats → shift 2;
// PSMCT16 / PSMZ16 → shift 1; PSMT8 → shift 0). PSMT4 stays
// out of scope (host-word fallback).
// - Ch84: minimal interior rasterizer. SPRITE → axis-aligned
// rectangle fill; TRI / TRI_STRIP / TRI_FAN → edge-function
// half-plane test inside the bounding box. POINT / LINE /
// LINE_STRIP keep Ch82 single-closing-pixel behavior.
// Separate raster_pixel_emit strobe channel — existing
// pixel_emit (one-per-close) is unchanged. Flat-shaded;
// Gouraud across the interior is a future chapter.
// - Ch85: D3D-style top-left fill rule on triangle edges. At
// IDLE→SCAN, signed area picks winding (auto-swap to CCW if
// CW); per-edge top-or-left flag → 0/1 bias such that
// `inside = all (e[i] + bias[i] <= 0)`. Top edges
// (dy==0 & dx>0) and left edges (dy>0) are inclusive; right
// and bottom edges are exclusive. Adjacent triangles sharing
// an edge no longer double-paint. Degenerate triangles
// (signed area == 0) skip SCAN and latch raster_degenerate.
// - Ch86: per-pixel Gouraud color interpolation for triangles.
// Three per-vertex colors (ras_c0/c1/c2) latched at SCAN
// init with the same v1↔v2 swap mirror as the vertex coords.
// Inside a TRI, raster_pixel_color_q is computed as
// λ0*c0 + λ1*c1 + λ2*c2 per channel, where λ_i are the
// barycentric weights derived from the unbiased edge
// functions and divided by the post-swap signed area. R/G/B/A
// are integer-truncated; Q passes through from the closing
// vertex (it's an IEEE float and doesn't interp as integer).
// SPRITE keeps flat shading.
// - Ch87: 2-entry raster command FIFO. Captures the per-prim
// context at every close cycle and feeds the FSM one
// primitive per scan. Effective concurrency = 1 in-flight
// + 2 queued = absorbs up to 3 back-to-back closes without
// raster_overflow. raster_overflow now latches when the
// FIFO is FULL on a new push attempt; degenerate triangles
// are filtered at enqueue (set raster_degenerate, not
// pushed). Pop happens at IDLE→SCAN AND at end-of-scan when
// the FIFO has more work, so back-to-back scans run
// contiguously without an IDLE bubble.
// - Ch88: 3-stage pixel pipeline + R_DRAIN. The SCAN body is
// split into S0 (current x/y from the bbox walker), S1 (edge
// functions + top-left bias decision), and S2 (Gouraud
// interp + fb_addr math + emit). Throughput is one candidate
// pixel per cycle; latency from pop_ok to first emit is 3
// pipeline stages. After S0 generates the bbox corner the
// FSM transitions R_SCAN→R_DRAIN so S1/S2 can drain without
// S0 producing more valids. The next pop fires only when
// drain_done = (state==R_DRAIN) & !s1_valid_q & !s2_valid_q,
// so back-to-back scans don't lose pipeline-tail pixels.
// - PRIM write resets the vertex counter AND the rolling window so
// a fresh primitive context starts clean.
//
// Trace payload schema (this stub):
// GS MODE arg0=offset arg1=value arg2=- arg3=-
// GS BGCOLOR arg0=r arg1=g arg2=b arg3=-
// GS WRITE arg0=reg# arg1=data arg2=selector arg3=-
// (selector: 1=PRIM 2=RGBAQ 3=XYZF2 4=XYZ2 5=FRAME_1 6=ZBUF_1
// 7=TEX0_1 (Ch98); 0 for unknown reg#)
// GS PRIM_DRAW arg0=prim_type arg1=threshold arg2=count arg3=vertex
// (Ch78: prim_v0_q / prim_v1_q / prim_v2_q outputs
// carry the per-primitive vertex tuple alongside the
// event — sample on the same cycle as ev_valid.
// Ch79: prim_color_q output carries the draw-time
// rgbaq_q value, latched on the same edge.
// Ch80: prim_color_v0_q / v1_q / v2_q outputs
// carry the per-vertex Gouraud color tuple.)
`timescale 1ns/1ps
module gs_stub
import trace_pkg::*;
#(
parameter logic [15:0] BGCOLOR_OFFSET = 16'h00E0,
// Ch91 — privileged display registers. PMODE selects which
// CRTCs are active and how their outputs blend; DISPFB1 holds
// the framebuffer base/width/PSM that the PCRTC scanout
// engine reads from. gs_pcrtc_stub consumes these latches as
// its DISPFB-equivalent inputs (replacing Ch90's TB sideband
// ports). DISPFB2 / DISPLAY1 / DISPLAY2 / SMODE / SYNC are
// out of scope until a dual-display / interlace chapter.
parameter logic [15:0] PMODE_OFFSET = 16'h0000,
parameter logic [15:0] DISPFB1_OFFSET = 16'h0070,
// Ch92 — DISPLAY1 (display window/crop) at offset 0x0080.
// gs_pcrtc_stub honors DX/DY (window origin) and DW/DH
// (window extents-1) to gate scanout to a sub-rect of the
// active area. MAGH/MAGV (scaling) deferred — pcrtc still
// takes H/V totals from module parameters at instantiation.
parameter logic [15:0] DISPLAY1_OFFSET = 16'h0080,
// Ch75: the Ch74-era PMODE/SMODE2/FRAME_1/ZBUF_1 offset
// parameters were wrong — they conflated GIF A+D register
// numbers with privileged CRTC/MMIO offsets. PMODE/SMODE2 are
// privileged-only (CPU MMIO); FRAME_1/ZBUF_1 belong to the GIF
// A+D namespace, not the privileged one. Removed; replaced with
// a separate GIF-A+D port below that uses real PS2 reg# from
// PCSX2's GSRegs.h.
// Ch122 — when set, PSMCT32 raster emit (the per-pixel write
// path through `raster_pixel_fb_addr_q`) computes the VRAM
// byte address via the canonical PS2 GS page/block swizzle
// (gs_swizzle_psmct32_stub) instead of the legacy linear
// formula `(FBW*64*y + x) * 4`. Other PSMs are not affected
// by this parameter — PSMCT16 has its own gate
// (PSMCT16_SWIZZLE, Ch128), PSMT8 has PSMT8_SWIZZLE (Ch134),
// PSMT4 has PSMT4_SWIZZLE (Ch140 — see below).
// Default 0 keeps every existing PSMCT32 raster TB on the
// original linear addressing.
parameter bit PSMCT32_SWIZZLE = 1'b0,
// Ch128 — when set, PSMCT16 raster emit computes the VRAM
// byte address via the canonical PS2 GS page/block/column
// swizzle (gs_swizzle_psmct16_stub) instead of the legacy
// linear formula `(FBW*64*y + x) * 2`. Mirrors the Ch122
// PSMCT32 raster gate at the same emit stage. PSMCT32 / PSMT8
// are governed by their own gates (PSMCT32_SWIZZLE /
// PSMT8_SWIZZLE / PSMT4_SWIZZLE). Default 0 keeps
// every existing PSMCT16 raster TB (Ch95 raster_psmct16, etc.)
// on the legacy linear path. Closes the third PSMCT16
// integration point — together with Ch126 (read-side
// pcrtc) and Ch127 (image-xfer write-side), all three
// major PSMCT16 paths can be byte-consistent under the
// canonical swizzle when their gates are flipped on.
parameter bit PSMCT16_SWIZZLE = 1'b0,
// Ch134 — when set, PSMT8 raster emit computes the VRAM byte
// address via the canonical PS2 GS page/block/column swizzle
// (gs_swizzle_psmt8_stub) instead of the legacy linear
// formula `(FBW*64*y + x) * 1`. PSMT8 pages are 128 px wide
// (vs 64 px for direct-color PSMs) so the swizzle internally
// divides FBW by 2 — PCSX2 asserts FBW must be even for
// PSMT8 at GSLocalMemory.h:553. Mirrors the Ch122/Ch128
// raster gates at the same emit stage. PSMCT32 / PSMCT16 /
// PSMT4 are governed by their own gates. Default 0 keeps
// every existing PSMT8 raster TB (Ch105 raster_psmt8, Ch107
// PSMT4-via-CT16-CLUT palette path, etc.) on the legacy
// linear addressing. Closes the third PSMT8 integration
// point — together with Ch132 (read-side pcrtc) and Ch133
// (image-xfer write-side), all three major PSMT8 paths can
// be byte-consistent under the canonical swizzle when their
// gates are flipped on.
parameter bit PSMT8_SWIZZLE = 1'b0,
// Ch140 — when set, PSMT4 raster emit computes the VRAM byte
// address via the canonical PS2 GS page/block/column swizzle
// (gs_swizzle_psmt4_stub) instead of the legacy linear
// formula `(FBW*64*y + x) >> 1`. PSMT4 is 4 bits/pixel so the
// swizzle module also outputs a `nibble_hi` selector that
// picks which nibble of the byte at the swizzled address
// holds this pixel — the linear formula's s2_pixel_index[0]
// selector is wrong under the swizzled layout because the
// canonical PCSX2 columnTable4 reorders nibbles within a
// block. The existing Ch106 nibble RMW machinery
// (write_be=4'b0001 + write_mask 0x0F or 0xF0) layers on top
// of the swizzled byte address: the mask + low/high data
// placement key on the swizzle's nibble_hi when this gate is
// on, instead of on s2_pixel_index[0]. PSMT4 pages are 128
// px wide AND 128 px tall (different from PSMT8's 128×64);
// the swizzle internally uses bw_pg = ras_fbw>>1 — PCSX2
// asserts FBW must be even for PSMT4 at GSLocalMemory.h:560.
// Mirrors the Ch122/Ch128/Ch134 raster gates at the same emit
// stage. PSMCT32 / PSMCT16 / PSMT8 are governed by their own
// gates. Default 0 keeps every existing PSMT4 raster TB
// (Ch106 raster_psmt4, Ch107 PSMT4 e2e, Ch104 round-trip,
// etc.) on the legacy linear addressing. Closes the third
// PSMT4 integration point — together with Ch138 (read-side
// pcrtc) and Ch139 (image-xfer write-side), all three major
// PSMT4 paths can be byte-consistent under the canonical
// swizzle when their gates are flipped on, completing the
// four-PSM byte-accuracy foundation for ALL three integration
// points.
parameter bit PSMT4_SWIZZLE = 1'b0,
// Brick 1 (texturing) — VRAM texel read latency, in clk cycles,
// for the texture read port (tex_rd_*). vram_stub's second read
// port (read2) is COMBINATIONAL, but gs_stub presents the texel
// address from a registered pipeline stage and consumes the data
// one cycle later, so the effective round-trip seen by the
// sampler is 1 cycle. Kept as a parameter so a future registered/
// BRAM-backed VRAM (vram_bram_stub, RD_LATENCY 1+) can bump it
// without touching the datapath.
parameter int TEX_RD_LATENCY = 1,
// Ch296 — when set, the FIFO pop (textured-scan start) is held off
// while `clut_load_busy` is high, so the textured texel fetch never
// collides with a VRAM->CLUT load on the shared read2 port. Only a top
// with a CLUT loader sets this (and drives clut_load_busy); default 0
// ignores clut_load_busy entirely (it may float), keeping every other
// instance byte-identical to pre-Ch296.
parameter bit CLUT_STALL = 1'b0,
// Ch295 — texel-read port timing model. Selects WHICH pipeline
// stage drives the texel address presented on `tex_rd_addr`, so
// the registered `tex_rd_data` lands in the SAME cycle the
// existing single S1->S2 texel register samples it:
//
// 0 (default) : COMBINATIONAL read port (vram_stub.read2). The
// address is generated from the S1 coords
// (s1_x_q/s1_y_q); read2 returns mem[addr] in the
// SAME (S1) cycle, so `s1_tex_color` is valid in S1
// and the single S1->S2 register aligns it with the
// S2 pixel. This is the Brick-1 contract every
// existing TB + the vram_stub demo top verified.
//
// 1 : REGISTERED 1-cycle read port (vram_bram_stub.read2,
// the BRAM board VRAM). The address is generated one
// stage EARLIER, from the S0 coords (ras_cur_x/
// ras_cur_y), and presented on the S0 cycle. The BRAM
// registers the read and returns mem[addr] one cycle
// later — exactly at the S1 cycle — so `s1_tex_color`
// is again valid in S1 and the SAME single S1->S2
// register (unchanged) aligns it with the S2 pixel.
// No emit-stage timing changes; only the address
// source + read-enable move up one stage.
//
// The S0 coord/valid used for addressing is exactly what latches
// into s1_x_q/s1_valid_q at the next edge (see the S0->S1 latch),
// so the texel sampled for S0 pixel P is the texel for the S1
// instance of pixel P — the invariant the single register relies on.
parameter bit TEX_RD_REGISTERED = 1'b0,
// Brick 2a — ALPHA blending (transparency). When a FLAT (non-
// textured) PSMCT32 SPRITE closes with PRIM.ABE=1 AND the ALPHA_1
// register selects the canonical source-over config
// (A=0/Cs, B=1/Cd, C=0/As, D=1/Cd), the raster emit becomes a
// read-MODIFY-write against the destination framebuffer:
// Cv = ((Cs - Cd) * As) >> 7 + Cd per R/G/B, clamp [0,255]
// The dest pixel Cd is READ back from VRAM at the SAME byte address
// that is about to be written, through a dedicated dest-fb read
// port (fb_rd_*). gs_stub does NOT own VRAM — the wrapper arbitrates
// fb_rd_* onto the spare second read port (read2), mutually
// exclusive with the texel-fetch port (a FLAT blend never textures)
// and with the PSMT4 RMW path (blend is PSMCT32). See the wrapper
// read2 arbitration note.
//
// When PRIM.ABE=0 (or the primitive is textured, or PSM != PSMCT32,
// or ALPHA_1 selects a non-source-over config), the emit path is
// BYTE-IDENTICAL to the pre-Brick-2a opaque write — fb_rd_en stays
// low and no extra emit-pipeline latency is introduced.
//
// FB_RD_REGISTERED selects the dest-read timing model, mirroring
// TEX_RD_REGISTERED:
// 0 (default) : COMBINATIONAL read port (vram_stub.read2). fb_rd_addr
// is presented in the S2 stage; fb_rd_data is valid the
// SAME cycle and latched into the S3 blend stage.
// 1 : REGISTERED 1-cycle read port (vram_bram_stub.read2).
// fb_rd_addr is presented in the S2 stage; the BRAM
// returns fb_rd_data one cycle later — exactly at the
// S3 stage — so S3 blends against the LIVE fb_rd_data.
// Either way the blended pixel commits one cycle later than an opaque
// pixel would (the added S3 stage); since ABE is constant within a
// primitive, no pixel mixes the two latencies.
parameter bit FB_RD_REGISTERED = 1'b0,
// Brick 2b — Z-buffer (depth test) dest-Z read timing model,
// mirroring FB_RD_REGISTERED. The Z read shares the per-pixel
// S2-address / S3-use shape:
// 0 (default) : COMBINATIONAL read port (vram_stub.read2). z_rd_addr
// presented in the S2 stage; z_rd_data valid the SAME
// cycle and latched into the S3 compare stage.
// 1 : REGISTERED 1-cycle read port (vram_bram_stub.read2).
// z_rd_addr presented in S2; z_rd_data returns one
// cycle later (S3) and is compared live.
// The stored-Z read uses the SAME read2 port the alpha dest-fb read
// and the texel fetch use — mutually exclusive by feature (a flat
// Z-tested sprite never textures and never alpha-blends in v1; see
// new_zte_active). A passing Z pixel commits BOTH the framebuffer
// color AND the Z value through the single VRAM write port, so a
// Z-tested primitive runs the pixel pipeline at HALF rate (one
// candidate pixel every 2 cycles): beat 0 writes the fb color (on
// pass) and beat 1 writes the Z value (on pass, ZMSK=0). When ZTE
// is inactive for the primitive, the pipeline runs full rate and
// the emit path is byte-identical to pre-Brick-2b.
parameter bit Z_RD_REGISTERED = 1'b0,
// Ch301 — PERSPECTIVE-CORRECT textured triangles. When 1, a TME TRIANGLE
// whose texture coords were supplied via the ST register (S=u/w, T=v/w) plus
// RGBAQ.Q (=1/w) interpolates those three perspective attributes affinely
// (extra shared-divider gradient steps) and recovers per-pixel integer texel
// (u,v) = (S/Q, T/Q) through the pipelined gs_reciprocal_stub LUT + multiply
// (gs_persp_uv) — NO per-pixel divider. UV-supplied triangles + SPRITEs keep
// the existing AFFINE path. Default 0 generate-guards ALL perspective logic
// out, so every existing TB / the flat production build is byte-identical
// (zero DSP/ALM/M20K cost). Only a perspective demo build sets it 1.
parameter bit PERSPECTIVE_CORRECT = 1'b0,
// Ch351 — perspective reciprocal LUT width (forwarded to gs_persp_uv). Default 8 = byte-identical to
// Ch342/348. Far-W draws (high PERSP_FRAC, small Q) set 11 for finer 1/w. Inert when PERSPECTIVE_CORRECT=0.
parameter int PERSP_RECIP_IDX_BITS = 8,
// Ch352 — triangle-setup divider settle cycles. The gradient solve uses ONE shared combinational signed
// divide (grad_num_q[grad_step]/grad_det) advanced one grad_step per cycle. At 30 MHz that divide cone is
// ~100 ns (66 ns setup) and only matters for perspective/gradient-heavy draws (SH3). GRAD_DIV_CYCLES holds
// each grad_step for N cycles so the divide settles, capturing the FIFO only on the last cycle (paired with a
// multicycle SDC constraint). The gradient VALUES are unchanged — bit-exact, only the per-triangle setup is
// slower (once per primitive, free). Default 1 = byte-identical to today; board fits set 4.
parameter int GRAD_DIV_CYCLES = 1,
// Ch352 — use the SEQUENTIAL gs_grad_divider for the triangle-setup divide instead of the combinational `/`.
// Default 0 = combinational + grad_settle (byte-identical to today; sim regression unchanged). Board fits set
// 1: registered iterations, no combinational cone -> closes timing with NO SDC exception (the combinational
// divider's cone could not be multicycle'd or false-pathed without the Quartus fitter grinding for hours).
parameter bit GRAD_SEQ_DIVIDER = 1'b0,
// COMBINED architectural probe — when 1, a TRIANGLE that is
// textured (PSMCT32 DECAL) AND alpha-blended (source-over) AND
// depth-tested (GEQUAL/GREATER, PSMZ32) at once runs through a
// dedicated walker-stalling 5-beat per-pixel FSM (comb_*). This
// exists to PROVE the per-pixel read/write SCHEDULE for a fully
// featured primitive (one read2 consumer per beat: Z, then texel,
// then dest-fb; then two writes: color, then Z). Speed is NOT a
// goal — the walker HOLDS at each candidate pixel until the FSM
// completes. Default 0 generate-guards ALL combined logic out and
// ties ras_combined constant 0, so every existing TB / production
// build is BYTE-IDENTICAL (zero added cost). Only a combined-mode
// probe demo sets it 1.
parameter bit COMBINED_TAZ = 1'b0,
// Ch344 — TEXTURED + source-over ALPHA SPRITE path. When 1, a SPRITE that is
// textured (PSMCT32 TEX0) AND alpha-blended (PRIM.ABE=1, ALPHA_1 source-over)
// renders through an ISOLATED half-rate per-pixel schedule (NOT the combined-TAZ
// compositor, which stays triangle/tile/Z-shaped and load-bearing): the existing
// texel fetch (S1, read2) and the alpha dest-read (S2, read2) are placed on
// SEPARATE half-rate beats so read2 keeps exactly one consumer per cycle, then
// gs_alpha_blend composites Cs=texel(*vertex color if TFX=0 MODULATE), As=texel
// alpha, over the dest. Default 0 generate-guards ALL of it out and ties
// new_tex_abe_active to 0, so a textured ABE sprite still falls back to opaque
// DECAL exactly as pre-Ch344 — every existing TB / production build byte-identical.
parameter bit SPRITE_TEX_ALPHA = 1'b0,
// Ch347 — also admit PSMT8 (CLUT-indexed) textures into the textured-alpha SPRITE path. Default 0
// keeps the proven Ch344 PSMCT32 sprite path byte-identical; when 1, a PSMT8 (TEX0.PSM=0x13) source-
// over sprite takes the same half-rate texel+dest+blend path, with gs_texture_unit doing the
// index->CLUT->ABGR decode upstream of the s1_tex_color capture (the CLUT must be loaded via a CLD!=0
// TEX0 before the draw, same precondition as the proven indexed-DECAL path). PSMT4 stays out for v1.
parameter bit SPRITE_TEX_ALPHA_CLUT = 1'b0,
// ------------------------------------------------------------------
// Ch304 TILE-LOCAL render mode — first architectural step toward
// tiled VRAM (docs/decisions/0008). When 1 (AND COMBINED_TAZ=1, AND
// the in-flight primitive is the combined TME+ABE+ZTE triangle), the
// combined per-pixel render targets an ON-CHIP 16x16 color+Z tile
// scratchpad instead of VRAM: the raster sequence becomes
// R_IDLE -> T_CLEAR -> R_SCAN(tile-local) -> R_DRAIN -> T_FLUSH -> R_IDLE.
// - T_CLEAR : sweep all 256 tile entries, color<=TILE_CLEAR_COLOR,
// Z<=TILE_CLEAR_Z (one entry / cycle, no FB emit).
// - R_SCAN : the EXISTING combined FSM runs, but its Z read, dest
// read, color write and Z write are rerouted to the two
// on-chip tile RAMs (u_tile_z / u_tile_color). The
// TEXTURE fetch is UNCHANGED on the VRAM read2 path.
// The combined raster_pixel_emit color/Z writes are
// SUPPRESSED (they land in tile RAM instead). z_rd_en /
// fb_rd_en stay LOW (the Z/dest reads are tile-local).
// - T_FLUSH : sweep all 256 tile entries, emit each tile color word
// to its linear PSMCT32 FB address via raster_pixel_emit.
// raster_active stays high across CLEAR+SCAN+DRAIN+FLUSH so external
// waits see no gap. The tile is at screen origin, so the tile-local
// index of pixel (x,y) is {y[3:0], x[3:0]}.
//
// Default 0 generate-guards ALL tile logic out and ties tile_active
// constant 0, so every existing TB / production build is BYTE-
// IDENTICAL — including a COMBINED_TAZ=1, TILE_LOCAL=0 build, which
// takes the unchanged VRAM combined path.
parameter bit TILE_LOCAL = 1'b0,
// Tile clear values (ABGR color / Z word). Defaults per task spec.
parameter logic [31:0] TILE_CLEAR_COLOR = 32'hFF00_8000, // green ABGR
parameter logic [31:0] TILE_CLEAR_Z = 32'h0000_4000,
// Ch305 TILE GRID — render the popped combined primitive across a small
// grid of 16x16 tiles (TILE_COLS x TILE_ROWS), re-rendering (re-clipping)
// the SAME primitive against each tile. The primitive is popped + its
// affine gradients solved ONCE; the grid loop only re-clips the walker
// bbox + re-targets the FLUSH framebuffer offset per tile.
//
// With BOTH = 1 the behavior is BYTE-IDENTICAL to the Ch303 single-tile
// path: tile_ox=tile_oy=0, the clip = (primitive bbox) ∩ [0..15] = the
// primitive bbox (the Ch303 prim is already 16-aligned at the origin),
// the FLUSH offset is 0, and there is exactly ONE tile iteration.
parameter int TILE_COLS = 1,
parameter int TILE_ROWS = 1,
// Ch323 TEST HOOK (default 0): force every tile's spill_valid set at reset, so the FIRST
// batch RELOADS instead of booting from clear. Used ONLY by the negative regression test to
// prove the clean-Z bootstrap matters (forcing valid + garbage backing => depth rejects all).
parameter bit SPILL_FORCE_VALID = 1'b0,
// Ch305+ MULTI-PRIMITIVE TILED SCENE — when set, each tile re-renders a
// LIST of TILE_PRIM_COUNT primitives (all buffered in the FIFO) in order,
// so later primitives depth-test / alpha-blend against earlier ones within
// the tile (no inter-primitive clear). Default OFF → byte-identical: every
// multiprim branch below is guarded by `if (TILE_MULTIPRIM)` and the normal
// streaming pop / grid path is reached UNCHANGED.
parameter bit TILE_MULTIPRIM = 1'b0,
parameter int TILE_PRIM_COUNT = 1,
// Ch331 — feeder "end-of-list" grid trigger. When 1, the grid starts on the LATCHED
// prim_list_flush_i (render whatever 1..FIFO_DEPTH prims are buffered) instead of the
// fixed fifo_count>=TILE_PRIM_COUNT threshold — so a runtime list of ANY size (incl.
// <TILE_PRIM_COUNT or >TILE_PRIM_COUNT, up to FIFO_DEPTH) renders in ONE pass without the
// threshold splitting it across framebuffer-clearing passes. Default 0 = legacy threshold.
parameter bit MP_FLUSH_ONLY = 1'b0,
// Ch336 — FRAMEBUFFER ACCUMULATION across >FIFO_DEPTH scenes. When 1 (feeder mode), the grid
// also fires when the FIFO is FULL (mid-list batch, not just end-of-list), and only the FIRST
// batch of a scene clears+flushes the full tile; later batches SPARSE-flush only the pixels they
// drew, so a multi-batch scene accumulates into one framebuffer instead of each batch wiping the
// last. v1 = COLOR accumulation with PER-BATCH Z (cross-batch Z ordering is a follow-on). Default
// 0 = legacy single-pass-per-grid (byte-identical).
parameter bit TILE_ACCUM_ENABLE = 1'b0,
// Ch338 — PERSISTENT cross-batch Z. With TILE_ACCUM_ENABLE the on-chip tile_z is cleared every
// batch, so a >FIFO_DEPTH scene whose overlapping prims SPLIT across FIFO batches loses Z ordering
// (each batch starts from a fresh Z → a later batch's FAR prim overwrites an earlier batch's NEAR
// prim). When 1, tile_z is widened to the WHOLE grid (NTILES*256, indexed by tile_id) and cleared
// ONLY on the first (full-flush) batch of a scene; later batches read/update the resident Z, so a
// tile re-rendered in a later batch depth-tests against ALL prior batches. Color still rides the
// Ch336 sparse flush (Z-fail pixels aren't written → not flushed → the accumulated FB survives), so
// ONLY Z needs to persist. Default 0 = legacy per-batch Z (tile_z stays 256-deep, byte-identical).
// On-chip only (no VRAM Z traffic) — sized for the small feeder FB; large FBs use the Ch323
// LPDDR spill/reload path instead.
parameter bit TILE_Z_PERSIST = 1'b0,
// Ch315 — PRIMITIVE/BIN CAPACITY. TILE_FIFO_DEPTH sizes the primitive FIFO
// (the full per-primitive attribute storage — the ~40 fifo_* arrays) AND, in
// the bin-buffer renderer, the per-tile bin depth (bin_prim[NTILES][DEPTH] of
// prim INDICES). So one knob scales both the prim-list capacity (N) and the
// per-tile bin capacity (M = N here: a tile's bin can hold every queued prim,
// so the single capacity limit is the FIFO). MUST be a power of two (the
// wptr/rptr wrap relies on $clog2(DEPTH) bits — see the FIFO_DEPTH comment).
// Default 4 → byte-identical to every pre-Ch315 build. The dominant cost is
// the fifo_* attribute storage (~hundreds of register bits/slot); the bins
// add only NTILES*FIFO_CNT_W index bits per depth (negligible) — so register
// bins stay reasonable far beyond the prim FIFO's practical limit.
parameter int TILE_FIFO_DEPTH = 4,
// Bin-buffer renderer. Default OFF -> BYTE-IDENTICAL: the TP_BIN phase is
// never entered (mp_grid_start goes straight to TP_CLEAR as before), the
// render uses the original mp_next_nonempty re-test path, and every bin_*
// register is dead (synth-pruned). When 1 (with TILE_MULTIPRIM=1), a
// pre-render classification phase materialises a per-tile ordered list of
// primitive indices (the BIN), and each tile's render walks ONLY its bin
// list instead of re-scanning every primitive at render time. The overlap
// test and clip math are IDENTICAL to mp_next_nonempty, so the rendered
// image is bit-for-bit the same as the Ch305 re-test path.
parameter bit BIN_BUFFER_ENABLE = 1'b0,
// GS SCISSOR_1 rectangular clip. Default OFF → byte-identical: the
// eff_sc* bounds below collapse to 0 / 0xFFF so intersecting them into
// every walker-bbox clip site is a provable no-op. When ENABLEd, the
// inclusive screen rect [SCAX0..SCAX1]x[SCAY0..SCAY1] is folded into the
// walker bounding box — exact for a rectangular scissor, no per-pixel test.
parameter bit SCISSOR_ENABLE = 1'b0,
// Ch294 — GS texture WRAP MODES (REPEAT + CLAMP). Default OFF →
// byte-identical: the per-primitive wrap snapshot (ras_wms/wmt/tw/th) is
// forwarded to gs_texture_unit, but with TEX_WRAP_ENABLE=0 the sampler
// collapses u_eff/v_eff to u/v as a compile-time constant, so the wrap
// plumbing is unused dead state and every existing demo/TB is identical.
parameter bit TEX_WRAP_ENABLE = 1'b0,
// Ch305+ TILE COLOR PSMCT16 — when set, the on-chip tile COLOR RAM
// stores 16-bit PSMCT16 (RGB5A1) instead of 32-bit PSMCT32, halving
// its footprint. The RMW packs blended ABGR8888 -> pix16 on write,
// unpacks pix16 -> ABGR8888 on the blend dest-read, and the FLUSH
// emits PSMCT16-formatted framebuffer writes. The Z tile RAM is
// UNCHANGED (stays 32-bit). Default OFF -> TILE_COLOR_W folds to 32
// and every pack/unpack/PSMCT16-flush branch is a compile-time-
// constant dead arm, so every existing build is BYTE-IDENTICAL.
parameter bit TILE_COLOR_PSMCT16 = 1'b0,
// Brick-2c — generic GS ALPHA blend selector. When 0 (default) the
// combined blender u_comb_blend is hardwired to source-over and the
// per-primitive selector snapshots (fifo_alpha_*/ras_alpha_*) are
// loaded but UNREAD by the blender → every existing build is
// BYTE-IDENTICAL. When 1, u_comb_blend uses the generic A/B/C/D + FIX
// datapath driven by the snapshotted per-primitive ALPHA_1 selectors.
parameter bit ALPHA_MODES_ENABLE = 1'b0,
// Ch310 — BILINEAR (4-tap) texture filtering in the COMBINED tiled
// renderer, PSMCT32-only. When 1 AND the per-primitive TEX1.MMAG=1
// (LINEAR magnification) AND the texture psm==PSMCT32, the combined
// per-pixel FSM stalls in a new CB_TWAIT beat while the (already-proven)
// 4-tap gs_texture_unit bilinear sampler runs its ~9-cycle multi-beat
// read, then latches the filtered ABGR. Default OFF → CB_TWAIT is never
// entered (bili_now is a compile-time-constant 0), the per-primitive
// filter snapshot (fifo_filter_lin/ras_filter_lin) and the fractional UV
// (s1_u_frac/s1_v_frac) are loaded/wired but UNREAD by the nearest texel
// path, and u_tex is built with BILINEAR_ENABLE=0 → g_nearest. So every
// existing build is BYTE-IDENTICAL.
parameter bit BILINEAR_ENABLE = 1'b0,
// Ch314 — BILINEAR for PALETTIZED textures. When 1 (with BILINEAR_ENABLE=1)
// the combined tile path ALSO accepts a PSMT8 or PSMT4 DECAL texture as a
// filtered primitive: close_combined's texture-PSM gate widens to admit
// them, and the shared gs_texture_unit gets PALETTE_BILINEAR=1 so each of
// the 4 taps fetches an index, CLUTs it to a color, then interpolates the
// COLORS (CLUT-before-interp). Default OFF → close_combined still requires a
// PSMCT32 texture and the sampler keeps indexed textures nearest, so every
// existing build is BYTE-IDENTICAL.
parameter bit PALETTE_BILINEAR = 1'b0,
// Ch323 — TILE COLOR+Z SPILL/RELOAD to LPDDR. Default OFF → byte-identical: the
// TP_RELOAD / TP_ZFLUSH phases below are never entered, no Z-flush stream is driven,
// and the synthesised state set stays TP_OFF..TP_BIN. When ENABLEd (TILE_LOCAL build),
// the tile FSM gains a RELOAD phase (load color+Z from a staging RAM into the tile RAMs
// before RENDER) and a ZFLUSH phase (emit the tile Z RAM to an LPDDR Z-backing region
// after the color FLUSH), so tile-local color/Z state can leave the chip and return.
parameter bit TILE_SPILL_ENABLE = 1'b0
) (
input logic clk,
input logic rst_n,
// ------------------------------------------------------------------
// Privileged CRTC/MMIO write port. Reaches gs_stub directly from
// CPU MMIO at 0x12000000 — NOT through GIF. BGCOLOR lives here.
// ------------------------------------------------------------------
input logic reg_wr_en,
input logic [15:0] reg_wr_addr,
input logic [63:0] reg_wr_data,
// ------------------------------------------------------------------
// Ch75 — GIF A+D / REGLIST register port. 8-bit register number
// (PS2-real, per PCSX2 GSRegs.h). gif_packed_stub drives this
// with REAL_AD_REG_MAP=1. Architecturally distinct from the
// privileged port: a GIF-side write to reg# 0x4C goes to the
// GIF-context FRAME_1 latch; a privileged write to offset
// 0x04C0 would be a different register. Today only the GIF-
// context lane is meaningfully populated.
// ------------------------------------------------------------------
input logic gif_reg_wr_en,
input logic [7:0] gif_reg_num,
input logic [63:0] gif_reg_data,
// Ch331 — runtime command-list "end of list" flush (feeder mode). A 1-cycle pulse
// says "the current list is fully staged"; gs_stub LATCHES it (mp_flush_pending) and
// starts the grid when the renderer is genuinely ready (all_grad_done / idle), rendering
// whatever 1..FIFO_DEPTH prims are buffered. Only consulted when MP_FLUSH_ONLY=1; tied 0
// (left open) by every non-feeder consumer, so default builds are byte-identical.
input logic prim_list_flush_i,
// Pixel source for platform_video_stub
output logic [7:0] bg_r,
output logic [7:0] bg_g,
output logic [7:0] bg_b,
// Ch91/Ch92: privileged display register latches.
// gs_pcrtc_stub consumes these directly — there's no separate
// sideband path. PMODE.EN1 (bit 0) gates scanout. DISPFB1
// sets {FBP[8:0], FBW[14:9], PSM[19:15], DBX[42:32], DBY[53:43]}.
// DISPLAY1 (Ch92) sets {DX[11:0], DY[22:12], MAGH[26:23],
// MAGV[28:27], DW[43:32], DH[54:44]} — the display window
// origin/extents within the active area. All reset to 0:
// EN1=0 keeps scanout disabled; DISPLAY1=0 means a 1×1 window
// at (0,0), so pcrtc effectively shows nothing until a real
// driver-style configuration sequence runs.
output logic [63:0] pmode_q,
output logic [63:0] dispfb1_q,
output logic [63:0] display1_q,
// Ch75: GIF-context register latches (real PS2 namespace). TBs
// can verify reg# routing by reading these directly.
output logic [63:0] prim_q,
output logic [63:0] rgbaq_q,
output logic [63:0] xyz2_q,
output logic [63:0] xyzf2_q,
output logic [63:0] frame_1_q,
output logic [63:0] zbuf_1_q,
// Ch98: TEX0_1 latch + decoded sub-fields. Real PS2 TEX0
// carries texture-base + CLUT plumbing fields. Only the
// CLUT-side fields are routed downstream at this scope:
// CBP[50:37] — CLUT buffer pointer (14 bits, source for
// VRAM→CLUT load; not yet acted on)
// CPSM[54:51] — CLUT pixel storage mode (4 bits;
// PSMCT32 = 0 honored, others deferred)
// CSM[55] — CLUT storage mode (1 bit; CSM2=1 is the
// scope here, CSM1 swizzle deferred)
// CSA[60:56] — CLUT entry offset in 16-entry units
// (5 bits; matches pcrtc.clut_csa)
// CLD[63:61] — CLUT load control (3 bits; latched but
// not yet triggering a load)
// The texture-side fields (TBP0/TBW/PSM/TW/TH/TCC/TFX) are
// also in tex0_1_q if a TB needs them, but no decode/output
// is wired yet.
output logic [63:0] tex0_1_q,
output logic [13:0] tex0_1_cbp_q,
output logic [3:0] tex0_1_cpsm_q,
output logic tex0_1_csm_q,
output logic [4:0] tex0_1_csa_q,
output logic [2:0] tex0_1_cld_q,
// Ch99: 1-cycle pulse on the cycle a TEX0_1 GIF write commits.
// clut_loader_stub samples the just-written tex0_1_* fields
// and starts a VRAM→CLUT copy when CLD != 0 and CPSM=PSMCT32.
output logic tex0_1_wr_q,
// Ch110 — host→local image-transfer (TRX) register set. Real
// PS2 driver path for VRAM uploads:
// BITBLTBUF (0x50) — SBP/SBW/SPSM (source) + DBP/DBW/DPSM (dest)
// TRXPOS (0x51) — SSAX/SSAY/DSAX/DSAY/DIR
// TRXREG (0x52) — RRW/RRH (transfer rect width/height in pixels)
// TRXDIR (0x53) — XDIR (0=host→local, 1=local→host,
// 2=local→local, 3=deactivate)
// The transfer is triggered by the TRXDIR write (`trxdir_wr_q`
// pulses at that posedge). gif_image_xfer_stub samples the
// already-latched BITBLTBUF/TRXPOS/TRXREG fields and consumes
// the IMAGE qword stream coming out of gif_packed_stub when
// a host→local upload is armed.
output logic [63:0] bitbltbuf_q,
output logic [63:0] trxpos_q,
output logic [63:0] trxreg_q,
output logic [63:0] trxdir_q,
output logic trxdir_wr_q,
// Ch76/Ch77: primitive observer. `prim_complete` pulses for one
// cycle when an XYZ2/XYZF2 commit closes a primitive under the
// latched PRIM[2:0]; `prim_complete_count` is the running total.
// Discrete primitives (POINT/LINE/TRI/SPRITE) draw once per N
// vertices. Strip / fan primitives (LINE_STRIP/TRI_STRIP/TRI_FAN)
// anchor on the first N vertices, then draw on every additional
// vertex commit (Ch77 saturate cadence). Reserved (PRIM=7) never
// closes a primitive. Ch84 wires this strobe into the interior
// rasterizer for SPRITE / TRI / TRI_STRIP / TRI_FAN; POINT /
// LINE / LINE_STRIP still emit only the Ch82 closing pixel. See
// body for the threshold table and saturate logic.
output logic prim_complete,
output logic [31:0] prim_complete_count,
// Ch78: per-primitive vertex-identity snapshot, latched in
// lockstep with the EV_PRIM_DRAW that closes a primitive (one
// cycle after the closing vertex commits, same cycle as
// ev_valid). Layout depends on the latched PRIM[2:0]:
// POINT (1-vert) : v0=closing
// LINE/LINE_STRIP/SPRITE (2) : v0=previous, v1=closing
// TRI/TRI_STRIP (3) : v0=v_n-2, v1=v_n-1, v2=closing
// TRI_FAN anchor draw (3) : v0=v1 (pivot), v1=v2, v2=v3
// TRI_FAN saturated draw (3) : v0=v_pivot, v1=v_n-1, v2=closing
// Consumers know the valid-slot count from prim_q[2:0]. Slots
// not used by the current primitive type read 0. PRIM-write
// resets the underlying rolling window so a fresh primitive
// context starts clean.
output logic [63:0] prim_v0_q,
output logic [63:0] prim_v1_q,
output logic [63:0] prim_v2_q,
// Ch79: per-primitive color snapshot — the value of rgbaq_q at
// the closing-vertex commit. Equivalent to prim_color_v2_q for
// 3-vertex prims, prim_color_v1_q for 2-vertex, prim_color_v0_q
// for POINT. Kept as a convenience output for consumers that
// only care about flat shading. Reads 0 if no RGBAQ has been
// written since reset; rgbaq_q is NOT cleared on PRIM write —
// color carries forward across PRIM context switches like real
// GS — and only resets on rst_n.
output logic [63:0] prim_color_q,
// Ch80: per-vertex Gouraud color snapshot. Each slot holds the
// value of rgbaq_q at the moment the corresponding vertex
// committed (sampled into a parallel rolling color window
// c_curr_q / c_prev_q / c_prev_prev_q / c_pivot_q on each
// vertex commit). Slot layout mirrors prim_v0_q / v1_q / v2_q:
// POINT (1-vert) : v0=rgbaq_q at commit
// LINE/LINE_STRIP/SPRITE (2) : v0=c_curr_q (prev), v1=rgbaq_q (close)
// TRI/TRI_STRIP (3) : v0=c_prev_q, v1=c_curr_q, v2=rgbaq_q
// TRI_FAN saturated draw (3) : v0=c_pivot_q, v1=c_curr_q, v2=rgbaq_q
// For a flat-shaded primitive (RGBAQ written once before the
// strip), all per-vertex slots equal each other and equal
// prim_color_q. For Gouraud (RGBAQ rewritten between vertex
// commits) they may differ, capturing the per-vertex color
// identity real game streams need. Slots not used by the
// current primitive type read 0.
output logic [63:0] prim_color_v0_q,
output logic [63:0] prim_color_v1_q,
output logic [63:0] prim_color_v2_q,
// Ch81: structured-field decoded snapshots, latched alongside
// the raw 64-bit slots. vertex_t carries x / y / z / fog /
// is_xyzf2 already unpacked; color_t carries r / g / b / a / q.
// gs_stub tracks the per-vertex source format (XYZ2 vs XYZF2)
// in a parallel rolling flag window so each slot's vertex_t can
// be decoded with the right Z/fog interpretation. Slots not
// used by the current primitive type read 0. Consumers downstream
// (rasterizer / pixel emit) can pick up these decoded fields
// instead of re-deriving them from the raw 64-bit payloads.
output trace_pkg::vertex_t prim_v0_decoded_q,
output trace_pkg::vertex_t prim_v1_decoded_q,
output trace_pkg::vertex_t prim_v2_decoded_q,
output trace_pkg::color_t prim_v0_color_decoded_q,
output trace_pkg::color_t prim_v1_color_decoded_q,
output trace_pkg::color_t prim_v2_color_decoded_q,
// Ch82 / Ch83 — minimal pixel emit. One pixel per closed
// primitive, sourced from the closing vertex's screen X/Y plus
// the latched FRAME_1 register. pixel_emit pulses for one cycle
// on the same edge prim_complete pulses (separate strobe channel
// — does not multiplex with the main ev_valid trace stream).
// Ch83 added a PSM-aware byte-width lookup so PSMCT16, PSMT8,
// PSMZ16 (etc.) compute the correct fb_addr.
//
// NOTE: this `pixel_emit` channel is the legacy single-pixel-
// per-primitive observer. Sub-byte PSM4 addressing is treated
// as host-word here (bpp_shift=2) because pixel_emit does not
// commit to vram_stub — it is a debug strobe only. The raster
// channel (`raster_pixel_emit`) is the real write port; Ch106
// gave it full PSMT4 support with `pixel_index >> 1` byte
// addressing and a per-bit nibble mask. Do not conflate the
// two channels.
// fb_addr = FBP * 2048 + (Y * FBW * 64 + X) * bytes_per_pixel
// FRAME_1 bit layout (per PCSX2 GSRegs.h):
// [8:0] FBP framebuffer base / 2048
// [21:16] FBW framebuffer width / 64 (in pixels)
// [29:24] PSM pixel storage format
// [63:32] FBMSK write mask (not used by emit)
output logic pixel_emit,
output logic [31:0] pixel_emit_count,
output logic [11:0] pixel_x_q, // closing vertex integer X (12 of 12.4)
output logic [11:0] pixel_y_q, // closing vertex integer Y (12 of 12.4)
output logic [63:0] pixel_color_q, // RGBAQ at emit (= prim_color_q)
output logic [8:0] pixel_fbp_q, // FRAME_1 FBP field
output logic [5:0] pixel_fbw_q, // FRAME_1 FBW field
output logic [5:0] pixel_psm_q, // FRAME_1 PSM field
output logic [31:0] pixel_fb_addr_q, // computed VRAM byte offset (Ch83 PSM-aware bpp)
// Ch84/Ch85/Ch86/Ch87/Ch88 — interior rasterizer. SEPARATE
// strobe channel from pixel_emit; the existing per-primitive-
// close emit is unchanged (still pulses once per closed
// primitive). The raster channel pulses once *per interior
// pixel* of TRIANGLE / TRI_STRIP / TRI_FAN (edge-function +
// top-left fill rule, Ch85) and SPRITE (axis-aligned rectangle
// fill). POINT / LINE / LINE_STRIP do NOT use the raster
// channel — they emit only the Ch82 closing-vertex pixel.
// For TRI/STRIP/FAN, color is per-pixel Gouraud-interpolated
// (Ch86) from the three per-vertex RGBAQ snapshots; SPRITE
// stays flat (== prim_color_q at close). Coordinates use the
// integer (12.4 top-12) screen X/Y; sub-pixel discarded.
// Closed primitives feed a 2-entry FIFO (Ch87) and emerge
// through a 3-stage pixel pipeline (Ch88) at one candidate
// pixel per cycle; raster_overflow latches only when a push
// arrives while the FIFO is full and no concurrent pop frees
// a slot.
//
// FSM is IDLE / SCAN (one pixel per cycle inside the box).
// raster_active is high during SCAN. raster_overflow latches if
// a new primitive closes while a previous one is still
// scanning — the new one is dropped, documented limitation.
// The TB design keeps bounding boxes small (≤ 4×4) to stay
// ahead of back-to-back vertex commits.
output logic raster_pixel_emit,
output logic [31:0] raster_pixel_emit_count,
output logic [11:0] raster_pixel_x_q,
output logic [11:0] raster_pixel_y_q,
output logic [63:0] raster_pixel_color_q,
output logic [31:0] raster_pixel_fb_addr_q,
// Ch95: per-byte write enable + PSM tag for the raster
// emit. PSMCT32 raster: be=4'b1111. PSMCT16 raster: be is
// 4'b0011 or 4'b1100 depending on the addressed halfword
// within the enclosing 32-bit word, and the lower 32 bits
// of raster_pixel_color_q carry the duplicated 16-bit
// RGB5A1 packing so vram_stub commits the right halfword
// without stomping its neighbor.
output logic [3:0] raster_pixel_be_q,
// Ch106: per-bit merge mask for sub-byte writes (PSMT4
// nibbles). 32'hFFFFFFFF for PSMCT32/16/PSMT8 (full byte
// writes — NO behavior change vs Ch95). 0x0000_000F (low
// nibble) or 0x0000_00F0 (high nibble) for PSMT4. vram_stub
// commits each enabled byte as
// mem[i] <= (mem[i] & ~mask[i]) | (data[i] & mask[i])
output logic [31:0] raster_pixel_mask_q,
output logic [5:0] raster_pixel_psm_q,
// Ch323 — tile Z-FLUSH stream (TILE_SPILL_ENABLE only). One emit per tile pixel during
// TP_ZFLUSH, on a SEPARATE channel from the color flush (raster_pixel_*) so the color
// flush path/counts are byte-identical. z_flush_addr_o is the Z-BACKING-relative byte
// offset (pixel_index*4, 32-bit Z); the de25 Z-writer adds the LPDDR Z-backing base.
// Tied 0 at the default (TP_ZFLUSH never entered).
output logic z_flush_emit_o,
output logic [31:0] z_flush_addr_o,
output logic [31:0] z_flush_data_o,
// Ch323 — DEDICATED tile COLOR-FLUSH spill stream (TILE_SPILL_ENABLE only). One emit per
// tile pixel during TP_FLUSH ONLY — a SEPARATE channel from the generic raster_pixel_emit /
// flush_emit_o (which ALSO fires for every RENDER-phase pixel write and so over-feeds the
// spill writer). Mirror of the Z-flush channel: tile_color_flush_addr_o is the spill-region-
// relative byte offset (pixel_index*4); the de25 color-writer adds the LPDDR base.
output logic tile_color_flush_emit_o,
output logic [31:0] tile_color_flush_addr_o,
output logic [31:0] tile_color_flush_data_o,
// Ch323 — tile RELOAD staging interface (TILE_SPILL_ENABLE only). reload_start_o pulses to
// arm the de25 staging engine's fill from LPDDR (color FB + Z-backing for this tile);
// tile_reload_ready_i = staging warm; then tile_reload_raddr_o sweeps 0..255 and the engine
// returns tile_reload_color_i/z_i (1-cyc). Inert at the default (TP_RELOAD never entered).
output logic reload_start_o,
output logic [7:0] tile_reload_raddr_o,
// Ch324 — current tile's byte offset into the raster LPDDR framebuffer (same formula as the
// flush side). gs_tile_reload latches it at the reload arm so it gathers exactly this tile.
// At TILE_COLS=TILE_ROWS=1 the tile is at the origin → 0 (byte-identical to Ch323).
output logic [29:0] reload_base_o,
input logic tile_reload_ready_i,
input logic [TILE_COLOR_W-1:0] tile_reload_color_i,
input logic [31:0] tile_reload_z_i,
// Ch323 diag — current tile phase (TP_OFF/CLEAR/RENDER/FLUSH/BIN/RELOAD/ZFLUSH). The
// de25 edge-detects entries into each phase as bring-up event counters. Inert otherwise.
output logic [2:0] tile_phase_o,
output logic raster_active,
// Ch337 — SCENE-LEVEL busy: high while ANY work for the in-flight scene remains, INCLUDING the
// inter-batch gaps that raster_active dips through on a >FIFO_DEPTH (multi-batch) scene. The
// control FSM gates its retrigger 'ready' on this (NOT raster_active) so a >8 scene can't be
// retriggered mid-render. = raster_active || tile still flushing || prims still queued || an
// end-of-list flush still pending. For a <=FIFO single-batch scene it tracks raster_active
// closely (FIFO/flush drain within the one batch), so the Ch330-335 path is unaffected.
output logic raster_scene_busy,
output logic raster_overflow,
// Ch172 — backpressure handshake to the upstream GIF parser.
// High when the raster command FIFO is full (raster_fifo_full ==
// count == FIFO_DEPTH). The wrapper routes this to
// gif_packed_stub.raster_fifo_full so the GIF deasserts in_ready
// BEFORE accepting the next qword; the DMAC then pauses cleanly
// and resumes when raster pops the next FIFO entry. With this
// path, no primitive can ever silently drop, and FIFO_DEPTH can
// be set independently of "how many sprites we might draw."
output logic raster_fifo_full,
// Ch85: latches when a TRI/TRI_STRIP/TRI_FAN closes with zero
// signed area (3 colinear vertices). The rasterizer skips SCAN
// for that primitive and stays IDLE — no raster pixels emit.
// pixel_emit (Ch82 channel) and prim_complete still pulse on
// the close cycle as usual; only the interior raster is
// suppressed. SPRITE never sets this flag (rectangle fill
// doesn't depend on signed area).
output logic raster_degenerate,
// ------------------------------------------------------------------
// Texturing (brick 1) — VRAM texel read port.
//
// gs_stub does NOT instantiate vram_stub; the wrapper/TB owns that.
// For textured SPRITEs the rasterizer fetches a texel per inside
// pixel through this read port. The TB wires it to vram_stub's
// SECOND read port (read2_addr/read2_data), muxed with
// clut_loader_stub's read2 access so CLUT-loading (which happens at
// TEX0 commit, before the raster scan) and texel-fetch (during the
// scan) don't collide.
//
// tex_rd_en / tex_rd_addr drive the read port; tex_rd_data returns
// the 32-bit word TEX_RD_LATENCY cycles later (default 1, matching
// vram_stub's combinational read viewed through gs_stub's own S2
// register stage — see the latency note at the sampler instance).
//
// Scope: PSMCT32 DECAL + (Ch296) PSMT8 indexed DECAL textured
// SPRITEs/TRIANGLEs. When PRIM.TME=0 (or PSM is neither PSMCT32 nor
// PSMT8) the texturing path is bypassed and emit is byte-identical to
// the pre-texturing Gouraud/flat path.
// ------------------------------------------------------------------
output logic tex_rd_en,
output logic [31:0] tex_rd_addr,
input logic [31:0] tex_rd_data,
// ------------------------------------------------------------------
// Ch296 — CLUT lookup port for PSMT8 indexed texturing. The texel
// fetch yields an 8-bit index; `clut_rd_idx` drives clut_stub's
// second (combinational) read port and `clut_rd_data` returns the
// PSMCT32 entry. The wrapper/TB instantiates clut_stub (+ the
// clut_loader that fills it at TEX0 commit) and wires these two
// signals. Tie clut_rd_data=0 for PSMCT32-only tops (the index path
// is never selected there).
// ------------------------------------------------------------------
output logic [7:0] clut_rd_idx,
input logic [31:0] clut_rd_data,
// Ch296 — CLUT-load-busy backpressure. The CLUT load (clut_loader_stub,
// fired at TEX0 commit) and the textured raster's texel fetch share the
// single VRAM read2 port. They must be mutually exclusive in time. The
// load runs ~256 cycles and can outlast the few GIF cycles between the
// TEX0 commit and the textured primitive reaching the raster head, so
// hold off the FIFO pop (don't START the textured scan) while the load
// is in flight. Tie LOW for tops with no CLUT loader (no effect). This
// gates only the pop, so in-flight primitives drain normally.
input logic clut_load_busy,
// ------------------------------------------------------------------
// Brick 2a — ALPHA blend dest-framebuffer read port.
//
// For a FLAT alpha-blended SPRITE pixel, gs_stub reads the existing
// destination pixel back from VRAM at the write address (fb_rd_addr)
// and blends the source color against it. The wrapper wires this to
// vram_stub/vram_bram_stub's SECOND read port (read2), arbitrated
// with the texel-fetch port — a FLAT blend never textures, so the
// two enables are mutually exclusive (documented in the wrapper).
//
// fb_rd_en / fb_rd_addr drive the read; fb_rd_data returns the
// 32-bit PSMCT32 word (the dest pixel) — combinationally
// (FB_RD_REGISTERED=0) or one cycle later (FB_RD_REGISTERED=1).
// ------------------------------------------------------------------
output logic fb_rd_en,
output logic [31:0] fb_rd_addr,
input logic [31:0] fb_rd_data,
// ------------------------------------------------------------------
// Brick 2b — Z-buffer (depth test) stored-Z read port.
//
// For a Z-tested FLAT PSMCT32 SPRITE pixel, gs_stub reads the stored
// Z value back from VRAM at the Z-buffer address for (x,y) and
// compares the fragment Z against it per TEST_1.ZTST. The wrapper
// wires this to the SAME spare second read port (read2) the texel
// fetch / alpha dest-fb read use — mutually exclusive by feature.
//
// z_rd_en / z_rd_addr drive the read; z_rd_data returns the 32-bit
// stored Z (PSMZ32) — combinationally (Z_RD_REGISTERED=0) or one
// cycle later (Z_RD_REGISTERED=1).
// ------------------------------------------------------------------
output logic z_rd_en,
output logic [31:0] z_rd_addr,
input logic [31:0] z_rd_data,
// ------------------------------------------------------------------
// COMBINED probe — schedule observability. These expose the
// walker-stalling 5-beat FSM state so a TB can assert the exact
// per-beat read/write sequence. Driven only when COMBINED_TAZ=1 and
// the in-flight primitive is the combined T+A+Z triangle; held 0
// otherwise (so param=0 / non-combined builds see constant 0).
// comb_active : a combined-mode scan is running this primitive
// comb_beat : current beat 0..4 of the per-pixel FSM
// comb_pix_inside : the held candidate pixel is inside the triangle
// comb_ztest_pass : depth test passed for the held pixel (valid
// from beat 1 onward)
// The per-beat read enables (tex_rd_en / fb_rd_en / z_rd_en) and the
// write strobe (raster_pixel_emit) are already observable on their
// own ports.
output logic comb_active,
output logic [2:0] comb_beat,
output logic comb_pix_inside,
output logic comb_ztest_pass,
// ------------------------------------------------------------------
// Ch304 TILE-LOCAL render mode — schedule observability. These expose
// the tile-phase FSM and the on-chip tile RAM ports so a TB can prove
// clear-init, hidden-pixel-skips-texture/color, visible-pixel reads
// Z+tex+color & writes color+Z, and flush emits FB. Held 0 when
// TILE_LOCAL=0 (constant tie-off) so non-tiled builds see 0.
// tile_active : the in-flight primitive is rendering tile-local
// tile_phase : 0=OFF 1=CLEAR 2=RENDER 3=FLUSH 4=BIN(BIN_BUFFER_ENABLE)
// tile_color_we : color tile write strobe this cycle
// tile_color_waddr: color tile write index
// tile_color_raddr: color tile read index (registered read)
// tile_z_we : Z tile write strobe this cycle
// tile_z_waddr : Z tile write index
// tile_z_raddr : Z tile read index (registered read)
output logic tile_active,
output logic [2:0] tile_phase,
output logic tile_color_we,
output logic [7:0] tile_color_waddr,
output logic [7:0] tile_color_raddr,
output logic tile_z_we,
output logic [7:0] tile_z_waddr,
output logic [7:0] tile_z_raddr,
// Trace
output logic ev_valid,
output subsys_e ev_subsys,
output event_e ev_event,
output logic [63:0] ev_arg0,
output logic [63:0] ev_arg1,
output logic [63:0] ev_arg2,
output logic [63:0] ev_arg3,
output logic [31:0] ev_flags
);
// Default at reset: mid-grey so a missing BGCOLOR write is visually
// distinct from a black screen caused by disabled video.
localparam logic [7:0] DEFAULT_R = 8'h40;
localparam logic [7:0] DEFAULT_G = 8'h40;
localparam logic [7:0] DEFAULT_B = 8'h40;
// ------------------------------------------------------------------
// BGCOLOR latch
// Real BGCOLOR is a 24-bit RGB field in a 64-bit register. We take the
// low 24 bits of reg_wr_data as {R[23:16], G[15:8], B[7:0]} — matches
// Sony GS layout (see docs/supplemental/sony_official_docs.md).
// ------------------------------------------------------------------
logic is_bgcolor_wr;
logic is_pmode_wr;
logic is_dispfb1_wr;
logic is_display1_wr;
assign is_bgcolor_wr = reg_wr_en && (reg_wr_addr == BGCOLOR_OFFSET);
assign is_pmode_wr = reg_wr_en && (reg_wr_addr == PMODE_OFFSET);
assign is_dispfb1_wr = reg_wr_en && (reg_wr_addr == DISPFB1_OFFSET);
assign is_display1_wr = reg_wr_en && (reg_wr_addr == DISPLAY1_OFFSET);
always_ff @(posedge clk) begin
if (!rst_n) begin
bg_r <= DEFAULT_R;
bg_g <= DEFAULT_G;
bg_b <= DEFAULT_B;
end else if (is_bgcolor_wr) begin
bg_r <= reg_wr_data[23:16];
bg_g <= reg_wr_data[15:8];
bg_b <= reg_wr_data[7:0];
end
end
// ------------------------------------------------------------------
// Ch91 / Ch92 / Ch93 / Ch94 — PMODE / DISPFB1 / DISPLAY1 latches.
//
// PMODE bit 0 = EN1 (display 1 enable). DISPFB1 fields are
// extracted by gs_pcrtc_stub:
// FBP : [8:0] base in 2048-byte units
// FBW : [14:9] width in 64-pixel units
// PSM : [19:15] pixel storage mode (PSMCT32 + PSMCT16 honored, Ch94)
// DBX : [42:32] display X origin (consumed by pcrtc, Ch91-audit)
// DBY : [53:43] display Y origin (consumed by pcrtc, Ch91-audit)
// DISPLAY1 fields (Ch92/Ch93) — DX/DY/DW/DH (window inside
// active area), MAGH/MAGV (per-axis magnification, Ch93).
// All three regs reset to 0 — scanout starts disabled,
// exactly the pre-Ch91 default behavior, so legacy TBs that
// don't write PMODE see no scanout activity.
// ------------------------------------------------------------------
always_ff @(posedge clk) begin
if (!rst_n) begin
pmode_q <= 64'd0;
dispfb1_q <= 64'd0;
display1_q <= 64'd0;
end else begin
if (is_pmode_wr) pmode_q <= reg_wr_data;
if (is_dispfb1_wr) dispfb1_q <= reg_wr_data;
if (is_display1_wr) display1_q <= reg_wr_data;
end
end
// Ch75/Ch98: GIF-context register latches keyed on real PS2
// reg# per PCSX2 GSRegs.h. Only a small primitive-oriented
// subset is recognized today — anything else lands as
// EV_MODE in the trace.
// 0x00 PRIM
// 0x01 RGBAQ
// 0x04 XYZF2
// 0x05 XYZ2
// 0x06 TEX0_1 (Ch98)
// 0x4C FRAME_1
// 0x4E ZBUF_1 <- not 0x4F (that's ZBUF_2)
localparam logic [7:0] GIF_REG_PRIM = 8'h00;
localparam logic [7:0] GIF_REG_RGBAQ = 8'h01;
// Brick 1 — texture-coordinate registers. ST (0x02) carries the
// float S/T coords (used for TRI/perspective); UV (0x03) carries
// the integer U/V used for SPRITEs (and non-perspective prims).
// v1 SPRITE texturing only consumes UV. ST is decoded/latched too
// so a future TRI texturing chapter has the latch in place, but
// ST is NOT yet wired into any sampler path.
localparam logic [7:0] GIF_REG_ST = 8'h02;
localparam logic [7:0] GIF_REG_UV = 8'h03;
localparam logic [7:0] GIF_REG_XYZF2 = 8'h04;
localparam logic [7:0] GIF_REG_XYZ2 = 8'h05;
localparam logic [7:0] GIF_REG_TEX0_1 = 8'h06;
// Brick 2a — ALPHA_1 (GIF reg# 0x42). Blend-function selector:
// A = [1:0] (0=Cs, 1=Cd, 2=0) first color operand
// B = [3:2] (0=Cs, 1=Cd, 2=0) second color operand
// C = [5:4] (0=As, 1=Ad, 2=FIX) blend coefficient
// D = [7:6] (0=Cs, 1=Cd, 2=0) additive operand
// FIX = [39:32] fixed alpha (used when C=2)
// GS formula: Cv = ((A - B) * C) >> 7 + D. The brick-2a source-over
// config is A=0 B=1 C=0 D=1 -> ((Cs - Cd) * As) >> 7 + Cd.
localparam logic [7:0] GIF_REG_ALPHA_1 = 8'h42;
localparam logic [7:0] GIF_REG_SCISSOR_1 = 8'h40;
// Brick 2b — TEST_1 (GIF reg# 0x47). Per-context pixel test
// selector. Brick 2b consumes only the Z-test sub-fields:
// ZTE = bit 16 (Z-test enable)
// ZTST = bits [18:17] (0=NEVER, 1=ALWAYS, 2=GEQUAL, 3=GREATER)
// The alpha-test / dest-alpha-test fields (ATE/ATST/AREF/AFAIL,
// DATE/DATM) are latched in test_1_q but NOT decoded/acted on by
// brick 2b — out of scope. ZTST semantics (PS2): LARGER Z = nearer
// the viewer; GEQUAL passes when fragment_Z >= stored_Z.
localparam logic [7:0] GIF_REG_TEST_1 = 8'h47;
// Ch294 — CLAMP_1 (GIF reg# 0x48). Texture wrap-mode selector.
// WMS = [1:0] (0=REPEAT 1=CLAMP 2/3=REGION_*) S/u wrap
// WMT = [3:2] (same) T/v wrap
// Higher REGION_* fields (MINU/MAXU/MINV/MAXV) latched but not decoded
// (REGION_* not modelled this rung). Consumed only when TEX_WRAP_ENABLE=1.
localparam logic [7:0] GIF_REG_CLAMP_1 = 8'h48;
// Ch310 — TEX1_1 (GIF reg# 0x14). Texture sampling-mode selector. We
// consume ONLY the MMAG bit (magnification filter):
// MMAG = bit 5 (0=NEAREST, 1=LINEAR)
// (LCM/MXL/MMIN/MTBA/L/K are latched but not decoded — only magnification
// bilinear is modelled this rung.) NOTE: GIF reg# 0x14 is distinct from
// the PSMT4 PSM value 6'h14 seen in the bpp tables — same numeric, totally
// unrelated. Consumed only when BILINEAR_ENABLE=1.
localparam logic [7:0] GIF_REG_TEX1_1 = 8'h14;
localparam logic [7:0] GIF_REG_FRAME_1 = 8'h4C;
localparam logic [7:0] GIF_REG_ZBUF_1 = 8'h4E;
localparam logic [7:0] GIF_REG_BITBLTBUF = 8'h50;
localparam logic [7:0] GIF_REG_TRXPOS = 8'h51;
localparam logic [7:0] GIF_REG_TRXREG = 8'h52;
localparam logic [7:0] GIF_REG_TRXDIR = 8'h53;
// Brick 1 — texture-coordinate latches (internal; not yet ported
// out). uv_q holds the most-recent UV write (U=[13:0], V=[27:14],
// both 10.4 fixed-point — 4 fractional bits). st_q holds ST.
logic [63:0] st_q;
logic [63:0] uv_q;
// Brick 2a — ALPHA_1 latch (internal; not ported out).
logic [63:0] alpha_1_q;
// Brick 2b — TEST_1 latch (internal; not ported out). Holds the
// pixel-test config; only ZTE/ZTST are consumed (see below).
logic [63:0] test_1_q;
// SCISSOR_1 latch (internal). Inclusive screen rect; consumed only when
// SCISSOR_ENABLE=1 (see eff_sc* below). Reset to FULL-RANGE so an enabled
// build with no SCISSOR_1 write yet is still a no-op clip.
logic [63:0] scissor_1_q;
// CLAMP_1 latch (internal). Reset to 0 → WMS=WMT=0=REPEAT default.
logic [63:0] clamp_1_q;
// Ch310 — TEX1_1 latch (internal). Reset to 0 → MMAG=0 = NEAREST default,
// so an enabled build with no TEX1_1 write yet still samples nearest.
logic [63:0] tex1_1_q;
logic gif_is_known_reg;
assign gif_is_known_reg =
(gif_reg_num == GIF_REG_PRIM ) ||
(gif_reg_num == GIF_REG_RGBAQ ) ||
(gif_reg_num == GIF_REG_ST ) ||
(gif_reg_num == GIF_REG_UV ) ||
(gif_reg_num == GIF_REG_XYZF2 ) ||
(gif_reg_num == GIF_REG_XYZ2 ) ||
(gif_reg_num == GIF_REG_TEX0_1 ) ||
(gif_reg_num == GIF_REG_ALPHA_1 ) ||
(gif_reg_num == GIF_REG_SCISSOR_1) ||
(gif_reg_num == GIF_REG_TEST_1 ) ||
(gif_reg_num == GIF_REG_CLAMP_1 ) ||
(gif_reg_num == GIF_REG_TEX1_1 ) ||
(gif_reg_num == GIF_REG_FRAME_1 ) ||
(gif_reg_num == GIF_REG_ZBUF_1 ) ||
(gif_reg_num == GIF_REG_BITBLTBUF) ||
(gif_reg_num == GIF_REG_TRXPOS ) ||
(gif_reg_num == GIF_REG_TRXREG ) ||
(gif_reg_num == GIF_REG_TRXDIR );
always_ff @(posedge clk) begin
if (!rst_n) begin
prim_q <= 64'd0;
rgbaq_q <= 64'd0;
st_q <= 64'd0;
uv_q <= 64'd0;
alpha_1_q <= 64'd0;
test_1_q <= 64'd0;
// SCAX0=0, SCAX1=0x7FF, SCAY0=0, SCAY1=0x7FF → full-range no-op.
scissor_1_q <= 64'h07FF_0000_07FF_0000;
clamp_1_q <= 64'd0; // WMS=WMT=0 → REPEAT
tex1_1_q <= 64'd0; // MMAG=0 → NEAREST
xyz2_q <= 64'd0;
xyzf2_q <= 64'd0;
tex0_1_q <= 64'd0;
frame_1_q <= 64'd0;
zbuf_1_q <= 64'd0;
bitbltbuf_q <= 64'd0;
trxpos_q <= 64'd0;
trxreg_q <= 64'd0;
trxdir_q <= 64'd0;
end else if (gif_reg_wr_en) begin
unique case (gif_reg_num)
GIF_REG_PRIM: prim_q <= gif_reg_data;
GIF_REG_RGBAQ: rgbaq_q <= gif_reg_data;
GIF_REG_ST: st_q <= gif_reg_data;
GIF_REG_UV: uv_q <= gif_reg_data;
GIF_REG_XYZF2: xyzf2_q <= gif_reg_data;
GIF_REG_XYZ2: xyz2_q <= gif_reg_data;
GIF_REG_TEX0_1: tex0_1_q <= gif_reg_data;
GIF_REG_ALPHA_1: alpha_1_q <= gif_reg_data;
GIF_REG_SCISSOR_1: scissor_1_q <= gif_reg_data;
GIF_REG_TEST_1: test_1_q <= gif_reg_data;
GIF_REG_CLAMP_1: clamp_1_q <= gif_reg_data;
GIF_REG_TEX1_1: tex1_1_q <= gif_reg_data;
GIF_REG_FRAME_1: frame_1_q <= gif_reg_data;
GIF_REG_ZBUF_1: zbuf_1_q <= gif_reg_data;
GIF_REG_BITBLTBUF: bitbltbuf_q <= gif_reg_data;
GIF_REG_TRXPOS: trxpos_q <= gif_reg_data;
GIF_REG_TRXREG: trxreg_q <= gif_reg_data;
GIF_REG_TRXDIR: trxdir_q <= gif_reg_data;
default: ; // unknown reg — trace only
endcase
end
end
// Ch301 perspective — "saw an ST (0x02) write since the last PRIM"
// per-primitive flag. A triangle takes the PERSPECTIVE path iff its
// texcoords were supplied via ST (float S/T/Q), NOT via UV (0x03).
// Mirror of vert_count / pivot_seen_q lifecycle: cleared on PRIM
// write, latched on any ST write. When PERSPECTIVE_CORRECT=0 this is
// a constant 0 (the generate else-branch ties it low), so every
// downstream use (fifo_persp / ras_persp) is 0 and the param=0 build
// is byte-identical to the affine-only design.
logic saw_st_q;
generate
if (PERSPECTIVE_CORRECT) begin : g_saw_st
always_ff @(posedge clk) begin
if (!rst_n)
saw_st_q <= 1'b0;
else if (gif_is_prim_wr)
saw_st_q <= 1'b0;
else if (gif_reg_wr_en && (gif_reg_num == GIF_REG_ST))
saw_st_q <= 1'b1;
end
end else begin : g_no_saw_st
assign saw_st_q = 1'b0;
end
endgenerate
// Ch98 — TEX0_1 sub-field decoders (CLUT plumbing). All
// combinational from tex0_1_q so they always reflect the
// most recent TEX0_1 write.
assign tex0_1_cbp_q = tex0_1_q[50:37];
assign tex0_1_cpsm_q = tex0_1_q[54:51];
assign tex0_1_csm_q = tex0_1_q[55];
assign tex0_1_csa_q = tex0_1_q[60:56];
assign tex0_1_cld_q = tex0_1_q[63:61];
// Ch310 — TEX1_1.MMAG (magnification filter). bit 5: 0=NEAREST, 1=LINEAR.
// Combinational from tex1_1_q; snapshotted per-primitive as fifo_filter_lin.
logic tex1_mmag;
assign tex1_mmag = tex1_1_q[5];
// Brick 1 — TEX0_1 TEXTURE-side sub-field decoders (per PS2 GS
// TEX0 layout, PCSX2 GSRegs.h; the CLUT-side fields above start at
// CBP=[50:37], which fixes the texture-side field positions):
// TBP0 [13:0] texture base pointer (in 64-byte/256-bit words)
// TBW [19:14] texture buffer width (texels-per-row / 64)
// PSM [25:20] pixel storage mode (6 bits)
// TW [29:26] log2(texture width)
// TH [33:30] log2(texture height)
// TCC [34] / TFX [36:35] — not consumed by v1 DECAL
logic [13:0] tex0_tbp0;
logic [5:0] tex0_tbw;
logic [5:0] tex0_psm;
logic [3:0] tex0_tw;
logic [3:0] tex0_th;
assign tex0_tbp0 = tex0_1_q[13:0];
assign tex0_tbw = tex0_1_q[19:14];
assign tex0_psm = tex0_1_q[25:20];
assign tex0_tw = tex0_1_q[29:26];
assign tex0_th = tex0_1_q[33:30];
// Ch294 — CLAMP_1 decoded wrap-mode fields (WMS[1:0], WMT[3:2]).
logic [1:0] clamp_wms, clamp_wmt;
assign clamp_wms = clamp_1_q[1:0];
assign clamp_wmt = clamp_1_q[3:2];
// PRIM.TME — texture-mapping enable (bit 4 of the PRIM register,
// per PS2 GS PRIM layout: [2:0] PRIM type, [3] IIP, [4] TME,
// [5] FGE, [6] ABE, ...). When 0, the textured emit path is
// bypassed entirely.
logic prim_tme;
assign prim_tme = prim_q[4];
// Brick 2a — PRIM.ABE (alpha-blend enable, bit 6 of PRIM). When 0,
// the blend path is bypassed and the emit is opaque (byte-identical
// to pre-Brick-2a). ALPHA_1 sub-field decoders are combinational
// from alpha_1_q so they always reflect the latest ALPHA_1 write.
logic prim_abe;
logic [1:0] alpha_a, alpha_b, alpha_c, alpha_d;
logic [7:0] alpha_fix;
logic alpha_is_source_over;
assign prim_abe = prim_q[6];
assign alpha_a = alpha_1_q[1:0];
assign alpha_b = alpha_1_q[3:2];
assign alpha_c = alpha_1_q[5:4];
assign alpha_d = alpha_1_q[7:6];
// Brick-2c — ALPHA_1.FIX (fixed alpha coefficient, used when C==2).
assign alpha_fix = alpha_1_q[39:32];
// Source-over: Cv = ((Cs - Cd) * As) >> 7 + Cd i.e. A=0 B=1 C=0 D=1.
// brick-2a supports ONLY this config; any other selector falls back
// to an opaque write (documented limitation).
assign alpha_is_source_over = (alpha_a == 2'd0) && (alpha_b == 2'd1)
&& (alpha_c == 2'd0) && (alpha_d == 2'd1);
// Brick 2b — TEST_1 Z-test sub-field decoders + ZBUF_1 decoders.
// Combinational from test_1_q / zbuf_1_q so they always reflect the
// most-recent write.
// TEST_1.ZTE = bit 16 (Z-test enable)
// TEST_1.ZTST = bits [18:17] (0=NEVER 1=ALWAYS 2=GEQUAL 3=GREATER)
// ZBUF_1.ZBP = bits [8:0] (Z buffer base, *2048-word page units)
// ZBUF_1.PSM = bits [27:24] (0=PSMZ32, 1=PSMZ24, 2=PSMZ16)
// ZBUF_1.ZMSK = bit 0 (1 => do NOT update Z)
logic test_zte;
logic [1:0] test_ztst;
logic [8:0] zbuf_zbp;
logic [3:0] zbuf_psm;
logic zbuf_zmsk;
assign test_zte = test_1_q[16];
assign test_ztst = test_1_q[18:17];
assign zbuf_zbp = zbuf_1_q[8:0];
assign zbuf_psm = zbuf_1_q[27:24];
assign zbuf_zmsk = zbuf_1_q[0];
// SCISSOR_1 inclusive screen rect, decoded from the SIMPLE reg
// (part-selects on scissor_1_q — no parenthesized-expr bit-select).
// SCAX0=[10:0] SCAX1=[26:16] SCAY0=[42:32] SCAY1=[58:48]
logic [10:0] scissor_x0, scissor_x1, scissor_y0, scissor_y1;
assign scissor_x0 = scissor_1_q[10:0];
assign scissor_x1 = scissor_1_q[26:16];
assign scissor_y0 = scissor_1_q[42:32];
assign scissor_y1 = scissor_1_q[58:48];
// Effective scissor bounds folded into the 12-bit walker clip. When
// SCISSOR_ENABLE=0 (compile-time constant) these are 0 / 0xFFF, so
// max(x,0)=x and min(x,0xFFF)=x → intersecting them is a no-op and the
// synthesizer folds the whole term away. BYTE-IDENTICAL at the default.
logic [11:0] eff_scx0, eff_scx1, eff_scy0, eff_scy1;
assign eff_scx0 = SCISSOR_ENABLE ? {1'b0, scissor_x0} : 12'd0;
assign eff_scx1 = SCISSOR_ENABLE ? {1'b0, scissor_x1} : 12'hFFF;
assign eff_scy0 = SCISSOR_ENABLE ? {1'b0, scissor_y0} : 12'd0;
assign eff_scy1 = SCISSOR_ENABLE ? {1'b0, scissor_y1} : 12'hFFF;
// Brick 2b — Z-test ENABLED for a primitive only when ALL hold
// (otherwise byte-identical to the pre-2b opaque write):
// - TEST_1.ZTE = 1
// - ZBUF_1.PSM = PSMZ32 (0x0) — 32-bit Z only in v1
// - FRAME PSM = PSMCT32 (0x00) — flat color sprite only
// The mutual-exclusivity-by-feature argument (vs texturing /
// alpha blend, which also use read2) is enforced at close
// (new_zte_active below): a Z-tested sprite is FLAT, non-blended.
logic z_format_ok;
assign z_format_ok = test_zte && (zbuf_psm == 4'h0);
// Ch99 — 1-cycle TEX0_1 write pulse. The pulse is REGISTERED
// (latched at the same posedge that commits tex0_1_q), so it
// becomes visible to consumers ONE CYCLE AFTER the TEX0_1
// commit. By that time tex0_1_q has already updated, and the
// combinational tex0_1_*_q sub-field decoders reflect the
// new payload — so a downstream loader sampling on the pulse
// edge sees the just-written values without a race.
always_ff @(posedge clk) begin
if (!rst_n) begin
tex0_1_wr_q <= 1'b0;
end else begin
tex0_1_wr_q <= gif_reg_wr_en && (gif_reg_num == GIF_REG_TEX0_1);
end
end
// Ch110 — 1-cycle TRXDIR write pulse. Triggers
// gif_image_xfer_stub: when the just-committed trxdir_q[1:0]
// is 2'd0 (host→local), the xfer engine snapshots
// BITBLTBUF/TRXPOS/TRXREG and starts consuming IMAGE-mode
// qwords coming out of gif_packed_stub.
always_ff @(posedge clk) begin
if (!rst_n) begin
trxdir_wr_q <= 1'b0;
end else begin
trxdir_wr_q <= gif_reg_wr_en && (gif_reg_num == GIF_REG_TRXDIR);
end
end
// ------------------------------------------------------------------
// Ch76/Ch77 — primitive observer
//
// PRIM[2:0] selects primitive type per real PS2 GS:
// 0 POINT → discrete, 1 vertex per draw
// 1 LINE → discrete, 2 vertices per draw
// 2 LINE_STRIP → strip, anchor=2; then 1 draw per additional vertex
// 3 TRIANGLE → discrete, 3 vertices per draw
// 4 TRI_STRIP → strip, anchor=3; then 1 draw per additional vertex
// 5 TRI_FAN → strip, anchor=3; then 1 draw per additional vertex,
// with v0 of every saturated draw PINNED to
// the pivot vertex (Ch78 / Ch80 / Ch81 —
// raw, color, and decoded fields all carry
// pivot identity). Cadence count matches
// TRI_STRIP; identity does NOT.
// 6 SPRITE → discrete, 2 vertices per draw
// 7 reserved → no draw, no vertex accounting
//
// A vertex commit is any XYZ2 or XYZF2 write on the gif_reg_* port.
// A PRIM write resets the vertex counter so a fresh primitive type
// starts cleanly. Discrete-primitive draws reset the counter to 0;
// strip/fan draws SATURATE the counter at the primary threshold so
// every subsequent vertex also closes a primitive (Ch77 anchor
// logic). prim_complete pulses one cycle on each draw;
// EV_PRIM_DRAW preempts that cycle's EV_WRITE in the trace. The
// xyz2_q / xyzf2_q latches always update on a vertex commit,
// independent of whether a primitive closes.
// ------------------------------------------------------------------
logic [2:0] vert_count;
logic [2:0] prim_type;
logic [2:0] vert_threshold;
logic prim_is_strip;
logic gif_is_vertex;
logic gif_is_prim_wr;
assign prim_type = prim_q[2:0];
always_comb begin
unique case (prim_type)
3'd0: vert_threshold = 3'd1; // POINT
3'd1: vert_threshold = 3'd2; // LINE
3'd2: vert_threshold = 3'd2; // LINE_STRIP anchor
3'd3: vert_threshold = 3'd3; // TRIANGLE
3'd4: vert_threshold = 3'd3; // TRI_STRIP anchor
3'd5: vert_threshold = 3'd3; // TRI_FAN anchor
3'd6: vert_threshold = 3'd2; // SPRITE
default: vert_threshold = 3'd0; // reserved
endcase
end
assign prim_is_strip = (prim_type == 3'd2) || // LINE_STRIP
(prim_type == 3'd4) || // TRI_STRIP
(prim_type == 3'd5); // TRI_FAN
assign gif_is_vertex = gif_reg_wr_en &&
((gif_reg_num == GIF_REG_XYZ2) ||
(gif_reg_num == GIF_REG_XYZF2));
assign gif_is_prim_wr = gif_reg_wr_en && (gif_reg_num == GIF_REG_PRIM);
// A strip is "saturated" once vert_count has reached the primary
// threshold — every subsequent vertex closes another primitive.
logic strip_saturated;
logic prim_complete_now;
assign strip_saturated = prim_is_strip && (vert_count == vert_threshold);
assign prim_complete_now = gif_is_vertex && (vert_threshold != 3'd0) &&
(strip_saturated ||
((vert_count + 3'd1) == vert_threshold));
always_ff @(posedge clk) begin
if (!rst_n) begin
vert_count <= 3'd0;
prim_complete <= 1'b0;
prim_complete_count <= 32'd0;
end else begin
prim_complete <= 1'b0;
if (gif_is_prim_wr) begin
vert_count <= 3'd0;
end else if (gif_is_vertex && (vert_threshold != 3'd0)) begin
if (prim_complete_now) begin
// Strip stays saturated at the primary threshold;
// discrete resets to 0. (If already saturated,
// vert_threshold == vert_count so this is a no-op.)
vert_count <= prim_is_strip ? vert_threshold : 3'd0;
prim_complete <= 1'b1;
prim_complete_count <= prim_complete_count + 32'd1;
end else begin
vert_count <= vert_count + 3'd1;
end
end
end
end
// ------------------------------------------------------------------
// Ch78 — vertex-identity rolling window
//
// v_curr, v_prev, v_prev_prev: rolling window of the most recent
// three vertex commits under the current PRIM context. Updated on
// every vertex commit (whether closing or not). On PRIM write,
// the window clears so a fresh primitive type starts with no
// residual vertex bleed.
//
// v_pivot: the first vertex committed since the most recent PRIM
// write. Used for TRI_FAN's pivot-vertex semantics so a saturated
// fan extension can identify {pivot, prev, curr} instead of the
// strip-style rolling {v_n-2, v_n-1, v_n}. pivot_seen latches the
// first vertex once and ignores subsequent commits.
// ------------------------------------------------------------------
logic [63:0] v_curr_q;
logic [63:0] v_prev_q;
logic [63:0] v_prev_prev_q;
logic [63:0] v_pivot_q;
logic pivot_seen_q;
// Ch81: parallel format-flag rolling window — tracks whether each
// vertex in the rolling window came from XYZ2 (0) or XYZF2 (1).
// Updates lockstep with v_*_q. Cleared on PRIM write so a fresh
// primitive context has no residual format flags.
logic xyzf2_curr_q;
logic xyzf2_prev_q;
logic xyzf2_prev_prev_q;
logic xyzf2_pivot_q;
logic gif_vertex_is_xyzf2;
assign gif_vertex_is_xyzf2 = gif_is_vertex && (gif_reg_num == GIF_REG_XYZF2);
always_ff @(posedge clk) begin
if (!rst_n) begin
v_curr_q <= 64'd0;
v_prev_q <= 64'd0;
v_prev_prev_q <= 64'd0;
v_pivot_q <= 64'd0;
pivot_seen_q <= 1'b0;
xyzf2_curr_q <= 1'b0;
xyzf2_prev_q <= 1'b0;
xyzf2_prev_prev_q <= 1'b0;
xyzf2_pivot_q <= 1'b0;
end else if (gif_is_prim_wr) begin
v_curr_q <= 64'd0;
v_prev_q <= 64'd0;
v_prev_prev_q <= 64'd0;
v_pivot_q <= 64'd0;
pivot_seen_q <= 1'b0;
xyzf2_curr_q <= 1'b0;
xyzf2_prev_q <= 1'b0;
xyzf2_prev_prev_q <= 1'b0;
xyzf2_pivot_q <= 1'b0;
end else if (gif_is_vertex && (vert_threshold != 3'd0)) begin
v_prev_prev_q <= v_prev_q;
v_prev_q <= v_curr_q;
v_curr_q <= gif_reg_data;
xyzf2_prev_prev_q <= xyzf2_prev_q;
xyzf2_prev_q <= xyzf2_curr_q;
xyzf2_curr_q <= gif_vertex_is_xyzf2;
if (!pivot_seen_q) begin
v_pivot_q <= gif_reg_data;
xyzf2_pivot_q <= gif_vertex_is_xyzf2;
pivot_seen_q <= 1'b1;
end
end
end
// Ch80 — per-vertex Gouraud color rolling window. Mirrors the
// vertex window above but samples rgbaq_q (the current GS
// color register, settled to its draw-time value because
// gif_packed_stub serializes A+D entries one per cycle, so any
// RGBAQ write that "belongs to" this vertex landed on the
// previous cycle and rgbaq_q has already updated). Cleared on
// rst_n AND on PRIM write so per-vertex color identity stays
// tied to the current primitive context.
logic [63:0] c_curr_q;
logic [63:0] c_prev_q;
logic [63:0] c_prev_prev_q;
logic [63:0] c_pivot_q;
always_ff @(posedge clk) begin
if (!rst_n) begin
c_curr_q <= 64'd0;
c_prev_q <= 64'd0;
c_prev_prev_q <= 64'd0;
c_pivot_q <= 64'd0;
end else if (gif_is_prim_wr) begin
c_curr_q <= 64'd0;
c_prev_q <= 64'd0;
c_prev_prev_q <= 64'd0;
c_pivot_q <= 64'd0;
end else if (gif_is_vertex && (vert_threshold != 3'd0)) begin
c_prev_prev_q <= c_prev_q;
c_prev_q <= c_curr_q;
c_curr_q <= rgbaq_q;
// Gate on the SAME pivot_seen_q the vertex block reads —
// both update on the cycle of the first vertex commit.
if (!pivot_seen_q) begin
c_pivot_q <= rgbaq_q;
end
end
end
// Brick 1 — per-vertex texture-coordinate (UV) rolling window.
// Exact mirror of the c_*_q Gouraud-color window above, but
// samples uv_q (the current GS UV register). For a textured
// SPRITE the two endpoints' UVs define the linear texel ramp:
// uv0 = uvc_curr_q (first endpoint), uv1 = uv_q (closing).
// UV writes are A+D entries serialized one-per-cycle by
// gif_packed_stub, so uv_q has already settled to the value that
// "belongs to" this vertex by the time the vertex commits — same
// timing argument as c_curr_q. Cleared on rst_n and PRIM write.
logic [63:0] uvc_curr_q;
logic [63:0] uvc_prev_q;
logic [63:0] uvc_prev_prev_q;
logic [63:0] uvc_pivot_q;
always_ff @(posedge clk) begin
if (!rst_n) begin
uvc_curr_q <= 64'd0;
uvc_prev_q <= 64'd0;
uvc_prev_prev_q <= 64'd0;
uvc_pivot_q <= 64'd0;
end else if (gif_is_prim_wr) begin
uvc_curr_q <= 64'd0;
uvc_prev_q <= 64'd0;
uvc_prev_prev_q <= 64'd0;
uvc_pivot_q <= 64'd0;
end else if (gif_is_vertex && (vert_threshold != 3'd0)) begin
uvc_prev_prev_q <= uvc_prev_q;
uvc_prev_q <= uvc_curr_q;
uvc_curr_q <= uv_q;
if (!pivot_seen_q) begin
uvc_pivot_q <= uv_q;
end
end
end
// Ch301 perspective — per-vertex S/T (ST register, 0x02) rolling
// window. EXACT parallel of the uvc_*_q UV window above, but samples
// st_q (the GS ST register) so the perspective triangle path has the
// float S_fp/T_fp at each of {v0,v1,v2}. The matching per-vertex Q
// (1/w) is NOT a new window: it is the existing RGBAQ.Q field already
// captured in the c_*_q colour window (extract bits [55:32]). Same
// reset / PRIM-clear / vertex-shift timing argument as uvc_*_q.
// GUARDED: when PERSPECTIVE_CORRECT=0 these are constant 0.
logic [63:0] stc_curr_q;
logic [63:0] stc_prev_q;
logic [63:0] stc_prev_prev_q;
logic [63:0] stc_pivot_q;
generate
if (PERSPECTIVE_CORRECT) begin : g_stc_window
always_ff @(posedge clk) begin
if (!rst_n) begin
stc_curr_q <= 64'd0;
stc_prev_q <= 64'd0;
stc_prev_prev_q <= 64'd0;
stc_pivot_q <= 64'd0;
end else if (gif_is_prim_wr) begin
stc_curr_q <= 64'd0;
stc_prev_q <= 64'd0;
stc_prev_prev_q <= 64'd0;
stc_pivot_q <= 64'd0;
end else if (gif_is_vertex && (vert_threshold != 3'd0)) begin
stc_prev_prev_q <= stc_prev_q;
stc_prev_q <= stc_curr_q;
stc_curr_q <= st_q;
if (!pivot_seen_q) begin
stc_pivot_q <= st_q;
end
end
end
end else begin : g_no_stc_window
assign stc_curr_q = 64'd0;
assign stc_prev_q = 64'd0;
assign stc_prev_prev_q = 64'd0;
assign stc_pivot_q = 64'd0;
end
endgenerate
// Snapshot: latched on the same edge prim_complete pulses, so it
// aligns one-for-one with EV_PRIM_DRAW in the trace block (which
// also fires its registered ev_valid pulse on that same edge).
// Held until the next prim_complete; idle cycles do not disturb
// the snapshot. Ch79: prim_color_q snapshots rgbaq_q on the same
// edge — RGBAQ writes are A+D entries on a different cycle from
// any XYZ2/XYZF2 commit (gif_packed_stub serializes A+D to one
// accept per cycle), so rgbaq_q is already settled to its
// "draw-time" value when prim_complete_now fires.
// Ch81: closing-vertex format flag is "the reg# being committed
// RIGHT NOW", not a registered value — the rolling-flag window
// hasn't shifted yet on this cycle. (Same parity as how the
// vertex slot uses gif_reg_data directly for the closing slot.)
logic fan_sat_path;
assign fan_sat_path = prim_is_strip && (prim_type == 3'd5) && strip_saturated;
always_ff @(posedge clk) begin
if (!rst_n) begin
prim_v0_q <= 64'd0;
prim_v1_q <= 64'd0;
prim_v2_q <= 64'd0;
prim_color_q <= 64'd0;
prim_color_v0_q <= 64'd0;
prim_color_v1_q <= 64'd0;
prim_color_v2_q <= 64'd0;
prim_v0_decoded_q <= '0;
prim_v1_decoded_q <= '0;
prim_v2_decoded_q <= '0;
prim_v0_color_decoded_q <= '0;
prim_v1_color_decoded_q <= '0;
prim_v2_color_decoded_q <= '0;
end else if (prim_complete_now) begin
prim_color_q <= rgbaq_q;
unique case (vert_threshold)
3'd1: begin // POINT
prim_v0_q <= gif_reg_data;
prim_v1_q <= 64'd0;
prim_v2_q <= 64'd0;
prim_color_v0_q <= rgbaq_q;
prim_color_v1_q <= 64'd0;
prim_color_v2_q <= 64'd0;
prim_v0_decoded_q <= trace_pkg::decode_vertex(gif_reg_data, gif_vertex_is_xyzf2);
prim_v1_decoded_q <= '0;
prim_v2_decoded_q <= '0;
prim_v0_color_decoded_q <= trace_pkg::decode_color(rgbaq_q);
prim_v1_color_decoded_q <= '0;
prim_v2_color_decoded_q <= '0;
end
3'd2: begin // LINE / LINE_STRIP / SPRITE
prim_v0_q <= v_curr_q;
prim_v1_q <= gif_reg_data;
prim_v2_q <= 64'd0;
prim_color_v0_q <= c_curr_q;
prim_color_v1_q <= rgbaq_q;
prim_color_v2_q <= 64'd0;
prim_v0_decoded_q <= trace_pkg::decode_vertex(v_curr_q, xyzf2_curr_q);
prim_v1_decoded_q <= trace_pkg::decode_vertex(gif_reg_data, gif_vertex_is_xyzf2);
prim_v2_decoded_q <= '0;
prim_v0_color_decoded_q <= trace_pkg::decode_color(c_curr_q);
prim_v1_color_decoded_q <= trace_pkg::decode_color(rgbaq_q);
prim_v2_color_decoded_q <= '0;
end
3'd3: begin // TRI / TRI_STRIP / TRI_FAN
// FAN saturated path uses v_pivot for v0; all
// other 3-vertex closes use the rolling v_prev.
// For a FAN's first draw, v_pivot == v_prev so
// the two are equivalent on that cycle.
prim_v0_q <= fan_sat_path ? v_pivot_q : v_prev_q;
prim_v1_q <= v_curr_q;
prim_v2_q <= gif_reg_data;
prim_color_v0_q <= fan_sat_path ? c_pivot_q : c_prev_q;
prim_color_v1_q <= c_curr_q;
prim_color_v2_q <= rgbaq_q;
prim_v0_decoded_q <= trace_pkg::decode_vertex(
fan_sat_path ? v_pivot_q : v_prev_q,
fan_sat_path ? xyzf2_pivot_q : xyzf2_prev_q);
prim_v1_decoded_q <= trace_pkg::decode_vertex(v_curr_q, xyzf2_curr_q);
prim_v2_decoded_q <= trace_pkg::decode_vertex(gif_reg_data, gif_vertex_is_xyzf2);
prim_v0_color_decoded_q <= trace_pkg::decode_color(fan_sat_path ? c_pivot_q : c_prev_q);
prim_v1_color_decoded_q <= trace_pkg::decode_color(c_curr_q);
prim_v2_color_decoded_q <= trace_pkg::decode_color(rgbaq_q);
end
default: ; // reserved — never reached (vert_threshold=0 gates prim_complete_now)
endcase
end
end
// ------------------------------------------------------------------
// Ch82 / Ch83 — minimal pixel emit
//
// One pixel per closed primitive, addressed from FRAME_1's FBP /
// FBW plus the closing vertex's integer screen coordinates.
// Ch83 added a PSM-aware bytes-per-pixel shift (bpp_shift below)
// so PSMCT16, PSMT8, PSMZ16 etc. compute correct fb_addr;
// PSMT4's true 0.5 bytes/pixel stays out of scope (host-word
// fallback). Multiplication is sim-only; synthesis target is
// unaffected because gs_stub is a stub.
// ------------------------------------------------------------------
logic [11:0] pixel_x_next;
logic [11:0] pixel_y_next;
logic [31:0] fbp_bytes;
logic [31:0] pixels_per_row;
logic [31:0] pixel_index;
logic [31:0] pixel_byte_offset;
logic [31:0] pixel_fb_addr_next;
// Ch83: PSM-aware bytes-per-pixel shift. Drives the byte-offset
// multiplier for fb_addr. Recognized formats below; unrecognized
// PSMs fall back to 4-byte addressing (host-word, matches what
// PSMT4/T8H/T4HL/T4HH would produce for their host-slot
// addresses). PSMT4's 4-bit-per-pixel sub-byte addressing is
// out of scope — treated as host-word here.
//
// PSM | bytes/pixel | shift
// ----------+-------------+------
// PSMCT32 | 4 | 2 (host-word)
// PSMCT24 | 4 | 2 (24-bit packed in 32-bit slot)
// PSMCT16 | 2 | 1
// PSMCT16S | 2 | 1
// PSMT8 | 1 | 0
// PSMT4 | 0.5 | 2 * (* host-word fallback)
// PSMT8H | 4 | 2 (host-word)
// PSMT4HL | 4 | 2 (host-word)
// PSMT4HH | 4 | 2 (host-word)
// PSMZ32 | 4 | 2
// PSMZ24 | 4 | 2
// PSMZ16 | 2 | 1
// PSMZ16S | 2 | 1
logic [1:0] bpp_shift;
always_comb begin
unique case (frame_1_q[29:24])
6'h00, 6'h01, 6'h1B, 6'h24, 6'h2C, // PSMCT32 / PSMCT24 / PSMT8H / PSMT4HL / PSMT4HH
6'h30, 6'h31: // PSMZ32 / PSMZ24
bpp_shift = 2'd2; // 4 bytes/pixel
6'h02, 6'h0A, 6'h32, 6'h3A: // PSMCT16 / PSMCT16S / PSMZ16 / PSMZ16S
bpp_shift = 2'd1; // 2 bytes/pixel
6'h13: bpp_shift = 2'd0; // PSMT8 — 1 byte/pixel
6'h14: bpp_shift = 2'd2; // PSMT4 — sub-byte unsupported, host-word fallback
default: bpp_shift = 2'd2; // unrecognized PSM — host-word fallback
endcase
end
assign pixel_x_next = gif_reg_data[15:4]; // top 12 of 12.4 X
assign pixel_y_next = gif_reg_data[31:20]; // top 12 of 12.4 Y
assign fbp_bytes = {23'd0, frame_1_q[8:0]} << 11; // FBP * 2048
assign pixels_per_row = {26'd0, frame_1_q[21:16]} << 6; // FBW * 64
assign pixel_index = ({20'd0, pixel_y_next} * pixels_per_row)
+ {20'd0, pixel_x_next};
assign pixel_byte_offset = pixel_index << bpp_shift; // Ch83: PSM-aware
assign pixel_fb_addr_next = fbp_bytes + pixel_byte_offset;
always_ff @(posedge clk) begin
if (!rst_n) begin
pixel_emit <= 1'b0;
pixel_emit_count <= 32'd0;
pixel_x_q <= 12'd0;
pixel_y_q <= 12'd0;
pixel_color_q <= 64'd0;
pixel_fbp_q <= 9'd0;
pixel_fbw_q <= 6'd0;
pixel_psm_q <= 6'd0;
pixel_fb_addr_q <= 32'd0;
end else begin
pixel_emit <= 1'b0;
if (prim_complete_now) begin
pixel_emit <= 1'b1;
pixel_emit_count <= pixel_emit_count + 32'd1;
pixel_x_q <= pixel_x_next;
pixel_y_q <= pixel_y_next;
pixel_color_q <= rgbaq_q;
pixel_fbp_q <= frame_1_q[8:0];
pixel_fbw_q <= frame_1_q[21:16];
pixel_psm_q <= frame_1_q[29:24];
pixel_fb_addr_q <= pixel_fb_addr_next;
end
end
end
// ------------------------------------------------------------------
// Ch84 / Ch85 — minimal interior rasterizer
//
// Brutally simple but real:
// - SPRITE (PRIM=6): 2-vertex axis-aligned rectangle, fill all
// pixels in the bounding box.
// - TRI / TRI_STRIP / TRI_FAN: 3-vertex triangle, edge-function
// test at each pixel in the bounding box. Ch85 applies the
// D3D-style top-left fill rule: at FSM init, if signed area
// is < 0 the FSM swaps v1 and v2 to canonicalize to CCW;
// each post-swap edge is classified as top-or-left
// (inclusive) or right/bottom (exclusive); the inside test
// is `(e[i] + bias[i]) <= 0` for all i, where bias[i]=0 for
// inclusive edges and 1 for exclusive. Adjacent triangles
// sharing an edge no longer double-paint; degenerate
// triangles (signed area == 0) skip SCAN and latch
// raster_degenerate.
// - POINT / LINE / LINE_STRIP: do NOT raster — they keep the
// Ch82 single-closing-pixel emit on the pixel_emit channel
// and the raster channel stays quiet for them.
//
// FSM (Ch87 + Ch88): R_IDLE → R_SCAN → R_DRAIN. Each closed
// TRI/SPRITE that's not degenerate latches its full per-prim
// context (vertices, bias, signed area, per-vertex colors,
// FRAME_1 fields, bounding box) into a 2-entry FIFO at the
// close cycle. The FSM dequeues the oldest entry on
// R_IDLE→R_SCAN. In R_SCAN, S0 walks the bounding box one
// pixel per cycle into the S1/S2 pipeline; S2 emits inside
// pixels via raster_pixel_emit. When S0 reaches the bbox
// corner, the FSM transitions R_SCAN→R_DRAIN to let S1 and S2
// finish their tail without S0 producing more valids. The
// next pop fires on drain_done = (R_DRAIN) & !s1_valid_q &
// !s2_valid_q if the FIFO has more work, otherwise the FSM
// returns to R_IDLE. raster_active stays high across the
// R_DRAIN→R_SCAN seam for back-to-back primitives.
// raster_overflow latches only when a push arrives while the
// FIFO is full AND no concurrent pop frees a slot — Ch87's
// audit-medium fix preserves the new primitive on a
// simultaneous full+pop+push race.
//
// Color: TRI/STRIP/FAN are per-pixel Gouraud (Ch86, computed
// in S2 from latched per-vertex RGBAQ snapshots and the
// barycentric weights produced from the unbiased edge
// functions); SPRITE stays flat (Ch86 keeps SPRITE flat since
// it has only 2 vertices). Coordinates are integer (12.4
// top-12); sub-pixel snap-to-integer.
// ------------------------------------------------------------------
// Ch88 — added R_DRAIN to let the 3-stage pipeline (S0 coord
// generation / S1 edge test + barycentric weights / S2 color
// interp + fb_addr + emit) flush in-flight pixels after the
// last bbox coord is generated. The FSM stays in R_DRAIN
// until both S1 and S2 valid bits go low, then either pops
// the next FIFO entry or returns to R_IDLE.
typedef enum logic [1:0] { R_IDLE, R_SCAN, R_DRAIN } raster_state_e;
raster_state_e raster_state;
// Latched per-primitive context for the scan
typedef enum logic [1:0] { RM_NONE, RM_SPRITE, RM_TRI } raster_mode_e;
raster_mode_e ras_mode;
logic [11:0] ras_v0_x, ras_v0_y;
logic [11:0] ras_v1_x, ras_v1_y;
logic [11:0] ras_v2_x, ras_v2_y;
logic [11:0] ras_x_min, ras_x_max;
logic [11:0] ras_y_min, ras_y_max;
logic [11:0] ras_cur_x, ras_cur_y;
logic [63:0] ras_color;
logic [8:0] ras_fbp;
logic [5:0] ras_fbw, ras_psm;
logic [1:0] ras_bpp_shift;
// Brick 1 — in-flight texture context for the active scan.
logic ras_tme;
logic [10:0] ras_u0, ras_v0t, ras_u1, ras_v1t;
logic [31:0] ras_tex_base;
logic [13:0] ras_tbw;
logic [5:0] ras_tpsm;
// Ch294 — popped per-primitive wrap-mode context (CLAMP_1 + TEX0 dims).
logic [1:0] ras_wms;
logic [1:0] ras_wmt;
logic [3:0] ras_tw;
logic [3:0] ras_th;
// Brick 1 (DDA) — popped per-pixel Q16.16 texel-coord increments.
logic signed [31:0] ras_du_dx_q;
logic signed [31:0] ras_dv_dy_q;
// Brick 2a — in-flight alpha-blend-active flag for the scan. High
// only for a FLAT PSMCT32 SPRITE that closed with PRIM.ABE=1 and a
// source-over ALPHA_1 config. Held for the whole primitive (ABE is
// constant per primitive), so the emit path is uniform.
logic ras_abe;
// Ch344 — in-flight flag: this SPRITE is the textured + source-over alpha path (isolated
// half-rate texel-read/dest-read/blend). DERIVED, not packed: ras_abe is now set for both
// flat AND textured-PSMCT32 alpha sprites (new_abe_active below, gated on SPRITE_TEX_ALPHA),
// and ras_tme distinguishes them — so no FIFO attr-word change is needed. Constant 0 when
// SPRITE_TEX_ALPHA=0 (new_abe_active never sets ras_abe for a textured sprite then).
logic ras_tex_abe;
assign ras_tex_abe = ras_abe && ras_tme;
// Ch310 — in-flight per-primitive bilinear-filter flag (TEX1_1.MMAG),
// snapshotted per-primitive, mirroring ras_abe. Loaded always; only READ
// by the combined texel path when BILINEAR_ENABLE=1 (drives u_tex.filter_lin
// + bili_now), so the default build is byte-identical.
logic ras_filter_lin;
// Brick-2c — in-flight generic GS ALPHA selectors for the active scan
// (snapshotted per-primitive, mirroring ras_abe). Loaded always; only
// READ by u_comb_blend when ALPHA_MODES_ENABLE=1, so the default build
// is byte-identical.
logic [1:0] ras_alpha_a, ras_alpha_b, ras_alpha_c, ras_alpha_d;
logic [7:0] ras_alpha_fix;
// Brick 2b — in-flight Z-test context for the active scan. ras_zte
// gates the depth-test path (high only for a FLAT PSMCT32 SPRITE
// closed with TEST_1.ZTE=1 and ZBUF PSMZ32). The fields below are
// constant across the scan (flat sprite Z + per-primitive ZBUF
// config), so the emit path is uniform per primitive.
logic ras_zte; // depth-test active for this scan
logic [1:0] ras_ztst; // 0=NEVER 1=ALWAYS 2=GEQUAL 3=GREATER
logic ras_zmsk; // 1 => do NOT update Z on pass
logic [8:0] ras_zbp; // ZBUF base in 2048-word page units
logic [31:0] ras_z_value; // flat fragment Z for the sprite
// Choose primitive type for the scan based on PRIM[2:0] at
// close. Only TRI/TRI_STRIP/TRI_FAN drive RM_TRI; SPRITE
// drives RM_SPRITE; everything else is RM_NONE (no scan).
function automatic raster_mode_e classify_prim_for_raster(input logic [2:0] pt);
unique case (pt)
3'd3, 3'd4, 3'd5: return RM_TRI; // TRI / TRI_STRIP / TRI_FAN
3'd6: return RM_SPRITE;
default: return RM_NONE;
endcase
endfunction
// Min/max helpers (unsigned 12-bit)
function automatic logic [11:0] umin12(input logic [11:0] a,
input logic [11:0] b);
return (a < b) ? a : b;
endfunction
function automatic logic [11:0] umax12(input logic [11:0] a,
input logic [11:0] b);
return (a > b) ? a : b;
endfunction
// Closing vertex's screen X/Y (live from gif_reg_data on the
// close cycle, same source the pixel-emit block uses).
logic [11:0] close_x, close_y;
assign close_x = gif_reg_data[15:4];
assign close_y = gif_reg_data[31:20];
// Brick 2b — closing vertex's fragment Z (XYZ2/XYZF2 bits [63:32]).
// For a SPRITE the Z is flat (constant across the rectangle), so the
// closing vertex's Z is THE fragment Z for every pixel of the sprite.
logic [31:0] close_z;
assign close_z = gif_reg_data[63:32];
// For 3-vertex prims the slot mapping mirrors the prim_v*_q
// snapshot logic: TRI/TRI_STRIP use {v_prev, v_curr, closing};
// TRI_FAN saturated uses {v_pivot, v_curr, closing}; FAN
// anchor draws coincide because v_pivot==v_prev there.
logic [11:0] tri_v0_x_next, tri_v0_y_next;
logic [11:0] tri_v1_x_next, tri_v1_y_next;
logic [11:0] tri_v2_x_next, tri_v2_y_next;
assign tri_v0_x_next = fan_sat_path ? v_pivot_q[15:4] : v_prev_q[15:4];
assign tri_v0_y_next = fan_sat_path ? v_pivot_q[31:20] : v_prev_q[31:20];
assign tri_v1_x_next = v_curr_q[15:4];
assign tri_v1_y_next = v_curr_q[31:20];
assign tri_v2_x_next = close_x;
assign tri_v2_y_next = close_y;
// For SPRITE the slot mapping is {v_curr=first endpoint,
// closing=second endpoint}. v0 / v1 form the diagonal corners.
logic [11:0] sp_v0_x_next, sp_v0_y_next;
logic [11:0] sp_v1_x_next, sp_v1_y_next;
assign sp_v0_x_next = v_curr_q[15:4];
assign sp_v0_y_next = v_curr_q[31:20];
assign sp_v1_x_next = close_x;
assign sp_v1_y_next = close_y;
// Brick 1 — SPRITE texel coords at the two endpoints. v0 (first
// endpoint) uses the windowed UV (uvc_curr_q); v1 (closing) uses
// the live UV register (uv_q) — same {prev, closing} mapping the
// geometry uses (sp_v0 = v_curr, sp_v1 = closing). UV fields are
// U=[13:0], V=[27:14] in 10.4 fixed-point; the integer texel
// coordinate is the top 11 bits (>>4) so the 0..2047 texel range
// fits gs_texel_addr's u/v[10:0] inputs.
logic [10:0] sp_u0_next, sp_v0t_next, sp_u1_next, sp_v1t_next;
assign sp_u0_next = uvc_curr_q[14:4]; // U integer part (10.4 → 11-bit)
assign sp_v0t_next = uvc_curr_q[28:18]; // V integer part (V starts at bit 14)
assign sp_u1_next = uv_q[14:4];
assign sp_v1t_next = uv_q[28:18];
// Brick 1 — texture base byte address from TEX0.TBP0.
//
// TBP0 scaling convention: base_bytes = TBP0 * 256 (i.e. TBP0<<8).
// This MATCHES the VRAM upload path (gif_image_xfer_stub scales
// BITBLTBUF.DBP the same way: dest_base = DBP*256, see that file's
// `dest_base_q <= {18'd0, dbp} << 8`). It deliberately does NOT
// match the framebuffer/pcrtc FBP*2048 convention — TBP0 and DBP
// are the texture/transfer base namespace, FBP is the framebuffer
// namespace, and a real flow uploads a texture via BITBLT to some
// DBP and then points TEX0.TBP0 at the same value. Choosing the
// DBP scaling here is what makes "BITBLT a texture, then TEX0 it"
// read back from exactly where it was written.
//
// TODO(upload-path reconciliation): v1's TB preloads the texture
// DIRECTLY into vram_stub at base = TBP0*256 (bypassing BITBLT/TRX)
// so this scaling is the only contract that must hold. A follow-on
// chapter must verify the round trip "BITBLT upload at DBP → TEX0
// TBP0=DBP → textured sprite samples the uploaded texels" end to
// end, including any PSM-specific block/page swizzle (this linear
// path assumes the non-swizzled layout, matching gs_texel_addr).
logic [31:0] tex_base_next;
assign tex_base_next = {18'd0, tex0_tbp0} << 8;
// Brick 1 (DDA) — per-pixel Q16.16 texel-coord step for the SPRITE
// being pushed. ONE divide each (per primitive, at the close cycle):
// du_dx = ((u1 - u0) << 16) / (x1 - x0)
// dv_dy = ((v1 - v0) << 16) / (y1 - y0)
// A zero span (single-column/row sprite) yields step 0 so the
// coord holds the first endpoint's texel. These feed fifo_du_dx /
// fifo_dv_dy at push; the per-pixel datapath uses only the popped
// step with a multiply (dda_uv above) — no per-pixel divide.
logic signed [31:0] sp_span_x, sp_span_y;
logic signed [31:0] sp_du, sp_dv;
logic signed [31:0] du_dx_next, dv_dy_next;
assign sp_span_x = $signed({20'd0, sp_v1_x_next}) - $signed({20'd0, sp_v0_x_next});
assign sp_span_y = $signed({20'd0, sp_v1_y_next}) - $signed({20'd0, sp_v0_y_next});
assign sp_du = $signed({21'd0, sp_u1_next}) - $signed({21'd0, sp_u0_next});
assign sp_dv = $signed({21'd0, sp_v1t_next}) - $signed({21'd0, sp_v0t_next});
assign du_dx_next = (sp_span_x == 32'sd0) ? 32'sd0 : ((sp_du <<< 16) / sp_span_x);
assign dv_dy_next = (sp_span_y == 32'sd0) ? 32'sd0 : ((sp_dv <<< 16) / sp_span_y);
// Edge-function evaluation for the current scan pixel inside a
// triangle. Use 32-bit signed math: e(p) = (p.x - vA.x) *
// (vB.y - vA.y) - (p.y - vA.y) * (vB.x - vA.x) for edge AB.
// Inside test (Ch85 top-left fill rule): for the post-swap CCW
// triangle, e[i] is interior-negative; an edge i is "top-or-
// left" iff (dy > 0) OR (dy == 0 AND dx > 0) — those edges are
// INCLUSIVE (e == 0 counts as inside); other edges are
// EXCLUSIVE (strict e < 0). Implemented by a per-edge bias: 0
// for top-left, 1 for right/bottom; the comparison then
// becomes (e + bias <= 0).
function automatic logic [31:0] edge_fn(input logic [11:0] px,
input logic [11:0] py,
input logic [11:0] ax,
input logic [11:0] ay,
input logic [11:0] bx,
input logic [11:0] by);
logic signed [31:0] dpx, dpy, dbx, dby, prod;
dpx = $signed({20'd0, px}) - $signed({20'd0, ax});
dpy = $signed({20'd0, py}) - $signed({20'd0, ay});
dbx = $signed({20'd0, bx}) - $signed({20'd0, ax});
dby = $signed({20'd0, by}) - $signed({20'd0, ay});
prod = dpx * dby - dpy * dbx;
return prod;
endfunction
// Top-or-left predicate for an edge (vA, vB) in CCW math
// traversal order, Y-down screen coords.
function automatic logic top_or_left(input logic [11:0] ax,
input logic [11:0] ay,
input logic [11:0] bx,
input logic [11:0] by);
logic signed [12:0] dx, dy;
dx = $signed({1'b0, bx}) - $signed({1'b0, ax});
dy = $signed({1'b0, by}) - $signed({1'b0, ay});
return (dy > 0) || ((dy == 32'sd0) && (dx > 0));
endfunction
// Signed area of the original (pre-swap) triangle, computed
// combinationally from the slot-mapped vertices. >0 = CCW,
// <0 = CW, 0 = degenerate. Drives the FSM init's swap decision
// and the degenerate skip.
logic signed [31:0] tri_sa_calc;
assign tri_sa_calc =
($signed({20'd0, tri_v1_x_next}) - $signed({20'd0, tri_v0_x_next})) *
($signed({20'd0, tri_v2_y_next}) - $signed({20'd0, tri_v0_y_next})) -
($signed({20'd0, tri_v1_y_next}) - $signed({20'd0, tri_v0_y_next})) *
($signed({20'd0, tri_v2_x_next}) - $signed({20'd0, tri_v0_x_next}));
logic tri_swap_ccw, tri_degenerate;
assign tri_swap_ccw = (tri_sa_calc < 32'sd0);
assign tri_degenerate = (tri_sa_calc == 32'sd0);
// Post-swap CCW vertices. If sa < 0 we swap v1 and v2 so the
// top-left rule below applies uniformly.
logic [11:0] tri_v0x, tri_v0y, tri_v1x, tri_v1y, tri_v2x, tri_v2y;
assign tri_v0x = tri_v0_x_next;
assign tri_v0y = tri_v0_y_next;
assign tri_v1x = tri_swap_ccw ? tri_v2_x_next : tri_v1_x_next;
assign tri_v1y = tri_swap_ccw ? tri_v2_y_next : tri_v1_y_next;
assign tri_v2x = tri_swap_ccw ? tri_v1_x_next : tri_v2_x_next;
assign tri_v2y = tri_swap_ccw ? tri_v1_y_next : tri_v2_y_next;
// Brick 3 — per-vertex fragment Z for the 3 triangle vertices.
// Source Z follows the SAME {prev, curr, closing} (or {pivot, curr,
// closing} for a saturated fan) slot mapping the X/Y use, and the
// SAME v1<->v2 swap as the geometry / colors so vertex-attribute
// alignment is preserved into the post-swap CCW slots. Z lives in
// bits [63:32] of the vertex word (XYZ2 layout, matching close_z and
// the Brick-2b sprite Z path). Affine (screen-linear) Z is correct
// for this chapter; perspective-correct Z is a future item.
logic [31:0] tri_v0_z_next, tri_v1_z_next, tri_v2_z_next;
assign tri_v0_z_next = fan_sat_path ? v_pivot_q[63:32] : v_prev_q[63:32];
assign tri_v1_z_next = v_curr_q[63:32];
assign tri_v2_z_next = close_z; // gif_reg_data[63:32]
logic [31:0] tri_v0z, tri_v1z, tri_v2z;
assign tri_v0z = tri_v0_z_next;
assign tri_v1z = tri_swap_ccw ? tri_v2_z_next : tri_v1_z_next;
assign tri_v2z = tri_swap_ccw ? tri_v1_z_next : tri_v2_z_next;
// Post-swap per-vertex Gouraud colors (same mapping the FIFO color
// fields use), exposed as wires so the Brick-3 gradient setup can
// read per-channel vertex attributes at the push cycle.
logic [63:0] tri_c0, tri_c1, tri_c2;
assign tri_c0 = fan_sat_path ? c_pivot_q : c_prev_q;
assign tri_c1 = tri_swap_ccw ? rgbaq_q : c_curr_q;
assign tri_c2 = tri_swap_ccw ? c_curr_q : rgbaq_q;
// Textured-triangle rung — post-swap per-vertex texture coordinate
// (UV). EXACT mirror of the tri_c* colour mapping above but on the
// uvc_*_q rolling window (the closing vertex's UV is the live uv_q,
// same role rgbaq_q plays for colour). U=[14:4], V=[28:18] in the
// 10.4 UV register → integer texel = >>4 (the [14:4]/[28:18] slice),
// matching the SPRITE sp_u0_next/sp_v0t_next decode. The closing UV
// is taken from uv_q (the just-written closing vertex's UV).
logic [10:0] tri_uv0_u, tri_uv0_v, tri_uv1_u, tri_uv1_v, tri_uv2_u, tri_uv2_v;
logic [63:0] tri_uvc0, tri_uvc1, tri_uvc2; // pre-slice {V,U} window words
assign tri_uvc0 = fan_sat_path ? uvc_pivot_q : uvc_prev_q;
assign tri_uvc1 = tri_swap_ccw ? uv_q : uvc_curr_q;
assign tri_uvc2 = tri_swap_ccw ? uvc_curr_q : uv_q;
assign tri_uv0_u = tri_uvc0[14:4];
assign tri_uv0_v = tri_uvc0[28:18];
assign tri_uv1_u = tri_uvc1[14:4];
assign tri_uv1_v = tri_uvc1[28:18];
assign tri_uv2_u = tri_uvc2[14:4];
assign tri_uv2_v = tri_uvc2[28:18];
// Ch301 perspective — post-swap per-vertex S/T/Q (24-bit fixed-point,
// FRAC=12). EXACT parallel of the tri_uv*/tri_c* post-swap mapping:
// S_fp = st_q[23:0] (carried in stc_*_q[23:0])
// T_fp = st_q[55:32] (carried in stc_*_q[55:32])
// Q_fp = rgbaq[55:32] (the RGBAQ.Q field already in the c_*_q
// colour window — extract [55:32], NOT a new
// window). For the closing vertex Q comes from
// the live rgbaq_q, exactly as the colour and
// UV closing values use rgbaq_q / uv_q.
// The stc window words pick {pivot|prev, curr|live, live|curr} with
// the same tri_swap_ccw v1<->v2 swap as geometry/colour/UV. Q rides
// the SAME tri_c0/c1/c2 post-swap colour words so S/T/Q stay aligned
// to the same physical vertex after the swap.
// GUARDED: when PERSPECTIVE_CORRECT=0 the stc window + tri_c* are 0,
// so all of these are 0 and never reach the (param-gated) FIFO use.
logic [63:0] tri_stc0, tri_stc1, tri_stc2; // pre-slice {T,S} window words
logic [23:0] tri_s0, tri_t0, tri_q0;
logic [23:0] tri_s1, tri_t1, tri_q1;
logic [23:0] tri_s2, tri_t2, tri_q2;
assign tri_stc0 = fan_sat_path ? stc_pivot_q : stc_prev_q;
assign tri_stc1 = tri_swap_ccw ? st_q : stc_curr_q;
assign tri_stc2 = tri_swap_ccw ? stc_curr_q : st_q;
assign tri_s0 = tri_stc0[23:0];
assign tri_t0 = tri_stc0[55:32];
assign tri_s1 = tri_stc1[23:0];
assign tri_t1 = tri_stc1[55:32];
assign tri_s2 = tri_stc2[23:0];
assign tri_t2 = tri_stc2[55:32];
// Q from the post-swap colour words (RGBAQ.Q field == bits [55:32]).
assign tri_q0 = tri_c0[55:32];
assign tri_q1 = tri_c1[55:32];
assign tri_q2 = tri_c2[55:32];
// Per-edge top-left flag (1 = top-or-left, 0 = right/bottom).
// Computed against the post-swap CCW order so bias[i] = ~tl[i].
logic tl0_init, tl1_init, tl2_init;
assign tl0_init = top_or_left(tri_v0x, tri_v0y, tri_v1x, tri_v1y);
assign tl1_init = top_or_left(tri_v1x, tri_v1y, tri_v2x, tri_v2y);
assign tl2_init = top_or_left(tri_v2x, tri_v2y, tri_v0x, tri_v0y);
// Latched per-edge bias used in SCAN. ras_bias[i] = 1 means
// the i-th edge is right/bottom (exclusive); 0 means top-left
// (inclusive). Loaded at IDLE→SCAN.
logic [2:0] ras_bias;
// Ch86 — per-vertex Gouraud color and signed area, latched at
// IDLE→SCAN with the same v1↔v2 swap semantics as the vertex /
// bias state so the colors stay aligned with the post-swap CCW
// vertex slots. ras_sa_q is positive (post-swap absolute value)
// — used as the divisor for barycentric weights. Q (texture
// divisor) is NOT interpolated; it passes through from the
// closing vertex's color as an opaque 32-bit IEEE float.
logic [63:0] ras_c0_q;
logic [63:0] ras_c1_q;
logic [63:0] ras_c2_q;
logic signed [31:0] ras_sa_q;
// Brick 3 — in-flight TRI affine interpolation context (popped from
// the FIFO at SCAN start). Q16.16 gradients + Z base.
logic ras_tri_active;
// COMBINED probe — in-flight flag for the combined T+A+Z triangle.
// Popped from fifo_combined at SCAN start. Drives the combined FSM
// and SUPPRESSES the standard per-feature emit paths for this
// primitive. ras_combined_g below ties it to a constant 0 net when
// COMBINED_TAZ=0 so every gating expression collapses to its legacy
// form and the build is byte-identical.
logic ras_combined_r; // register (only written when COMBINED_TAZ=1)
logic ras_combined; // gated view used everywhere
logic signed [31:0] ras_dr_dx, ras_dr_dy;
logic signed [31:0] ras_dg_dx, ras_dg_dy;
logic signed [31:0] ras_db_dx, ras_db_dy;
logic signed [31:0] ras_da_dx, ras_da_dy;
logic signed [31:0] ras_dz_dx, ras_dz_dy;
logic [31:0] ras_z0;
// Textured-triangle rung — in-flight affine U/V context. Base texel
// coord at post-swap v0 + the 4 Q16.16 gradients (popped from the
// FIFO at SCAN start, solved by the shared divider engine).
logic [10:0] ras_u0_base, ras_v0_base;
logic signed [31:0] ras_du_dx_t, ras_du_dy_t;
logic signed [31:0] ras_dv_dx_t, ras_dv_dy_t;
// Ch301 perspective — in-flight S/T/Q affine context (popped at SCAN
// start). ras_persp gates the perspective emit path (mirrors
// ras_tri_active / ras_tme). S/T/Q base at post-swap v0 + the 6
// Q16.16 gradients. GUARDED: when PERSPECTIVE_CORRECT=0 these stay 0
// (never written meaningfully; ras_persp constant 0).
logic ras_persp;
logic [23:0] ras_s0_base, ras_t0_base, ras_q0_base;
logic signed [31:0] ras_ds_dx, ras_ds_dy;
logic signed [31:0] ras_dt_dx, ras_dt_dy;
logic signed [31:0] ras_dq_dx, ras_dq_dy;
// COMBINED probe — gate ras_combined to the register when the param
// is enabled, else tie it to a constant 0 net (continuous assign, per
// iverilog-12 tie-off rule). At COMBINED_TAZ=0 every `ras_combined`
// reference below collapses to 0 and the legacy paths are restored
// bit-for-bit.
generate
if (COMBINED_TAZ) begin : g_ras_combined
assign ras_combined = ras_combined_r;
end else begin : g_no_ras_combined
assign ras_combined = 1'b0;
end
endgenerate
// ------------------------------------------------------------------
// Brick 3 — SYNTHESIZABLE affine attribute interpolation setup.
//
// The Ch86 per-pixel barycentric divide (interp_byte) is NOT
// synthesizable (Quartus error 272006 on the 64-bit signed divide,
// hence the `// synthesis translate_off` guard). Replace it with
// screen-space-affine incremental interpolation, exactly mirroring
// how the SPRITE UV path was made synthesizable (dda_uv):
//
// * AT SETUP (per primitive, latency-tolerant — happens at the
// FIFO-push close cycle): solve, for each attribute A, the plane
// A(x,y) = A0 + (x-x0)*dAdx + (y-y0)*dAdy
// from the 3 post-swap CCW vertices. dAdx / dAdy are computed in
// Q16.16 fixed-point with ONE divide each by the post-swap signed
// area det (= ras_sa_q, positive). A few divides per triangle at
// setup are fine — same budget as the sprite-UV DDA.
//
// * PER PIXEL (S2 stage): accumulate A0 + ((dx*dAdx + dy*dAdy)>>16)
// with pure multiply + add + arithmetic shift — NO divide, NO
// translate_off. Color channels clamp to [0,255]; Z is a full
// 32-bit value fed to the Brick-2b Z path.
//
// det (post-swap) = (x1-x0)(y2-y0) - (x2-x0)(y1-y0) evaluated on the
// post-swap CCW vertices; this equals ras_sa_q by construction (the
// pre-swap tri_sa_calc negated when tri_swap_ccw). Computed here
// directly on the post-swap vertices so the gradient solve is
// self-consistent regardless of swap.
logic signed [31:0] tri_det_post;
assign tri_det_post =
($signed({20'd0, tri_v1x}) - $signed({20'd0, tri_v0x})) *
($signed({20'd0, tri_v2y}) - $signed({20'd0, tri_v0y})) -
($signed({20'd0, tri_v2x}) - $signed({20'd0, tri_v0x})) *
($signed({20'd0, tri_v1y}) - $signed({20'd0, tri_v0y}));
// Gradient solve, Q16.16 result. a0/a1/a2 are the post-swap
// per-vertex attribute values; the (x,y) deltas are taken from the
// post-swap vertices. A zero det (degenerate) yields 0 step (the
// primitive is filtered as degenerate before push anyway, so this is
// just defensive). 64-bit intermediates: for 8-bit color the numerator
// is tiny; for Z the demo values are small (<~0x400) so num<<16 stays
// well within signed 64-bit. Full-range (>~2^31) Z would overflow the
// <<16 intermediate — documented future item (perspective-correct +
// wide-Z interpolation).
// SYNTHESIS-WIDTH NOTE. The Ch86 divide errored on Quartus (272006)
// because a 64-bit SIGNED divide sign-extends the LPM operand to 65
// bits, exceeding the 64-bit lpm_divide limit. To stay safely under
// that cap the setup divide here is bounded to a signed 56-bit
// dividend / signed 32-bit divisor (LPM_WIDTHN ~= 56 < 64). The
// attribute inputs are 32-bit (8-bit color trivially fits; demo-range
// Z fits — a vertex Z above ~2^39 would overflow the 16-bit pre-shift
// and is a documented FUTURE item alongside perspective-correct Z).
// Deltas are signed 14-bit (12-bit screen coords). A 32-bit attr diff
// (~33 bits) times a 14-bit delta (~14 bits) = ~47 bits; minus a like
// term = ~48 bits; <<16 = ~56 bits — hence the 56-bit dividend.
// -----------------------------------------------------------------
// Ch295 — divider-collapse for Quartus fit. The Brick-3 setup
// previously inlined a SEPARATE hardware divide at each of the 10
// (attribute × axis) gradient call sites — ~10 dividers, the
// dominant new LAB cost that pushed the fitter ~24% over. All 10
// divides share the SAME denominator (tri_det_post). They are now
// computed by a SINGLE divider time-shared across the 10 numerators
// by a small per-triangle setup engine (`grad_setup_*` below).
//
// These functions return ONLY the pre-shifted signed numerator
// (num <<< 16) — pure multiply / subtract / shift, NO divide. The
// shared engine performs `quo = num_q / det_q` and truncates to
// [31:0], so the arithmetic is BIT-IDENTICAL to the previous
// per-call divide+truncate (same num, same det, same
// truncation). Numerator width unchanged (signed 56-bit: see the
// SYNTHESIS-WIDTH NOTE above for the bound derivation; the <<<16 is
// applied here in the 56-bit numerator domain exactly as before).
function automatic logic signed [55:0] grad_num_dadx(
input logic signed [31:0] a0,
input logic signed [31:0] a1,
input logic signed [31:0] a2,
input logic signed [11:0] x0, input logic signed [11:0] y0,
input logic signed [11:0] x1, input logic signed [11:0] y1,
input logic signed [11:0] x2, input logic signed [11:0] y2);
logic signed [13:0] dy1, dy2;
logic signed [55:0] num;
dy1 = $signed({2'd0, y1}) - $signed({2'd0, y0});
dy2 = $signed({2'd0, y2}) - $signed({2'd0, y0});
num = (a1 - a0) * $signed({{42{dy2[13]}}, dy2})
- (a2 - a0) * $signed({{42{dy1[13]}}, dy1});
grad_num_dadx = num <<< 16;
endfunction
function automatic logic signed [55:0] grad_num_dady(
input logic signed [31:0] a0,
input logic signed [31:0] a1,
input logic signed [31:0] a2,
input logic signed [11:0] x0, input logic signed [11:0] y0,
input logic signed [11:0] x1, input logic signed [11:0] y1,
input logic signed [11:0] x2, input logic signed [11:0] y2);
logic signed [13:0] dx1, dx2;
logic signed [55:0] num;
dx1 = $signed({2'd0, x1}) - $signed({2'd0, x0});
dx2 = $signed({2'd0, x2}) - $signed({2'd0, x0});
num = (a2 - a0) * $signed({{42{dx1[13]}}, dx1})
- (a1 - a0) * $signed({{42{dx2[13]}}, dx2});
grad_num_dady = num <<< 16;
endfunction
// Ch87 — raster command FIFO. Holds primitive contexts captured
// at close cycles while the FSM is busy. Effective concurrency =
// 1 in-flight (in ras_*_q) + FIFO_DEPTH queued.
//
// Ch171 bumped FIFO_DEPTH 2 → 4 as a tactical fix for the Ch171
// 320×240 test card (4 large SPRITEs at 160×120 = 19,200 raster
// cycles each), where DMA was pushing all 4 sprites before
// raster popped any, overflowing the 2-entry FIFO and silently
// dropping the 4th sprite.
//
// Ch172 — proper backpressure landed (`raster_fifo_full` output
// → `gif_packed_stub.raster_fifo_full` → GIF deasserts in_ready
// → DMAC pauses BEFORE pushing a qword that could push_drop).
// With backpressure, FIFO_DEPTH is independent of "how many
// sprites we might draw" — it only sets the work-ahead window
// while raster is busy.
//
// Brick-3 fix — FIFO_DEPTH 2 → 4. The Ch295 divider-collapse made
// a TRI's affine gradients take ~10 cycles to solve (one shared
// divider) AFTER push, and a slot is held (grad_pending) until its
// own gradients are ready, so a TRI now occupies its FIFO slot for
// ~10 extra cycles before it can pop. The pre-Ch295 combinational
// solve let the head pop immediately, so 1-in-flight + 2-queued
// (DEPTH=2) absorbed 3 back-to-back closes; the solve latency
// consumes one of those slots' worth of slack. DEPTH=4 restores the
// original work-ahead window (the grad-solving slot is effectively
// "in flight"), so back-to-back TRI closes are queued, never
// dropped — the gate is the per-slot grad_pending on POP, not a
// global push stall. (tb_ee_core_gif_raster_queue pushes 3 closes
// with raster_fifo_full tied 0, i.e. no backpressure relief.)
// DEPTH=4 (not 3) keeps FIFO_PTR_W a power-of-two so the plain
// `+1` ring-pointer wrap stays correct (Ch171's value; DEPTH=3
// would need explicit mod-3 wrap). The added slots are plain
// registers — cheap next to the ~9 dividers Ch295 removed; no new
// wide datapath.
//
// Parallel arrays (one per latched field) avoid iverilog 12's
// packed-struct-array field-access quirk seen in Ch81.
localparam int FIFO_DEPTH = TILE_FIFO_DEPTH; // Ch315 — was hardcoded 4; now a param (power-of-2)
// ----------------------------------------------------------------------
// Ch328 (1a) — BRAM-backed per-prim ATTRIBUTE store. The ~76 HEAVY render
// attributes below cost ~1,033 regs + ~600 ALMs PER FIFO slot (Ch315), the
// capacity wall. They move to a packed-struct M20K `attr_ram` (the HOT
// bin/classification fields — mode, bbox, combined, tri_active, grad_pending —
// stay in registers). 1a = DARK dual-write only: a delayed-copy pack stage
// mirrors the register fifo_* into attr_ram on push; all reads stay on the
// registers, so the build is byte-identical. 1b switches the latch sites to
// prefetch from attr_ram and drops the heavy register arrays (the payoff).
typedef struct packed {
logic [11:0] v0x, v0y, v1x, v1y, v2x, v2y;
logic [2:0] bias;
logic signed [31:0] sa;
logic [63:0] color, c0, c1, c2;
logic [8:0] fbp; logic [5:0] fbw, psm; logic [1:0] bpp_shift;
logic tme;
logic [10:0] u0, v0, u1, v1;
logic [31:0] tex_base; logic [13:0] tbw; logic [5:0] tpsm;
logic [1:0] wms, wmt; logic [3:0] tw, th;
logic signed [31:0] du_dx, dv_dy;
logic abe, filter_lin;
logic [1:0] alpha_a, alpha_b, alpha_c, alpha_d; logic [7:0] alpha_fix;
logic zte; logic [1:0] ztst; logic zmsk; logic [8:0] zbp; logic [31:0] zval;
logic [31:0] z0;
logic [10:0] u0_base, v0_base;
logic [31:0] u1v, u2v;
logic persp;
logic [23:0] s0_base, t0_base, q0_base;
logic [63:0] stq1, stq2;
logic [23:0] q1, q2;
logic [31:0] v1z, v2z;
// NOTE (Ch328 1b): the 20 ENGINE-WRITTEN gradient outputs (dr_dx..dz_dy, du_dx_t..dv_dy_t,
// ds_dx..dq_dy) are deliberately NOT here — they stay in fifo_* sideband registers (the
// gradient engine writes them per-step; M20K-resident would force a per-step wide RMW).
} prim_attr_t; // INPUTS only — W shrinks 1693 -> 1053 (640 = 20 output fields removed)
localparam int PRIM_ATTR_W = $bits(prim_attr_t); // expected 1053 (inputs only; 1b)
// Ch328 1c — the actual RAM is a FLAT packed vector (NOT a prim_attr_t struct array). Quartus
// elaborates a struct array PER FIELD (attr_ram[i].tme, attr_ram[i].abe, ...) → ~76 tiny per-
// field arrays that never become M20K; a flat [W-1:0] word is ONE wide memory it can infer.
// prim_attr_t is retained ONLY as the layout/width spec ($bits + the documented pack order).
(* ramstyle = "M20K" *) logic [PRIM_ATTR_W-1:0] attr_ram [0:FIFO_DEPTH-1]; // heavy-attr table
// Ch328 1c — SINGLE registered read port so attr_ram infers as 1R1W (DUAL_PORT) M20K instead of a
// multi-read register file. TWO consumers use it: the multiprim grid (mp_load_prim) and the gradient
// prefetch. They are TIME-EXCLUSIVE — the grid waits for all_grad_done before rendering — so one
// address mux + one registered data reg suffices. Each drives its issue strobe + address for ONE
// cycle, then reads attr_rd_q the NEXT (the inherent +1 latency of a registered RAM read). attr_rd_q
// is the ONLY synthesis read. NOTE: the streaming pop is DELIBERATELY NOT a port consumer — its read
// stays a single-cycle attr_ram[fifo_rptr] register read to preserve the full-FIFO push+pop
// same-cycle slot-free concurrency (a multi-cycle pop drops prims); streaming is shallow and gains
// no M20K benefit. pop_rd_issue/pop_rd_addr are therefore tied off (kept for the uniform mux shape).
logic mp_rd_issue, pop_rd_issue, grad_rd_issue; // 1-cycle read requests
logic [FIFO_PTR_W-1:0] mp_rd_addr, pop_rd_addr, grad_rd_addr; // their slots
logic [FIFO_PTR_W-1:0] attr_rd_addr; // muxed read address (comb)
reg [PRIM_ATTR_W-1:0] attr_rd_q; // registered read data (sole synth read)
always_comb begin
// mp + grad are time-exclusive; pop is tied off. priority is belt-and-suspenders.
if (grad_rd_issue) attr_rd_addr = grad_rd_addr;
else if (mp_rd_issue) attr_rd_addr = mp_rd_addr;
else if (pop_rd_issue) attr_rd_addr = pop_rd_addr;
else attr_rd_addr = '0;
end
always_ff @(posedge clk) attr_rd_q <= attr_ram[attr_rd_addr];
`ifndef SYNTHESIS
// mp (multiprim grid) and grad (prefetch) must be time-exclusive: neither arbitrates, each assumes
// it owns the port the cycle it issues. all_grad_done gating the grid start guarantees this.
always_ff @(posedge clk) if (rst_n)
if (mp_rd_issue && grad_rd_issue)
$error("ch328 1c: mp+grad concurrent attr read — expected time-exclusive via all_grad_done");
`endif
`ifndef SYNTHESIS
initial if (PRIM_ATTR_W != 1053)
$error("ch328: prim_attr_t width drifted to %0d (expected 1053) — pack/unpack/checker out of sync", PRIM_ATTR_W);
`endif
// delayed-copy pack-stage state: 1 cyc after push_ok, fifo_*[pack_slot] are settled
logic pack_v;
logic [$clog2(FIFO_DEPTH)-1:0] pack_slot;
// Ch328 (1b) — sinks for the grad-engine-ONLY input fields when a rasterizer reader
// (mp_load_prim / pop) unpacks the full attr_ram word; the rasterizer doesn't consume these.
logic [31:0] mp_dump_u1v, mp_dump_u2v, mp_dump_v1z, mp_dump_v2z;
logic [63:0] mp_dump_stq1, mp_dump_stq2;
logic [23:0] mp_dump_q1, mp_dump_q2;
logic [31:0] pp_dump_u1v, pp_dump_u2v, pp_dump_v1z, pp_dump_v2z;
logic [63:0] pp_dump_stq1, pp_dump_stq2;
logic [23:0] pp_dump_q1, pp_dump_q2;
// Ch328 (1b) — PUSH-ASSEMBLED input word (same-cycle write so attr_ram is valid at push+1,
// matching the register FIFO; the delayed pack stage caused a streaming pop read-after-write
// hazard). Two full concats faithfully MIRROR the push's TRI/SPRITE branches in EXACT pack
// order; the equivalence checker validates them against the live fifo_* writes (dark). Common
// fields (color/frame/tex/wrap/filter/alpha/ztst/zmsk/zbp) are identical in both branches.
wire [PRIM_ATTR_W-1:0] tri_attr_word = {
tri_v0x, tri_v0y, tri_v1x, tri_v1y, tri_v2x, tri_v2y,
{~tl2_init, ~tl1_init, ~tl0_init}, (tri_swap_ccw ? -tri_sa_calc : tri_sa_calc),
rgbaq_q, (fan_sat_path ? c_pivot_q : c_prev_q), (tri_swap_ccw ? rgbaq_q : c_curr_q),
(tri_swap_ccw ? c_curr_q : rgbaq_q),
frame_1_q[8:0], frame_1_q[21:16], frame_1_q[29:24], bpp_shift,
close_tme_effective, 11'd0, 11'd0, 11'd0, 11'd0,
tex_base_next, {8'd0, tex0_tbw}, tex0_psm, clamp_wms, clamp_wmt, tex0_tw, tex0_th,
32'sd0, 32'sd0, 1'b0, tex1_mmag,
alpha_a, alpha_b, alpha_c, alpha_d, alpha_fix,
new_tri_zte_active, test_ztst, zbuf_zmsk, zbuf_zbp, 32'd0, tri_v0z,
tri_uv0_u, tri_uv0_v, {5'd0, tri_uv1_v, 5'd0, tri_uv1_u}, {5'd0, tri_uv2_v, 5'd0, tri_uv2_u},
(saw_st_q && close_tme_effective), tri_s0, tri_t0, tri_q0,
{16'd0, tri_t1, tri_s1}, {16'd0, tri_t2, tri_s2}, tri_q1, tri_q2, tri_v1z, tri_v2z };
wire [PRIM_ATTR_W-1:0] sp_attr_word = {
sp_v0_x_next, sp_v0_y_next, sp_v1_x_next, sp_v1_y_next, 12'd0, 12'd0,
3'b000, 32'sd0, rgbaq_q, 64'd0, 64'd0, 64'd0,
frame_1_q[8:0], frame_1_q[21:16], frame_1_q[29:24], bpp_shift,
prim_tme, sp_u0_next, sp_v0t_next, sp_u1_next, sp_v1t_next,
tex_base_next, {8'd0, tex0_tbw}, tex0_psm, clamp_wms, clamp_wmt, tex0_tw, tex0_th,
du_dx_next, dv_dy_next, new_abe_active, tex1_mmag,
alpha_a, alpha_b, alpha_c, alpha_d, alpha_fix,
new_zte_active, test_ztst, zbuf_zmsk, zbuf_zbp, close_z, 32'd0,
11'd0, 11'd0, 32'd0, 32'd0,
1'b0, 24'd0, 24'd0, 24'd0, 64'd0, 64'd0, 24'd0, 24'd0, 32'd0, 32'd0 };
wire [PRIM_ATTR_W-1:0] attr_word_next = (new_mode == RM_TRI) ? tri_attr_word : sp_attr_word;
raster_mode_e fifo_mode [0:FIFO_DEPTH-1];
logic [11:0] fifo_v0x [0:FIFO_DEPTH-1];
logic [11:0] fifo_v0y [0:FIFO_DEPTH-1];
logic [11:0] fifo_v1x [0:FIFO_DEPTH-1];
logic [11:0] fifo_v1y [0:FIFO_DEPTH-1];
logic [11:0] fifo_v2x [0:FIFO_DEPTH-1];
logic [11:0] fifo_v2y [0:FIFO_DEPTH-1];
logic [11:0] fifo_x_min [0:FIFO_DEPTH-1];
logic [11:0] fifo_x_max [0:FIFO_DEPTH-1];
logic [11:0] fifo_y_min [0:FIFO_DEPTH-1];
logic [11:0] fifo_y_max [0:FIFO_DEPTH-1];
logic [2:0] fifo_bias [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_sa [0:FIFO_DEPTH-1];
logic [63:0] fifo_color [0:FIFO_DEPTH-1];
logic [63:0] fifo_c0 [0:FIFO_DEPTH-1];
logic [63:0] fifo_c1 [0:FIFO_DEPTH-1];
logic [63:0] fifo_c2 [0:FIFO_DEPTH-1];
logic [8:0] fifo_fbp [0:FIFO_DEPTH-1];
logic [5:0] fifo_fbw [0:FIFO_DEPTH-1];
logic [5:0] fifo_psm [0:FIFO_DEPTH-1];
logic [1:0] fifo_bpp_shift [0:FIFO_DEPTH-1];
// Brick 1 — per-primitive texture context (SPRITE DECAL v1).
// fifo_tme gates the textured path; fifo_uv0/uv1 are the integer
// texel coords at the two SPRITE endpoints (already >>4 from the
// UV 10.4 fixed-point); fifo_tex_base/tbw/psm describe the
// texture in VRAM. Carried alongside the geometry so the in-flight
// scan samples the right texture even if a new TEX0/PRIM arrives.
logic fifo_tme [0:FIFO_DEPTH-1];
logic [10:0] fifo_u0 [0:FIFO_DEPTH-1];
logic [10:0] fifo_v0 [0:FIFO_DEPTH-1];
logic [10:0] fifo_u1 [0:FIFO_DEPTH-1];
logic [10:0] fifo_v1 [0:FIFO_DEPTH-1];
logic [31:0] fifo_tex_base [0:FIFO_DEPTH-1];
logic [13:0] fifo_tbw [0:FIFO_DEPTH-1];
logic [5:0] fifo_tpsm [0:FIFO_DEPTH-1];
// Ch294 — per-primitive wrap-mode snapshot (CLAMP_1 WMS/WMT + TEX0 TW/TH).
logic [1:0] fifo_wms [0:FIFO_DEPTH-1];
logic [1:0] fifo_wmt [0:FIFO_DEPTH-1];
logic [3:0] fifo_tw [0:FIFO_DEPTH-1];
logic [3:0] fifo_th [0:FIFO_DEPTH-1];
// Brick 1 (DDA) — Q16.16 per-pixel texel-coord increments, computed
// once per primitive at push (single divide each).
logic signed [31:0] fifo_du_dx [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dv_dy [0:FIFO_DEPTH-1];
// Brick 2a — per-primitive alpha-blend-active flag.
logic fifo_abe [0:FIFO_DEPTH-1];
// Ch310 — per-primitive bilinear-filter flag (TEX1_1.MMAG). Loaded always
// (mirrors fifo_abe); consumed only when BILINEAR_ENABLE=1, so the default
// build is byte-identical.
logic fifo_filter_lin [0:FIFO_DEPTH-1];
// Brick 2c — per-primitive generic GS ALPHA selector snapshot
// (A/B/C/D + FIX). Loaded always; consumed by u_comb_blend only when
// ALPHA_MODES_ENABLE=1, so the default build is byte-identical.
logic [1:0] fifo_alpha_a [0:FIFO_DEPTH-1];
logic [1:0] fifo_alpha_b [0:FIFO_DEPTH-1];
logic [1:0] fifo_alpha_c [0:FIFO_DEPTH-1];
logic [1:0] fifo_alpha_d [0:FIFO_DEPTH-1];
logic [7:0] fifo_alpha_fix [0:FIFO_DEPTH-1];
// Brick 2b — per-primitive Z-test context.
logic fifo_zte [0:FIFO_DEPTH-1];
logic [1:0] fifo_ztst [0:FIFO_DEPTH-1];
logic fifo_zmsk [0:FIFO_DEPTH-1];
logic [8:0] fifo_zbp [0:FIFO_DEPTH-1];
logic [31:0] fifo_zval [0:FIFO_DEPTH-1];
// Brick 3 — per-primitive TRI affine interpolation context.
// Q16.16 per-pixel gradients for R/G/B/A/Z, computed once at push
// (one divide each). Z base value at post-swap vertex 0 (v0); the
// color bases come from ras_c0_q[..] (already captured). x0/y0 are
// ras_v0_x/ras_v0_y. fifo_tri_zte gates whether the TRI feeds the
// Brick-2b Z path. ras_tri_active gates the whole affine path.
logic fifo_tri_active [0:FIFO_DEPTH-1];
// COMBINED probe — per-primitive flag: this slot is the textured +
// alpha + depth-tested triangle that the combined FSM owns. Set only
// when COMBINED_TAZ=1 (close_combined). Always 0 at param=0.
logic fifo_combined [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dr_dx [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dr_dy [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dg_dx [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dg_dy [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_db_dx [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_db_dy [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_da_dx [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_da_dy [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dz_dx [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dz_dy [0:FIFO_DEPTH-1];
logic [31:0] fifo_z0 [0:FIFO_DEPTH-1];
// Textured-triangle rung — per-slot U/V affine context. The two
// extra interpolated attributes (U,V) ride the SAME shared-divider
// setup engine as R/G/B/A/Z (GRAD_STEPS grows to cover their dx/dy);
// the per-pixel evaluator reuses the existing right-sized 32x16
// interp_affine path. fifo_u0_base/v0_base are the integer texel
// coords at post-swap vertex 0 (the affine base, like ras_z0 for Z);
// the 4 gradient fields are written by the engine on the new steps.
// The texture descriptor (tme/tex_base/tbw/tpsm) reuses the existing
// fifo_tme/fifo_tex_base/fifo_tbw/fifo_tpsm fields (set to 0 for an
// untextured TRI, so the legacy Gouraud path is unchanged).
logic [10:0] fifo_u0_base [0:FIFO_DEPTH-1];
logic [10:0] fifo_v0_base [0:FIFO_DEPTH-1];
logic [31:0] fifo_u1v [0:FIFO_DEPTH-1]; // post-swap v1 {v,u} packed (for engine recompute)
logic [31:0] fifo_u2v [0:FIFO_DEPTH-1]; // post-swap v2 {v,u} packed
logic signed [31:0] fifo_du_dx_t [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_du_dy_t [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dv_dx_t [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dv_dy_t [0:FIFO_DEPTH-1];
// Ch301 perspective — per-slot S/T/Q affine-of-perspective-attribute
// context. A TRI uses the perspective path iff fifo_persp[slot]=1
// (PERSPECTIVE_CORRECT=1 AND TME AND ST-sourced). S/T/Q are 24-bit
// fixed-point (FRAC=12). The base values are post-swap vertex 0; v1/v2
// S/T/Q are packed so the decoupled gradient engine can recompute the
// numerators at schedule time (same role fifo_u1v/u2v play for U/V).
// fifo_stq1/2 pack {16'd0, T(24), S(24)}; fifo_q1/q2 carry Q(24).
// The 6 gradient fields (dS/dx,dS/dy,dT/dx,dT/dy,dQ/dx,dQ/dy) are
// signed Q16.16, solved by the shared divider engine on steps 14..19.
// These declarations are ALWAYS present (unused FIFO fields cost
// nothing behaviorally at param=0); the param-gated USE is the
// localparam GRAD_STEPS / fifo_persp / ras_persp plumbing, so the
// param=0 build is byte-identical.
logic fifo_persp [0:FIFO_DEPTH-1];
logic [23:0] fifo_s0_base [0:FIFO_DEPTH-1];
logic [23:0] fifo_t0_base [0:FIFO_DEPTH-1];
logic [23:0] fifo_q0_base [0:FIFO_DEPTH-1];
logic [63:0] fifo_stq1 [0:FIFO_DEPTH-1]; // post-swap v1 {16'd0,T,S}
logic [63:0] fifo_stq2 [0:FIFO_DEPTH-1]; // post-swap v2 {16'd0,T,S}
logic [23:0] fifo_q1 [0:FIFO_DEPTH-1]; // post-swap v1 Q
logic [23:0] fifo_q2 [0:FIFO_DEPTH-1]; // post-swap v2 Q
logic signed [31:0] fifo_ds_dx [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_ds_dy [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dt_dx [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dt_dy [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dq_dx [0:FIFO_DEPTH-1];
logic signed [31:0] fifo_dq_dy [0:FIFO_DEPTH-1];
// Brick-3 fix — per-slot post-swap v1.z / v2.z, stored so the
// gradient setup engine can recompute the Z numerators for a slot
// at engine-start time (the engine is now decoupled from push and
// self-schedules across pending slots; v0.z is already in fifo_z0).
logic [31:0] fifo_v1z [0:FIFO_DEPTH-1];
logic [31:0] fifo_v2z [0:FIFO_DEPTH-1];
// Ch295 — per-slot "gradients not yet solved" flag. Set when a TRI
// is pushed; cleared by the shared single-divider setup engine when
// it finishes writing all 10 gradient fields for that slot. pop_ok
// is blocked for a slot while this is set, so the rasterizer never
// pops a TRI whose affine gradients are still being computed.
logic fifo_grad_pending [0:FIFO_DEPTH-1];
// Ch171 — widen pointers / counter for FIFO_DEPTH=4. The
// pre-Ch171 declarations were 1-bit pointer + 2-bit counter
// tuned for FIFO_DEPTH=2; bumping depth without widening these
// truncated indices to 1 bit so the FIFO addressed only 2 of
// its 4 slots (re-introduced the "drop sprite 4" bug under a
// different shape). PTR_W / CNT_W follow FIFO_DEPTH.
localparam int FIFO_PTR_W = $clog2(FIFO_DEPTH); // 2 for DEPTH=4
localparam int FIFO_CNT_W = $clog2(FIFO_DEPTH + 1); // 3 for DEPTH=4
logic [FIFO_PTR_W-1:0] fifo_wptr;
logic [FIFO_PTR_W-1:0] fifo_rptr;
logic [FIFO_CNT_W-1:0] fifo_count;
logic fifo_empty;
logic fifo_full;
assign fifo_empty = (fifo_count == '0);
assign fifo_full = (fifo_count == FIFO_CNT_W'(FIFO_DEPTH));
// Ch295 / Brick-3 fix — single time-shared gradient divider + setup
// engine, DECOUPLED from the push.
//
// The engine self-schedules: whenever it is idle and some FIFO slot
// is flagged grad_pending, it picks the lowest such slot, recomputes
// that slot's 10 pre-shifted numerators (grad_num_q) and the common
// divisor (grad_det_q) from the slot's STORED post-swap vertex /
// colour / Z data, then iterates ONE divide per cycle over the 10
// numerators (grad_step 0..9), writing quo[31:0] into the matching
// fifo_d*_* field of grad_slot. On the last step it clears that
// slot's grad_pending and goes idle, then picks the next pending
// slot (if any). Setup is ~10 cycles per triangle — negligible
// against the thousands of cycles a TRI scan takes — and pop_ok
// holds each slot until its own gradients are ready, so the engine
// can never be clobbered by a back-to-back push and no primitive is
// dropped. The numerators are recomputed from FIFO storage (the same
// grad_num_dadx/dady functions as before), so the arithmetic is
// BIT-IDENTICAL to the pre-fix per-push solve.
// GRAD_STEPS = 14: R/G/B/A/Z (5 attrs × dx,dy = 10 steps) PLUS the
// textured-triangle rung's U,V (2 attrs × dx,dy = 4 more steps:
// 10=dU/dx 11=dU/dy 12=dV/dx 13=dV/dy). The shared divider simply
// iterates 4 more cycles per triangle — NO new divider, NO new
// per-pixel division. grad_step is 4-bit (0..15) so it already
// covers 0..13; only the localparam + the step decode grow.
// Ch301 perspective — GRAD_STEPS grows to 20 ONLY when the
// perspective path is compiled in (steps 14..19 = dS/dx,dS/dy,
// dT/dx,dT/dy,dQ/dx,dQ/dy). When PERSPECTIVE_CORRECT=0 it stays 14 so
// the engine iterates the same 14 steps and the grad_num_q /
// grad_load_num arrays are the same size as before — the param=0
// build is unchanged in size-effect and behavior.
localparam int GRAD_STEPS = PERSPECTIVE_CORRECT ? 20 : 14;
logic grad_busy;
logic [4:0] grad_step; // 0..13 (or 0..19 w/ perspective)
logic [4:0] grad_settle; // Ch352 — divide settle counter (0..GRAD_DIV_CYCLES-1)
// (* preserve *) — Ch352 (Codex): keep grad_result_q as a named 32-reg endpoint. Without it, synthesis
// optimized/renamed it away, the multicycle -to matched 0 registers, the exception applied to NOTHING, and
// Place churned timing the -66ns divider single-cycle. The SDC asserts the dest count is exactly 32.
(* preserve *) logic signed [31:0] grad_result_q; // settled divide captured here (the ONE multicycle endpoint)
logic grad_writing; // Ch352 — write phase: grad_result_q -> fifo (single-cycle)
logic [FIFO_PTR_W-1:0] grad_slot;
logic signed [31:0] grad_det_q;
logic signed [55:0] grad_num_q [0:GRAD_STEPS-1];
// Combinational "is there a slot waiting for gradients?" scan and the
// index of the lowest such slot. Used only when the engine is idle to
// pick the next triangle to solve.
logic grad_pending_any;
logic [FIFO_PTR_W-1:0] grad_pending_slot;
always_comb begin
grad_pending_any = 1'b0;
grad_pending_slot = '0;
for (int s = FIFO_DEPTH-1; s >= 0; s = s - 1) begin
if (fifo_grad_pending[s]) begin
grad_pending_any = 1'b1;
grad_pending_slot = FIFO_PTR_W'(s);
end
end
end
// Combinational numerators + divisor for the slot the engine is about
// to load (grad_pending_slot). These reuse the grad_num_dadx/dady
// functions on the slot's STORED post-swap data — the same pure
// multiply/subtract/shift expressions evaluated at push time before
// the fix. fifo_sa[slot] is the post-swap determinant (== tri_det_post
// by construction). Only ONE slot's worth is computed at a time (mux
// by grad_pending_slot), so this is one set of the existing functions,
// not FIFO_DEPTH copies — no new wide datapath.
logic signed [55:0] grad_load_num [0:GRAD_STEPS-1];
logic signed [31:0] grad_load_det;
logic [63:0] grad_ld_c0, grad_ld_c1, grad_ld_c2;
logic [11:0] grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y;
logic [31:0] grad_ld_z0, grad_ld_z1, grad_ld_z2;
// Textured-triangle rung — per-vertex U/V for the slot being loaded.
logic [10:0] grad_ld_u0, grad_ld_v0u, grad_ld_u1, grad_ld_v1u, grad_ld_u2, grad_ld_v2u;
// Ch301 perspective — per-vertex S/T/Q for the slot being loaded.
// Declared unconditionally (cheap); only ASSIGNED + used by the
// param-gated grad_load_num[14..19] block below.
logic [23:0] grad_ld_s0, grad_ld_t0, grad_ld_q0;
logic [23:0] grad_ld_s1, grad_ld_t1, grad_ld_q1;
logic [23:0] grad_ld_s2, grad_ld_t2, grad_ld_q2;
// Ch328 (1b) step C — registered M20K prefetch for the gradient engine. grad_word_q holds
// attr_ram[grad_slot] (1-cyc sync read, loaded in the prefetch cycle); the grad_ld_* unpack it
// (LHS-concat, pack order) instead of reading the input fifo_* registers. gw_d_* sink the
// fields the gradient solve doesn't use. (pop_ok holds the slot until grad_pending clears, so
// the extra prefetch cycle leaves the rendered framebuffer byte-identical.)
reg [PRIM_ATTR_W-1:0] grad_word_q;
reg grad_prefetching;
reg grad_prefetch2; // Ch328 1c — 2nd prefetch stage (the +1 registered-read-latency cycle)
// Ch328 1c — grad issues its (single-port) read combinationally during its idle+pending cycle;
// attr_rd_q holds attr_ram[grad_pending_slot] the NEXT cycle. mp/pop read-issues are wired when
// those readers are converted (tied 0 here so the shared mux + assertion are well-defined).
assign grad_rd_issue = !grad_busy && !grad_prefetching && !grad_prefetch2 && grad_pending_any;
assign grad_rd_addr = grad_pending_slot;
// Ch328 1c — mp_load_prim read-stall FSM. A load site latches its slot/bbox + sets mp_rd_state
// to RDS_ISSUE and HOLDS the tile phase; RDS_ISSUE drives attr_rd_addr (comb) so attr_rd_q is
// valid in RDS_CONSUME, where mp_load_prim consumes it and the phase finally advances.
localparam logic [1:0] RDS_IDLE = 2'd0, RDS_ISSUE = 2'd1, RDS_CONSUME = 2'd2;
reg [1:0] mp_rd_state;
reg [FIFO_PTR_W-1:0] mp_pend_slot;
reg [11:0] mp_pend_cxlo, mp_pend_cxhi, mp_pend_cylo, mp_pend_cyhi;
assign mp_rd_issue = (mp_rd_state == RDS_ISSUE);
assign mp_rd_addr = mp_pend_slot;
// Ch328 1c — streaming pop is NOT on the shared M20K port (it stays a single-cycle register read):
// a multi-cycle M20K read would break the full-FIFO push+pop same-cycle slot-free concurrency, and
// the shallow streaming path gains no capacity benefit from M20K. So pop's port request is tied off.
assign pop_rd_issue = 1'b0;
assign pop_rd_addr = '0;
logic [11:0] gw_v0x,gw_v0y,gw_v1x,gw_v1y,gw_v2x,gw_v2y; logic [2:0] gw_d_bias; logic signed [31:0] gw_sa;
logic [63:0] gw_d_color, gw_c0, gw_c1, gw_c2; logic [8:0] gw_d_fbp; logic [5:0] gw_d_fbw,gw_d_psm; logic [1:0] gw_d_bpp;
logic gw_d_tme; logic [10:0] gw_d_u0,gw_d_v0,gw_d_u1,gw_d_v1; logic [31:0] gw_d_texb; logic [13:0] gw_d_tbw; logic [5:0] gw_d_tpsm;
logic [1:0] gw_d_wms,gw_d_wmt; logic [3:0] gw_d_tw,gw_d_th; logic signed [31:0] gw_d_dudx,gw_d_dvdy; logic gw_d_abe,gw_d_flin;
logic [1:0] gw_d_aa,gw_d_ab,gw_d_ac,gw_d_ad; logic [7:0] gw_d_afix; logic gw_d_zte; logic [1:0] gw_d_ztst; logic gw_d_zmsk;
logic [8:0] gw_d_zbp; logic [31:0] gw_d_zval;
logic [31:0] gw_z0; logic [10:0] gw_u0b,gw_v0b; logic [31:0] gw_u1v,gw_u2v;
logic gw_d_persp; logic [23:0] gw_s0b,gw_t0b,gw_q0b; logic [63:0] gw_stq1,gw_stq2; logic [23:0] gw_q1,gw_q2;
logic [31:0] gw_v1z,gw_v2z;
always_comb begin
{ gw_v0x,gw_v0y,gw_v1x,gw_v1y,gw_v2x,gw_v2y,gw_d_bias,gw_sa,gw_d_color,gw_c0,gw_c1,gw_c2,
gw_d_fbp,gw_d_fbw,gw_d_psm,gw_d_bpp,gw_d_tme,gw_d_u0,gw_d_v0,gw_d_u1,gw_d_v1,gw_d_texb,gw_d_tbw,gw_d_tpsm,
gw_d_wms,gw_d_wmt,gw_d_tw,gw_d_th,gw_d_dudx,gw_d_dvdy,gw_d_abe,gw_d_flin,gw_d_aa,gw_d_ab,gw_d_ac,gw_d_ad,gw_d_afix,
gw_d_zte,gw_d_ztst,gw_d_zmsk,gw_d_zbp,gw_d_zval,gw_z0,gw_u0b,gw_v0b,gw_u1v,gw_u2v,
gw_d_persp,gw_s0b,gw_t0b,gw_q0b,gw_stq1,gw_stq2,gw_q1,gw_q2,gw_v1z,gw_v2z } = grad_word_q;
grad_ld_c0 = gw_c0;
grad_ld_c1 = gw_c1;
grad_ld_c2 = gw_c2;
grad_ld_v0x = gw_v0x;
grad_ld_v0y = gw_v0y;
grad_ld_v1x = gw_v1x;
grad_ld_v1y = gw_v1y;
grad_ld_v2x = gw_v2x;
grad_ld_v2y = gw_v2y;
grad_ld_z0 = gw_z0;
grad_ld_z1 = gw_v1z;
grad_ld_z2 = gw_v2z;
grad_ld_u0 = gw_u0b;
grad_ld_v0u = gw_v0b;
grad_ld_u1 = gw_u1v [10:0];
grad_ld_v1u = gw_u1v [26:16];
grad_ld_u2 = gw_u2v [10:0];
grad_ld_v2u = gw_u2v [26:16];
grad_load_det = gw_sa;
grad_load_num[0] = grad_num_dadx({24'd0, grad_ld_c0[7:0]}, {24'd0, grad_ld_c1[7:0]}, {24'd0, grad_ld_c2[7:0]},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[1] = grad_num_dady({24'd0, grad_ld_c0[7:0]}, {24'd0, grad_ld_c1[7:0]}, {24'd0, grad_ld_c2[7:0]},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[2] = grad_num_dadx({24'd0, grad_ld_c0[15:8]}, {24'd0, grad_ld_c1[15:8]}, {24'd0, grad_ld_c2[15:8]},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[3] = grad_num_dady({24'd0, grad_ld_c0[15:8]}, {24'd0, grad_ld_c1[15:8]}, {24'd0, grad_ld_c2[15:8]},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[4] = grad_num_dadx({24'd0, grad_ld_c0[23:16]}, {24'd0, grad_ld_c1[23:16]}, {24'd0, grad_ld_c2[23:16]},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[5] = grad_num_dady({24'd0, grad_ld_c0[23:16]}, {24'd0, grad_ld_c1[23:16]}, {24'd0, grad_ld_c2[23:16]},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[6] = grad_num_dadx({24'd0, grad_ld_c0[31:24]}, {24'd0, grad_ld_c1[31:24]}, {24'd0, grad_ld_c2[31:24]},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[7] = grad_num_dady({24'd0, grad_ld_c0[31:24]}, {24'd0, grad_ld_c1[31:24]}, {24'd0, grad_ld_c2[31:24]},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[8] = grad_num_dadx(grad_ld_z0, grad_ld_z1, grad_ld_z2,
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[9] = grad_num_dady(grad_ld_z0, grad_ld_z1, grad_ld_z2,
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
// Textured-triangle rung — U (steps 10/11) and V (steps 12/13).
// U/V are 11-bit unsigned integer texel coords; zero-extend to
// the signed[31:0] attribute input the existing numerator
// functions take. Same shared divisor (grad_load_det) → same
// affine plane solve as the colour/Z attributes.
grad_load_num[10] = grad_num_dadx({21'd0, grad_ld_u0}, {21'd0, grad_ld_u1}, {21'd0, grad_ld_u2},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[11] = grad_num_dady({21'd0, grad_ld_u0}, {21'd0, grad_ld_u1}, {21'd0, grad_ld_u2},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[12] = grad_num_dadx({21'd0, grad_ld_v0u}, {21'd0, grad_ld_v1u}, {21'd0, grad_ld_v2u},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[13] = grad_num_dady({21'd0, grad_ld_v0u}, {21'd0, grad_ld_v1u}, {21'd0, grad_ld_v2u},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
end
// Ch301 perspective — per-vertex S/T/Q load + the 6 extra gradient
// numerators (steps 14..19), GUARDED so the grad_load_num indices
// 14..19 (which only exist when GRAD_STEPS==20) are never referenced
// in the param=0 build. S/T/Q are 24-bit fixed-point; zero-extend to
// the signed[31:0] attribute input of the SAME grad_num_dadx/dady
// functions (identical structure to the U/V steps), sharing the same
// divisor (grad_load_det) → same affine plane solve over S_fp/T_fp/
// Q_fp. The per-pixel perspective divide (S/Q, T/Q) is done LATER by
// gs_persp_uv on the emit side (human-wired); here we only interpolate
// the three perspective attributes linearly in screen space.
generate
if (PERSPECTIVE_CORRECT) begin : g_grad_stq
always_comb begin
grad_ld_s0 = gw_s0b;
grad_ld_t0 = gw_t0b;
grad_ld_q0 = gw_q0b;
grad_ld_s1 = gw_stq1[23:0];
grad_ld_t1 = gw_stq1[47:24]; // pack is {16'd0,T[47:24],S[23:0]}
grad_ld_q1 = gw_q1;
grad_ld_s2 = gw_stq2[23:0];
grad_ld_t2 = gw_stq2[47:24];
grad_ld_q2 = gw_q2;
// S (steps 14/15)
grad_load_num[14] = grad_num_dadx({8'd0, grad_ld_s0}, {8'd0, grad_ld_s1}, {8'd0, grad_ld_s2},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[15] = grad_num_dady({8'd0, grad_ld_s0}, {8'd0, grad_ld_s1}, {8'd0, grad_ld_s2},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
// T (steps 16/17)
grad_load_num[16] = grad_num_dadx({8'd0, grad_ld_t0}, {8'd0, grad_ld_t1}, {8'd0, grad_ld_t2},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[17] = grad_num_dady({8'd0, grad_ld_t0}, {8'd0, grad_ld_t1}, {8'd0, grad_ld_t2},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
// Q (steps 18/19)
grad_load_num[18] = grad_num_dadx({8'd0, grad_ld_q0}, {8'd0, grad_ld_q1}, {8'd0, grad_ld_q2},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
grad_load_num[19] = grad_num_dady({8'd0, grad_ld_q0}, {8'd0, grad_ld_q1}, {8'd0, grad_ld_q2},
grad_ld_v0x, grad_ld_v0y, grad_ld_v1x, grad_ld_v1y, grad_ld_v2x, grad_ld_v2y);
end
end else begin : g_no_grad_stq
// Tie the S/T/Q load wires low so they have a single driver
// even when the perspective numerators are not compiled in.
assign grad_ld_s0 = 24'd0; assign grad_ld_t0 = 24'd0; assign grad_ld_q0 = 24'd0;
assign grad_ld_s1 = 24'd0; assign grad_ld_t1 = 24'd0; assign grad_ld_q1 = 24'd0;
assign grad_ld_s2 = 24'd0; assign grad_ld_t2 = 24'd0; assign grad_ld_q2 = 24'd0;
end
endgenerate
// Combinational divide of the currently-selected numerator by the
// captured determinant — the SINGLE divider in the triangle setup
// path. Mirrors the previous per-call expression exactly:
// (num <<< 16) / det, truncated to [31:0]. The <<<16 was already
// folded into grad_num_q by grad_num_dadx/dady.
logic signed [55:0] grad_quo;
always_comb begin
if (grad_det_q == 32'sd0)
grad_quo = 56'sd0;
else
grad_quo = grad_num_q[grad_step]
/ $signed({{24{grad_det_q[31]}}, grad_det_q});
end
logic signed [31:0] grad_result;
assign grad_result = grad_quo[31:0];
// Ch352 — SEQUENTIAL divider (GRAD_SEQ_DIVIDER=1): registered iterations, no combinational cone. The FSM
// pulses div_start with grad_num_q[grad_step]/sign-extended grad_det_q, waits div_done, and captures
// div_quo[31:0] into grad_result_q. Bit-exact to the combinational `/` above (tb_gs_grad_divider). When
// GRAD_SEQ_DIVIDER=0 the instance is generate-stripped and these tie inert (combinational path used).
logic div_start;
logic signed [55:0] div_quo;
logic div_busy_w, div_done_w;
generate
if (GRAD_SEQ_DIVIDER) begin : g_seqdiv
gs_grad_divider #(.W(56)) u_grad_div (
.clk(clk), .rst_n(rst_n), .start(div_start),
.num(grad_num_q[grad_step]),
.den($signed({{24{grad_det_q[31]}}, grad_det_q})),
.quo(div_quo), .busy(div_busy_w), .done(div_done_w)
);
end else begin : g_no_seqdiv
assign div_quo = 56'sd0; assign div_busy_w = 1'b0; assign div_done_w = 1'b0;
end
endgenerate
// Ch352 — FIFO write data. GRAD_SEQ_DIVIDER: the captured div_quo lives in grad_result_q. Else the original
// path: GRAD_DIV_CYCLES==1 writes the combinational divide directly (byte-identical), K>1 writes grad_result_q.
wire signed [31:0] grad_wr_data = GRAD_SEQ_DIVIDER ? grad_result_q
: ((GRAD_DIV_CYCLES == 1) ? grad_result : grad_result_q);
// Ch172 — expose FIFO-full so the wrapper can route it back to
// the GIF parser's in_ready gate.
// Brick-3 fix — backpressure is REAL fifo_full ONLY. The gradient
// setup engine no longer gates the PUSH side: gating push on
// grad_busy dropped back-to-back triangles that arrived during a
// prior triangle's ~10-cycle gradient solve (tb_ee_core_gif_raster_
// queue). Instead the engine is decoupled — it self-schedules across
// FIFO slots flagged grad_pending and computes each slot's gradients
// in turn; pop_ok holds a slot until its own gradients are ready.
// The queue therefore accepts every incoming primitive.
assign raster_fifo_full = fifo_full;
// ==================================================================
// COMBINED PROBE — per-pixel 5-beat textured+alpha+depth FSM.
//
// When ras_combined is set (only at COMBINED_TAZ=1, for a TRI that is
// TME PSMCT32 DECAL + ABE source-over + ZTE PSMZ32), the legacy S0
// walker is HELD at the current candidate pixel (ras_cur_x/y) and a
// dedicated 5-beat FSM runs ONE pixel at a time. The single read2
// port (arbitrated in the BRAM top) is used by at most ONE consumer
// per cycle, and the single write port emits at most one beat per
// cycle, so the existing top-level mux + mutual-exclusion $error
// asserts keep holding without any top change.
//
// Beat schedule (read2 is a 1-cycle REGISTERED read in the BRAM top:
// an enable+addr presented on beat N returns data on beat N+1):
// CB_Z (0): if pixel inside -> assert z_rd_en + z_rd_addr (held Z
// addr). No texel/dest read, no write.
// If pixel NOT inside -> advance walker immediately.
// CB_ZW (1): z_rd_data ready -> latch stored Z, compute ztest_pass.
// PASS -> assert texel request (tex in_valid) with held
// (u,v); texel returns at beat 2.
// FAIL -> NO texel/dest read, NO write; advance walker.
// CB_T (2): s1_tex_color ready -> latch as source color Cs/As.
// Assert fb_rd_en + fb_rd_addr (held FB addr); dest
// color returns at beat 3.
// CB_FB (3): dest color ready -> blend (Cs=texel, Cd=dest,
// As=texel[31:24]). WRITE #1: raster_pixel_emit color
// (PSMCT32, be=1111) to the FB addr.
// CB_ZWR (4): WRITE #2: raster_pixel_emit frag_z to the Z addr
// (be=1111, psm=0). If ZMSK, skip the write but still
// consume the beat. Then advance walker.
//
// The walker advances to the next candidate pixel only after BOTH the
// color write (beat 3) AND the Z write (beat 4) have completed (or
// after a not-inside / depth-FAIL early-out, where nothing is read or
// written for that pixel).
// ==================================================================
// Ch310 — CB_TWAIT is a NEW beat inserted between CB_ZW and CB_T, entered
// ONLY on the bilinear path (bili_now). It HOLDS the FSM while the 4-tap
// gs_texture_unit sampler runs its multi-beat read. When BILINEAR_ENABLE=0
// bili_now is a compile-time-constant 0 so CB_TWAIT is never entered and
// the encoding/behavior of the other beats is unchanged (byte-identical).
typedef enum logic [2:0] {
CB_Z, CB_ZW, CB_T, CB_FB, CB_ZWR, CB_TWAIT
} comb_beat_e;
comb_beat_e comb_state_r; // current beat (register)
// Held-candidate-pixel geometry — combinational on the FROZEN walker
// coords (ras_cur_x/ras_cur_y). The combined primitive is always a
// TRIANGLE, so the inside test is the biased edge-function test on the
// held pixel (same math as the S1 stage, evaluated at ras_cur_*).
logic [31:0] comb_e0, comb_e1, comb_e2;
logic comb_pix_inside_w;
assign comb_e0 = edge_fn(ras_cur_x, ras_cur_y, ras_v0_x, ras_v0_y, ras_v1_x, ras_v1_y);
assign comb_e1 = edge_fn(ras_cur_x, ras_cur_y, ras_v1_x, ras_v1_y, ras_v2_x, ras_v2_y);
assign comb_e2 = edge_fn(ras_cur_x, ras_cur_y, ras_v2_x, ras_v2_y, ras_v0_x, ras_v0_y);
assign comb_pix_inside_w =
($signed(comb_e0 + {31'd0, ras_bias[0]}) <= 32'sd0)
& ($signed(comb_e1 + {31'd0, ras_bias[1]}) <= 32'sd0)
& ($signed(comb_e2 + {31'd0, ras_bias[2]}) <= 32'sd0);
// Held-pixel offsets from post-swap v0 (same right-sized signed-16
// narrowing the S2 interpolators use) for Z interpolation.
logic signed [31:0] comb_dx, comb_dy;
assign comb_dx = $signed({20'd0, ras_cur_x}) - $signed({20'd0, ras_v0_x});
assign comb_dy = $signed({20'd0, ras_cur_y}) - $signed({20'd0, ras_v0_y});
// Held-pixel interpolated fragment Z (affine, same evaluator as the
// legacy TRI Z path).
logic [31:0] comb_frag_z;
assign comb_frag_z = interp_affine_z(ras_z0, ras_dz_dx, ras_dz_dy, comb_dx, comb_dy);
// Held-pixel FB and Z byte addresses — linear PSMCT32 layout (combined
// dest is PSMCT32, non-swizzled in this probe). Mirrors s2_fb_addr_
// linear / s2_z_addr but on the held walker coords.
logic [31:0] comb_pixel_index;
logic [31:0] comb_fb_addr;
logic [31:0] comb_z_addr;
assign comb_pixel_index = ({20'd0, ras_cur_y} * ({26'd0, ras_fbw} << 6))
+ {20'd0, ras_cur_x};
assign comb_fb_addr = ({23'd0, ras_fbp} << 11) + (comb_pixel_index << ras_bpp_shift);
assign comb_z_addr = ({23'd0, ras_zbp} << 11) + (comb_pixel_index << 2);
// Per-beat read-strobe contributions (driven only in combined mode).
// comb_z_req : beat 0, inside pixel only.
// comb_tex_req: beat 1, on depth PASS only (feeds the texture unit
// in_valid; texel returns beat 2).
// comb_fb_req : beat 2, on depth PASS only.
logic comb_z_req;
logic comb_tex_req;
logic comb_fb_req;
// Latched combined per-pixel state.
logic comb_ztest_pass_r; // depth test result, latched at beat 1
logic [31:0] comb_zstored_r; // stored Z captured at beat 1
logic [31:0] comb_cs_r; // source color (texel ABGR)
logic [7:0] comb_as_r; // source alpha (texel[31:24])
// Ch333 — TEX0.TFX (list-level): 0=MODULATE (texel*color), 1=DECAL (texel replaces). The
// combined-TAZ source color comb_cs_r honors it below; DECAL is the effective default for the
// Ch330-332 anchors (their TEX0 carries TFX=1). MODULATE feeds the staging RGBAQ (provoking-
// vertex color ras_c2_q) through the texel — with a unity (0x80) texture the output IS the
// RGBAQ color, so the host controls per-primitive color at runtime. Per-vertex gouraud
// gradient is a later rung (needs the interp color aligned to s1_tex_color at CB_T).
wire [1:0] comb_tfx = tex0_1_q[36:35];
wire comb_modulate = (comb_tfx == 2'b00);
function automatic logic [7:0] gs_mod8(input logic [7:0] t, input logic [7:0] c);
logic [15:0] p; begin p = ({8'd0,t} * {8'd0,c}) >> 7; gs_mod8 = (p > 16'd255) ? 8'hFF : p[7:0]; end
endfunction
function automatic logic [31:0] gs_modulate_abgr(input logic [31:0] tex, input logic [31:0] col);
gs_modulate_abgr = {tex[31:24], // A: keep texel alpha (opaque blend via unity tex)
gs_mod8(tex[23:16], col[23:16]), // B
gs_mod8(tex[15:8], col[15:8]), // G
gs_mod8(tex[7:0], col[7:0])}; // R
endfunction
// Ch304 — Z value the combined depth test compares against. In the
// legacy VRAM combined path (TILE_LOCAL=0) this is the VRAM read-back
// (z_rd_data); in tile-local mode it is the on-chip tile Z RAM read
// (tile_z_rdata, registered, valid at beat1). tile_active is constant
// 0 at TILE_LOCAL=0 so this collapses to z_rd_data (byte-identical).
// tile_z_rdata is a module-scope net assigned below near the tile RAM.
// Ch305+ tile COLOR RAM width: 16 (PSMCT16) when TILE_COLOR_PSMCT16,
// else 32 (PSMCT32). At default OFF this is 32 so the COLOR RAM +
// its data nets keep their legacy 32-bit width (byte-identical).
localparam int TILE_COLOR_W = TILE_COLOR_PSMCT16 ? 16 : 32;
logic [TILE_COLOR_W-1:0] tile_color_rdata; // registered read data, color RAM
logic [TILE_COLOR_W-1:0] tile_color_wdata;
logic [31:0] tile_z_rdata; // registered read data, Z RAM
logic [31:0] tile_z_wdata;
logic [31:0] comb_z_compare_w;
assign comb_z_compare_w = tile_active ? tile_z_rdata : z_rd_data;
// Combinational depth test on beat 1 (CB_ZW): z_rd_data is the
// 1-cycle-registered read of the address presented on beat 0
// (CB_Z) — i.e. THIS held pixel's stored Z. Compare per ras_ztst.
// GEQUAL/GREATER use frag_z vs stored_z (PS2: larger Z = nearer).
logic comb_ztest_pass_w;
always_comb begin
unique case (ras_ztst)
2'd0: comb_ztest_pass_w = 1'b0; // NEVER
2'd1: comb_ztest_pass_w = 1'b1; // ALWAYS
// Ch304 — comb_z_compare_w is z_rd_data (VRAM) at TILE_LOCAL=0
// and tile_z_rdata (on-chip) in tile mode; byte-identical else.
2'd2: comb_ztest_pass_w = (comb_frag_z >= comb_z_compare_w); // GEQUAL
2'd3: comb_ztest_pass_w = (comb_frag_z > comb_z_compare_w); // GREATER
default: comb_ztest_pass_w = 1'b0;
endcase
end
// Ch304 — dest color feeding the combined blend. In the legacy VRAM
// combined path (TILE_LOCAL=0) the dest is the VRAM read-back
// (fb_rd_data). In tile-local mode (tile_active) the dest is the
// ON-CHIP tile color RAM read (tile_color_rdata, valid one cycle after
// the beat-2 raddr was presented). comb_blend_cd is constant fb_rd_data
// at TILE_LOCAL=0 (tile_active const 0), so the combined VRAM path is
// byte-identical. (Declared below near the tile RAM; forward net.)
logic [31:0] comb_blend_cd;
// Combined blend — DEDICATED instance so the legacy u_blend datapath
// is untouched (byte-identical at param=0). Cs = sampled texel,
// Cd = dest pixel read back, As = texel alpha. Same source-over math.
logic [7:0] comb_blend_r, comb_blend_g, comb_blend_b, comb_blend_a;
generate
if (COMBINED_TAZ) begin : g_comb_blend
gs_alpha_blend #(.ALPHA_MODES(ALPHA_MODES_ENABLE)) u_comb_blend (
.cs_r (comb_cs_r[7:0]),
.cs_g (comb_cs_r[15:8]),
.cs_b (comb_cs_r[23:16]),
.as (comb_as_r),
.cd_r (comb_blend_cd[7:0]),
.cd_g (comb_blend_cd[15:8]),
.cd_b (comb_blend_cd[23:16]),
// Brick 2c — generic GS ALPHA selectors (read only when
// ALPHA_MODES_ENABLE=1; default OFF → source-over).
.a_sel(ras_alpha_a),
.b_sel(ras_alpha_b),
.c_sel(ras_alpha_c),
.d_sel(ras_alpha_d),
.ad (comb_blend_cd[31:24]),
.fix (ras_alpha_fix),
.cv_r (comb_blend_r),
.cv_g (comb_blend_g),
.cv_b (comb_blend_b),
.a_out(comb_blend_a)
);
end else begin : g_no_comb_blend
assign comb_blend_r = 8'd0;
assign comb_blend_g = 8'd0;
assign comb_blend_b = 8'd0;
assign comb_blend_a = 8'd0;
end
endgenerate
// ==================================================================
// Ch304 TILE-LOCAL render mode — phase enum + control regs declared
// EARLY so the combined-active gate (below) can reference them.
//
// tile_active = TILE_LOCAL && ras_combined : the in-flight primitive
// is the combined T+A+Z triangle AND we render it into the on-chip
// tile scratchpad. Constant 0 at TILE_LOCAL=0 (generate tie-off), so
// every signal derived from it collapses to its legacy value and the
// combined VRAM path is byte-identical.
//
// tile_phase_r overlays the existing raster_state FSM (which keeps its
// 2-bit {R_IDLE,R_SCAN,R_DRAIN} encoding untouched):
// TP_OFF : not in a tile render (or non-tile primitive)
// TP_CLEAR : sweeping the tile RAMs to the clear color/Z
// TP_RENDER : the combined FSM runs, retargeted to the tile RAMs
// TP_FLUSH : sweeping the tile color RAM out to the framebuffer
// ==================================================================
// TP_BIN : (BIN_BUFFER_ENABLE only) pre-render binning sweep — classify
// every (prim,tile) pair into the per-tile bins. Touches NO
// tile RAM, so the tile-RAM driver's case-default keeps it inert.
// tile_phase_r is widened to 3 bits to hold the 5th encoding TP_BIN. At the
// default BIN_BUFFER_ENABLE=0 TP_BIN is never assigned, so the synthesised
// state set is unchanged (TP_OFF..TP_FLUSH) — byte-identical.
localparam logic [2:0] TP_OFF = 3'd0;
localparam logic [2:0] TP_CLEAR = 3'd1;
localparam logic [2:0] TP_RENDER = 3'd2;
localparam logic [2:0] TP_FLUSH = 3'd3;
localparam logic [2:0] TP_BIN = 3'd4;
// Ch323 (TILE_SPILL_ENABLE only): RELOAD = load tile color+Z from the staging RAM
// before RENDER; ZFLUSH = emit the tile Z RAM to LPDDR after the color FLUSH. Never
// assigned at the default TILE_SPILL_ENABLE=0, so the synthesised state set is unchanged.
localparam logic [2:0] TP_RELOAD = 3'd5;
localparam logic [2:0] TP_ZFLUSH = 3'd6;
// tile_active — = ras_combined at TILE_LOCAL=1, tied 0 otherwise.
generate
if (TILE_LOCAL) begin : g_tile_active
assign tile_active = ras_combined;
end else begin : g_no_tile_active
assign tile_active = 1'b0;
end
endgenerate
// Tile-phase register + sweep counter. Both are dead (held 0) at
// TILE_LOCAL=0 because tile_active is constant 0 — the always_ff arms
// that write them are all `if (tile_active ...)`-gated.
logic [2:0] tile_phase_r; // current tile phase (3-bit: 5th value TP_BIN)
logic [8:0] tile_sweep_r; // 0..256 sweep counter (CLEAR + FLUSH)
assign tile_phase = tile_phase_r;
// ---- Ch323 tile COLOR+Z spill/reload signals (TILE_SPILL_ENABLE only) ----
// Z-flush emit (TP_ZFLUSH): one cycle after a Z read addr is presented, the registered
// tile_z_rdata is emitted to the LPDDR Z-backing via the exposed z_flush_* stream.
logic z_flush_emit_q; // pipeline strobe (mirrors flush_emit_q for color)
logic [7:0] z_flush_idx_q; // tile index latched alongside the emit
logic [11:0] z_flush_ox_q, z_flush_oy_q;
// Tile RELOAD (TP_RELOAD): on entry pulse reload_start_o (arm the de25 staging fill), wait
// tile_reload_ready_i, then sweep 256 indices presenting tile_reload_raddr_o; the staging
// engine returns tile_reload_color_i/z_i 1 cycle later, written into the tile RAMs. All 256
// are written before TP_RENDER (all-or-nothing — no primitive eval on a partial tile).
logic reload_wait; // waiting for staging fill (tile_reload_ready_i)
logic reload_wr_we; // tile-RAM write window (1 cyc behind the raddr)
logic [7:0] reload_wr_addr; // tile index being written this beat
// ==================================================================
// Ch305 TILE GRID — outer loop over TILE_COLS x TILE_ROWS tiles.
//
// N_TILES tiles, one per (col,row). The grid loop is gated entirely
// by tile_active (dead at TILE_LOCAL=0); at TILE_COLS=TILE_ROWS=1 the
// single tile is at the origin so tile_ox=tile_oy=0 and N_TILES=1 →
// exactly one iteration, byte-identical to Ch303.
//
// prim_x_min/max/y_min/max : a COPY of the popped primitive's bbox,
// preserved across the whole grid render because ras_x_min/max
// are OVERWRITTEN by the per-tile clipped bbox each tile.
// tile_col_r / tile_row_r : current tile coordinate in the grid.
// tile_ox / tile_oy : current tile origin in screen pixels
// (= col*16 / row*16). 12-bit to match the coord datapath.
// ==================================================================
localparam int N_TILES = TILE_COLS * TILE_ROWS;
logic [11:0] prim_x_min, prim_x_max, prim_y_min, prim_y_max;
// Counters are sized to cover the grid; widths are derived so a
// 1x1 grid still infers a valid (1-bit) register without warnings.
logic [11:0] tile_col_r; // 0..TILE_COLS-1
logic [11:0] tile_row_r; // 0..TILE_ROWS-1
logic [11:0] tile_ox, tile_oy;
assign tile_ox = tile_col_r << 4; // col*16
assign tile_oy = tile_row_r << 4; // row*16
// Ch323 — clean-Z BOOTSTRAP: a per-tile "this tile has been spilled to LPDDR" valid bit.
// On the FIRST render of a tile (valid=0) we must NOT reload (LPDDR holds garbage / a stale
// tile) — render from the local CLEAR instead. After the tile's color+Z FLUSH completes we
// set its bit; later passes RELOAD. Without this, the first batch reloads garbage Z and the
// GEQUAL depth test rejects every fragment (nothing renders). Reset clears all bits.
logic [N_TILES-1:0] spill_valid;
wire [31:0] tile_id_w = tile_row_r * TILE_COLS + tile_col_r;
// Per-tile clip of the primitive bbox against this tile's 16x16 span.
// clip_*_min = max(prim_*_min, tile_lo) ; clip_*_max = min(prim_*_max,
// tile_hi). iverilog-12: name the compares, no parenthesized bit-select.
logic [11:0] tile_x_lo, tile_x_hi, tile_y_lo, tile_y_hi;
assign tile_x_lo = tile_ox;
assign tile_x_hi = tile_ox + 12'd15;
assign tile_y_lo = tile_oy;
assign tile_y_hi = tile_oy + 12'd15;
logic [11:0] clip_x_min, clip_x_max, clip_y_min, clip_y_max;
// width-safe unsigned max/min via compare-then-select (no $signed).
// 3-way: intersect (prim bbox) n (tile bbox) n (scissor rect). At the
// default SCISSOR_ENABLE=0 the eff_sc* terms are 0/0xFFF → no-op.
logic [11:0] clip_x_min_pt, clip_x_max_pt, clip_y_min_pt, clip_y_max_pt;
assign clip_x_min_pt = (prim_x_min > tile_x_lo) ? prim_x_min : tile_x_lo;
assign clip_x_max_pt = (prim_x_max < tile_x_hi) ? prim_x_max : tile_x_hi;
assign clip_y_min_pt = (prim_y_min > tile_y_lo) ? prim_y_min : tile_y_lo;
assign clip_y_max_pt = (prim_y_max < tile_y_hi) ? prim_y_max : tile_y_hi;
assign clip_x_min = (clip_x_min_pt > eff_scx0) ? clip_x_min_pt : eff_scx0;
assign clip_x_max = (clip_x_max_pt < eff_scx1) ? clip_x_max_pt : eff_scx1;
assign clip_y_min = (clip_y_min_pt > eff_scy0) ? clip_y_min_pt : eff_scy0;
assign clip_y_max = (clip_y_max_pt < eff_scy1) ? clip_y_max_pt : eff_scy1;
// Empty overlap: primitive does not touch this tile. When empty we
// SKIP the RENDER (CLEAR -> FLUSH only) so the tile shows clear color.
logic tile_clip_empty;
assign tile_clip_empty = (clip_x_min > clip_x_max)
| (clip_y_min > clip_y_max);
// Last tile in the grid? (col == COLS-1 AND row == ROWS-1.) Compared
// against 12-bit literals of the param-1 so a 1x1 grid is just (0,0).
logic tile_is_last;
assign tile_is_last = (tile_col_r == 12'(TILE_COLS - 1))
& (tile_row_r == 12'(TILE_ROWS - 1));
// ==================================================================
// Ch305+ MULTI-PRIMITIVE TILED SCENE — per-tile primitive batch.
//
// In multiprim mode the streaming pop is disabled; instead a whole
// batch of TILE_PRIM_COUNT primitives (already in the FIFO with their
// gradients solved) is rendered into each tile in FIFO order. Within a
// tile the on-chip color/Z RAMs are CLEARED once (TP_CLEAR) then each
// primitive is rendered without an inter-primitive clear, so primitive
// p+1 composites over p. All regs/wires below are dead at the default
// TILE_MULTIPRIM=0 (no branch writes them).
// ==================================================================
logic [FIFO_CNT_W-1:0] prim_idx_r; // current primitive within tile (0..prim_count_r-1)
logic [FIFO_CNT_W-1:0] prim_count_r; // # primitives in the batch (latched at grid start)
logic [FIFO_PTR_W-1:0] grid_base_rptr; // fifo_rptr at grid start = slot of primitive 0
// ==================================================================
// BIN BUFFER (BIN_BUFFER_ENABLE only) — per-tile ordered primitive lists.
//
// Before any tile renders, the TP_BIN sweep classifies every (prim,tile)
// pair: if primitive p's bbox ∩ tile-t's 16x16 span ∩ scissor is non-empty,
// p is APPENDED to tile t's bin (bin_prim[t][bin_n[t]] = p; bin_n[t]++).
// The overlap test is the SAME max/min clip-empty math used by
// mp_next_nonempty, so a bin walked in slot order yields the IDENTICAL
// primitive sequence the re-test path would have produced. At
// BIN_BUFFER_ENABLE=0 these registers are never written (TP_BIN unreached)
// and the bin-walk render arms are folded out -> synth-pruned dead state.
//
// NTILES = TILE_COLS*TILE_ROWS (=4 for the 2x2 grid). t = row*COLS + col,
// so (col,row) = (t % TILE_COLS, t / TILE_COLS). A bin holds at most
// FIFO_DEPTH prims; prim_count_r <= FIFO_DEPTH so the append never overflows.
// ==================================================================
localparam int NTILES = TILE_COLS * TILE_ROWS;
// Index/counter widths. $clog2(1)=0 (a 1x1 grid at the default), so guard
// both width derivations to a minimum of 1 bit — a zero-width register/cast
// is a Quartus FATAL ("zero width constant") AND an iverilog synth error.
// BIN_T_W holds 0..NTILES (the sweep counter can equal NTILES at done).
// CUR_T_W indexes 0..NTILES-1.
localparam int CUR_T_W = (NTILES > 1) ? $clog2(NTILES) : 1;
localparam int BIN_T_W = (NTILES > 1) ? $clog2(NTILES + 1) : 1;
logic [FIFO_CNT_W-1:0] bin_prim [0:NTILES-1][0:FIFO_DEPTH-1]; // prim INDEX per slot
logic [FIFO_CNT_W-1:0] bin_n [0:NTILES-1]; // # prims in each tile's bin
// Ch315 — CAPACITY DIAGNOSTICS (sim-visible; read hierarchically by the TB, no
// new ports). raster_overflow (the existing latch) flags a FIFO-full push;
// these COUNT/peak them so a stress run quantifies capacity pressure.
// raster_overflow_count_r : # of dropped (FIFO-full) primitive pushes.
// bin_occ_max_r : peak per-tile bin occupancy seen this batch (the
// "full-ish bin" metric — how deep the busiest
// tile's bin got vs the FIFO_DEPTH ceiling).
// bin_overflow_r : DEFENSIVE bin-capacity flag — set if a bin push
// would exceed FIFO_DEPTH. With the bin sized to the
// FIFO (M=N) this can never fire (the FIFO drops the
// prim first), but it makes the bin's own ceiling
// explicit and would catch a future M<N decoupling.
logic [15:0] raster_overflow_count_r;
logic [FIFO_CNT_W-1:0] bin_occ_max_r;
logic bin_overflow_r;
// Ch329 — non-combined primitives have NO tile-local color/Z path (the tile RAMs + combined FSM
// are gated by tile_active == ras_combined). The multiprim grid is therefore COMBINED-TAZ-ONLY;
// a non-combined prim is REFUSED at binning (never added to a bin → never loaded → cannot stall
// the bin-walk) and counted here for visibility. (Full tile-local sprite support is future work.)
logic [15:0] tile_refused_count;
// Binning sweep counters: one (prim p, tile t) pair classified per cycle.
// Order chosen so that, for a FIXED tile t, prim p is visited in INCREASING
// order -> appends to a bin happen in p order -> draw order preserved.
// Loop nesting: t is the OUTER counter, p is the INNER counter (p advances
// every cycle, t advances when p wraps past prim_count_r-1). NTILES and
// prim_count_r bounds are register/param compares (not loop bounds), so no
// non-constant FOR-loop is introduced (Quartus-safe).
logic [BIN_T_W-1:0] bin_t; // current tile being classified (0..NTILES)
logic [FIFO_CNT_W-1:0] bin_p; // current prim being classified (0..prim_count_r)
logic [FIFO_CNT_W-1:0] bin_slot_r; // current slot within cur-tile's bin (render walk)
// Tile-t (binning sweep) screen bounds, derived from bin_t. col = t%COLS,
// row = t/COLS. For the 2x2 grid both are 0/1. Computed combinationally
// from bin_t so the overlap test below sees the tile currently being binned.
logic [11:0] bin_t_col, bin_t_row, bin_t_x_lo, bin_t_x_hi, bin_t_y_lo, bin_t_y_hi;
always_comb begin
bin_t_col = 12'(bin_t % TILE_COLS);
bin_t_row = 12'(bin_t / TILE_COLS);
bin_t_x_lo = bin_t_col << 4;
bin_t_x_hi = (bin_t_col << 4) + 12'd15;
bin_t_y_lo = bin_t_row << 4;
bin_t_y_hi = (bin_t_row << 4) + 12'd15;
end
// Overlap test for the pair (prim bin_p, tile bin_t): the prim's fifo bbox ∩
// tile-bin_t bbox ∩ scissor is non-empty. IDENTICAL max/min/empty math to
// mp_next_nonempty (just indexed by bin_p / the bin_t bounds). Indexed array
// reads pulled into locals first (iverilog-12 variable-index-in-always_comb).
logic [FIFO_PTR_W-1:0] bin_slot_p; // fifo slot of prim bin_p
logic [11:0] binf_fx_lo, binf_fx_hi, binf_fy_lo, binf_fy_hi;
logic [11:0] bin_cx_lo, bin_cx_hi, bin_cy_lo, bin_cy_hi;
logic bin_overlap;
always_comb begin
bin_slot_p = grid_base_rptr + FIFO_PTR_W'(bin_p);
binf_fx_lo = fifo_x_min[bin_slot_p]; binf_fx_hi = fifo_x_max[bin_slot_p];
binf_fy_lo = fifo_y_min[bin_slot_p]; binf_fy_hi = fifo_y_max[bin_slot_p];
bin_cx_lo = (binf_fx_lo > bin_t_x_lo) ? binf_fx_lo : bin_t_x_lo;
bin_cx_hi = (binf_fx_hi < bin_t_x_hi) ? binf_fx_hi : bin_t_x_hi;
bin_cy_lo = (binf_fy_lo > bin_t_y_lo) ? binf_fy_lo : bin_t_y_lo;
bin_cy_hi = (binf_fy_hi < bin_t_y_hi) ? binf_fy_hi : bin_t_y_hi;
// Fold scissor rect (no-op at default SCISSOR_ENABLE=0).
bin_cx_lo = (bin_cx_lo > eff_scx0) ? bin_cx_lo : eff_scx0;
bin_cx_hi = (bin_cx_hi < eff_scx1) ? bin_cx_hi : eff_scx1;
bin_cy_lo = (bin_cy_lo > eff_scy0) ? bin_cy_lo : eff_scy0;
bin_cy_hi = (bin_cy_hi < eff_scy1) ? bin_cy_hi : eff_scy1;
bin_overlap = !((bin_cx_lo > bin_cx_hi) || (bin_cy_lo > bin_cy_hi));
end
// Reduction over the UNPACKED fifo_grad_pending array — cannot use a
// simple |-reduce on an unpacked array, so loop it (iverilog-12 OK).
logic all_grad_done;
always_comb begin
all_grad_done = 1'b1;
for (int gi = 0; gi < FIFO_DEPTH; gi++)
if (fifo_grad_pending[gi]) all_grad_done = 1'b0;
end
// Ch331 — latched "end-of-list" flush request (feeder mode). Set by the 1-cycle
// prim_list_flush_i pulse, held until the grid actually starts (mp_grid_start), so the
// request can never be missed if the feeder's done arrives before the renderer is ready
// (gradients still solving, a prior render draining, etc.). Inert when MP_FLUSH_ONLY=0.
logic mp_grid_start;
// Ch336 — scene/batch accumulation state. scene_first_batch_r marks the first FIFO-batch of a
// scene (it CLEARS+full-flushes; later batches sparse-flush onto the accumulated FB). batch_full_r
// latches the flush mode for the in-flight grid. scene_done_w = the end-of-list flush is pending
// and nothing is left to render -> the scene is fully drained, so the next batch starts fresh.
logic scene_first_batch_r, batch_full_r;
logic mp_flush_pending;
wire scene_done_w = TILE_ACCUM_ENABLE && mp_flush_pending && (fifo_count == FIFO_CNT_W'(0))
&& (raster_state == R_IDLE) && !tile_render_busy;
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
mp_flush_pending <= 1'b0; scene_first_batch_r <= 1'b1; batch_full_r <= 1'b1;
end else if (TILE_ACCUM_ENABLE) begin
// end-of-list flush held until the scene fully drains (survives mid-list batch drains)
if (scene_done_w) mp_flush_pending <= 1'b0;
else if (prim_list_flush_i) mp_flush_pending <= 1'b1;
// first batch of a scene -> clears; reset when the scene is done
if (scene_done_w) scene_first_batch_r <= 1'b1;
else if (mp_grid_start) scene_first_batch_r <= 1'b0;
if (mp_grid_start) batch_full_r <= scene_first_batch_r; // latch this grid's flush mode
end else begin
// Ch331 legacy: flush consumed at grid start; every grid full-flushes.
if (mp_grid_start) mp_flush_pending <= 1'b0;
else if (prim_list_flush_i) mp_flush_pending <= 1'b1;
batch_full_r <= 1'b1;
end
end
// Grid-start trigger. Legacy (non-accum): end-of-list flush (MP_FLUSH_ONLY) or fixed threshold.
// Ch336 accum: ALSO fire when the FIFO is FULL — a mid-list batch — so a >FIFO_DEPTH scene renders
// in multiple batches that accumulate (first clears, rest sparse-flush) instead of stalling.
assign mp_grid_start = TILE_LOCAL && TILE_MULTIPRIM
&& (raster_state == R_IDLE)
&& (TILE_ACCUM_ENABLE ? ((fifo_full || mp_flush_pending) && (fifo_count != FIFO_CNT_W'(0)))
: MP_FLUSH_ONLY ? (mp_flush_pending && (fifo_count != FIFO_CNT_W'(0)))
: (fifo_count >= FIFO_CNT_W'(TILE_PRIM_COUNT)))
&& all_grad_done
&& !tile_render_busy;
// Ch336 — per-pixel "written" mask for the CURRENT tile (sparse flush). Cleared through TP_CLEAR,
// set on each combined render color write (CB_FB, tile_color_we in TP_RENDER). On a non-first
// batch the flush emits ONLY written pixels, so it overlays the batch's prims onto the
// accumulated framebuffer instead of overwriting it with the green clear. Pruned when
// TILE_ACCUM_ENABLE=0 (the flush gate below stops reading it). 256 = one 16x16 tile.
logic [255:0] tile_written_r;
always_ff @(posedge clk) begin
if (tile_phase_r == TP_CLEAR) tile_written_r <= 256'd0;
else if (tile_color_we && (tile_phase_r == TP_RENDER)) tile_written_r[tile_color_waddr] <= 1'b1;
end
// Per-primitive walker-bbox clip against the CURRENT tile is computed
// DIRECTLY from a given slot's fifo bbox (not via the prim_* regs) so a
// reload can clip in the same cycle it picks the slot — see the
// mp_first_*/mp_after_* wires defined just after the helper function below.
// "Skip empty primitives" helper: given a starting prim index, return the
// first index >= start whose clip against the CURRENT tile is NON-empty,
// or prim_count_r if none remain. iverilog-12: no `return` from a task —
// implemented as a function with a single result variable. The function
// reads grid_base_rptr / prim_count_r / tile bbox / fifo bbox (all stable
// within the evaluating cycle).
function automatic [FIFO_CNT_W-1:0] mp_next_nonempty (input [FIFO_CNT_W-1:0] start);
logic [FIFO_CNT_W-1:0] idx;
logic [FIFO_CNT_W-1:0] res;
logic [FIFO_PTR_W-1:0] slot;
logic [11:0] cx_lo, cx_hi, cy_lo, cy_hi;
logic [11:0] fx_lo, fx_hi, fy_lo, fy_hi;
logic found;
begin
res = prim_count_r;
found = 1'b0;
// CONSTANT loop bound (FIFO_DEPTH) so Quartus can unroll — iverilog
// tolerates a non-constant bound (prim_count_r) but Quartus rejects it
// ("non-constant loop count limit exceeded"). The original [start,
// prim_count_r) range is preserved by the in-loop guard below.
for (idx = '0; idx < FIFO_CNT_W'(FIFO_DEPTH); idx = idx + FIFO_CNT_W'(1)) begin
slot = grid_base_rptr + FIFO_PTR_W'(idx);
fx_lo = fifo_x_min[slot]; fx_hi = fifo_x_max[slot];
fy_lo = fifo_y_min[slot]; fy_hi = fifo_y_max[slot];
cx_lo = (fx_lo > tile_x_lo) ? fx_lo : tile_x_lo;
cx_hi = (fx_hi < tile_x_hi) ? fx_hi : tile_x_hi;
cy_lo = (fy_lo > tile_y_lo) ? fy_lo : tile_y_lo;
cy_hi = (fy_hi < tile_y_hi) ? fy_hi : tile_y_hi;
// Fold scissor rect into the empty-test (no-op at default).
cx_lo = (cx_lo > eff_scx0) ? cx_lo : eff_scx0;
cx_hi = (cx_hi < eff_scx1) ? cx_hi : eff_scx1;
cy_lo = (cy_lo > eff_scy0) ? cy_lo : eff_scy0;
cy_hi = (cy_hi < eff_scy1) ? cy_hi : eff_scy1;
if (!found && (idx >= start) && (idx < prim_count_r)
&& !((cx_lo > cx_hi) || (cy_lo > cy_hi))) begin
res = idx;
found = 1'b1;
end
end
mp_next_nonempty = res;
end
endfunction
// FIRST non-empty primitive of the batch for THIS tile (used at CLEAR-done),
// and its source slot + tile-clipped walker bbox.
logic [FIFO_CNT_W-1:0] mp_first_idx;
logic [FIFO_PTR_W-1:0] mp_first_slot;
logic [11:0] mp_first_cx_lo, mp_first_cx_hi, mp_first_cy_lo, mp_first_cy_hi;
// NEXT non-empty primitive strictly after prim_idx_r (used at RENDER-drain),
// and its source slot + tile-clipped walker bbox. prim_idx_r+1 may overflow
// FIFO_CNT_W only past the batch, which mp_next_nonempty's `< prim_count_r`
// bound already rejects (returns prim_count_r), so the add is width-safe.
logic [FIFO_CNT_W-1:0] mp_after_arg;
logic [FIFO_CNT_W-1:0] mp_after_idx;
logic [FIFO_PTR_W-1:0] mp_after_slot;
logic [11:0] mp_after_cx_lo, mp_after_cx_hi, mp_after_cy_lo, mp_after_cy_hi;
// Computed PROCEDURALLY (always_comb), NOT via chained continuous assigns:
// iverilog-12 mis-evaluates a variable-indexed array read (fifo_x_max[slot])
// inside a continuous assign when the index net is itself driven by an assign
// through a function — it returns 0 for the indexed read (observed: a hang
// with the loaded walker bbox stuck at [0..0]). In a procedural always_comb the
// index reads evaluate correctly. (The original assign form is legal SV and
// would synthesize fine in Quartus; this is purely a sim-modelling workaround.)
// Tile-only intersection temps (prim bbox n tile bbox), before the
// scissor is folded in below. Indexed array reads are pulled into
// locals first (iverilog-12 variable-index-in-always_comb is fine).
logic [11:0] mpf_pt_cx_lo, mpf_pt_cx_hi, mpf_pt_cy_lo, mpf_pt_cy_hi;
logic [11:0] mpa_pt_cx_lo, mpa_pt_cx_hi, mpa_pt_cy_lo, mpa_pt_cy_hi;
logic [11:0] mpf_fx_lo, mpf_fx_hi, mpf_fy_lo, mpf_fy_hi;
logic [11:0] mpa_fx_lo, mpa_fx_hi, mpa_fy_lo, mpa_fy_hi;
always_comb begin
mp_first_idx = mp_next_nonempty(FIFO_CNT_W'(0));
mp_first_slot = grid_base_rptr + FIFO_PTR_W'(mp_first_idx);
mpf_fx_lo = fifo_x_min[mp_first_slot]; mpf_fx_hi = fifo_x_max[mp_first_slot];
mpf_fy_lo = fifo_y_min[mp_first_slot]; mpf_fy_hi = fifo_y_max[mp_first_slot];
mpf_pt_cx_lo = (mpf_fx_lo > tile_x_lo) ? mpf_fx_lo : tile_x_lo;
mpf_pt_cx_hi = (mpf_fx_hi < tile_x_hi) ? mpf_fx_hi : tile_x_hi;
mpf_pt_cy_lo = (mpf_fy_lo > tile_y_lo) ? mpf_fy_lo : tile_y_lo;
mpf_pt_cy_hi = (mpf_fy_hi < tile_y_hi) ? mpf_fy_hi : tile_y_hi;
// Fold scissor rect (no-op at default SCISSOR_ENABLE=0).
mp_first_cx_lo = (mpf_pt_cx_lo > eff_scx0) ? mpf_pt_cx_lo : eff_scx0;
mp_first_cx_hi = (mpf_pt_cx_hi < eff_scx1) ? mpf_pt_cx_hi : eff_scx1;
mp_first_cy_lo = (mpf_pt_cy_lo > eff_scy0) ? mpf_pt_cy_lo : eff_scy0;
mp_first_cy_hi = (mpf_pt_cy_hi < eff_scy1) ? mpf_pt_cy_hi : eff_scy1;
mp_after_arg = prim_idx_r + FIFO_CNT_W'(1);
mp_after_idx = mp_next_nonempty(mp_after_arg);
mp_after_slot = grid_base_rptr + FIFO_PTR_W'(mp_after_idx);
mpa_fx_lo = fifo_x_min[mp_after_slot]; mpa_fx_hi = fifo_x_max[mp_after_slot];
mpa_fy_lo = fifo_y_min[mp_after_slot]; mpa_fy_hi = fifo_y_max[mp_after_slot];
mpa_pt_cx_lo = (mpa_fx_lo > tile_x_lo) ? mpa_fx_lo : tile_x_lo;
mpa_pt_cx_hi = (mpa_fx_hi < tile_x_hi) ? mpa_fx_hi : tile_x_hi;
mpa_pt_cy_lo = (mpa_fy_lo > tile_y_lo) ? mpa_fy_lo : tile_y_lo;
mpa_pt_cy_hi = (mpa_fy_hi < tile_y_hi) ? mpa_fy_hi : tile_y_hi;
// Fold scissor rect (no-op at default SCISSOR_ENABLE=0).
mp_after_cx_lo = (mpa_pt_cx_lo > eff_scx0) ? mpa_pt_cx_lo : eff_scx0;
mp_after_cx_hi = (mpa_pt_cx_hi < eff_scx1) ? mpa_pt_cx_hi : eff_scx1;
mp_after_cy_lo = (mpa_pt_cy_lo > eff_scy0) ? mpa_pt_cy_lo : eff_scy0;
mp_after_cy_hi = (mpa_pt_cy_hi < eff_scy1) ? mpa_pt_cy_hi : eff_scy1;
end
// ------------------------------------------------------------------
// BIN-WALK clip wires (BIN_BUFFER_ENABLE render path). cur_t = the tile
// currently being rendered (= tile_row_r*COLS + tile_col_r). For the CLEAR
// ->RENDER load we want bin slot 0; for a RENDER->RENDER advance we want bin
// slot bin_slot_r+1. Each gives a prim INDEX (bin_prim[cur_t][slot]); the
// tile-clipped walker bbox is the SAME max/min/scissor math as mp_first_*
// (clipped against the CURRENT tile via tile_x_lo/hi). At BIN_BUFFER_ENABLE=0
// these are unread dead wires (the bin-walk render arms are folded out).
// ------------------------------------------------------------------
logic [CUR_T_W-1:0] cur_t;
assign cur_t = CUR_T_W'(tile_row_r * TILE_COLS + tile_col_r);
// selected prim INDEX for the two walk cases.
logic [FIFO_CNT_W-1:0] binw_first_idx, binw_next_idx;
logic [FIFO_CNT_W-1:0] binw_next_slot_sel; // bin slot for the advance
logic [FIFO_PTR_W-1:0] binw_first_slot, binw_next_fslot;
logic [11:0] binw_first_cx_lo, binw_first_cx_hi, binw_first_cy_lo, binw_first_cy_hi;
logic [11:0] binw_next_cx_lo, binw_next_cx_hi, binw_next_cy_lo, binw_next_cy_hi;
// intermediates (indexed reads pulled into locals first).
logic [11:0] bwf_fx_lo, bwf_fx_hi, bwf_fy_lo, bwf_fy_hi;
logic [11:0] bwn_fx_lo, bwn_fx_hi, bwn_fy_lo, bwn_fy_hi;
logic [11:0] bwf_pt_cx_lo, bwf_pt_cx_hi, bwf_pt_cy_lo, bwf_pt_cy_hi;
logic [11:0] bwn_pt_cx_lo, bwn_pt_cx_hi, bwn_pt_cy_lo, bwn_pt_cy_hi;
always_comb begin
// bin slot 0 (first prim of cur_t's bin).
binw_first_idx = bin_prim[cur_t][0];
binw_first_slot = grid_base_rptr + FIFO_PTR_W'(binw_first_idx);
bwf_fx_lo = fifo_x_min[binw_first_slot]; bwf_fx_hi = fifo_x_max[binw_first_slot];
bwf_fy_lo = fifo_y_min[binw_first_slot]; bwf_fy_hi = fifo_y_max[binw_first_slot];
bwf_pt_cx_lo = (bwf_fx_lo > tile_x_lo) ? bwf_fx_lo : tile_x_lo;
bwf_pt_cx_hi = (bwf_fx_hi < tile_x_hi) ? bwf_fx_hi : tile_x_hi;
bwf_pt_cy_lo = (bwf_fy_lo > tile_y_lo) ? bwf_fy_lo : tile_y_lo;
bwf_pt_cy_hi = (bwf_fy_hi < tile_y_hi) ? bwf_fy_hi : tile_y_hi;
binw_first_cx_lo = (bwf_pt_cx_lo > eff_scx0) ? bwf_pt_cx_lo : eff_scx0;
binw_first_cx_hi = (bwf_pt_cx_hi < eff_scx1) ? bwf_pt_cx_hi : eff_scx1;
binw_first_cy_lo = (bwf_pt_cy_lo > eff_scy0) ? bwf_pt_cy_lo : eff_scy0;
binw_first_cy_hi = (bwf_pt_cy_hi < eff_scy1) ? bwf_pt_cy_hi : eff_scy1;
// next bin slot (bin_slot_r+1). Index into bin_prim is taken modulo
// FIFO_DEPTH (low FIFO_PTR_W bits) so it is ALWAYS in [0..FIFO_DEPTH-1]
// — the advance only CONSUMES this when bin_slot_r+1 < bin_n[cur_t] <=
// FIFO_DEPTH, so the masked value equals the true slot whenever used;
// the mask only tames the unused wrap case (Quartus const-range safety).
binw_next_slot_sel = bin_slot_r + FIFO_CNT_W'(1);
binw_next_idx = bin_prim[cur_t][binw_next_slot_sel[FIFO_PTR_W-1:0]];
binw_next_fslot = grid_base_rptr + FIFO_PTR_W'(binw_next_idx);
bwn_fx_lo = fifo_x_min[binw_next_fslot]; bwn_fx_hi = fifo_x_max[binw_next_fslot];
bwn_fy_lo = fifo_y_min[binw_next_fslot]; bwn_fy_hi = fifo_y_max[binw_next_fslot];
bwn_pt_cx_lo = (bwn_fx_lo > tile_x_lo) ? bwn_fx_lo : tile_x_lo;
bwn_pt_cx_hi = (bwn_fx_hi < tile_x_hi) ? bwn_fx_hi : tile_x_hi;
bwn_pt_cy_lo = (bwn_fy_lo > tile_y_lo) ? bwn_fy_lo : tile_y_lo;
bwn_pt_cy_hi = (bwn_fy_hi < tile_y_hi) ? bwn_fy_hi : tile_y_hi;
binw_next_cx_lo = (bwn_pt_cx_lo > eff_scx0) ? bwn_pt_cx_lo : eff_scx0;
binw_next_cx_hi = (bwn_pt_cx_hi < eff_scx1) ? bwn_pt_cx_hi : eff_scx1;
binw_next_cy_lo = (bwn_pt_cy_lo > eff_scy0) ? bwn_pt_cy_lo : eff_scy0;
binw_next_cy_hi = (bwn_pt_cy_hi < eff_scy1) ? bwn_pt_cy_hi : eff_scy1;
end
// tile-local index of the HELD combined walker pixel = {y[3:0],x[3:0]}.
// For ANY 16-aligned tile this is the tile-local RAM address (the low
// 4 bits of the screen coord), so it is offset-agnostic by construction.
logic [7:0] tile_idx_w;
assign tile_idx_w = {ras_cur_y[3:0], ras_cur_x[3:0]};
// Combined-active gate: a combined-mode scan is in flight.
// Ch304 — in tile-local mode the combined per-pixel FSM may run ONLY
// during the RENDER phase (NOT while the tile is being CLEARed, even
// though raster_state is already R_SCAN then). tile_active is constant
// 0 at TILE_LOCAL=0, so the added term collapses away (byte-identical).
logic comb_scan_active;
assign comb_scan_active = ras_combined && (raster_state == R_SCAN)
&& (!tile_active || (tile_phase_r == TP_RENDER));
// Per-beat read-strobe contributions. One consumer per beat:
// beat 0 (CB_Z) : Z read, inside pixel only.
// beat 1 (CB_ZW): texel read, depth-PASS only.
// beat 2 (CB_T) : dest-fb read, depth-PASS only (latched pass).
assign comb_z_req = comb_scan_active && (comb_state_r == CB_Z)
&& comb_pix_inside_w;
assign comb_tex_req = comb_scan_active && (comb_state_r == CB_ZW)
&& comb_ztest_pass_w;
assign comb_fb_req = comb_scan_active && (comb_state_r == CB_T)
&& comb_ztest_pass_r;
// Ch310 — is THIS combined primitive doing a bilinear PSMCT32 sample? Only
// then does the texel read take the multi-beat 4-tap path that needs the
// CB_TWAIT stall. Compile-time-constant 0 when BILINEAR_ENABLE=0, so the
// CB_TWAIT branch + the bilinear advance are pruned and the FSM is exactly
// the legacy CB_ZW→CB_T→CB_FB→CB_ZWR sequence (byte-identical).
// ras_filter_lin = the per-primitive TEX1_1.MMAG snapshot; ras_tpsm==6'h00
// = PSMCT32 (bilinear is PSMCT32-only, matching u_tex's is_ct32 gate).
logic bili_now;
assign bili_now = BILINEAR_ENABLE && comb_scan_active
&& ras_filter_lin && (ras_tpsm == 6'h00);
// Observable schedule taps.
assign comb_active = comb_scan_active;
assign comb_beat = comb_state_r;
assign comb_pix_inside = comb_scan_active ? comb_pix_inside_w : 1'b0;
assign comb_ztest_pass = comb_scan_active ? comb_ztest_pass_r : 1'b0;
// ------------------------------------------------------------------
// FLUSH read/emit pipeline. The tile color RAM read is REGISTERED, so
// a raddr presented on cycle N returns rdata on cycle N+1. During
// TP_FLUSH we present raddr=tile_sweep_r each cycle and, ONE cycle
// later, emit the framebuffer write for the index we presented last
// cycle. flush_emit_q / flush_idx_q carry the 1-cycle delay.
// ------------------------------------------------------------------
logic flush_emit_q; // emit a FB write THIS cycle
logic [7:0] flush_idx_q; // tile index whose color is emitted now
// Linear PSMCT32 FB byte address for tile-local index idx (x=idx[3:0],
// y=idx[7:4]), matching the combined comb_fb_addr formula:
// pixel_index = y*(fbw*64) + x ; addr = (fbp<<11) + (pixel_index<<2).
// Ch305 — the FLUSH FB address is OFFSET by the tile origin. flush_x/y
// are the tile-LOCAL 0..15 indices; the screen coord is tile_ox/oy +
// those. flush_ox_q / flush_oy_q latch the tile origin alongside the
// emit pipeline so the +1-cycle registered read stays aligned even if
// the tile counter advances (it does not until FLUSH ends, but latching
// keeps the address self-consistent and TB-observable). At
// COLS=ROWS=1 the offset is 0 → byte-identical to Ch303.
logic [11:0] flush_x_w, flush_y_w; // screen coords
logic [11:0] flush_ox_q, flush_oy_q; // latched tile origin for the emit
logic [31:0] flush_pixel_index_w, flush_fb_addr_w;
assign flush_x_w = flush_ox_q + {8'd0, flush_idx_q[3:0]};
assign flush_y_w = flush_oy_q + {8'd0, flush_idx_q[7:4]};
assign flush_pixel_index_w = ({20'd0, flush_y_w} * ({26'd0, ras_fbw} << 6))
+ {20'd0, flush_x_w};
assign flush_fb_addr_w = ({23'd0, ras_fbp} << 11) + (flush_pixel_index_w << 2);
// Ch324 — current tile's PSMCT32 byte offset into the raster framebuffer: SAME formula as the
// flush pixel index, but at the tile ORIGIN (tile_ox/tile_oy). gs_tile_reload adds this to its
// COLOR/Z region base to gather exactly the tile the flush wrote. 1x1 grid → tile_ox=oy=0 → 0.
assign reload_base_o = 30'((((({20'd0, tile_oy} * ({26'd0, ras_fbw} << 6)) + {20'd0, tile_ox})) << 2));
// Ch305+ PSMCT16 flush FB byte address: 2 bytes/pixel (<<1) vs 4 for
// PSMCT32 (<<2). The low bit (pixel_index<<1) supplies byte_addr[1],
// which vram_normalize_pkg keys the low/high halfword selection off.
// Dead (unread) at TILE_COLOR_PSMCT16=0.
wire [31:0] flush_fb_addr16_w = ({23'd0, ras_fbp} << 11) + (flush_pixel_index_w << 1);
// Ch323 — Z-FLUSH byte address (TILE_SPILL_ENABLE). Same pixel-index math as the color
// flush, but Z-BACKING-relative (NO fbp base — the de25 Z-writer adds the LPDDR z_base)
// and ALWAYS 4 bytes/pixel (<<2; Z is 32-bit regardless of TILE_COLOR_PSMCT16). Uses the
// z_flush_* latches (presented during TP_ZFLUSH). Dead at the default.
wire [11:0] z_flush_x_w = z_flush_ox_q + {8'd0, z_flush_idx_q[3:0]};
wire [11:0] z_flush_y_w = z_flush_oy_q + {8'd0, z_flush_idx_q[7:4]};
wire [31:0] z_flush_pixel_index_w = ({20'd0, z_flush_y_w} * ({26'd0, ras_fbw} << 6))
+ {20'd0, z_flush_x_w};
wire [31:0] z_flush_addr_w = z_flush_pixel_index_w << 2;
// Ch323 — tile RELOAD staging read address: during the TP_RELOAD sweep (after the staging
// fill is ready), present tile index 0..255 to the de25 staging engine. 0 otherwise.
assign tile_reload_raddr_o = (tile_phase_r == TP_RELOAD && !reload_wait && tile_sweep_r < 9'd256)
? tile_sweep_r[7:0] : 8'd0;
assign tile_phase_o = tile_phase_r; // Ch323 diag — current tile phase (de25 counts entries)
// ------------------------------------------------------------------
// Tile RAM write/read port drivers (combinational). One write port +
// one read port each; the phases never collide (CLEAR writes only;
// RENDER reads Z@beat1 / color@beat3 and writes color@beat3 / Z@beat4
// — read and write are different beats; FLUSH reads only).
// (tile_color_rdata / tile_z_rdata declared up near comb_z_compare_w
// so the depth-test + blend-dest muxes can reference them.)
// ------------------------------------------------------------------
// RENDER write data:
// color (beat3, CB_FB) = blended PSMCT32 word.
// Z (beat4, CB_ZWR) = fragment Z (comb_frag_z), skipped if ZMSK.
logic [31:0] tile_render_color_w;
assign tile_render_color_w = {comb_blend_a, comb_blend_b,
comb_blend_g, comb_blend_r};
// Ch305+ PSMCT16 pack of the CLEAR + RENDER ABGR8888 color into
// pix16 = {a1, b5, g5, r5}, the EXACT convention used by the S2
// raster pack (s2_pack_psm16) and the gs_pcrtc unpack: r5=[7:3],
// g5=[15:11], b5=[23:19], a1=[31]. Dead (unread) at TILE_COLOR_PSMCT16=0.
wire [15:0] tile_clear_pix16 = {TILE_CLEAR_COLOR[31], TILE_CLEAR_COLOR[23:19],
TILE_CLEAR_COLOR[15:11], TILE_CLEAR_COLOR[7:3]};
wire [15:0] tile_render_pix16 = {tile_render_color_w[31], tile_render_color_w[23:19],
tile_render_color_w[15:11], tile_render_color_w[7:3]};
// Tile color write strobe + addr + data.
always_comb begin
tile_color_we = 1'b0;
tile_color_waddr = 8'd0;
tile_color_wdata = '0;
tile_color_raddr = 8'd0;
tile_z_we = 1'b0;
tile_z_waddr = 8'd0;
tile_z_wdata = 32'd0;
tile_z_raddr = 8'd0;
if (tile_active) begin
unique case (tile_phase_r)
TP_CLEAR: begin
// Sweep all 256 entries to the clear values.
if (tile_sweep_r < 9'd256) begin
tile_color_we = 1'b1;
tile_color_waddr = tile_sweep_r[7:0];
// PSMCT16: store packed pix16; PSMCT32: full ABGR.
// Both width-correct for TILE_COLOR_W (16 vs 32).
tile_color_wdata = TILE_COLOR_PSMCT16
? TILE_COLOR_W'(tile_clear_pix16)
: TILE_COLOR_W'(TILE_CLEAR_COLOR);
// Ch338 — with persistent cross-batch Z, clear Z ONLY on the first
// (full-flush) batch of a scene; later (sparse) batches keep the resident
// Z so a tile re-rendered in a later batch depth-tests against prior ones.
// batch_full_r is the latched flush mode (1 = first batch). At
// TILE_Z_PERSIST=0 this is constant 1 → byte-identical (always clears).
tile_z_we = (!TILE_Z_PERSIST) || batch_full_r;
tile_z_waddr = tile_sweep_r[7:0];
tile_z_wdata = TILE_CLEAR_Z;
end
end
TP_RENDER: begin
if (comb_scan_active) begin
unique case (comb_state_r)
// beat0 (CB_Z): present Z read for held pixel.
CB_Z: if (comb_pix_inside_w)
tile_z_raddr = tile_idx_w;
// beat2 (CB_T): present color (dest) read.
CB_T: tile_color_raddr = tile_idx_w;
// beat3 (CB_FB): WRITE blended color to tile.
CB_FB: begin
tile_color_we = 1'b1;
tile_color_waddr = tile_idx_w;
// PSMCT16: pack blended ABGR -> pix16;
// PSMCT32: full blended ABGR word.
tile_color_wdata = TILE_COLOR_PSMCT16
? TILE_COLOR_W'(tile_render_pix16)
: TILE_COLOR_W'(tile_render_color_w);
end
// beat4 (CB_ZWR): WRITE fragment Z (unless ZMSK).
CB_ZWR: if (!ras_zmsk) begin
tile_z_we = 1'b1;
tile_z_waddr = tile_idx_w;
tile_z_wdata = comb_frag_z;
end
default: ;
endcase
end
end
TP_FLUSH: begin
// Present a color read for each index 0..255; the
// emit happens one cycle later (registered read).
if (tile_sweep_r < 9'd256)
tile_color_raddr = tile_sweep_r[7:0];
end
// Ch323 (TILE_SPILL_ENABLE) — these states are only entered when spill is
// enabled, so at the default they are unreachable and the sim is byte-identical.
TP_ZFLUSH: begin
// Present a Z read for each index 0..255; the Z emit happens one cycle
// later (registered read), mirroring TP_FLUSH for color.
if (tile_sweep_r < 9'd256)
tile_z_raddr = tile_sweep_r[7:0];
end
TP_RELOAD: begin
// Write reloaded color+Z from the staging engine into the tile RAMs. The
// staging read addr was presented LAST cycle (1-cyc latency), so its data
// (tile_reload_color_i / tile_reload_z_i) is valid this cycle. reload_wr_we
// gates the write window; reload_wr_addr is the tile index for this beat.
// SEPARATE color/Z write enables (Codex) — both asserted together so the
// reload is all-or-nothing, but independently wired.
if (reload_wr_we) begin
tile_color_we = 1'b1;
tile_color_waddr = reload_wr_addr;
tile_color_wdata = TILE_COLOR_W'(tile_reload_color_i);
tile_z_we = 1'b1;
tile_z_waddr = reload_wr_addr;
tile_z_wdata = tile_reload_z_i;
end
end
default: ;
endcase
end
end
// Ch305+ PSMCT16 dest-read unpack: tile_color_rdata holds pix16 in
// [15:0]; bit-replicate to ABGR8888 for the blend dest (gs_alpha_blend
// uses dest R/G/B + source alpha, so a8 only feeds trace consistency —
// we use {8{a1}}). Same bit-replication as gs_pcrtc_stub's CT16 unpack:
// r8={r5,r5[4:2]}, g8={g5,g5[4:2]}, b8={b5,b5[4:2]}. At
// TILE_COLOR_PSMCT16=0 tile_color_rdata is the legacy 32-bit ABGR word
// and this unpacked net is unread (the mux below selects rdata directly).
logic [4:0] tile_rd_r5, tile_rd_g5, tile_rd_b5;
logic tile_rd_a1;
logic [31:0] tile_color_rdata_abgr;
assign tile_rd_r5 = tile_color_rdata[4:0];
assign tile_rd_g5 = tile_color_rdata[9:5];
assign tile_rd_b5 = tile_color_rdata[14:10];
assign tile_rd_a1 = tile_color_rdata[15];
assign tile_color_rdata_abgr = {{8{tile_rd_a1}},
{tile_rd_b5, tile_rd_b5[4:2]},
{tile_rd_g5, tile_rd_g5[4:2]},
{tile_rd_r5, tile_rd_r5[4:2]}};
// Zero-extend tile_color_rdata to 32 bits with a width-cast (no-op at
// TILE_COLOR_W=32 -> byte-identical; high 16 are 0 at TILE_COLOR_W=16
// but that arm is never selected for the PSMCT32-mode mux below).
logic [31:0] tile_color_rdata32;
assign tile_color_rdata32 = 32'(tile_color_rdata);
// Combined blend dest mux: tile color read in tile mode, else VRAM.
// PSMCT16 tile uses the unpacked ABGR; PSMCT32 tile uses rdata as-is
// (32-bit). Both arms are 32-bit into comb_blend_cd; at
// TILE_COLOR_PSMCT16=0 this collapses to the legacy expression
// (tile_color_rdata32 === tile_color_rdata, a 32-bit net).
assign comb_blend_cd = tile_active
? (TILE_COLOR_PSMCT16 ? tile_color_rdata_abgr
: tile_color_rdata32)
: fb_rd_data;
// ------------------------------------------------------------------
// Tile RAM instances — ONE 16x16 color tile + ONE 16x16 Z tile (256
// entries each, 32-bit). Instantiated only at TILE_LOCAL=1; tied to 0
// reads otherwise so non-tiled builds infer no BRAM.
// ------------------------------------------------------------------
// Ch338 — persistent cross-batch Z: when TILE_Z_PERSIST (and the grid has >1 tile) the Z tile is
// the WHOLE grid (NTILES*256), indexed by {tile_id, local}. The COLOR tile stays one 16x16 tile —
// color rides the sparse flush into VRAM, only Z must persist on-chip. tile_id_w is the current
// tile (stable across its render), so the per-pixel Z read/write hit the right tile's region.
localparam int TZ_ID_BITS = (TILE_Z_PERSIST && N_TILES > 1) ? $clog2(N_TILES) : 0;
localparam int TZ_ADDR_W = 8 + TZ_ID_BITS;
logic [TZ_ADDR_W-1:0] tile_z_gwaddr, tile_z_graddr;
generate
if (TZ_ID_BITS > 0) begin : g_ztile_global
assign tile_z_gwaddr = {tile_id_w[TZ_ID_BITS-1:0], tile_z_waddr};
assign tile_z_graddr = {tile_id_w[TZ_ID_BITS-1:0], tile_z_raddr};
end else begin : g_ztile_local
assign tile_z_gwaddr = tile_z_waddr;
assign tile_z_graddr = tile_z_raddr;
end
endgenerate
generate
if (TILE_LOCAL) begin : g_tile_ram
// Ch305+ COLOR tile RAM width = TILE_COLOR_W (16 for PSMCT16,
// 32 for PSMCT32). gs_tile_ram already has a DATA_W param.
gs_tile_ram #(.ADDR_W(8), .DATA_W(TILE_COLOR_W)) u_tile_color (
.clk (clk),
.rst_n (rst_n),
.we (tile_color_we),
.waddr (tile_color_waddr),
.wdata (tile_color_wdata),
.raddr (tile_color_raddr),
.rdata (tile_color_rdata)
);
gs_tile_ram #(.ADDR_W(TZ_ADDR_W), .DATA_W(32)) u_tile_z (
.clk (clk),
.rst_n (rst_n),
.we (tile_z_we),
.waddr (tile_z_gwaddr),
.wdata (tile_z_wdata),
.raddr (tile_z_graddr),
.rdata (tile_z_rdata)
);
end else begin : g_no_tile_ram
assign tile_color_rdata = '0; // TILE_COLOR_W-wide tie-off
assign tile_z_rdata = 32'd0;
end
endgenerate
// ------------------------------------------------------------------
// Ch88 — 3-stage SCAN pipeline
//
// S0 (this cycle, pre-pipe) : ras_cur_x / ras_cur_y advance.
// Registered. Drives s1 latches at
// next edge.
// S1 (one cycle after S0) : edge functions + top-left bias
// + inside test. Combinational from
// s1_x_q / s1_y_q. Drives s2 latches.
// S2 (two cycles after S0) : color interp + fb_addr + emit.
// Combinational from s2_x_q / s2_y_q
// / s2_L*_q. Drives raster_pixel_*
// outputs.
//
// Throughput stays at 1 candidate pixel per cycle once the pipe
// is full. Latency is 2-3 cycles per pixel (S0 → emit). The FSM
// adds an R_DRAIN state after S0 finishes the bbox so S1 and S2
// can flush before the next FIFO entry pops; raster_active stays
// high through R_SCAN AND R_DRAIN so external waits don't see
// gaps mid-primitive.
// ------------------------------------------------------------------
// S1 stage registers
logic [11:0] s1_x_q, s1_y_q;
logic s1_valid_q;
// S2 stage registers
logic [11:0] s2_x_q, s2_y_q;
logic s2_valid_q;
logic s2_inside_q;
logic signed [31:0] s2_L0_q, s2_L1_q, s2_L2_q;
raster_mode_e s2_mode_q; // mirrored — ras_mode could change between scans
// S1 combinational: edge functions on the s1 stage's latched coords.
logic [31:0] s1_e0, s1_e1, s1_e2;
logic [31:0] s1_e0_biased, s1_e1_biased, s1_e2_biased;
logic s1_in_triangle;
logic s1_pixel_inside;
assign s1_e0 = edge_fn(s1_x_q, s1_y_q, ras_v0_x, ras_v0_y, ras_v1_x, ras_v1_y);
assign s1_e1 = edge_fn(s1_x_q, s1_y_q, ras_v1_x, ras_v1_y, ras_v2_x, ras_v2_y);
assign s1_e2 = edge_fn(s1_x_q, s1_y_q, ras_v2_x, ras_v2_y, ras_v0_x, ras_v0_y);
assign s1_e0_biased = s1_e0 + {31'd0, ras_bias[0]};
assign s1_e1_biased = s1_e1 + {31'd0, ras_bias[1]};
assign s1_e2_biased = s1_e2 + {31'd0, ras_bias[2]};
assign s1_in_triangle = ($signed(s1_e0_biased) <= 32'sd0)
& ($signed(s1_e1_biased) <= 32'sd0)
& ($signed(s1_e2_biased) <= 32'sd0);
assign s1_pixel_inside = (ras_mode == RM_SPRITE) ? 1'b1
: (ras_mode == RM_TRI) ? s1_in_triangle
: 1'b0;
// ------------------------------------------------------------------
// Brick 1 — SPRITE texture sampling (PSMCT32 DECAL).
//
// Per-pixel texel coordinate by LINEAR interpolation across the
// sprite, evaluated on the S1-stage pixel (s1_x_q, s1_y_q):
// u = u0 + du_dx * (px - x0)
// v = v0 + dv_dy * (py - y0)
// where (x0,y0)/(x1,y1) are the two SPRITE endpoints
// (ras_v0_*, ras_v1_*) and (u0,v0)/(u1,v1) the integer texel
// coords latched at those endpoints.
//
// SYNTHESIZABLE DDA (no per-pixel divide). The per-texel step
// du_dx = (u1-u0)/(x1-x0) and dv_dy = (v1-v0)/(y1-y0) are
// computed ONCE per primitive in Q16.16 fixed-point at FIFO push
// (a single divide each, latency-tolerant — it happens at the
// close cycle, long before SCAN). The popped ras_du_dx_q /
// ras_dv_dy_q are then applied per pixel with a MULTIPLY + shift
// only — fully synthesizable. For the common 1:1 texel:pixel
// SPRITE (step == 1.0 exactly), the Q16 step is exactly 0x10000
// so u = px-x0+u0 / v = py-y0+v0 to the bit, byte-identical to
// the old per-pixel divide. Degenerate spans (x1==x0) latch a
// zero step so the coord holds u0/v0. The whole path is now LIVE
// for synthesis — no `// synthesis translate_off`.
//
// The sampler is driven from the S1 stage and its returned color
// is registered into s2_tex_color_q at the S1->S2 edge, so the
// texel lands in the SAME pipeline slot as the S2 pixel it
// belongs to (tex_rd_data from the combinational read2 port is
// valid in the S1 cycle; the register absorbs the path into S2).
// ------------------------------------------------------------------
// Texturing-active gate. SPRITE feeds the linear DDA UV; a TME
// TRIANGLE (textured-triangle rung) feeds the AFFINE U/V computed
// from the shared-divider gradients (or the perspective U/V on the
// Ch301 path). Ch348 — DECAL texturing admits PSMCT32 (0x00) AND the
// indexed CLUT formats PSMT8 (0x13) / PSMT4 (0x14) for BOTH sprites
// and triangles: the index->CLUT->ABGR decode is in the shared
// gs_texture_unit, so no per-primitive-type restriction. (Affine
// PSMT8+CLUT triangles are proven by tb_gs_psmt8_clut_triangle.)
logic s1_tex_active; // texturing this pixel?
assign s1_tex_active = ((ras_mode == RM_SPRITE) ||
((ras_mode == RM_TRI) && ras_tri_active))
&& ras_tme
&& ((ras_tpsm == 6'h00) // PSMCT32 direct DECAL
|| (ras_tpsm == 6'h13) // Ch296 PSMT8 indexed DECAL
|| (ras_tpsm == 6'h14)); // Ch297 PSMT4 indexed DECAL
// DDA coordinate evaluator: coord = c0 + ((p - a) * step_q16) >>> 16,
// clamped to the 11-bit texel range. `step_q16` is the popped
// Q16.16 per-pixel increment; `p` the current screen coord; `a`
// the first endpoint's screen coord; `c0` the first endpoint's
// texel coord. Pure multiply + arithmetic shift — synthesizable.
function automatic logic [10:0] dda_uv(input logic [11:0] p,
input logic [11:0] a,
input logic [10:0] c0,
input logic signed [31:0] step_q16);
logic signed [31:0] dp;
logic signed [63:0] prod;
logic signed [31:0] coord;
logic signed [63:0] prod_sra;
dp = $signed({20'd0, p}) - $signed({20'd0, a});
prod = $signed({{32{step_q16[31]}}, step_q16}) * $signed({{32{dp[31]}}, dp});
prod_sra = prod >>> 16; // arithmetic >> 16
coord = $signed({21'd0, c0}) + prod_sra[31:0];
return coord[10:0];
endfunction
// Texel coordinate source. TEX_RD_REGISTERED selects whether the
// address is generated on the S1 stage (combinational read2) or the
// S0 stage (registered/BRAM read2 — one cycle earlier so the
// registered data still lands at S1). ras_cur_x/ras_cur_y are the S0
// walker coords that latch into s1_x_q/s1_y_q at the next edge, so
// the same coordinate is sampled either way — only the cycle it is
// presented to VRAM differs. ras_v0_*/ras_u0/ras_v0t/ras_d*_q are
// per-primitive constants (held for the whole scan) so the DDA is
// identical regardless of which stage's coords feed it.
logic [11:0] tex_coord_x, tex_coord_y;
logic tex_coord_valid;
assign tex_coord_x = TEX_RD_REGISTERED ? ras_cur_x : s1_x_q;
assign tex_coord_y = TEX_RD_REGISTERED ? ras_cur_y : s1_y_q;
assign tex_coord_valid = TEX_RD_REGISTERED ? (raster_state == R_SCAN)
: s1_valid_q;
// Texture-stage pixel offset from post-swap vertex 0, for the TRI
// affine U/V. Same offset basis (relative to ras_v0_x/y) and the
// same right-sized signed-16 narrowing the S2 colour/Z interp uses;
// here it is evaluated on the texture-stage coords so the sampled
// texel lines up with the pixel emitted at that location.
logic signed [31:0] tex_dx, tex_dy;
assign tex_dx = $signed({20'd0, tex_coord_x}) - $signed({20'd0, ras_v0_x});
assign tex_dy = $signed({20'd0, tex_coord_y}) - $signed({20'd0, ras_v0_y});
// Ch301 perspective — the perspective divide runs its OWN longer pipeline
// off the S1 stage (gs_persp_uv 4 cyc + texel 1 cyc), gated by the S1-timed
// persp_in_valid (s1_valid_q && s1_pixel_inside) and emitted against
// persp_x5/persp_y5 (= s1_x_q/s1_y_q delayed 5). So the S/T/Q interp MUST
// use the S1-stage coords (s1_x_q/s1_y_q), NOT the S0-based tex_coord (which
// the affine registered-read path uses) — else the numerators are one pixel
// ahead of their valid/emit-coord and the texture shifts by one texel.
logic signed [31:0] tex_dx_s1, tex_dy_s1;
assign tex_dx_s1 = $signed({20'd0, s1_x_q}) - $signed({20'd0, ras_v0_x});
assign tex_dy_s1 = $signed({20'd0, s1_y_q}) - $signed({20'd0, ras_v0_y});
// Triangle-interpolated affine U/V (reuses the 32x16 interp_affine_uv).
logic [10:0] s1_u_tri, s1_v_tri;
assign s1_u_tri = interp_affine_uv(ras_u0_base, ras_du_dx_t, ras_du_dy_t, tex_dx, tex_dy);
assign s1_v_tri = interp_affine_uv(ras_v0_base, ras_dv_dx_t, ras_dv_dy_t, tex_dx, tex_dy);
// Ch301 perspective — per-pixel SCREEN-LINEAR interpolated S/T/Q
// (24-bit, FRAC=12), evaluated on the SAME tex_dx/tex_dy offsets the
// affine U/V use, via the no-clamp interp_affine_wide. These are the
// perspective NUMERATORS/DENOMINATOR: the per-pixel divide
// u = S/Q, v = T/Q is done by gs_persp_uv on the EMIT side.
// Ch301 perspective: emit-side wiring TBD (human) — feed
// s1_s_persp / s1_t_persp / s1_q_persp into the gs_persp_uv
// (uq,vq,q) inputs, gate the sampler u/v mux on ras_persp, and add
// the matching 4-cycle delay of color/z/x/y/valid.
// GATED: 0 when PERSPECTIVE_CORRECT=0 (the generate ties them low so
// no perspective datapath is synthesized in the affine-only build).
logic [23:0] s1_s_persp, s1_t_persp, s1_q_persp;
generate
if (PERSPECTIVE_CORRECT) begin : g_persp_interp
assign s1_s_persp = interp_affine_wide(ras_s0_base, ras_ds_dx, ras_ds_dy, tex_dx_s1, tex_dy_s1);
assign s1_t_persp = interp_affine_wide(ras_t0_base, ras_dt_dx, ras_dt_dy, tex_dx_s1, tex_dy_s1);
assign s1_q_persp = interp_affine_wide(ras_q0_base, ras_dq_dx, ras_dq_dy, tex_dx_s1, tex_dy_s1);
end else begin : g_no_persp_interp
assign s1_s_persp = 24'd0;
assign s1_t_persp = 24'd0;
assign s1_q_persp = 24'd0;
end
endgenerate
// U/V presented to the texel-fetch path: SPRITE → linear DDA;
// TRIANGLE → affine. Selected by ras_tri_active (set only for a TRI).
logic [10:0] s1_u, s1_v;
assign s1_u = ras_tri_active ? s1_u_tri
: dda_uv(tex_coord_x, ras_v0_x, ras_u0, ras_du_dx_q);
assign s1_v = ras_tri_active ? s1_v_tri
: dda_uv(tex_coord_y, ras_v0_y, ras_v0t, ras_dv_dy_q);
// Ch310 — sub-texel FRACTION paralleling s1_u/s1_v (BILINEAR). On the
// affine TRI path it is interp_affine_uv_frac over the SAME args as
// s1_u_tri/s1_v_tri (so the 4-bit frac re-pairs with the integer texel
// exactly). For ras_persp or a non-TRI (SPRITE) primitive we pass frac=0
// — bilinear is exercised only on the affine combined TRI, and a 0 frac is
// the nearest-tap value there, which is safe. These wires are UNREAD by the
// nearest texel path, so at BILINEAR_ENABLE=0 the default build is unchanged.
logic [3:0] s1_u_frac, s1_v_frac;
assign s1_u_frac = ras_tri_active
? interp_affine_uv_frac(ras_u0_base, ras_du_dx_t, ras_du_dy_t, tex_dx, tex_dy)
: 4'd0;
assign s1_v_frac = ras_tri_active
? interp_affine_uv_frac(ras_v0_base, ras_dv_dx_t, ras_dv_dy_t, tex_dx, tex_dy)
: 4'd0;
// ------------------------------------------------------------------
// Ch302 — PERSPECTIVE-CORRECT emit-side wiring.
//
// The per-pixel divide (u = S/Q, v = T/Q) is done by gs_persp_uv
// (LATENCY=4: in_valid → out_valid). It consumes the screen-linear
// interpolated S_fp/T_fp/Q_fp (s1_s_persp/s1_t_persp/s1_q_persp),
// which are valid at the SAME S1 texture stage as the affine
// s1_u/s1_v. Its integer texel output (persp_u, persp_v) is muxed
// into the SHARED gs_texture_unit below (a perspective primitive
// never does affine simultaneously — only one read2 port exists).
//
// Pipeline alignment (combinational read2: SEL_DELAY=0, RD_LATENCY=1):
// S1+0 : gs_persp_uv.in_valid presented (S/T/Q live at S1)
// S1+4 : persp_outvalid + persp_u/persp_v valid
// S1+5 : s1_tex_color (perspective texel) valid (1 cyc read latency)
// AND persp_emit5 (= persp_outvalid delayed 1) high
// AND persp_x5/persp_y5 (= s1_x_q/s1_y_q delayed 5) aligned
//
// GATED: outputs declared here, driven only inside the
// PERSPECTIVE_CORRECT generate; tied 0 in the else so the param=0
// build is byte-identical (ras_persp is constant 0 there too).
logic [10:0] persp_u, persp_v;
logic persp_outvalid;
logic [11:0] persp_x5, persp_y5;
logic persp_emit5;
logic [5:0] persp_inflight; // in-flight count for the drain gate
logic persp_pipe_busy; // OR-reduce: any perspective work pending
generate
if (PERSPECTIVE_CORRECT) begin : g_persp_emit
logic persp_in_valid;
assign persp_in_valid = ras_persp && s1_valid_q && s1_pixel_inside;
`ifdef CH351_TRACE
// Ch351 trace: dump the per-pixel interpolated perspective attributes for one scanline. If
// s1_q_persp is CONSTANT across x (not stepping), the perspective slopes are 0/unsolved =
// under-interpolation; if it steps smoothly, the interp is fine and the divide is the suspect.
always_ff @(posedge clk) if (persp_in_valid && s1_y_q == `CH351_TRACE)
$display("CH351 x=%0d y=%0d s=%0d t=%0d q=%0d (ds_dx=%0d dq_dx=%0d q0=%0d)",
s1_x_q, s1_y_q, $signed(s1_s_persp), $signed(s1_t_persp), $signed(s1_q_persp),
ras_ds_dx, ras_dq_dx, ras_q0_base);
`endif
gs_persp_uv #(
.ATTR_W(24), .Q_W(24), .FRAC(12), .SCALE(24),
.RECIP_W(25), .TEXEL_W(11), .TEXEL_MAX(2047),
.RECIP_IDX_BITS(PERSP_RECIP_IDX_BITS)
) u_persp_uv (
.clk (clk),
.rst_n (rst_n),
.in_valid (persp_in_valid),
.uq (s1_s_persp),
.vq (s1_t_persp),
.q (s1_q_persp),
.out_valid(persp_outvalid),
.u (persp_u),
.v (persp_v)
);
// 5-deep shift register of {x,y} so persp_x5/persp_y5 hold the
// coords of the pixel whose perspective texel is valid this
// cycle (S1+5). reg[4] is the +5 tap.
logic [11:0] persp_x_sr [0:4];
logic [11:0] persp_y_sr [0:4];
always_ff @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
for (int i = 0; i < 5; i++) begin
persp_x_sr[i] <= 12'd0;
persp_y_sr[i] <= 12'd0;
end
persp_emit5 <= 1'b0;
persp_inflight <= 6'd0;
end else begin
persp_x_sr[0] <= s1_x_q;
persp_y_sr[0] <= s1_y_q;
for (int i = 1; i < 5; i++) begin
persp_x_sr[i] <= persp_x_sr[i-1];
persp_y_sr[i] <= persp_y_sr[i-1];
end
// Emit strobe = persp_outvalid (S1+4) registered once
// → lands at S1+5, exactly when s1_tex_color is the
// perspective texel for persp_x5/persp_y5.
persp_emit5 <= persp_outvalid;
// In-flight count: a divide enters on in_valid, leaves
// on persp_emit5. Tracks the whole 5-deep emit pipe so
// R_DRAIN waits for the perspective pipe to empty.
case ({persp_in_valid, persp_emit5})
2'b10: persp_inflight <= persp_inflight + 6'd1;
2'b01: persp_inflight <= persp_inflight - 6'd1;
default: persp_inflight <= persp_inflight;
endcase
end
end
assign persp_x5 = persp_x_sr[4];
assign persp_y5 = persp_y_sr[4];
// Busy while any divide is in flight OR an emit is about to
// fire this cycle (persp_emit5 / persp_outvalid not yet drained).
assign persp_pipe_busy = (persp_inflight != 6'd0)
|| persp_outvalid || persp_emit5;
end else begin : g_no_persp_emit
assign persp_u = 11'd0;
assign persp_v = 11'd0;
assign persp_outvalid = 1'b0;
assign persp_x5 = 12'd0;
assign persp_y5 = 12'd0;
assign persp_emit5 = 1'b0;
assign persp_inflight = 6'd0;
assign persp_pipe_busy = 1'b0;
end
endgenerate
logic [31:0] s1_tex_color;
// Ch296 — SEL_DELAY re-pairs the PSMT8 byte selector with the
// in-flight registered read when the texel address advances each
// cycle (TEX_RD_REGISTERED=1). Combinational-read mode holds the
// address stable, so SEL_DELAY=0.
// Ch310 — bilinear sampler status (driven only when BILINEAR_ENABLE=1).
// s1_tex_outvalid pulses when the 4-tap sample completes; tex_busy is high
// across the multi-beat run. On the nearest path (BILINEAR_ENABLE=0 or
// non-PSMCT32 / MMAG=0) busy stays 0 and out_valid follows the 1-cycle read,
// so the combined FSM never enters CB_TWAIT and the build is byte-identical.
logic tex_busy;
logic s1_tex_outvalid;
gs_texture_unit #(
.PSMT4_SWIZZLE(PSMT4_SWIZZLE),
.PSMT8_SWIZZLE(PSMT8_SWIZZLE),
.PSMCT32_SWIZZLE(PSMCT32_SWIZZLE),
.TEX_WRAP_ENABLE(TEX_WRAP_ENABLE),
.BILINEAR_ENABLE(BILINEAR_ENABLE),
.PALETTE_BILINEAR(PALETTE_BILINEAR),
.RD_LATENCY(TEX_RD_LATENCY),
.SEL_DELAY (TEX_RD_REGISTERED ? TEX_RD_LATENCY : 0)
) u_tex (
.clk (clk),
.rst_n (rst_n),
// Ch302 — perspective primitive feeds the SHARED texture unit
// from gs_persp_uv (persp_u/v valid at S1+4). A perspective prim
// never simultaneously does affine, so muxing on ras_persp is
// exclusive. When ras_persp=0 (always so at PERSPECTIVE_CORRECT=0)
// this is byte-identical to the affine inputs.
// COMBINED probe: in combined mode the texel fetch is issued ONLY
// on beat 1 (comb_tex_req), so tex_rd_en fires for exactly one
// cycle per passing pixel. The u/v already track the HELD walker
// pixel (tex_coord_x/y == ras_cur_x/y under TEX_RD_REGISTERED=1
// while the walker is stalled), so the legacy s1_u/s1_v feed the
// correct affine texel address. ras_combined is constant 0 at
// COMBINED_TAZ=0, so this mux collapses to the legacy expression.
.in_valid (ras_combined ? comb_tex_req
: (ras_persp ? persp_outvalid
: (tex_coord_valid && s1_tex_active
&& !(ras_tex_abe && ta_beat_q)))), // Ch344: texel on primary beat only
.u (ras_persp ? persp_u : s1_u),
.v (ras_persp ? persp_v : s1_v),
// Ch310 — fractional UV for BILINEAR. Perspective passes 0 fracs
// (bilinear demo uses the affine TRI path). Unused at BILINEAR_ENABLE=0.
.u_frac (ras_persp ? 4'd0 : s1_u_frac),
.v_frac (ras_persp ? 4'd0 : s1_v_frac),
// Ch310 — per-primitive runtime filter select (TEX1_1.MMAG snapshot).
.filter_lin (ras_filter_lin),
.tbp0_base_bytes (ras_tex_base),
.tbw (ras_tbw),
.psm (ras_tpsm),
.wms (ras_wms),
.wmt (ras_wmt),
.tw (ras_tw),
.th (ras_th),
.tex_rd_en (tex_rd_en),
.tex_rd_addr (tex_rd_addr),
.tex_rd_data (tex_rd_data),
.clut_rd_idx (clut_rd_idx),
.clut_rd_data (clut_rd_data),
.out_valid (s1_tex_outvalid), // Ch310 — consumed by CB_TWAIT; nearest path ignores it
.tex_color (s1_tex_color),
.busy (tex_busy) // Ch310 — high across the 4-tap run; 0 on nearest
);
// S1 -> S2 texel register. s2_tex_active_q gates the substitution
// at emit; s2_tex_color_q is the sampled ABGR for the S2 pixel.
logic s2_tex_active_q;
logic [31:0] s2_tex_color_q;
// ------------------------------------------------------------------
// Brick 2a — S3 alpha-blend stage.
//
// When ras_abe is active, an S2 inside-pixel does NOT drive the
// emit outputs directly. Instead it presents the dest-read address
// (= the pixel's write address) on fb_rd_addr/fb_rd_en during the
// S2 cycle and latches a "pending blend" record into the S3 stage.
// One cycle later, S3 blends the source color (s3_cs_*) against the
// dest pixel read back from VRAM and drives the emit outputs. This
// adds exactly ONE pipeline cycle for blended primitives; opaque
// primitives (ras_abe=0) keep the original direct S2->output emit
// and are byte-identical to pre-Brick-2a.
// ------------------------------------------------------------------
logic s3_valid_q; // a blended pixel is in the S3 stage
logic [11:0] s3_x_q, s3_y_q;
logic [31:0] s3_fb_addr_q;
logic [3:0] s3_be_q;
logic [31:0] s3_mask_q;
logic [5:0] s3_psm_q;
logic [63:0] s3_cs_q; // source color (ABGR + Q), pre-blend
logic [31:0] s3_dest_q; // dest pixel (PSMCT32), comb-read latch
logic [7:0] s3_as_q; // source alpha (RGBAQ.A, 0..128 scale)
// Dest pixel feeding the blend: live for the registered read port
// (data arrives at S3), latched copy for the combinational port
// (data captured at the S2->S3 edge).
logic [31:0] s3_dest_pixel;
assign s3_dest_pixel = FB_RD_REGISTERED ? fb_rd_data : s3_dest_q;
// ------------------------------------------------------------------
// Brick 2b — S3 Z-test stage + half-rate 2-beat write sequencing.
//
// A Z-tested inside pixel presents its Z-buffer address on z_rd_*
// during S2 and latches a "pending Z-test" record into the S3 stage
// (mirrors the alpha S3 mechanism). One cycle later, S3 compares the
// fragment Z (constant flat ras_z_value) against the stored Z read
// back from VRAM, per ZTST. On PASS the pixel needs TWO writes
// through the single VRAM write port: the fb COLOR and (if ZMSK=0)
// the Z VALUE. We serialize them across two beats and freeze the
// pixel pipeline for the second beat so the next pixel's color does
// not contend for the port.
//
// z_beat: 0 = "primary" beat (pipeline advances; color write fires
// on a passing pixel and a pending Z write is queued);
// 1 = "z" beat (pipeline frozen; the queued Z write fires).
// z_beat toggles every cycle ONLY while ras_zte is active and a scan
// is in flight. When ras_zte=0 it is held at 0 and the pipeline runs
// full rate — byte-identical to pre-Brick-2b.
// ------------------------------------------------------------------
logic s3_zvalid_q; // a Z-tested pixel is in the S3 stage
logic [11:0] s3_zx_q, s3_zy_q;
logic [31:0] s3_zfb_addr_q; // framebuffer (color) byte address
logic [31:0] s3_z_addr_q; // Z-buffer byte address
logic [63:0] s3_zcolor_q; // fb color to write on pass (PSMCT32)
logic [31:0] s3_zval_q; // fragment Z to write on pass
logic [1:0] s3_ztst_q; // compare op
logic s3_zmsk_q; // 1 => do not update Z
logic [31:0] s3_zstored_q; // stored Z, captured for the compare
// Stored Z feeding the compare: ALWAYS the latched copy. Under the
// half-rate 2-beat sequencing the S2->S3 distance is 2 cycles, so a
// "live registered z_rd_data at S3" (the alpha-style trick) would be
// off by one pixel. Instead s3_zstored_q is captured on the FROZEN
// z beat that immediately follows the S2 primary beat (see the
// pipeline body). At that moment z_rd_addr has been held across the
// primary->frozen edge, so the registered read (1-cycle) AND the
// combinational read both present this pixel's stored Z — a single
// capture point is correct for both Z_RD_REGISTERED settings.
logic [31:0] s3_zstored;
assign s3_zstored = s3_zstored_q;
// Depth comparison (PS2: larger Z = nearer the viewer).
// NEVER : never pass
// ALWAYS : always pass
// GEQUAL : frag_Z >= stored_Z
// GREATER : frag_Z > stored_Z
logic z_test_pass;
always_comb begin
unique case (s3_ztst_q)
2'd0: z_test_pass = 1'b0; // NEVER
2'd1: z_test_pass = 1'b1; // ALWAYS
2'd2: z_test_pass = (s3_zval_q >= s3_zstored); // GEQUAL
2'd3: z_test_pass = (s3_zval_q > s3_zstored); // GREATER
default: z_test_pass = 1'b0;
endcase
end
// Half-rate beat toggle. Combinational "is this a Z scan" gate.
logic z_scan_active;
assign z_scan_active = ras_zte && (raster_state != R_IDLE);
logic z_beat_q; // 0 = primary, 1 = z beat
logic z_advance; // pipeline advances this cycle?
assign z_advance = !(z_scan_active && z_beat_q);
// Queued Z write produced on the primary beat (a passing pixel that
// must update Z). Consumed on the z beat.
logic zw_pending_q;
logic [31:0] zw_addr_q;
logic [31:0] zw_val_q;
logic [7:0] blend_r, blend_g, blend_b, blend_a;
gs_alpha_blend u_blend (
.cs_r (s3_cs_q[7:0]),
.cs_g (s3_cs_q[15:8]),
.cs_b (s3_cs_q[23:16]),
.as (s3_as_q),
.cd_r (s3_dest_pixel[7:0]),
.cd_g (s3_dest_pixel[15:8]),
.cd_b (s3_dest_pixel[23:16]),
.cv_r (blend_r),
.cv_g (blend_g),
.cv_b (blend_b),
.a_out(blend_a)
);
// Blended PSMCT32 ABGR word (source alpha passed through), Q upper.
logic [63:0] s3_blended_color64;
assign s3_blended_color64 = {s3_cs_q[63:32],
blend_a, blend_b, blend_g, blend_r};
// ==================================================================
// Ch344 — TEXTURED + source-over ALPHA SPRITE path (isolated; NOT the
// combined-TAZ compositor). A ras_tex_abe SPRITE runs HALF-RATE so the
// texel read (S1, read2) and the dest read (frozen beat, read2) never
// co-fire — same mutual-exclusion guarantee as Brick 2b's Z sprite:
// primary beat (ta_beat=0): texel fetch (tex_rd); latched to s2_tex_color_q.
// frozen beat (ta_beat=1): dest fetch (fb_rd) -> ta_dest_q.
// next primary beat : blend texel*vtx-color over ta_dest_q, emit.
// Source alpha = TEXEL alpha (TCC=1). Source color = texel MODULATEd by the
// per-vertex RGBAQ tint (ras_color); a unity 0x80 vertex tint = pass-through.
// ras_tex_abe is constant 0 at SPRITE_TEX_ALPHA=0, so all of this prunes.
logic ta_scan_active;
assign ta_scan_active = ras_tex_abe && (raster_state != R_IDLE);
logic ta_beat_q; // 0 = primary (texel), 1 = frozen (dest)
logic ta_advance;
assign ta_advance = !(ta_scan_active && ta_beat_q);
// This path targets the BOARD wrapper's read model: TEX_RD_REGISTERED=1 + FB_RD_REGISTERED=1 (1-cycle
// registered read2). Under that model the streaming s2_tex_color_q is captured on the advance beat —
// one cycle too early for the texel (which lands on the frozen beat) — so we capture our OWN copies:
// ta_tex_q <= s1_tex_color on the FROZEN beat (the texel for THIS S2 pixel is the registered result
// of the primary-beat texel request — exactly how the combined FSM latches it at CB_T), and
// the dest = fb_rd_data live on the primary/emit beat (registered result of the frozen-beat fb_rd).
// Texel alignment to S2 (registered read2): the texel for a pixel is the registered result of the
// texel request issued when that pixel was at S1 (primary beat) — it lands on the NEXT (frozen) beat.
// But the half-rate stall delays the pixel reaching S2 by an extra beat, so we capture s1_tex_color
// UNCONDITIONALLY on every frozen beat into ta_tex_q and keep a 1-deep delayed copy ta_tex_q1; the
// delayed copy is exactly aligned with the S2 pixel at its emit beat (verified cycle-by-cycle).
logic [31:0] ta_tex_q; // newest sampled texel (frozen-beat capture)
logic [31:0] ta_tex_q1; // 1-deep delay -> aligned with the S2 pixel being emitted
logic [31:0] ta_dest_q; // (combinational fallback only) dest captured on the frozen beat
function automatic logic [7:0] ta_mod8(input logic [7:0] t, input logic [7:0] c);
logic [15:0] p; begin p = ({8'd0,t} * {8'd0,c}) >> 7; ta_mod8 = (p > 16'd255) ? 8'hFF : p[7:0]; end
endfunction
logic [31:0] ta_cs; // MODULATEd source color (texel alpha kept)
assign ta_cs = {ta_tex_q1[31:24],
ta_mod8(ta_tex_q1[23:16], ras_color[23:16]),
ta_mod8(ta_tex_q1[15:8], ras_color[15:8]),
ta_mod8(ta_tex_q1[7:0], ras_color[7:0])};
logic [31:0] ta_cd;
assign ta_cd = FB_RD_REGISTERED ? fb_rd_data : ta_dest_q;
logic [7:0] ta_br, ta_bg, ta_bb, ta_ba;
gs_alpha_blend u_ta_blend (
.cs_r (ta_cs[7:0]), .cs_g (ta_cs[15:8]), .cs_b (ta_cs[23:16]),
.as (ta_tex_q1[31:24]),
.cd_r (ta_cd[7:0]), .cd_g (ta_cd[15:8]), .cd_b (ta_cd[23:16]),
.cv_r (ta_br), .cv_g (ta_bg), .cv_b (ta_bb), .a_out (ta_ba)
);
logic [63:0] ta_blended_color64;
assign ta_blended_color64 = {ras_color[63:32], ta_ba, ta_bb, ta_bg, ta_br};
// Dest-read enable: frozen beat only (read2 = dest). The texel read is gated to the
// primary beat (in_valid below), so the two read2 consumers are time-exclusive.
logic s2_tex_abe_dest;
assign s2_tex_abe_dest = ras_tex_abe && s2_valid_q && s2_inside_q && ta_beat_q;
// Brick 3 — SYNTHESIZABLE per-pixel affine attribute interpolation,
// evaluated in the S2 stage. The popped Q16.16 gradients (ras_d*_dx /
// ras_d*_dy) and per-vertex-0 base values are constant for the
// in-flight primitive; S2 only needs s2_x_q/s2_y_q (relative to v0)
// to accumulate. Pure multiply + add + arithmetic shift — NO divide,
// NO `// synthesis translate_off`. This REPLACES the Ch86 non-
// synthesizable barycentric divide (interp_byte). The s2_L*_q
// barycentric weights are no longer used for interpolation (kept only
// as legacy white-box probes; the coverage test still uses the edge
// functions directly).
//
// A(x,y) = base0 + ( dx*dAdx + dy*dAdy ) >> 16
// dx = s2_x - v0_x , dy = s2_y - v0_y
//
// Color channels clamp to [0,255]; Z keeps the full 32-bit value.
logic signed [31:0] s2_dx, s2_dy;
assign s2_dx = $signed({20'd0, s2_x_q}) - $signed({20'd0, ras_v0_x});
assign s2_dy = $signed({20'd0, s2_y_q}) - $signed({20'd0, ras_v0_y});
// Ch296 — per-pixel multiplier resize (Quartus area fit). The
// affine step is `dadx*dx + dady*dy`. `dadx`/`dady` are signed
// Q16.16 gradients (full 32-bit). `dx`/`dy` are PIXEL offsets from
// the reference vertex: dx = s2_x_q - ras_v0_x with both operands
// 12-bit UNSIGNED screen coords ([0,4095]), so dx,dy ∈ [-4095,+4095]
// — strictly inside signed 16-bit ([-32768,+32767]) with ample
// margin (a signed 13-bit range would already suffice). The prior
// code sign-extended BOTH operands to 64 bits, synthesizing a
// 64×64 multiplier per attribute (~8 instances). Sizing dx/dy to
// signed 16 bits makes each a 32×16 signed multiply whose product
// (≤48 bits) and the 2-term sum (≤49 bits) are LOSSLESSLY captured
// by the 64-bit `step`/`acc`/`val` accumulators — so the accumulate,
// `>>> 16`, base add, and clamp are BIT-IDENTICAL to before. The
// dxw/dyw narrowing is a guaranteed value-preserving truncation
// because the high bits are pure sign extension of the 16-bit value
// (see the bound above). tb_gs_tri_interp asserts dx/dy in range.
//
// Affine evaluator for an 8-bit color channel: clamp result to 0..255.
function automatic logic [7:0] interp_affine8(
input logic [7:0] base0,
input logic signed [31:0] dadx,
input logic signed [31:0] dady,
input logic signed [31:0] dx,
input logic signed [31:0] dy);
logic signed [15:0] dxw, dyw; // pixel offsets, true width
logic signed [47:0] prod_x, prod_y; // 32×16 signed products
logic signed [63:0] acc;
logic signed [63:0] step;
logic signed [63:0] val;
dxw = dx[15:0];
dyw = dy[15:0];
prod_x = dadx * dxw; // 32×16 signed
prod_y = dady * dyw; // 32×16 signed
step = $signed({{16{prod_x[47]}}, prod_x})
+ $signed({{16{prod_y[47]}}, prod_y});
acc = step >>> 16; // arithmetic >>16
val = $signed({56'd0, base0}) + acc;
if (val < 64'sd0) interp_affine8 = 8'd0;
else if (val > 64'sd255) interp_affine8 = 8'd255;
else interp_affine8 = val[7:0];
endfunction
// Affine evaluator for the full 32-bit fragment Z (no clamp).
function automatic logic [31:0] interp_affine_z(
input logic [31:0] base0,
input logic signed [31:0] dadx,
input logic signed [31:0] dady,
input logic signed [31:0] dx,
input logic signed [31:0] dy);
logic signed [15:0] dxw, dyw; // pixel offsets, true width
logic signed [47:0] prod_x, prod_y; // 32×16 signed products
logic signed [63:0] step;
logic signed [63:0] val;
dxw = dx[15:0];
dyw = dy[15:0];
prod_x = dadx * dxw; // 32×16 signed
prod_y = dady * dyw; // 32×16 signed
step = $signed({{16{prod_x[47]}}, prod_x})
+ $signed({{16{prod_y[47]}}, prod_y});
val = $signed({32'd0, base0}) + (step >>> 16);
interp_affine_z = val[31:0];
endfunction
// Textured-triangle rung — affine evaluator for an 11-bit texel
// coordinate (U or V). IDENTICAL right-sized 32x16 multiply
// structure as interp_affine8/_z (dadx/dady are signed Q16.16
// gradients; dx/dy are the same signed-16 pixel offsets). Clamp to
// the 0..2047 texel range so an out-of-range affine extrapolation
// (off-triangle rounding) saturates instead of wrapping. NO new
// multiplier width: prod_x/prod_y are the same 32x16 signed products
// the colour/Z evaluators use.
function automatic logic [10:0] interp_affine_uv(
input logic [10:0] base0,
input logic signed [31:0] dadx,
input logic signed [31:0] dady,
input logic signed [31:0] dx,
input logic signed [31:0] dy);
logic signed [15:0] dxw, dyw; // pixel offsets, true width
logic signed [47:0] prod_x, prod_y; // 32×16 signed products
logic signed [63:0] step;
logic signed [63:0] val;
dxw = dx[15:0];
dyw = dy[15:0];
prod_x = dadx * dxw; // 32×16 signed
prod_y = dady * dyw; // 32×16 signed
step = $signed({{16{prod_x[47]}}, prod_x})
+ $signed({{16{prod_y[47]}}, prod_y});
val = $signed({53'd0, base0}) + (step >>> 16);
if (val < 64'sd0) interp_affine_uv = 11'd0;
else if (val > 64'sd2047) interp_affine_uv = 11'd2047;
else interp_affine_uv = val[10:0];
endfunction
// Ch310 — SUB-TEXEL FRACTION sibling of interp_affine_uv (BILINEAR).
// Identical 32×16 multiply + accumulate structure; instead of the integer
// texel (val[10:0] after >>>16), it returns the 4-bit fraction that the
// >>>16 truncates away. base0 is an integer texel coord (no fraction), so
// the entire sub-texel fraction lives in the low 16 bits of `step`; the top
// 4 of those (step[15:12]) are the /16 frac the 4-tap lerp consumes.
// The fraction is forced to 0 whenever the integer part saturates to the
// 0 or 2047 texel-clamp boundary (matching interp_affine_uv's clamp), so a
// clamped edge tap blends with frac=0 = the boundary texel exactly (no
// half-step past the edge). iverilog-12: NO bit-select on a parenthesized
// expr — `step` is a named temp, then step[15:12].
function automatic logic [3:0] interp_affine_uv_frac(
input logic [10:0] base0,
input logic signed [31:0] dadx,
input logic signed [31:0] dady,
input logic signed [31:0] dx,
input logic signed [31:0] dy);
logic signed [15:0] dxw, dyw; // pixel offsets, true width
logic signed [47:0] prod_x, prod_y; // 32×16 signed products
logic signed [63:0] step;
logic signed [63:0] val;
dxw = dx[15:0];
dyw = dy[15:0];
prod_x = dadx * dxw;
prod_y = dady * dyw;
step = $signed({{16{prod_x[47]}}, prod_x})
+ $signed({{16{prod_y[47]}}, prod_y});
val = $signed({53'd0, base0}) + (step >>> 16);
// frac = step[15:12] (the truncated sub-texel bits), 0 at the clamps.
if (val < 64'sd0) interp_affine_uv_frac = 4'd0;
else if (val > 64'sd2047) interp_affine_uv_frac = 4'd0;
else interp_affine_uv_frac = step[15:12];
endfunction
// Ch301 perspective — WIDE affine evaluator for a 24-bit perspective
// attribute (S_fp / T_fp / Q_fp, FRAC=12). IDENTICAL 32x16 multiply
// structure as interp_affine_uv but returns the raw 24-bit linear
// interpolant WITHOUT the [0,2047] texel clamp — these are the
// numerator/denominator the per-pixel divide (gs_persp_uv, emit side)
// consumes, not direct texel coords. base0 is the 24-bit value at
// post-swap v0; dadx/dady the signed Q16.16 gradients; dx/dy the same
// signed-16 pixel offsets the affine U/V use. Negative / overflow are
// truncated to 24 bits (the gradient solve keeps values in range for
// an in-triangle pixel; off-triangle extrapolation is sampled but the
// emit-side divide / sampler bounds it).
function automatic logic [23:0] interp_affine_wide(
input logic [23:0] base0,
input logic signed [31:0] dadx,
input logic signed [31:0] dady,
input logic signed [31:0] dx,
input logic signed [31:0] dy);
logic signed [15:0] dxw, dyw; // pixel offsets, true width
logic signed [47:0] prod_x, prod_y; // 32×16 signed products
logic signed [63:0] step;
logic signed [63:0] val;
dxw = dx[15:0];
dyw = dy[15:0];
prod_x = dadx * dxw; // 32×16 signed
prod_y = dady * dyw; // 32×16 signed
step = $signed({{16{prod_x[47]}}, prod_x})
+ $signed({{16{prod_y[47]}}, prod_y});
val = $signed({40'd0, base0}) + (step >>> 16);
interp_affine_wide = val[23:0];
endfunction
logic [7:0] s2_interp_r, s2_interp_g, s2_interp_b, s2_interp_a;
assign s2_interp_r = interp_affine8(ras_c0_q[7:0], ras_dr_dx, ras_dr_dy, s2_dx, s2_dy);
assign s2_interp_g = interp_affine8(ras_c0_q[15:8], ras_dg_dx, ras_dg_dy, s2_dx, s2_dy);
assign s2_interp_b = interp_affine8(ras_c0_q[23:16], ras_db_dx, ras_db_dy, s2_dx, s2_dy);
assign s2_interp_a = interp_affine8(ras_c0_q[31:24], ras_da_dx, ras_da_dy, s2_dx, s2_dy);
// Ch335 — combined-path per-pixel interpolated (Gouraud) color, computed from the HELD
// candidate pixel's offset (comb_dx/comb_dy) so it aligns with s1_tex_color at CB_T. For a
// FLAT primitive the gradients are 0 so this collapses to ras_c0_q (the constant color) — i.e.
// Ch333/334 flat-color scenes are byte-identical; non-zero gradients give a smooth gradient.
wire [7:0] comb_interp_r = interp_affine8(ras_c0_q[7:0], ras_dr_dx, ras_dr_dy, comb_dx, comb_dy);
wire [7:0] comb_interp_g = interp_affine8(ras_c0_q[15:8], ras_dg_dx, ras_dg_dy, comb_dx, comb_dy);
wire [7:0] comb_interp_b = interp_affine8(ras_c0_q[23:16], ras_db_dx, ras_db_dy, comb_dx, comb_dy);
wire [31:0] comb_interp_color = {8'd0, comb_interp_b, comb_interp_g, comb_interp_r};
// Per-pixel interpolated fragment Z for a TRI (Brick 3). For a SPRITE
// the flat ras_z_value is used; the mux happens at the Z-test latch.
logic [31:0] s2_interp_z;
assign s2_interp_z = interp_affine_z(ras_z0, ras_dz_dx, ras_dz_dy, s2_dx, s2_dy);
logic [63:0] s2_interp_color;
assign s2_interp_color = {ras_c2_q[63:32], // Q passthrough
s2_interp_a, s2_interp_b, s2_interp_g, s2_interp_r};
// ----------------------------------------------------------------
// Ch95 — PSM-aware emit packing.
//
// The 32-bit ABGR produced by Gouraud (TRI) or `ras_color`
// (SPRITE) is the natural PSMCT32 lane. For 16-bit PSMs
// (PSMCT16/PSMCT16S/PSMZ16/PSMZ16S, all of which set
// ras_bpp_shift==1), pack ABGR → RGB5A1 by taking the high
// 5 bits of each color channel and the alpha MSB:
// r5 = R[7:3], g5 = G[7:3], b5 = B[7:3], a1 = A[7]
// pix16 = {a1, b5, g5, r5}
// The 16-bit value is placed in the LOW 16 bits of the
// emit lane, and `raster_pixel_be_q = 4'b0011` so vram_stub
// commits exactly the 2 bytes at write_addr (the pixel's
// byte address) without stomping anything else. write_addr
// here is the byte address from gs_stub's PSM-aware fb_addr
// math (Ch83) — for PSMCT16 it's already 2-byte aligned but
// not necessarily 4-byte aligned, and per-byte enable makes
// unaligned halfword writes safe.
// ----------------------------------------------------------------
logic [31:0] s2_natural_abgr;
logic [4:0] s2_pack_r5;
logic [4:0] s2_pack_g5;
logic [4:0] s2_pack_b5;
logic s2_pack_a1;
logic [15:0] s2_pack_psm16;
logic [7:0] s2_pack_t8_index;
logic [3:0] s2_pack_t4_nibble;
logic s2_psm_is_16bit;
logic s2_psm_is_t8;
logic s2_psm_is_t4;
logic [63:0] s2_emit_color64;
logic [3:0] s2_emit_be;
logic [31:0] s2_emit_mask;
// Brick 1 — for a textured SPRITE pixel (PSMCT32 direct or, Ch296,
// PSMT8 indexed DECAL) the sampled texel REPLACES the flat sprite
// color. Otherwise the SPRITE color is the latched flat ras_color,
// exactly as before. s2_tex_active_q is only ever set for
// SPRITE+TME+(PSMCT32|PSMT8), so when it's low the SPRITE path is
// byte-identical to pre-Brick-1.
logic [63:0] s2_sprite_color64;
assign s2_sprite_color64 = s2_tex_active_q
? {ras_color[63:32], s2_tex_color_q}
: ras_color;
// Textured-triangle rung — for a TME PSMCT32 TRIANGLE the sampled
// texel REPLACES the interpolated Gouraud color (DECAL), mirroring
// the SPRITE substitution above. s2_tex_active_q is set only for a
// TRI+TME+PSMCT32 pixel here, so an untextured Gouraud triangle keeps
// s2_interp_color exactly — the legacy Gouraud emit is byte-identical.
// ZTE/ABE are OFF for this rung (architect scope), so the texel goes
// straight to the emit color with no blend/Z interaction.
logic [63:0] s2_tri_color64;
assign s2_tri_color64 = s2_tex_active_q
? {s2_interp_color[63:32], s2_tex_color_q}
: s2_interp_color;
assign s2_natural_abgr = (s2_mode_q == RM_TRI) ? s2_tri_color64[31:0]
: s2_sprite_color64[31:0];
assign s2_pack_r5 = s2_natural_abgr[7:3];
assign s2_pack_g5 = s2_natural_abgr[15:11];
assign s2_pack_b5 = s2_natural_abgr[23:19];
assign s2_pack_a1 = s2_natural_abgr[31];
assign s2_pack_psm16 = {s2_pack_a1, s2_pack_b5, s2_pack_g5, s2_pack_r5};
assign s2_pack_t8_index = s2_natural_abgr[7:0]; // Ch105: real-GS uses R as PSMT8 index
assign s2_pack_t4_nibble = s2_natural_abgr[3:0]; // Ch106: low 4 bits of R as PSMT4 index
assign s2_psm_is_16bit = (ras_bpp_shift == 2'd1);
assign s2_psm_is_t8 = (ras_bpp_shift == 2'd0);
assign s2_psm_is_t4 = (ras_psm == 6'h14);
// Ch105: PSMT8 fb_addr is 1 byte/pixel — write_addr is the
// exact byte address. vram_stub commits mem[write_addr] from
// data[7:0] when write_be[0] is set, so a 1-byte commit at
// any byte alignment is just be=4'b0001 with the index in
// the low byte of write_data. No lane shifting needed.
//
// Ch106: PSMT4 packs 2 pixels per byte. The byte address is
// pixel_index >> 1; pixel_index[0] picks low (=0) or high
// (=1) nibble inside that byte. Emit be=4'b0001 with the
// 4-bit index in the LOW nibble of write_data and use
// write_mask = 0x0000_000F (low) / 0x0000_00F0 (high) so
// vram_stub merges only the targeted nibble — the OTHER
// nibble in the same byte is preserved. Back-to-back same-
// byte writes (e.g. PSMT4 pixels x=0 and x=1) chain through
// NBA semantics: the second write samples mem[addr] AFTER
// the prior commit, so both nibbles end up in the byte.
// Ch140 — nibble selector for PSMT4 emit. Linear path uses
// s2_pixel_index[0] (Ch106). Swizzled path (PSMT4_SWIZZLE=1)
// uses the swizzle module's nibble_hi output — required
// because the canonical PCSX2 columnTable4 reorders nibbles
// within a block, so s2_pixel_index[0] is wrong. Declared
// here at module scope (referenced by the always_comb below).
// swizzle4_raster_nibble_hi is generated further down where
// the PSMT4 raster swizzle module is instantiated; this is
// forward-referenced at elaboration time which iverilog
// accepts for net assignments.
logic psmt4_raster_nibble_select;
assign psmt4_raster_nibble_select = PSMT4_SWIZZLE ? swizzle4_raster_nibble_hi
: s2_pixel_index[0];
always_comb begin
if (s2_psm_is_t4) begin
// PSMT4: 4 bits per pixel. Index in low nibble of
// write_data, write_mask gates which nibble of the
// byte gets merged. The "high nibble" position is
// realised by placing the index at [7:4] in the
// commit data — the mask is 0x0F or 0xF0 to match.
// Selector keys on linear x[0] when PSMT4_SWIZZLE=0,
// or on the swizzle's nibble_hi when the gate is on.
s2_emit_color64 = {(s2_mode_q == RM_TRI) ? s2_interp_color[63:32]
: ras_color[63:32],
24'd0,
psmt4_raster_nibble_select ? {s2_pack_t4_nibble, 4'd0}
: {4'd0, s2_pack_t4_nibble}};
s2_emit_be = 4'b0001;
s2_emit_mask = psmt4_raster_nibble_select ? 32'h0000_00F0
: 32'h0000_000F;
end else if (s2_psm_is_t8) begin
// PSMT8: 1 byte per pixel. Index in the LOW byte of
// write_data; vram_stub commits exactly that byte at
// write_addr. Q (upper 32) preserved for trace-side
// consistency.
s2_emit_color64 = {(s2_mode_q == RM_TRI) ? s2_interp_color[63:32]
: ras_color[63:32],
24'd0, s2_pack_t8_index};
s2_emit_be = 4'b0001;
s2_emit_mask = 32'hFFFF_FFFF;
end else if (s2_psm_is_16bit) begin
// 16-bit pixel goes in the LOW halfword; upper half
// doesn't matter (be gates it off). Q (upper 32) is
// preserved for trace-side consistency.
s2_emit_color64 = {(s2_mode_q == RM_TRI) ? s2_interp_color[63:32]
: ras_color[63:32],
16'd0, s2_pack_psm16};
s2_emit_be = 4'b0011;
s2_emit_mask = 32'hFFFF_FFFF;
end else begin
// PSMCT32 (4 bytes/pixel). For a textured SPRITE the
// sampled texel replaces the flat color (Brick 1); for a
// textured TRIANGLE the texel replaces the Gouraud color
// (s2_tri_color64). Both collapse to their respective
// untextured color when texturing is inactive, so the
// TME=0 paths are byte-identical.
s2_emit_color64 = (s2_mode_q == RM_TRI) ? s2_tri_color64
: s2_sprite_color64;
s2_emit_be = 4'b1111;
s2_emit_mask = 32'hFFFF_FFFF;
end
end
// fb_addr for the S2-stage pixel.
logic [31:0] s2_fbp_bytes;
logic [31:0] s2_pixels_per_row;
logic [31:0] s2_pixel_index;
logic [31:0] s2_pixel_byte_offset;
logic [31:0] s2_fb_addr;
assign s2_fbp_bytes = {23'd0, ras_fbp} << 11;
assign s2_pixels_per_row = {26'd0, ras_fbw} << 6;
assign s2_pixel_index = ({20'd0, s2_y_q} * s2_pixels_per_row)
+ {20'd0, s2_x_q};
// Ch106: PSMT4 byte offset is pixel_index >> 1 (2 pixels per
// byte). Other PSMs use pixel_index << ras_bpp_shift.
assign s2_pixel_byte_offset = s2_psm_is_t4 ? (s2_pixel_index >> 1)
: (s2_pixel_index << ras_bpp_shift);
logic [31:0] s2_fb_addr_linear;
assign s2_fb_addr_linear = s2_fbp_bytes + s2_pixel_byte_offset;
// Ch302 — perspective fb addr. The perspective DECAL primitive is
// PSMCT32, non-swizzled (PSMCT32_SWIZZLE=0 for the demo), so it uses
// the SAME linear PSMCT32 formula s2_fb_addr_linear uses, but
// evaluated on the +5-aligned perspective coords (persp_x5/persp_y5)
// and with ras_bpp_shift==2 (4 bytes/pixel for PSMCT32).
// addr = FBP_bytes + (y*pixels_per_row + x) << 2
logic [31:0] persp_pixel_index;
logic [31:0] persp_fb_addr;
assign persp_pixel_index = ({20'd0, persp_y5} * s2_pixels_per_row)
+ {20'd0, persp_x5};
assign persp_fb_addr = s2_fbp_bytes + (persp_pixel_index << ras_bpp_shift);
// Ch122 — optional PSMCT32 raster swizzle. The swizzle module
// is purely combinational; we feed it the raster's per-cycle
// (s2_x_q, s2_y_q) plus the active framebuffer's (FBP, FBW)
// snapshot. When the gate is on AND the active PSM is PSMCT32,
// use its absolute byte address; otherwise fall back to the
// legacy linear formula.
logic s2_psm_is_ct32;
logic [31:0] s2_fb_addr_swizzled;
assign s2_psm_is_ct32 = (ras_psm == 6'h00);
gs_swizzle_psmct32_stub u_raster_swizzle (
.fbp (ras_fbp),
.fbw (ras_fbw),
.x (s2_x_q[11:0]),
.y (s2_y_q[11:0]),
.addr(s2_fb_addr_swizzled)
);
// Ch128 — optional PSMCT16 raster swizzle. Same wiring shape
// as Ch122 but uses gs_swizzle_psmct16_stub. Active when
// PSMCT16_SWIZZLE=1 AND ras_psm == PSMCT16 (=0x02). The
// swizzle module bakes its own page (64×64), block grid
// (4 cols × 8 rows of 16×8 blocks), and within-block
// columnTable16 in. Default PSMCT16_SWIZZLE=0 preserves the
// legacy linear PSMCT16 raster path for Ch95 and every other
// existing PSMCT16 raster TB.
logic s2_psm_is_ct16;
logic [31:0] s2_fb_addr_swizzled16;
assign s2_psm_is_ct16 = (ras_psm == 6'h02);
gs_swizzle_psmct16_stub u_raster_swizzle16 (
.fbp (ras_fbp),
.fbw (ras_fbw),
.x (s2_x_q[11:0]),
.y (s2_y_q[11:0]),
.addr(s2_fb_addr_swizzled16)
);
// Ch134 — optional PSMT8 raster swizzle. Same wiring shape as
// Ch122/Ch128 but uses gs_swizzle_psmt8_stub. Active when
// PSMT8_SWIZZLE=1 AND ras_psm == PSMT8 (=0x13). PSMT8 pages
// are 128 px wide (vs 64 px for direct-color PSMs) so the
// swizzle internally uses bw_pg = ras_fbw>>1 — PCSX2 asserts
// FBW must be even for PSMT8. Default PSMT8_SWIZZLE=0
// preserves the legacy linear PSMT8 raster path for Ch105 +
// every TB that emits PSMT8 raster (Ch107 PSMT4-via-CT16-CLUT
// palette etc.).
logic s2_psm_is_psmt8;
logic [31:0] s2_fb_addr_swizzled8;
assign s2_psm_is_psmt8 = (ras_psm == 6'h13);
gs_swizzle_psmt8_stub u_raster_swizzle8 (
.fbp (ras_fbp),
.fbw (ras_fbw),
.x (s2_x_q[11:0]),
.y (s2_y_q[11:0]),
.addr(s2_fb_addr_swizzled8)
);
// Ch140 — optional PSMT4 raster swizzle. Same wiring shape as
// Ch122/Ch128/Ch134 but uses gs_swizzle_psmt4_stub which
// outputs both an absolute byte address AND a nibble_hi
// selector. Active when PSMT4_SWIZZLE=1 AND ras_psm == PSMT4
// (=0x14). PSMT4 pages are 128×128 px; the swizzle
// internally uses bw_pg = ras_fbw>>1 — PCSX2 asserts FBW must
// be even for PSMT4. The nibble_hi output threads into the
// PSMT4 emit-color/mask logic above (see s2_psm_is_t4 branch
// in the always_comb): when this gate is on AND s2_psm_is_t4,
// the existing Ch106 nibble RMW machinery (write_be=4'b0001,
// write_mask 0x0F or 0xF0) keys on swizzle4_raster_nibble_hi
// instead of s2_pixel_index[0]. Default PSMT4_SWIZZLE=0
// preserves the legacy linear PSMT4 raster path.
logic s2_psm_is_t4_swizzle;
logic [31:0] s2_fb_addr_swizzled4;
logic swizzle4_raster_nibble_hi;
assign s2_psm_is_t4_swizzle = (ras_psm == 6'h14);
gs_swizzle_psmt4_stub u_raster_swizzle4 (
.fbp (ras_fbp),
.fbw (ras_fbw),
.x (s2_x_q[11:0]),
.y (s2_y_q[11:0]),
.addr (s2_fb_addr_swizzled4),
.nibble_hi(swizzle4_raster_nibble_hi)
);
// Per-PSM swizzle dispatch. The four parameters are
// independent; defaults of 0 keep every PSM on the legacy
// linear path.
assign s2_fb_addr = (PSMCT32_SWIZZLE && s2_psm_is_ct32) ? s2_fb_addr_swizzled :
(PSMCT16_SWIZZLE && s2_psm_is_ct16) ? s2_fb_addr_swizzled16 :
(PSMT8_SWIZZLE && s2_psm_is_psmt8) ? s2_fb_addr_swizzled8 :
(PSMT4_SWIZZLE && s2_psm_is_t4_swizzle) ? s2_fb_addr_swizzled4 :
s2_fb_addr_linear;
// Brick 2b — Z-buffer byte address for the S2-stage pixel.
//
// PSMZ32 is 4 bytes/pixel and shares the framebuffer's row stride
// (FBW), exactly like the color framebuffer, so the Z address has
// the same linear shape as s2_fb_addr_linear but with the Z-buffer
// base (ZBP * 2048) instead of FBP * 2048:
// z_addr = ZBP*2048 + (y * FBW*64 + x) * 4
// The Z buffer is stored as plain 32-bit words; brick 2b uses the
// legacy LINEAR layout (no Z swizzle in v1 — the swizzled raster
// gates apply to the COLOR fb only). The pixel index reuse keeps
// the same per-row arithmetic as the color path.
logic [31:0] s2_zbp_bytes;
logic [31:0] s2_z_addr;
assign s2_zbp_bytes = {23'd0, ras_zbp} << 11; // ZBP * 2048
// s2_pixel_index already = y*FBW*64 + x; PSMZ32 is 4 bytes/pixel.
assign s2_z_addr = s2_zbp_bytes + (s2_pixel_index << 2);
// Brick 2a — dest-fb read port drive. Combinational from the S2
// stage: when this S2 pixel is an inside pixel of an alpha-blended
// primitive, present its write address as the dest-read address.
// The wrapper routes fb_rd_addr onto vram read2 and returns the
// dest pixel (combinationally or 1-cycle-registered per
// FB_RD_REGISTERED). fb_rd_en stays low for opaque primitives, so
// ABE=0 reads nothing and the path is byte-identical.
logic s2_blend_pixel;
// Ch344 — FLAT alpha only (!ras_tme): ras_abe is now also set for textured-alpha sprites, but
// those drive their OWN half-rate beat-gated dest read (s2_tex_abe_dest below), never this
// streaming path — otherwise the legacy fb_rd would collide with the texel read on read2.
assign s2_blend_pixel = s2_valid_q && s2_inside_q && ras_abe && !ras_tme;
// COMBINED probe: in combined mode the dest-fb read fires only on
// beat 2 (comb_fb_req) at the held-pixel FB address. s2_blend_pixel
// is 0 during a combined scan (ras_abe is 0 for the combined prim),
// so the OR is conflict-free; ras_combined is constant 0 at param=0.
// Ch304 — in tile-local mode the combined dest read is ON-CHIP
// (tile_color RAM), so the VRAM dest-read enable MUST stay LOW for the
// combined primitive. tile_active is constant 0 at TILE_LOCAL=0, so
// this collapses to the legacy expression (byte-identical).
assign fb_rd_en = tile_active ? 1'b0
: ras_combined ? comb_fb_req : (s2_blend_pixel || s2_tex_abe_dest); // Ch344: +textured-alpha dest read (frozen beat)
assign fb_rd_addr = ras_combined ? comb_fb_addr : s2_fb_addr;
// Brick 2b — stored-Z read port drive. Combinational from the S2
// stage: when this S2 pixel is an inside pixel of a Z-tested
// primitive, present the Z-buffer address on read2. The wrapper
// routes z_rd_addr onto read2 (mutually exclusive with texel /
// dest-fb by feature) and returns the stored Z (comb or 1-cycle
// registered per Z_RD_REGISTERED). z_rd_en stays low for non-Z
// primitives so the path is byte-identical.
logic s2_ztest_pixel;
assign s2_ztest_pixel = s2_valid_q && s2_inside_q && ras_zte;
// COMBINED probe: in combined mode the stored-Z read fires only on
// beat 0 (comb_z_req) at the held-pixel Z address. s2_ztest_pixel is
// 0 during a combined scan (ras_zte is 0 for the combined prim), so
// the mux is conflict-free; ras_combined is constant 0 at param=0.
// Ch304 — in tile-local mode the combined stored-Z read is ON-CHIP
// (tile_z RAM), so the VRAM Z-read enable MUST stay LOW for the
// combined primitive. tile_active const 0 at TILE_LOCAL=0 (byte-id).
assign z_rd_en = tile_active ? 1'b0
: ras_combined ? comb_z_req : s2_ztest_pixel;
assign z_rd_addr = ras_combined ? comb_z_addr : s2_z_addr;
// Scan advance — row-major within the bounding box. last_pixel
// = at (x_max, y_max), end of scan.
logic ras_at_x_end;
logic ras_at_y_end;
assign ras_at_x_end = (ras_cur_x == ras_x_max);
assign ras_at_y_end = (ras_cur_y == ras_y_max);
raster_mode_e new_mode;
assign new_mode = classify_prim_for_raster(prim_type);
// Brick 2a — is the CLOSING primitive an alpha-blended FLAT SPRITE?
// Requirements (all must hold; otherwise opaque, byte-identical):
// - PRIM.ABE = 1
// - SPRITE primitive (RM_SPRITE)
// - destination FRAME PSM is PSMCT32 (0x00)
// - NOT textured-effective. A SPRITE textures only when
// prim_tme=1 AND its TEX0 PSM is PSMCT32 or (Ch296) PSMT8
// (matching s1_tex_active). A flat blend reads ONE dest pixel
// through read2; a textured blend would need TWO reads (texel +
// dest) — out of scope, so when a SPRITE is textured we fall back
// to opaque DECAL. Keeping this PSM set in lockstep with
// s1_tex_active preserves the read2 mutual-exclusion invariant
// (texel-fetch never collides with the alpha dest-read).
// - ALPHA_1 selects the source-over config.
logic close_tme_effective;
logic new_abe_active;
assign close_tme_effective = prim_tme && ((tex0_psm == 6'h00) ||
(tex0_psm == 6'h13));
// Ch344 — new_abe_active now sets ras_abe for the FLAT alpha sprite (the original
// !close_tme_effective case) OR the textured-alpha sprite (new_tex_abe_active). This reuses
// the already-packed ras_abe so NO FIFO attr-word change is needed; ras_tme then distinguishes
// the two at the consumer (ras_tex_abe = ras_abe && ras_tme). With SPRITE_TEX_ALPHA=0,
// new_tex_abe_active is constant 0, so this collapses to the exact pre-Ch344 flat-only term.
assign new_abe_active = (prim_abe
&& (new_mode == RM_SPRITE)
&& (frame_1_q[29:24] == 6'h00) // PSMCT32 dest
&& !close_tme_effective
&& alpha_is_source_over)
|| new_tex_abe_active;
// Ch344 — is the CLOSING primitive a TEXTURED + source-over alpha SPRITE? Same SPRITE +
// PSMCT32-dest + source-over requirements as the flat case, but REQUIRES texturing (PSMCT32
// TEX0). Gated on SPRITE_TEX_ALPHA so param-off keeps a textured ABE sprite falling back to
// opaque DECAL. ORed into new_abe_active above (sets ras_abe); the legacy flat dest-read
// (s2_blend_pixel) is gated to !ras_tme so it never fires for these — the textured path uses
// its own half-rate beat-gated dest read.
logic new_tex_abe_active;
assign new_tex_abe_active = SPRITE_TEX_ALPHA
&& prim_abe
&& (new_mode == RM_SPRITE)
&& (frame_1_q[29:24] == 6'h00) // PSMCT32 dest
&& prim_tme
&& ((tex0_psm == 6'h00) // PSMCT32 texel (Ch344), or
|| (SPRITE_TEX_ALPHA_CLUT && tex0_psm == 6'h13)) // PSMT8 CLUT texel (Ch347)
&& alpha_is_source_over;
// Brick 2b — is the CLOSING primitive a Z-tested FLAT PSMCT32 SPRITE?
// Requirements (all must hold; otherwise no depth test, byte-
// identical to pre-2b):
// - TEST_1.ZTE = 1 AND ZBUF PSM = PSMZ32 (z_format_ok)
// - SPRITE primitive (RM_SPRITE)
// - destination FRAME PSM is PSMCT32 (0x00)
// - NOT textured-effective (a Z-tested flat sprite never textures;
// a textured Z-tested sprite would need TWO read2 reads/pixel —
// OUT OF SCOPE, falls back to no-Z opaque DECAL)
// - NOT alpha-blended (likewise a 2nd read2 consumer — OUT OF
// SCOPE; an ABE Z-tested sprite falls back to plain alpha blend
// without depth test)
// These exclusions make the stored-Z read the SOLE read2 consumer
// for the primitive, so read2 arbitration stays collision-free.
logic new_zte_active;
assign new_zte_active = z_format_ok
&& (new_mode == RM_SPRITE)
&& (frame_1_q[29:24] == 6'h00) // PSMCT32 dest
&& !close_tme_effective
&& !new_abe_active;
// Brick 3 — is the CLOSING primitive a Z-tested PSMCT32 TRIANGLE?
// Same requirements as the sprite Z path (ZTE=1, PSMZ32, PSMCT32
// dest), but for RM_TRI. TRI in v1 is never textured/blended (those
// FIFO fields are forced 0 on the TRI push), so the stored-Z read is
// again the sole read2 consumer and arbitration stays collision-free.
// The fragment Z is the per-pixel INTERPOLATED s2_interp_z (vs the
// sprite path's flat ras_z_value).
// Textured-triangle rung — a TME TRIANGLE drives the texel-fetch
// read2; excluding it here keeps the stored-Z read the SOLE read2
// consumer (mutual-exclusion-by-feature), exactly like the sprite Z
// path's !close_tme_effective guard. A textured triangle therefore
// has ZTE forced off (architect scope: ZTE=0/ABE=0 for this rung),
// so read2 arbitration stays collision-free.
logic new_tri_zte_active;
assign new_tri_zte_active = z_format_ok
&& (new_mode == RM_TRI)
&& (frame_1_q[29:24] == 6'h00) // PSMCT32 dest
&& !close_tme_effective;
// COMBINED probe — is the CLOSING primitive the textured + alpha +
// depth-tested TRIANGLE the combined FSM owns? ALL must hold:
// - COMBINED_TAZ enabled
// - TRIANGLE (RM_TRI)
// - PSMCT32 destination FRAME
// - TEXTURED, PSMCT32 DECAL (strictly tex0_psm==PSMCT32 — not T8;
// the combined sampler treats the texel as direct ABGR)
// - alpha-blended, source-over (ABE=1 + ALPHA_1 source-over)
// - depth-tested, PSMZ32 (z_format_ok already folds in TEST_1.ZTE)
// When set, this slot is pushed with fifo_combined=1 and the legacy
// per-feature FIFO fields (fifo_abe/fifo_zte) stay 0 so ONLY the
// combined FSM acts on the primitive. Constant 0 at COMBINED_TAZ=0.
// Ch313 — full PSMCT16 framebuffer mode. When TILE_COLOR_PSMCT16, the
// tile color RAM, dest-color reads and flush emit are ALL already PSMCT16
// (16-bit lanes, byte-enables, <<1 FB addr — see ~lines 3424/3499/6220).
// The only thing that was still forcing a PSMCT32 FRAME was THIS
// eligibility gate, so relax it to also accept a PSMCT16 FRAME
// (frame_1_q[29:24]==6'h02) when the tile machinery is in PSMCT16 mode.
// The whole render/flush/scanout path is then consistently PSMCT16 and the
// framebuffer is half-size. At TILE_COLOR_PSMCT16=0 (default) the second
// term is constant-0, so this collapses to the original PSMCT32-only gate
// (byte-identical). A PSMCT16 FRAME only makes sense with PSMCT16 tiles,
// so keying off TILE_COLOR_PSMCT16 also rules out nonsensical mixed combos.
logic close_combined;
assign close_combined = COMBINED_TAZ
&& (new_mode == RM_TRI)
&& ((frame_1_q[29:24] == 6'h00) // PSMCT32 dest
|| (TILE_COLOR_PSMCT16 && (frame_1_q[29:24] == 6'h02))) // Ch313: PSMCT16 dest
// Ch314 — texture PSM: PSMCT32 DECAL always; with
// PALETTE_BILINEAR also a PSMT8 (0x13) / PSMT4 (0x14)
// indexed DECAL (the shared sampler CLUTs each tap). The
// extra disjunct is constant-0 at default → byte-identical.
&& prim_tme && ((tex0_psm == 6'h00)
|| (PALETTE_BILINEAR && ((tex0_psm == 6'h13)
|| (tex0_psm == 6'h14))))
// Ch309 — with ALPHA_MODES_ENABLE the generic blender
// handles any ALPHA_1 (A/B/C/D/FIX) config, so accept any
// ABE primitive; at default (0) only source-over qualifies
// (byte-identical).
&& prim_abe && (alpha_is_source_over || ALPHA_MODES_ENABLE)
&& z_format_ok; // ZTE + PSMZ32
// Ch87 — combinational helpers that compute the bounding box
// for either primitive mode at the close cycle. Used both at
// direct-load (when FSM is IDLE and FIFO is empty) and at
// enqueue (when something is already in flight).
logic [11:0] enq_x_min, enq_x_max, enq_y_min, enq_y_max;
always_comb begin
if (new_mode == RM_TRI) begin
enq_x_min = umin12(umin12(tri_v0x, tri_v1x), tri_v2x);
enq_x_max = umax12(umax12(tri_v0x, tri_v1x), tri_v2x);
enq_y_min = umin12(umin12(tri_v0y, tri_v1y), tri_v2y);
enq_y_max = umax12(umax12(tri_v0y, tri_v1y), tri_v2y);
end else begin
enq_x_min = umin12(sp_v0_x_next, sp_v1_x_next);
enq_x_max = umax12(sp_v0_x_next, sp_v1_x_next);
enq_y_min = umin12(sp_v0_y_next, sp_v1_y_next);
enq_y_max = umax12(sp_v0_y_next, sp_v1_y_next);
end
end
// Push / pop event predicates, evaluated combinationally.
// Push fires on prim_complete_now for an eligible non-degenerate
// primitive (degenerate TRI sets raster_degenerate but is not
// enqueued). FIFO-full push attempts set raster_overflow and
// drop the primitive — UNLESS a pop fires the same cycle, in
// which case the freed slot accepts the new push and count
// stays at full (Ch87 Codex audit-medium fix). Pop fires when
// the FSM needs the next entry: at IDLE while FIFO has work,
// or at end-of-scan while FIFO has work.
logic prim_eligible;
logic prim_filtered_degen;
logic push_attempt;
logic push_drop;
logic push_ok;
// Ch88 pipeline-aware end-of-scan / drain-done.
// ras_at_end_of_s0 : S0 is producing the last bbox coord
// (corner pixel). After this cycle, FSM
// transitions to R_DRAIN.
// ras_drain_done : S1 and S2 valid bits are both low while
// in R_DRAIN, so the pipeline is fully
// flushed. Only at this point can the
// next FIFO entry pop into ras_*_q
// (the in-flight pixels would otherwise
// see clobbered context).
logic ras_at_end_of_s0;
logic ras_drain_done;
assign ras_at_end_of_s0 = (raster_state == R_SCAN)
& ras_at_x_end & ras_at_y_end;
// Brick 2a — the drain must also wait for the S3 blend stage to
// empty, otherwise a popped next-primitive could clobber ras_abe /
// ras_color while a blended pixel is still in S3.
// Brick 2b — the drain must ALSO wait for the Z S3 stage to empty,
// for any queued Z write to fire, and for the half-rate beat to land
// back on the primary beat — otherwise a popped next-primitive could
// clobber ras_zte / ras_z_value while a Z pixel is still in flight,
// or the queued Z write could be dropped. z_beat_q is included so the
// pop happens on a primary beat (z_advance high).
// Ch302 — a perspective primitive's emit lives in a +5-deep pipe
// BEHIND the S1/S2/S3 stages (gs_persp_uv 4-cycle divide + 1-cycle
// texel read). The drain must ALSO wait for that pipe to empty, or a
// popped next-primitive would clobber ras_persp / ras_s0_base / … and
// raster_active could drop while perspective pixels are still emitting.
// persp_pipe_busy is constant 0 at PERSPECTIVE_CORRECT=0, so this term
// is a no-op in the affine-only build (byte-identical).
// Ch304 — in tile-local mode the primitive is NOT done when the
// combined walker reaches R_DRAIN: the on-chip tile must still be
// FLUSHed to the framebuffer. tile_render_busy holds the drain (and
// therefore the pop / R_IDLE transition + raster_active) until the
// tile phase returns to TP_OFF after FLUSH completes. tile_active is
// constant 0 at TILE_LOCAL=0, so this term is a no-op (byte-identical).
logic tile_render_busy;
assign tile_render_busy = tile_active && (tile_phase_r != TP_OFF);
// Ch337 — scene-level busy (see the output-port comment). Holds through inter-batch gaps: the
// next batch's primitives sit in the FIFO (fifo_count != 0) and/or an end-of-list flush is
// pending (mp_flush_pending) even while raster_active momentarily drops between grid passes.
assign raster_scene_busy = raster_active || tile_render_busy
|| (fifo_count != FIFO_CNT_W'(0)) || mp_flush_pending;
assign ras_drain_done = (raster_state == R_DRAIN)
& !s1_valid_q & !s2_valid_q & !s3_valid_q
& !s3_zvalid_q & !zw_pending_q & !z_beat_q
& !persp_pipe_busy
& !tile_render_busy;
// Ch305 MULTIPRIM — pipeline-empty for the INTER-PRIMITIVE advance within a
// tile render. Identical to ras_drain_done WITHOUT the !tile_render_busy term
// (which is always 1 mid-grid, forcing ras_drain_done to 0). Advancing prim
// N->N+1 on bare (raster_state==R_DRAIN) would clobber ras_* and start the
// next primitive's tile_z reads while primitive N's last color/Z writes are
// still in the s1/s2/s3 + queued-Z + half-rate-z pipeline — corrupting the
// composite (broken occlusion / lost writes). This holds the advance until
// that pipeline is fully flushed. Dead at TILE_MULTIPRIM=0 (never read there).
logic comb_pipe_empty;
assign comb_pipe_empty = (raster_state == R_DRAIN)
& !s1_valid_q & !s2_valid_q & !s3_valid_q
& !s3_zvalid_q & !zw_pending_q & !z_beat_q
& !persp_pipe_busy;
logic pop_ok;
// Ch295 — also block the pop while the slot at fifo_rptr still has
// its affine gradients in flight in the setup engine. SPRITE slots
// clear grad_pending at push, so they are never held; a TRI slot is
// released the cycle the engine writes its last gradient. The engine
// finishes in ~10 cycles, far ahead of the rasterizer's readiness,
// so this never stalls the steady-state pipeline.
// Ch296 — also block the pop while a VRAM->CLUT load is in flight, so
// the textured scan's texel fetch never collides with the load on the
// shared read2 port. Gated by the CLUT_STALL parameter so only a top
// that actually instantiates a CLUT loader (and drives clut_load_busy)
// pays for / depends on this term; for every other gs_stub instance
// the parameter is 0 and `clut_load_busy` is don't-care (may float),
// exactly as before this chapter.
logic clut_stall_eff;
assign clut_stall_eff = CLUT_STALL ? clut_load_busy : 1'b0;
// Ch305+ MULTIPRIM — in multiprim mode the batch grid path OWNS the FIFO,
// so the normal streaming pop is permanently disabled. TILE_MULTIPRIM is a
// compile-time 0 by default, so this term folds away → byte-identical.
assign pop_ok = !fifo_empty && !fifo_grad_pending[fifo_rptr] &&
!clut_stall_eff &&
!(TILE_LOCAL && TILE_MULTIPRIM) &&
((raster_state == R_IDLE) || ras_drain_done);
assign prim_eligible = prim_complete_now && (new_mode != RM_NONE);
assign prim_filtered_degen = prim_eligible && (new_mode == RM_TRI) && tri_degenerate;
assign push_attempt = prim_eligible && !prim_filtered_degen;
// A simultaneous pop frees a slot, so a full-FIFO push still
// succeeds in that cycle. push_drop only triggers when full
// AND no pop is happening — those primitives are genuinely
// beyond the rasterizer's depth.
assign push_drop = push_attempt && fifo_full && !pop_ok;
assign push_ok = push_attempt && (!fifo_full || pop_ok);
// COMBINED probe — advance the stalled walker to the next candidate
// pixel and reset the per-pixel FSM to beat 0. Row-major within the
// bounding box, identical to the legacy S0 walker advance, plus the
// R_SCAN→R_DRAIN transition at the bbox corner. Called from the
// combined FSM (NBAs to module regs; no `return;` per iverilog-12).
task automatic comb_advance_walker;
if (ras_at_end_of_s0) begin
raster_state <= R_DRAIN;
end else if (ras_at_x_end) begin
ras_cur_x <= ras_x_min;
ras_cur_y <= ras_cur_y + 12'd1;
end else begin
ras_cur_x <= ras_cur_x + 12'd1;
end
comb_state_r <= CB_Z;
endtask
// Ch305+ MULTIPRIM — load all ras_* working fields from FIFO slot `slot`,
// clipping the walker bbox to the supplied (already-tile-clipped) corners.
// Mirrors the pop-block field set EXACTLY (4744+) but reads `slot` instead
// of fifo_rptr, does NOT advance fifo_rptr, and does NOT touch the prim_*
// copy or the tile/grid counters. Only ever called from `if (TILE_MULTIPRIM)`
// branches, so the default build never invokes it (NBAs, no `return;`).
task automatic mp_load_prim (input [FIFO_PTR_W-1:0] slot,
input [11:0] cx_lo, input [11:0] cx_hi,
input [11:0] cy_lo, input [11:0] cy_hi);
// Ch328 (1b) — INPUT fields unpacked from the M20K attr_ram[slot] in EXACT pack order
// (LHS-concat; attr_ram is now fresh-at-push so no read-after-write hazard). HOT fields
// (mode/tri_active/combined) + the 20 gradient OUTPUTS stay in fifo_* registers; bbox from
// the caller. mp_dump_* sink the grad-engine-only inputs the rasterizer doesn't consume.
{ ras_v0_x, ras_v0_y, ras_v1_x, ras_v1_y, ras_v2_x, ras_v2_y, ras_bias, ras_sa_q,
ras_color, ras_c0_q, ras_c1_q, ras_c2_q, ras_fbp, ras_fbw, ras_psm, ras_bpp_shift,
ras_tme, ras_u0, ras_v0t, ras_u1, ras_v1t, ras_tex_base, ras_tbw, ras_tpsm,
ras_wms, ras_wmt, ras_tw, ras_th, ras_du_dx_q, ras_dv_dy_q, ras_abe, ras_filter_lin,
ras_alpha_a, ras_alpha_b, ras_alpha_c, ras_alpha_d, ras_alpha_fix,
ras_zte, ras_ztst, ras_zmsk, ras_zbp, ras_z_value, ras_z0, ras_u0_base, ras_v0_base,
mp_dump_u1v, mp_dump_u2v, ras_persp, ras_s0_base, ras_t0_base, ras_q0_base,
mp_dump_stq1, mp_dump_stq2, mp_dump_q1, mp_dump_q2, mp_dump_v1z, mp_dump_v2z } <= attr_rd_q; // Ch328 1c: single read port (slot was issued last cycle)
ras_mode <= fifo_mode [slot]; // HOT (register)
// Clipped walker bbox (against the current tile), supplied by caller.
ras_x_min <= cx_lo;
ras_x_max <= cx_hi;
ras_y_min <= cy_lo;
ras_y_max <= cy_hi;
ras_cur_x <= cx_lo;
ras_cur_y <= cy_lo;
ras_tri_active <= fifo_tri_active [slot]; // HOT
ras_combined_r <= fifo_combined [slot]; // HOT
comb_state_r <= CB_Z;
// 20 ENGINE-WRITTEN gradient OUTPUTS — sideband registers (NOT in attr_ram).
ras_dr_dx <= fifo_dr_dx [slot]; ras_dr_dy <= fifo_dr_dy [slot];
ras_dg_dx <= fifo_dg_dx [slot]; ras_dg_dy <= fifo_dg_dy [slot];
ras_db_dx <= fifo_db_dx [slot]; ras_db_dy <= fifo_db_dy [slot];
ras_da_dx <= fifo_da_dx [slot]; ras_da_dy <= fifo_da_dy [slot];
ras_dz_dx <= fifo_dz_dx [slot]; ras_dz_dy <= fifo_dz_dy [slot];
ras_du_dx_t <= fifo_du_dx_t [slot]; ras_du_dy_t <= fifo_du_dy_t [slot];
ras_dv_dx_t <= fifo_dv_dx_t [slot]; ras_dv_dy_t <= fifo_dv_dy_t [slot];
ras_ds_dx <= fifo_ds_dx [slot]; ras_ds_dy <= fifo_ds_dy [slot];
ras_dt_dx <= fifo_dt_dx [slot]; ras_dt_dy <= fifo_dt_dy [slot];
ras_dq_dx <= fifo_dq_dx [slot]; ras_dq_dy <= fifo_dq_dy [slot];
endtask
// Ch328 (1b) — the push now writes attr_ram[fifo_wptr] DIRECTLY same-cycle (attr_word_next),
// so this block just tracks the pushed slot for the equivalence checker (chk_* lag pack_* by
// 1 more cycle, by which point both attr_ram[slot] and fifo_*[slot] are settled).
always_ff @(posedge clk) begin
if (!rst_n) begin
pack_v <= 1'b0;
end else begin
pack_v <= push_ok;
pack_slot <= fifo_wptr;
end
end
`ifndef SYNTHESIS
// Ch328 (1a) EQUIVALENCE CHECKER (Codex gate). INDEPENDENT decode: slice the stored word
// back into width-typed fields via LHS-concat (NOT the pack concat function) and compare each
// to the live fifo_* register. Runs across every regression scene. A width/order drift between
// the pack concat and this decode → mismatch. Sim-only (no synth cost). chk_* lag pack_* by 1
// more cycle so attr_ram[slot] is settled.
logic chk_v;
logic [$clog2(FIFO_DEPTH)-1:0] chk_slot;
logic [11:0] k_v0x,k_v0y,k_v1x,k_v1y,k_v2x,k_v2y; logic [2:0] k_bias; logic signed [31:0] k_sa;
logic [63:0] k_color,k_c0,k_c1,k_c2; logic [8:0] k_fbp; logic [5:0] k_fbw,k_psm; logic [1:0] k_bpp;
logic k_tme; logic [10:0] k_u0,k_v0,k_u1,k_v1; logic [31:0] k_texb; logic [13:0] k_tbw; logic [5:0] k_tpsm;
logic [1:0] k_wms,k_wmt; logic [3:0] k_tw,k_th; logic signed [31:0] k_dudx,k_dvdy; logic k_abe,k_flin;
logic [1:0] k_aa,k_ab,k_ac,k_ad; logic [7:0] k_afix; logic k_zte; logic [1:0] k_ztst; logic k_zmsk;
logic [8:0] k_zbp; logic [31:0] k_zval;
logic [31:0] k_z0;
logic [10:0] k_u0b,k_v0b; logic [31:0] k_u1v,k_u2v;
logic k_persp; logic [23:0] k_s0b,k_t0b,k_q0b; logic [63:0] k_stq1,k_stq2; logic [23:0] k_q1,k_q2;
logic [31:0] k_v1z,k_v2z;
always_ff @(posedge clk) begin
if (!rst_n) chk_v <= 1'b0;
else begin
chk_v <= pack_v;
chk_slot <= pack_slot;
if (chk_v) begin
{ k_v0x,k_v0y,k_v1x,k_v1y,k_v2x,k_v2y,k_bias,k_sa,k_color,k_c0,k_c1,k_c2,
k_fbp,k_fbw,k_psm,k_bpp,k_tme,k_u0,k_v0,k_u1,k_v1,k_texb,k_tbw,k_tpsm,
k_wms,k_wmt,k_tw,k_th,k_dudx,k_dvdy,k_abe,k_flin,k_aa,k_ab,k_ac,k_ad,k_afix,
k_zte,k_ztst,k_zmsk,k_zbp,k_zval,k_z0,k_u0b,k_v0b,k_u1v,k_u2v,
k_persp,k_s0b,k_t0b,k_q0b,k_stq1,k_stq2,k_q1,k_q2,
k_v1z,k_v2z } = attr_ram[chk_slot];
if (k_v0x!==fifo_v0x[chk_slot]||k_v0y!==fifo_v0y[chk_slot]||k_v1x!==fifo_v1x[chk_slot]||
k_v1y!==fifo_v1y[chk_slot]||k_v2x!==fifo_v2x[chk_slot]||k_v2y!==fifo_v2y[chk_slot]||
k_bias!==fifo_bias[chk_slot]||k_sa!==fifo_sa[chk_slot]||k_color!==fifo_color[chk_slot]||
k_c0!==fifo_c0[chk_slot]||k_c1!==fifo_c1[chk_slot]||k_c2!==fifo_c2[chk_slot]||
k_fbp!==fifo_fbp[chk_slot]||k_fbw!==fifo_fbw[chk_slot]||k_psm!==fifo_psm[chk_slot]||
k_bpp!==fifo_bpp_shift[chk_slot]||k_tme!==fifo_tme[chk_slot]||k_u0!==fifo_u0[chk_slot]||
k_v0!==fifo_v0[chk_slot]||k_u1!==fifo_u1[chk_slot]||k_v1!==fifo_v1[chk_slot]||
k_texb!==fifo_tex_base[chk_slot]||k_tbw!==fifo_tbw[chk_slot]||k_tpsm!==fifo_tpsm[chk_slot]||
k_wms!==fifo_wms[chk_slot]||k_wmt!==fifo_wmt[chk_slot]||k_tw!==fifo_tw[chk_slot]||
k_th!==fifo_th[chk_slot]||k_dudx!==fifo_du_dx[chk_slot]||k_dvdy!==fifo_dv_dy[chk_slot]||
k_abe!==fifo_abe[chk_slot]||k_flin!==fifo_filter_lin[chk_slot]||k_aa!==fifo_alpha_a[chk_slot]||
k_ab!==fifo_alpha_b[chk_slot]||k_ac!==fifo_alpha_c[chk_slot]||k_ad!==fifo_alpha_d[chk_slot]||
k_afix!==fifo_alpha_fix[chk_slot]||k_zte!==fifo_zte[chk_slot]||k_ztst!==fifo_ztst[chk_slot]||
k_zmsk!==fifo_zmsk[chk_slot]||k_zbp!==fifo_zbp[chk_slot]||k_zval!==fifo_zval[chk_slot]||
k_z0!==fifo_z0[chk_slot]||k_u0b!==fifo_u0_base[chk_slot]||
k_v0b!==fifo_v0_base[chk_slot]||k_u1v!==fifo_u1v[chk_slot]||k_u2v!==fifo_u2v[chk_slot]||
k_persp!==fifo_persp[chk_slot]||k_s0b!==fifo_s0_base[chk_slot]||
k_t0b!==fifo_t0_base[chk_slot]||k_q0b!==fifo_q0_base[chk_slot]||k_stq1!==fifo_stq1[chk_slot]||
k_stq2!==fifo_stq2[chk_slot]||k_q1!==fifo_q1[chk_slot]||k_q2!==fifo_q2[chk_slot]||
k_v1z!==fifo_v1z[chk_slot]||k_v2z!==fifo_v2z[chk_slot])
$error("ch328 attr_ram pack/unpack MISMATCH at slot %0d (field order/width drift)", chk_slot);
end
end
end
`endif
always_ff @(posedge clk) begin
if (!rst_n) begin
raster_state <= R_IDLE;
ras_mode <= RM_NONE;
ras_v0_x <= 12'd0; ras_v0_y <= 12'd0;
ras_v1_x <= 12'd0; ras_v1_y <= 12'd0;
ras_v2_x <= 12'd0; ras_v2_y <= 12'd0;
ras_x_min <= 12'd0; ras_x_max <= 12'd0;
ras_y_min <= 12'd0; ras_y_max <= 12'd0;
ras_cur_x <= 12'd0; ras_cur_y <= 12'd0;
ras_color <= 64'd0;
ras_fbp <= 9'd0; ras_fbw <= 6'd0; ras_psm <= 6'd0;
ras_bpp_shift <= 2'd2;
ras_tme <= 1'b0;
ras_u0 <= 11'd0; ras_v0t <= 11'd0; ras_u1 <= 11'd0; ras_v1t <= 11'd0;
ras_tex_base <= 32'd0; ras_tbw <= 14'd0; ras_tpsm <= 6'd0;
ras_wms <= 2'd0; ras_wmt <= 2'd0; ras_tw <= 4'd0; ras_th <= 4'd0;
ras_du_dx_q <= 32'sd0; ras_dv_dy_q <= 32'sd0;
ras_abe <= 1'b0;
ras_filter_lin <= 1'b0;
ras_alpha_a <= 2'd0;
ras_alpha_b <= 2'd0;
ras_alpha_c <= 2'd0;
ras_alpha_d <= 2'd0;
ras_alpha_fix <= 8'd0;
ras_zte <= 1'b0;
ras_ztst <= 2'd0;
ras_zmsk <= 1'b0;
ras_zbp <= 9'd0;
ras_z_value <= 32'd0;
ras_bias <= 3'b000;
ras_c0_q <= 64'd0;
ras_c1_q <= 64'd0;
ras_c2_q <= 64'd0;
ras_sa_q <= 32'sd0;
// Brick 3 — affine interpolation context reset.
ras_tri_active <= 1'b0;
// COMBINED probe — reset combined-mode state.
ras_combined_r <= 1'b0;
comb_state_r <= CB_Z;
comb_ztest_pass_r <= 1'b0;
comb_zstored_r <= 32'd0;
comb_cs_r <= 32'd0;
comb_as_r <= 8'd0;
// Ch304 TILE-LOCAL — reset tile-phase FSM + sweep/flush state.
tile_phase_r <= TP_OFF;
tile_sweep_r <= 9'd0;
flush_emit_q <= 1'b0;
flush_idx_q <= 8'd0;
// Ch305 TILE GRID — reset grid counters + preserved prim bbox.
tile_col_r <= 12'd0;
tile_row_r <= 12'd0;
spill_valid <= SPILL_FORCE_VALID ? '1 : '0; // Ch323 — no valid backing until spilled (test hook forces valid)
// Ch305+ MULTIPRIM — reset batch state.
prim_idx_r <= '0;
mp_rd_state <= RDS_IDLE; // Ch328 1c
prim_count_r <= '0;
grid_base_rptr <= '0;
// BIN BUFFER — clear bin counters / sweep counters (dead at
// BIN_BUFFER_ENABLE=0). bin_prim contents need no reset (only read
// through bin_n, which is zeroed). bin_n is the live state.
bin_t <= '0;
bin_p <= '0;
bin_slot_r <= '0;
for (int bt = 0; bt < NTILES; bt = bt + 1)
bin_n[bt] <= '0;
prim_x_min <= 12'd0; prim_x_max <= 12'd0;
prim_y_min <= 12'd0; prim_y_max <= 12'd0;
flush_ox_q <= 12'd0; flush_oy_q <= 12'd0;
ras_z0 <= 32'd0;
ras_dr_dx <= 32'sd0; ras_dr_dy <= 32'sd0;
ras_dg_dx <= 32'sd0; ras_dg_dy <= 32'sd0;
ras_db_dx <= 32'sd0; ras_db_dy <= 32'sd0;
ras_da_dx <= 32'sd0; ras_da_dy <= 32'sd0;
ras_dz_dx <= 32'sd0; ras_dz_dy <= 32'sd0;
// Textured-triangle rung — affine U/V context reset.
ras_u0_base <= 11'd0; ras_v0_base <= 11'd0;
ras_du_dx_t <= 32'sd0; ras_du_dy_t <= 32'sd0;
ras_dv_dx_t <= 32'sd0; ras_dv_dy_t <= 32'sd0;
// Ch301 perspective — S/T/Q affine context reset.
ras_persp <= 1'b0;
ras_s0_base <= 24'd0; ras_t0_base <= 24'd0; ras_q0_base <= 24'd0;
ras_ds_dx <= 32'sd0; ras_ds_dy <= 32'sd0;
ras_dt_dx <= 32'sd0; ras_dt_dy <= 32'sd0;
ras_dq_dx <= 32'sd0; ras_dq_dy <= 32'sd0;
// Ch88 pipeline regs
s1_x_q <= 12'd0; s1_y_q <= 12'd0;
s1_valid_q <= 1'b0;
s2_x_q <= 12'd0; s2_y_q <= 12'd0;
s2_valid_q <= 1'b0;
s2_inside_q <= 1'b0;
s2_L0_q <= 32'sd0; s2_L1_q <= 32'sd0; s2_L2_q <= 32'sd0;
s2_mode_q <= RM_NONE;
s2_tex_active_q <= 1'b0;
s2_tex_color_q <= 32'd0;
// Brick 2a S3 blend stage
s3_valid_q <= 1'b0;
s3_x_q <= 12'd0; s3_y_q <= 12'd0;
s3_fb_addr_q <= 32'd0;
s3_be_q <= 4'b1111;
s3_mask_q <= 32'hFFFF_FFFF;
s3_psm_q <= 6'd0;
s3_cs_q <= 64'd0;
s3_dest_q <= 32'd0;
s3_as_q <= 8'd0;
// Brick 2b Z-test S3 stage + beat/queue
s3_zvalid_q <= 1'b0;
s3_zx_q <= 12'd0; s3_zy_q <= 12'd0;
s3_zfb_addr_q<= 32'd0;
s3_z_addr_q <= 32'd0;
s3_zcolor_q <= 64'd0;
s3_zval_q <= 32'd0;
s3_ztst_q <= 2'd0;
s3_zmsk_q <= 1'b0;
s3_zstored_q <= 32'd0;
z_beat_q <= 1'b0;
ta_beat_q <= 1'b0; // Ch344
ta_dest_q <= 32'd0; // Ch344
ta_tex_q <= 32'd0; // Ch344
ta_tex_q1 <= 32'd0; // Ch344
zw_pending_q <= 1'b0;
zw_addr_q <= 32'd0;
zw_val_q <= 32'd0;
raster_pixel_emit <= 1'b0;
// Ch323 — Z-flush stream + pipeline latches reset.
z_flush_emit_o <= 1'b0;
z_flush_addr_o <= 32'd0;
z_flush_data_o <= 32'd0;
tile_color_flush_emit_o <= 1'b0;
tile_color_flush_addr_o <= 32'd0;
tile_color_flush_data_o <= 32'd0;
z_flush_emit_q <= 1'b0;
z_flush_idx_q <= 8'd0;
z_flush_ox_q <= 12'd0;
z_flush_oy_q <= 12'd0;
// Ch323 — tile RELOAD control reset (tile_reload_raddr_o is a continuous assign).
reload_start_o <= 1'b0;
reload_wait <= 1'b0;
reload_wr_we <= 1'b0;
reload_wr_addr <= 8'd0;
raster_pixel_emit_count <= 32'd0;
raster_pixel_x_q <= 12'd0;
raster_pixel_y_q <= 12'd0;
raster_pixel_color_q <= 64'd0;
raster_pixel_fb_addr_q <= 32'd0;
raster_pixel_be_q <= 4'b1111;
raster_pixel_mask_q <= 32'hFFFF_FFFF;
raster_pixel_psm_q <= 6'd0;
raster_active <= 1'b0;
raster_overflow <= 1'b0;
raster_overflow_count_r <= 16'd0; // Ch315
bin_occ_max_r <= '0; // Ch315
bin_overflow_r <= 1'b0; // Ch315
tile_refused_count <= 16'd0; // Ch329
raster_degenerate <= 1'b0;
fifo_wptr <= '0;
fifo_rptr <= '0;
fifo_count <= '0;
// Ch295 — gradient setup engine reset.
grad_busy <= 1'b0;
grad_prefetching <= 1'b0; // Ch328 (1b) step C
grad_prefetch2 <= 1'b0; // Ch328 1c
grad_word_q <= '0;
grad_step <= 5'd0;
grad_settle <= 5'd0;
grad_result_q <= 32'sd0;
grad_writing <= 1'b0;
div_start <= 1'b0;
grad_slot <= '0;
grad_det_q <= 32'sd0;
for (int g = 0; g < GRAD_STEPS; g = g + 1) grad_num_q[g] <= 56'sd0;
for (int i = 0; i < FIFO_DEPTH; i = i + 1) begin
fifo_v1z [i] <= 32'd0;
fifo_v2z [i] <= 32'd0;
end
for (int i = 0; i < FIFO_DEPTH; i = i + 1) begin
fifo_grad_pending [i] <= 1'b0;
fifo_mode [i] <= RM_NONE;
fifo_v0x [i] <= 12'd0; fifo_v0y[i] <= 12'd0;
fifo_v1x [i] <= 12'd0; fifo_v1y[i] <= 12'd0;
fifo_v2x [i] <= 12'd0; fifo_v2y[i] <= 12'd0;
fifo_x_min [i] <= 12'd0; fifo_x_max[i] <= 12'd0;
fifo_y_min [i] <= 12'd0; fifo_y_max[i] <= 12'd0;
fifo_bias [i] <= 3'd0;
fifo_sa [i] <= 32'sd0;
fifo_color [i] <= 64'd0;
fifo_c0 [i] <= 64'd0;
fifo_c1 [i] <= 64'd0;
fifo_c2 [i] <= 64'd0;
fifo_fbp [i] <= 9'd0;
fifo_fbw [i] <= 6'd0;
fifo_psm [i] <= 6'd0;
fifo_bpp_shift [i] <= 2'd2;
fifo_tme [i] <= 1'b0;
fifo_u0 [i] <= 11'd0;
fifo_v0 [i] <= 11'd0;
fifo_u1 [i] <= 11'd0;
fifo_v1 [i] <= 11'd0;
fifo_tex_base [i] <= 32'd0;
fifo_tbw [i] <= 14'd0;
fifo_tpsm [i] <= 6'd0;
fifo_wms [i] <= 2'd0;
fifo_wmt [i] <= 2'd0;
fifo_tw [i] <= 4'd0;
fifo_th [i] <= 4'd0;
fifo_abe [i] <= 1'b0;
fifo_filter_lin [i] <= 1'b0;
fifo_alpha_a [i] <= 2'd0;
fifo_alpha_b [i] <= 2'd0;
fifo_alpha_c [i] <= 2'd0;
fifo_alpha_d [i] <= 2'd0;
fifo_alpha_fix [i] <= 8'd0;
fifo_zte [i] <= 1'b0;
fifo_ztst [i] <= 2'd0;
fifo_zmsk [i] <= 1'b0;
fifo_zbp [i] <= 9'd0;
fifo_zval [i] <= 32'd0;
// Brick 3 — affine interpolation context init.
fifo_tri_active [i] <= 1'b0;
fifo_combined [i] <= 1'b0; // COMBINED probe
fifo_z0 [i] <= 32'd0;
fifo_dr_dx [i] <= 32'sd0; fifo_dr_dy [i] <= 32'sd0;
fifo_dg_dx [i] <= 32'sd0; fifo_dg_dy [i] <= 32'sd0;
fifo_db_dx [i] <= 32'sd0; fifo_db_dy [i] <= 32'sd0;
fifo_da_dx [i] <= 32'sd0; fifo_da_dy [i] <= 32'sd0;
fifo_dz_dx [i] <= 32'sd0; fifo_dz_dy [i] <= 32'sd0;
// Textured-triangle rung — affine U/V context init.
fifo_u0_base [i] <= 11'd0;
fifo_v0_base [i] <= 11'd0;
fifo_u1v [i] <= 32'd0;
fifo_u2v [i] <= 32'd0;
fifo_du_dx_t [i] <= 32'sd0; fifo_du_dy_t [i] <= 32'sd0;
fifo_dv_dx_t [i] <= 32'sd0; fifo_dv_dy_t [i] <= 32'sd0;
// Ch301 perspective — S/T/Q context init.
fifo_persp [i] <= 1'b0;
fifo_s0_base [i] <= 24'd0;
fifo_t0_base [i] <= 24'd0;
fifo_q0_base [i] <= 24'd0;
fifo_stq1 [i] <= 64'd0;
fifo_stq2 [i] <= 64'd0;
fifo_q1 [i] <= 24'd0;
fifo_q2 [i] <= 24'd0;
fifo_ds_dx [i] <= 32'sd0; fifo_ds_dy [i] <= 32'sd0;
fifo_dt_dx [i] <= 32'sd0; fifo_dt_dy [i] <= 32'sd0;
fifo_dq_dx [i] <= 32'sd0; fifo_dq_dy [i] <= 32'sd0;
end
end else begin
raster_pixel_emit <= 1'b0;
z_flush_emit_o <= 1'b0; // Ch323 — Z-flush output strobe (pulse), default each cycle
tile_color_flush_emit_o <= 1'b0; // Ch323 — dedicated color-flush spill strobe, default each cycle
reload_start_o <= 1'b0; // Ch323 — reload-arm pulse, default each cycle
reload_wr_we <= 1'b0; // Ch323 — reload tile-RAM write window, re-asserted in TP_RELOAD
// ------------------------------------------------------
// Push side: enqueue new primitive context (or drop /
// degenerate-flag).
// ------------------------------------------------------
if (prim_filtered_degen) begin
raster_degenerate <= 1'b1;
end
if (push_drop) begin
raster_overflow <= 1'b1;
raster_overflow_count_r <= raster_overflow_count_r + 16'd1; // Ch315 — count drops
end
if (push_ok) begin
fifo_mode [fifo_wptr] <= new_mode;
fifo_color [fifo_wptr] <= rgbaq_q;
fifo_fbp [fifo_wptr] <= frame_1_q[8:0];
fifo_fbw [fifo_wptr] <= frame_1_q[21:16];
fifo_psm [fifo_wptr] <= frame_1_q[29:24];
fifo_bpp_shift [fifo_wptr] <= bpp_shift;
fifo_x_min [fifo_wptr] <= enq_x_min;
fifo_x_max [fifo_wptr] <= enq_x_max;
fifo_y_min [fifo_wptr] <= enq_y_min;
fifo_y_max [fifo_wptr] <= enq_y_max;
// Ch328 (1b) — SAME-CYCLE attr_ram write (valid push+1, like the register FIFO).
attr_ram [fifo_wptr] <= attr_word_next;
if (new_mode == RM_TRI) begin
fifo_v0x [fifo_wptr] <= tri_v0x;
fifo_v0y [fifo_wptr] <= tri_v0y;
fifo_v1x [fifo_wptr] <= tri_v1x;
fifo_v1y [fifo_wptr] <= tri_v1y;
fifo_v2x [fifo_wptr] <= tri_v2x;
fifo_v2y [fifo_wptr] <= tri_v2y;
fifo_bias [fifo_wptr] <= {~tl2_init, ~tl1_init, ~tl0_init};
fifo_sa [fifo_wptr] <= tri_swap_ccw ? -tri_sa_calc : tri_sa_calc;
fifo_c0 [fifo_wptr] <= fan_sat_path ? c_pivot_q : c_prev_q;
fifo_c1 [fifo_wptr] <= tri_swap_ccw ? rgbaq_q : c_curr_q;
fifo_c2 [fifo_wptr] <= tri_swap_ccw ? c_curr_q : rgbaq_q;
// Textured-triangle rung — capture the texture
// descriptor for a TRI+TME primitive. fifo_tme gates
// the textured path; only PSMCT32 (0x00) DECAL is
// sampled (the per-pixel s1_tex_active gate enforces
// this, mirroring the SPRITE path). For an untextured
// TRI close_tme_effective is 0 → these collapse to the
// legacy all-zero values and the Gouraud path is
// byte-identical. The SPRITE-DDA fields (fifo_u0/v0/u1/
// v1, fifo_du_dx/dv_dy) stay 0 for a TRI: the triangle
// uses AFFINE U/V (fifo_u0_base + fifo_du_dx_t etc.),
// not the 2-endpoint linear DDA.
fifo_tme [fifo_wptr] <= close_tme_effective;
fifo_u0 [fifo_wptr] <= 11'd0;
fifo_v0 [fifo_wptr] <= 11'd0;
fifo_u1 [fifo_wptr] <= 11'd0;
fifo_v1 [fifo_wptr] <= 11'd0;
fifo_tex_base [fifo_wptr] <= tex_base_next;
fifo_tbw [fifo_wptr] <= {8'd0, tex0_tbw};
fifo_tpsm [fifo_wptr] <= tex0_psm;
// Ch294 — wrap-mode snapshot (CLAMP_1 + TEX0 dims).
fifo_wms [fifo_wptr] <= clamp_wms;
fifo_wmt [fifo_wptr] <= clamp_wmt;
fifo_tw [fifo_wptr] <= tex0_tw;
fifo_th [fifo_wptr] <= tex0_th;
fifo_du_dx [fifo_wptr] <= 32'sd0;
fifo_dv_dy [fifo_wptr] <= 32'sd0;
// Textured-triangle rung — affine U/V context. Base at
// post-swap v0; the 4 gradients are solved by the
// shared divider engine (steps 10..13) and pre-cleared
// here. The packed {V,U} for v1/v2 lets the decoupled
// engine recompute the U/V numerators at schedule time.
fifo_u0_base [fifo_wptr] <= tri_uv0_u;
fifo_v0_base [fifo_wptr] <= tri_uv0_v;
fifo_u1v [fifo_wptr] <= {5'd0, tri_uv1_v, 5'd0, tri_uv1_u};
fifo_u2v [fifo_wptr] <= {5'd0, tri_uv2_v, 5'd0, tri_uv2_u};
fifo_du_dx_t [fifo_wptr] <= 32'sd0;
fifo_du_dy_t [fifo_wptr] <= 32'sd0;
fifo_dv_dx_t [fifo_wptr] <= 32'sd0;
fifo_dv_dy_t [fifo_wptr] <= 32'sd0;
// Brick 2a — TRI alpha-blend not implemented in v1.
fifo_abe [fifo_wptr] <= 1'b0;
// Ch310 — snapshot TEX1_1.MMAG bilinear filter (mirrors fifo_abe).
fifo_filter_lin [fifo_wptr] <= tex1_mmag;
// Brick 2c — snapshot ALPHA_1 selectors (mirrors fifo_abe).
fifo_alpha_a [fifo_wptr] <= alpha_a;
fifo_alpha_b [fifo_wptr] <= alpha_b;
fifo_alpha_c [fifo_wptr] <= alpha_c;
fifo_alpha_d [fifo_wptr] <= alpha_d;
fifo_alpha_fix [fifo_wptr] <= alpha_fix;
// Brick 3 — TRI Z-test feeds the Brick-2b Z stage
// with the per-pixel INTERPOLATED fragment Z. The
// ztst/zmsk/zbp context is shared with the sprite Z
// path; only the fragment-Z source differs (interp vs
// flat). The fb_color path stays the affine Gouraud
// color (s2_emit_color64 from s2_interp_color).
fifo_zte [fifo_wptr] <= new_tri_zte_active;
fifo_ztst [fifo_wptr] <= test_ztst;
fifo_zmsk [fifo_wptr] <= zbuf_zmsk;
fifo_zbp [fifo_wptr] <= zbuf_zbp;
fifo_zval [fifo_wptr] <= 32'd0; // unused for TRI (interp)
// Brick 3 / Ch295 — affine interpolation context.
// The 10 Q16.16 gradients are NO LONGER solved here
// with 10 inlined dividers. Instead the pre-shifted
// numerators (pure mul/sub/shift) and the common
// divisor are captured into the time-shared setup
// engine (grad_* regs, handled below); a SINGLE
// divider sequences the 10 quotients into the
// fifo_d*_* fields over the following ~10 cycles.
// The gradient fields are pre-cleared and the slot is
// flagged grad_pending so pop_ok holds it until the
// engine finishes. Result is bit-identical to the
// old per-call (num<<<16)/det+truncate.
fifo_tri_active [fifo_wptr] <= 1'b1;
// COMBINED probe — flag the slot if this TRI is the
// textured+alpha+depth primitive. fifo_tme/fifo_zte
// already collapse correctly for it (fifo_zte=0 here
// because new_tri_zte_active requires !close_tme_
// effective, and the combined prim IS textured); the
// combined FSM owns the reads/writes. Constant 0 at
// COMBINED_TAZ=0.
fifo_combined [fifo_wptr] <= close_combined;
fifo_z0 [fifo_wptr] <= tri_v0z;
fifo_dr_dx [fifo_wptr] <= 32'sd0; fifo_dr_dy [fifo_wptr] <= 32'sd0;
fifo_dg_dx [fifo_wptr] <= 32'sd0; fifo_dg_dy [fifo_wptr] <= 32'sd0;
fifo_db_dx [fifo_wptr] <= 32'sd0; fifo_db_dy [fifo_wptr] <= 32'sd0;
fifo_da_dx [fifo_wptr] <= 32'sd0; fifo_da_dy [fifo_wptr] <= 32'sd0;
fifo_dz_dx [fifo_wptr] <= 32'sd0; fifo_dz_dy [fifo_wptr] <= 32'sd0;
// Brick-3 fix — store v1.z / v2.z so the decoupled
// gradient engine can recompute the Z numerators for
// this slot when it schedules it (v0.z is in fifo_z0).
fifo_v1z [fifo_wptr] <= tri_v1z;
fifo_v2z [fifo_wptr] <= tri_v2z;
// Flag the slot as awaiting gradients. The engine
// (below) picks it up on a later idle cycle — the push
// no longer touches the engine state, so a back-to-back
// TRI push cannot clobber an in-flight solve.
fifo_grad_pending [fifo_wptr] <= 1'b1;
// Ch301 perspective — capture S/T/Q context for this
// TRI. fifo_persp gates the perspective emit path: set
// only when a TME triangle's texcoords came via ST
// (saw_st_q) AND the texture is actually sampled
// (close_tme_effective). saw_st_q is constant 0 when
// PERSPECTIVE_CORRECT=0, so fifo_persp is 0 and the
// perspective gradient steps 14..19 are inert (GRAD_STEPS
// stays 14). S/T are post-swap v0 bases; v1/v2 packed
// {16'd0,T,S}; Q from the RGBAQ.Q field of each vertex.
fifo_persp [fifo_wptr] <= saw_st_q && close_tme_effective;
fifo_s0_base [fifo_wptr] <= tri_s0;
fifo_t0_base [fifo_wptr] <= tri_t0;
fifo_q0_base [fifo_wptr] <= tri_q0;
fifo_stq1 [fifo_wptr] <= {16'd0, tri_t1, tri_s1};
fifo_stq2 [fifo_wptr] <= {16'd0, tri_t2, tri_s2};
fifo_q1 [fifo_wptr] <= tri_q1;
fifo_q2 [fifo_wptr] <= tri_q2;
fifo_ds_dx [fifo_wptr] <= 32'sd0; fifo_ds_dy [fifo_wptr] <= 32'sd0;
fifo_dt_dx [fifo_wptr] <= 32'sd0; fifo_dt_dy [fifo_wptr] <= 32'sd0;
fifo_dq_dx [fifo_wptr] <= 32'sd0; fifo_dq_dy [fifo_wptr] <= 32'sd0;
end else begin // RM_SPRITE
fifo_v0x [fifo_wptr] <= sp_v0_x_next;
fifo_v0y [fifo_wptr] <= sp_v0_y_next;
fifo_v1x [fifo_wptr] <= sp_v1_x_next;
fifo_v1y [fifo_wptr] <= sp_v1_y_next;
fifo_v2x [fifo_wptr] <= 12'd0;
fifo_v2y [fifo_wptr] <= 12'd0;
fifo_bias[fifo_wptr] <= 3'b000;
fifo_sa [fifo_wptr] <= 32'sd0;
fifo_c0 [fifo_wptr] <= 64'd0;
fifo_c1 [fifo_wptr] <= 64'd0;
fifo_c2 [fifo_wptr] <= 64'd0;
// Brick 1 — capture SPRITE texture context. Only
// PSMCT32 (0x00) actually textures in v1; other
// PSMs fall back to flat below even with TME=1.
fifo_tme [fifo_wptr] <= prim_tme;
fifo_u0 [fifo_wptr] <= sp_u0_next;
fifo_v0 [fifo_wptr] <= sp_v0t_next;
fifo_u1 [fifo_wptr] <= sp_u1_next;
fifo_v1 [fifo_wptr] <= sp_v1t_next;
fifo_tex_base [fifo_wptr] <= tex_base_next;
fifo_tbw [fifo_wptr] <= {8'd0, tex0_tbw};
fifo_tpsm [fifo_wptr] <= tex0_psm;
// Ch294 — wrap-mode snapshot (CLAMP_1 + TEX0 dims).
fifo_wms [fifo_wptr] <= clamp_wms;
fifo_wmt [fifo_wptr] <= clamp_wmt;
fifo_tw [fifo_wptr] <= tex0_tw;
fifo_th [fifo_wptr] <= tex0_th;
fifo_du_dx [fifo_wptr] <= du_dx_next;
fifo_dv_dy [fifo_wptr] <= dv_dy_next;
// Brick 2a — capture alpha-blend-active for this SPRITE.
fifo_abe [fifo_wptr] <= new_abe_active;
// Ch310 — snapshot TEX1_1.MMAG bilinear filter (mirrors fifo_abe).
fifo_filter_lin [fifo_wptr] <= tex1_mmag;
// Brick 2c — snapshot ALPHA_1 selectors (mirrors fifo_abe).
fifo_alpha_a [fifo_wptr] <= alpha_a;
fifo_alpha_b [fifo_wptr] <= alpha_b;
fifo_alpha_c [fifo_wptr] <= alpha_c;
fifo_alpha_d [fifo_wptr] <= alpha_d;
fifo_alpha_fix [fifo_wptr] <= alpha_fix;
// Brick 2b — capture Z-test context for this SPRITE.
// close_z is the flat fragment Z (XYZ2 bits[63:32]).
fifo_zte [fifo_wptr] <= new_zte_active;
fifo_ztst [fifo_wptr] <= test_ztst;
fifo_zmsk [fifo_wptr] <= zbuf_zmsk;
fifo_zbp [fifo_wptr] <= zbuf_zbp;
fifo_zval [fifo_wptr] <= close_z;
// Brick 3 — SPRITE uses flat color/Z, no affine ctx.
fifo_tri_active [fifo_wptr] <= 1'b0;
fifo_combined [fifo_wptr] <= 1'b0; // COMBINED probe — SPRITE never combined
fifo_z0 [fifo_wptr] <= 32'd0;
fifo_dr_dx [fifo_wptr] <= 32'sd0; fifo_dr_dy [fifo_wptr] <= 32'sd0;
fifo_dg_dx [fifo_wptr] <= 32'sd0; fifo_dg_dy [fifo_wptr] <= 32'sd0;
fifo_db_dx [fifo_wptr] <= 32'sd0; fifo_db_dy [fifo_wptr] <= 32'sd0;
fifo_da_dx [fifo_wptr] <= 32'sd0; fifo_da_dy [fifo_wptr] <= 32'sd0;
fifo_dz_dx [fifo_wptr] <= 32'sd0; fifo_dz_dy [fifo_wptr] <= 32'sd0;
fifo_v1z [fifo_wptr] <= 32'd0;
fifo_v2z [fifo_wptr] <= 32'd0;
// Textured-triangle rung — SPRITE uses the linear DDA
// UV path (fifo_u0/v0 + fifo_du_dx/dv_dy above), not
// the affine U/V context; clear it.
fifo_u0_base [fifo_wptr] <= 11'd0;
fifo_v0_base [fifo_wptr] <= 11'd0;
fifo_u1v [fifo_wptr] <= 32'd0;
fifo_u2v [fifo_wptr] <= 32'd0;
fifo_du_dx_t [fifo_wptr] <= 32'sd0;
fifo_du_dy_t [fifo_wptr] <= 32'sd0;
fifo_dv_dx_t [fifo_wptr] <= 32'sd0;
fifo_dv_dy_t [fifo_wptr] <= 32'sd0;
// Ch295 — SPRITE has no affine gradients, so its slot
// is never gradient-pending (pop can proceed at once).
fifo_grad_pending [fifo_wptr] <= 1'b0;
// Ch301 perspective — SPRITE never uses the perspective
// S/T/Q path; clear it.
fifo_persp [fifo_wptr] <= 1'b0;
fifo_s0_base [fifo_wptr] <= 24'd0;
fifo_t0_base [fifo_wptr] <= 24'd0;
fifo_q0_base [fifo_wptr] <= 24'd0;
fifo_stq1 [fifo_wptr] <= 64'd0;
fifo_stq2 [fifo_wptr] <= 64'd0;
fifo_q1 [fifo_wptr] <= 24'd0;
fifo_q2 [fifo_wptr] <= 24'd0;
fifo_ds_dx [fifo_wptr] <= 32'sd0; fifo_ds_dy [fifo_wptr] <= 32'sd0;
fifo_dt_dx [fifo_wptr] <= 32'sd0; fifo_dt_dy [fifo_wptr] <= 32'sd0;
fifo_dq_dx [fifo_wptr] <= 32'sd0; fifo_dq_dy [fifo_wptr] <= 32'sd0;
end
// 1-bit ptr in a 2-entry ring just toggles.
// Ch171 — wrap via +1 mod FIFO_DEPTH (was `~fifo_wptr`
// which only flips a 1-bit pointer; broke when
// FIFO_DEPTH grew past 2).
fifo_wptr <= fifo_wptr + FIFO_PTR_W'(1);
end
// ------------------------------------------------------
// Brick-3 fix — DECOUPLED gradient setup engine. One divide
// per cycle, sequenced across the 10 (attribute × axis)
// numerators of the slot the engine is currently solving.
// grad_result = grad_quo[31:0] is BIT-IDENTICAL to the
// pre-fix per-call (num<<<16)/det → [31:0]: same numerators
// (grad_num_dadx/dady), same divisor (fifo_sa == tri_det_post),
// same truncation. Writes the result into the matching
// fifo_d*_* field of grad_slot; on the final step clears the
// slot's grad_pending flag and goes idle.
//
// When idle, the engine scans the FIFO for any slot flagged
// grad_pending (grad_pending_any) and starts on the lowest
// such slot, LATCHING that slot's numerators / divisor from
// the combinational grad_load_* (computed from FIFO storage).
// It then runs its 10 steps undisturbed — the push side never
// touches grad_busy/grad_step/grad_slot/grad_num_q, so a
// back-to-back TRI push cannot clobber an in-flight solve, and
// the next pending slot is picked up the cycle after this one
// finishes. pop_ok holds each slot until its own grad_pending
// clears, so no slot is ever rasterized with stale gradients.
if (grad_busy) begin
// Ch352 (Codex) — at K>1 the SETTLE phase holds grad_step/grad_num_q/grad_det stable for
// GRAD_DIV_CYCLES while the combinational divide settles, then CAPTURES the settled quotient into
// grad_result_q (the SINGLE multicycle endpoint -> the SDC exception covers ~32 regs, not the 5120
// fifo_d* fields that made Place churn). The COMMIT cycle writes grad_wr_data into the selected fifo
// and advances. At GRAD_DIV_CYCLES==1 (sim) the settle phase folds away (param compare) and COMMIT
// runs every cycle writing the combinational divide directly — byte-identical to pre-Ch352.
if (!grad_writing && (GRAD_SEQ_DIVIDER || (GRAD_DIV_CYCLES != 1))) begin
if (GRAD_SEQ_DIVIDER) begin
// Ch352 sequential divider: kicked on entry; wait div_done, capture the bit-exact quotient.
div_start <= 1'b0; // deassert the kick pulse
if (div_done_w) begin
grad_result_q <= div_quo[31:0];
grad_writing <= 1'b1;
end
end else if (grad_settle != 5'(GRAD_DIV_CYCLES-1)) begin
grad_settle <= grad_settle + 5'd1;
end else begin
grad_result_q <= grad_quo[31:0]; // settled combinational divide
grad_settle <= 5'd0;
grad_writing <= 1'b1;
end
end else begin
grad_writing <= 1'b0; // commit (K=1: every cycle; K>1: the write phase)
case (grad_step)
5'd0: fifo_dr_dx [grad_slot] <= grad_wr_data;
5'd1: fifo_dr_dy [grad_slot] <= grad_wr_data;
5'd2: fifo_dg_dx [grad_slot] <= grad_wr_data;
5'd3: fifo_dg_dy [grad_slot] <= grad_wr_data;
5'd4: fifo_db_dx [grad_slot] <= grad_wr_data;
5'd5: fifo_db_dy [grad_slot] <= grad_wr_data;
5'd6: fifo_da_dx [grad_slot] <= grad_wr_data;
5'd7: fifo_da_dy [grad_slot] <= grad_wr_data;
5'd8: fifo_dz_dx [grad_slot] <= grad_wr_data;
5'd9: fifo_dz_dy [grad_slot] <= grad_wr_data;
5'd10: fifo_du_dx_t [grad_slot] <= grad_wr_data;
5'd11: fifo_du_dy_t [grad_slot] <= grad_wr_data;
5'd12: fifo_dv_dx_t [grad_slot] <= grad_wr_data;
5'd13: fifo_dv_dy_t [grad_slot] <= grad_wr_data;
// Ch301 perspective — steps 14..19 only reached when
// GRAD_STEPS==20 (PERSPECTIVE_CORRECT=1). At param=0 the
// engine terminates at step 13 (GRAD_STEPS-1) below, so
// these never fire and the fifo_d{s,t,q}_* fields hold
// their pre-cleared 0 — param=0 byte-identical.
5'd14: fifo_ds_dx [grad_slot] <= grad_wr_data;
5'd15: fifo_ds_dy [grad_slot] <= grad_wr_data;
5'd16: fifo_dt_dx [grad_slot] <= grad_wr_data;
5'd17: fifo_dt_dy [grad_slot] <= grad_wr_data;
5'd18: fifo_dq_dx [grad_slot] <= grad_wr_data;
5'd19: fifo_dq_dy [grad_slot] <= grad_wr_data;
default: ;
endcase
if (grad_step == 5'(GRAD_STEPS-1)) begin // last step
grad_busy <= 1'b0;
fifo_grad_pending [grad_slot] <= 1'b0;
end else begin
grad_step <= grad_step + 5'd1;
if (GRAD_SEQ_DIVIDER) div_start <= 1'b1; // kick the next step's sequential divide
end
end
end else if (grad_prefetch2) begin
// Ch328 1c — stage 3: grad_word_q is settled (latched last cycle from attr_rd_q), so
// grad_load_num/grad_load_det (combinational off grad_word_q) are valid. Latch the
// numerators and start the solve next cycle.
grad_prefetch2 <= 1'b0;
grad_busy <= 1'b1;
grad_step <= 5'd0;
grad_settle <= 5'd0;
grad_writing <= 1'b0;
if (GRAD_SEQ_DIVIDER) div_start <= 1'b1; // kick the first step's sequential divide
grad_det_q <= grad_load_det;
for (int g = 0; g < GRAD_STEPS; g = g + 1)
grad_num_q[g] <= grad_load_num[g];
end else if (grad_prefetching) begin
// Ch328 1c — stage 2: the single read port issued attr_ram[grad_slot] last cycle, so
// attr_rd_q is valid now → capture it into grad_word_q.
grad_prefetching <= 1'b0;
grad_prefetch2 <= 1'b1;
grad_word_q <= attr_rd_q;
end else if (grad_pending_any) begin
// Ch328 1c — stage 1: idle + work waiting. grad_rd_issue (combinational, above) drives
// attr_rd_addr = grad_pending_slot THIS cycle; attr_rd_q holds it next cycle. pop_ok
// holds the slot until grad_pending clears, so the extra latency stays render-neutral.
grad_prefetching <= 1'b1;
grad_slot <= grad_pending_slot;
end
// ------------------------------------------------------
// Pop side: dequeue into ras_*_q when FSM is ready.
// ------------------------------------------------------
if (pop_ok) begin
// Streaming pop stays a SINGLE-cycle register read (Ch328 1c did NOT move it to the
// M20K port): the full-FIFO push+pop concurrency relies on the slot freeing the SAME
// cycle pop_ok asserts, which a multi-cycle M20K read would break — and streaming is
// the shallow, non-capacity path that gains nothing from M20K. M20K is for the
// multiprim grid (mp+grad), where deep per-tile capacity is the goal.
// INPUT fields from attr_ram[fifo_rptr]; HOT + 20 gradient OUTPUTS from fifo_*.
{ ras_v0_x, ras_v0_y, ras_v1_x, ras_v1_y, ras_v2_x, ras_v2_y, ras_bias, ras_sa_q,
ras_color, ras_c0_q, ras_c1_q, ras_c2_q, ras_fbp, ras_fbw, ras_psm, ras_bpp_shift,
ras_tme, ras_u0, ras_v0t, ras_u1, ras_v1t, ras_tex_base, ras_tbw, ras_tpsm,
ras_wms, ras_wmt, ras_tw, ras_th, ras_du_dx_q, ras_dv_dy_q, ras_abe, ras_filter_lin,
ras_alpha_a, ras_alpha_b, ras_alpha_c, ras_alpha_d, ras_alpha_fix,
ras_zte, ras_ztst, ras_zmsk, ras_zbp, ras_z_value, ras_z0, ras_u0_base, ras_v0_base,
pp_dump_u1v, pp_dump_u2v, ras_persp, ras_s0_base, ras_t0_base, ras_q0_base,
pp_dump_stq1, pp_dump_stq2, pp_dump_q1, pp_dump_q2, pp_dump_v1z, pp_dump_v2z } <= attr_ram[fifo_rptr];
ras_mode <= fifo_mode [fifo_rptr]; // HOT (register)
ras_x_min <= fifo_x_min [fifo_rptr];
ras_x_max <= fifo_x_max [fifo_rptr];
ras_y_min <= fifo_y_min [fifo_rptr];
ras_y_max <= fifo_y_max [fifo_rptr];
ras_cur_x <= fifo_x_min [fifo_rptr];
ras_cur_y <= fifo_y_min [fifo_rptr];
prim_x_min <= fifo_x_min [fifo_rptr];
prim_x_max <= fifo_x_max [fifo_rptr];
prim_y_min <= fifo_y_min [fifo_rptr];
prim_y_max <= fifo_y_max [fifo_rptr];
tile_col_r <= 12'd0;
tile_row_r <= 12'd0;
ras_tri_active <= fifo_tri_active [fifo_rptr]; // HOT
ras_combined_r <= fifo_combined [fifo_rptr]; // HOT
comb_state_r <= CB_Z;
// 20 ENGINE-WRITTEN gradient OUTPUTS — sideband registers (NOT in attr_ram).
ras_dr_dx <= fifo_dr_dx [fifo_rptr]; ras_dr_dy <= fifo_dr_dy [fifo_rptr];
ras_dg_dx <= fifo_dg_dx [fifo_rptr]; ras_dg_dy <= fifo_dg_dy [fifo_rptr];
ras_db_dx <= fifo_db_dx [fifo_rptr]; ras_db_dy <= fifo_db_dy [fifo_rptr];
ras_da_dx <= fifo_da_dx [fifo_rptr]; ras_da_dy <= fifo_da_dy [fifo_rptr];
ras_dz_dx <= fifo_dz_dx [fifo_rptr]; ras_dz_dy <= fifo_dz_dy [fifo_rptr];
ras_du_dx_t <= fifo_du_dx_t [fifo_rptr]; ras_du_dy_t <= fifo_du_dy_t [fifo_rptr];
ras_dv_dx_t <= fifo_dv_dx_t [fifo_rptr]; ras_dv_dy_t <= fifo_dv_dy_t [fifo_rptr];
ras_ds_dx <= fifo_ds_dx [fifo_rptr]; ras_ds_dy <= fifo_ds_dy [fifo_rptr];
ras_dt_dx <= fifo_dt_dx [fifo_rptr]; ras_dt_dy <= fifo_dt_dy [fifo_rptr];
ras_dq_dx <= fifo_dq_dx [fifo_rptr]; ras_dq_dy <= fifo_dq_dy [fifo_rptr];
// Ch171 — same wrap fix as wptr above.
fifo_rptr <= fifo_rptr + FIFO_PTR_W'(1);
raster_state <= R_SCAN;
raster_active <= 1'b1;
// Ch304 — TILE-LOCAL: if the popped primitive is the
// combined T+A+Z triangle (fifo_combined), begin with the
// CLEAR phase before the combined render runs. Otherwise
// the tile machine stays OFF. TILE_LOCAL gates the whole
// block; at TILE_LOCAL=0 tile_phase_r stays TP_OFF and the
// sweep counter stays 0 (dead logic).
if (TILE_LOCAL) begin
if (fifo_combined[fifo_rptr]) begin
tile_phase_r <= TP_CLEAR;
tile_sweep_r <= 9'd0;
end else begin
tile_phase_r <= TP_OFF;
end
end
end else if (mp_grid_start) begin
// Ch305+ MULTIPRIM — start a batch grid render. Latch the
// batch (slot 0 = current fifo_rptr, count = whole FIFO),
// reset the per-tile primitive index, start the grid at tile
// (0,0), and enter the CLEAR phase. The per-primitive geometry
// load happens at CLEAR-done (prim 0) and each RENDER advance.
// Dead at the default TILE_MULTIPRIM=0 (mp_grid_start const 0).
grid_base_rptr <= fifo_rptr;
prim_count_r <= fifo_count;
prim_idx_r <= '0;
tile_col_r <= 12'd0;
tile_row_r <= 12'd0;
ras_combined_r <= 1'b1;
// Ch316 — load the batch's FRAME-derived FLUSH params NOW (from the
// oldest FIFO entry = fifo_rptr) so the per-tile flush address is
// valid for EVERY tile, including EMPTY tiles that PRECEDE the first
// non-empty tile (which load no primitive of their own). The flush
// row stride is flush_y * (ras_fbw<<6); a leading-empty tile would
// otherwise flush with the RESET ras_fbw=0 → stride 0 → all rows
// collapse onto row 0 → the tile's screen region keeps the FB-init
// value (black). All prims in a batch share FRAME, so this is exactly
// what the per-prim load (mp_load_prim) sets at render — hence
// byte-identical for any batch whose first tile is non-empty.
ras_fbp <= fifo_fbp [fifo_rptr];
ras_fbw <= fifo_fbw [fifo_rptr];
ras_psm <= fifo_psm [fifo_rptr];
ras_bpp_shift <= fifo_bpp_shift [fifo_rptr];
raster_active <= 1'b1;
raster_state <= R_SCAN;
tile_sweep_r <= 9'd0;
if (BIN_BUFFER_ENABLE) begin
// BIN BUFFER — run the classification sweep BEFORE rendering.
// Clear all bins, reset the (t,p) sweep counters, enter TP_BIN.
// When the sweep completes it enters TP_CLEAR of tile (0,0).
bin_t <= '0;
bin_p <= '0;
bin_slot_r <= '0;
for (int bt = 0; bt < NTILES; bt = bt + 1)
bin_n[bt] <= '0;
tile_phase_r <= TP_BIN;
end else begin
tile_phase_r <= TP_CLEAR;
end
end else if (ras_drain_done) begin
// Pipeline drained, no more work — go idle.
raster_state <= R_IDLE;
raster_active <= 1'b0;
ras_mode <= RM_NONE;
end
// FIFO occupancy: net change = push_ok - pop_ok.
case ({push_ok, pop_ok})
2'b10: fifo_count <= fifo_count + FIFO_CNT_W'(1);
2'b01: fifo_count <= fifo_count - FIFO_CNT_W'(1);
default: ; // 00 or 11 (concurrent push+pop) — no net change
endcase
// ------------------------------------------------------
// Ch88 — pipeline stages
//
// S0 (this cycle, while in R_SCAN): drive the pipe with
// ras_cur_x/y; advance for next cycle. On reaching the
// bbox corner, transition to R_DRAIN (S0 stops
// producing valid pixels but S1/S2 still propagate).
//
// S1 (one cycle after S0): latch s1_x_q / s1_y_q and
// s1_valid_q. S1's combinational `s1_pixel_inside` is
// already evaluated for the next-cycle latch into S2.
//
// S2 (two cycles after S0): latch s2_*_q. Emit if
// s2_valid_q && s2_inside_q.
// ------------------------------------------------------
// Brick 2b — half-rate beat toggle. While a Z-tested scan is
// in flight, z_beat_q alternates every cycle so each pixel
// gets a "primary" beat (z_beat=0; pipeline advances, color
// write) and a "z" beat (z_beat=1; pipeline frozen, Z write).
// When ras_zte=0, z_advance is always 1 and z_beat_q stays 0,
// so the pipeline runs full rate — byte-identical to pre-2b.
if (z_scan_active) z_beat_q <= ~z_beat_q;
else z_beat_q <= 1'b0;
// Ch344 — textured-alpha SPRITE half-rate beat (mirrors z_beat_q). Mutually exclusive
// with z_scan_active (a textured-alpha sprite has ras_zte=0), so at most one half-rate
// scan is active and `z_advance && ta_advance` collapses to whichever is in flight.
if (ta_scan_active) ta_beat_q <= ~ta_beat_q;
else ta_beat_q <= 1'b0;
// Ch344 — capture the dest pixel on the FROZEN beat (ta_beat=1), when fb_rd fired for
// this S2 pixel (s2_tex_abe_dest). Combinational read2 returns it the same cycle; the
// blend+emit consumes it on the following primary beat. Mirrors Brick 2b's frozen-beat
// s3_zstored capture. (Registered read2 would return it 1 cyc later, still before the
// primary beat — the frozen beat holds fb_rd_addr, so a single capture point is correct.)
// Texel capture: UNCONDITIONAL on every frozen beat (a texel request was issued on the
// preceding primary beat, so s1_tex_color is freshly valid here) + 1-deep delay for S2 align.
if (ta_scan_active && ta_beat_q) begin
ta_tex_q <= s1_tex_color;
ta_tex_q1 <= ta_tex_q;
end
// Dest capture (combinational-read fallback; registered uses fb_rd_data live on the emit beat).
if (ta_scan_active && ta_beat_q && s2_valid_q && s2_inside_q && ras_tex_abe)
ta_dest_q <= fb_rd_data;
// Pipeline advance (S0 walker + S0->S1->S2->S3 latches) is
// GATED by z_advance. On a frozen z beat all pipeline regs
// hold and only the queued Z write fires (below).
if (z_advance && ta_advance) begin // Ch344: freeze on the textured-alpha frozen beat too
// S0 → S1 latch.
// COMBINED probe: while a combined scan owns the walker,
// the standard pixel pipeline is kept EMPTY (s1_valid_q
// forced 0) so no legacy emit / S3 capture fires for the
// combined primitive — the combined FSM (below) drives all
// reads/writes. !ras_combined is constant 1 at param=0, so
// this is byte-identical to the legacy latch then.
s1_x_q <= ras_cur_x;
s1_y_q <= ras_cur_y;
s1_valid_q <= (raster_state == R_SCAN) && !ras_combined;
// S1 → S2 latch
s2_x_q <= s1_x_q;
s2_y_q <= s1_y_q;
s2_valid_q <= s1_valid_q;
s2_inside_q <= s1_pixel_inside;
s2_L0_q <= -$signed(s1_e1);
s2_L1_q <= -$signed(s1_e2);
s2_L2_q <= -$signed(s1_e0);
s2_mode_q <= ras_mode;
// Brick 1 — carry the sampled texel + texturing-active
// flag forward in lockstep with the S2 pixel it belongs
// to. With TEX_RD_LATENCY=1 against a combinational read2
// port, s1_tex_color is already the texel for this S1
// pixel, so a single register aligns it with S2 emit.
s2_tex_active_q <= s1_tex_active;
s2_tex_color_q <= s1_tex_color;
// S2 emit / S2->S3 blend or Z-test latch.
//
// Opaque non-Z pixel (ras_abe=0, ras_zte=0): drive the
// emit outputs directly, exactly as pre-Brick-2a/2b.
//
// Blended pixel (ras_abe=1): capture into the alpha S3
// stage (no Z scan can also be alpha — mutually exclusive
// by new_zte_active).
//
// Z-tested pixel (ras_zte=1): capture into the Z S3 stage.
// z_rd_* (above) presented the Z-buffer address this S2
// cycle. For the combinational read port latch the stored
// Z now; for the registered port the live z_rd_data is
// compared at S3.
s3_valid_q <= 1'b0;
s3_zvalid_q <= 1'b0;
if (s2_valid_q && s2_inside_q) begin
if (ras_tex_abe) begin
// Ch344 — textured-alpha SPRITE: emit the blended pixel on this primary beat.
// The texel is live in s2_tex_color_q; the dest was captured into ta_dest_q on
// the prior frozen beat. ta_blended_color64 = source-over(texel*vtx, ta_dest_q).
raster_pixel_emit <= 1'b1;
raster_pixel_emit_count <= raster_pixel_emit_count + 32'd1;
raster_pixel_x_q <= s2_x_q;
raster_pixel_y_q <= s2_y_q;
raster_pixel_color_q <= ta_blended_color64;
raster_pixel_fb_addr_q <= s2_fb_addr;
raster_pixel_be_q <= 4'b1111;
raster_pixel_mask_q <= 32'hFFFF_FFFF;
raster_pixel_psm_q <= ras_psm;
end else if (ras_abe) begin
s3_valid_q <= 1'b1;
s3_x_q <= s2_x_q;
s3_y_q <= s2_y_q;
s3_fb_addr_q <= s2_fb_addr;
s3_be_q <= s2_emit_be;
s3_mask_q <= s2_emit_mask;
s3_psm_q <= ras_psm;
// Flat SPRITE source color is ras_color (no texturing
// when blending in v1). Source alpha = RGBAQ.A.
s3_cs_q <= ras_color;
s3_as_q <= ras_color[31:24];
s3_dest_q <= fb_rd_data; // used when FB_RD_REGISTERED=0
end else if (ras_zte) begin
s3_zvalid_q <= 1'b1;
s3_zx_q <= s2_x_q;
s3_zy_q <= s2_y_q;
s3_zfb_addr_q <= s2_fb_addr;
s3_z_addr_q <= s2_z_addr;
s3_zcolor_q <= s2_emit_color64; // flat (sprite) or Gouraud (TRI) PSMCT32 color
// Brick 3 — TRI uses the per-pixel INTERPOLATED Z
// (s2_interp_z); SPRITE uses the flat ras_z_value.
s3_zval_q <= ras_tri_active ? s2_interp_z : ras_z_value;
s3_ztst_q <= ras_ztst;
s3_zmsk_q <= ras_zmsk;
// Combinational read port: z_rd_data is valid THIS
// S2 primary beat (mem[s2_z_addr] same cycle), so
// capture P's stored Z now. The registered read is
// 1 cycle late and is captured on the following
// frozen z beat instead (below).
if (!Z_RD_REGISTERED)
s3_zstored_q <= z_rd_data;
end else if (!ras_persp && !ras_combined) begin
// COMBINED probe: suppress the legacy opaque DECAL
// emit for the combined primitive (the combined FSM
// drives the blended color + Z writes instead). This
// branch is already unreachable in combined mode
// (s2_valid_q is held 0), but the explicit guard
// documents the suppression. Constant 1 at param=0.
// Ch302 — SUPPRESS the affine S2 opaque emit for a
// perspective primitive. The S2 pixel carries the
// STALE affine texel (the texture unit is muxed to
// the perspective u/v), so it must NOT drive the
// raster_pixel_* port. The perspective emit (below)
// drives them instead, from the +5 stage. When
// ras_persp=0 (always at PERSPECTIVE_CORRECT=0) this
// is byte-identical to the legacy opaque emit.
raster_pixel_emit <= 1'b1;
raster_pixel_emit_count <= raster_pixel_emit_count + 32'd1;
raster_pixel_x_q <= s2_x_q;
raster_pixel_y_q <= s2_y_q;
raster_pixel_color_q <= s2_emit_color64;
raster_pixel_fb_addr_q <= s2_fb_addr;
raster_pixel_be_q <= s2_emit_be;
raster_pixel_mask_q <= s2_emit_mask;
raster_pixel_psm_q <= ras_psm;
end
end
// Ch302 — PERSPECTIVE emit. Fires at S1+5 (persp_emit5),
// when s1_tex_color is the perspective DECAL texel for the
// pixel at (persp_x5, persp_y5). PSMCT32 full-word write
// (be=1111, mask=all). The DECAL texel color is the sampled
// ABGR zero-extended to 64 bits (upper 32 = Q, don't-care
// for the PSMCT32 fb write — mirrors the affine TRI DECAL
// packing {.., s2_tex_color_q} whose upper half is also
// unused by the write). Ordered AFTER the affine opaque
// emit so it wins the raster_pixel_* port on a perspective
// prim; the affine branch is suppressed via !ras_persp
// above, so there is never a same-cycle double-drive.
// ras_persp is constant 0 at PERSPECTIVE_CORRECT=0, so this
// block is dead in the affine-only build.
if (ras_persp && persp_emit5) begin
raster_pixel_emit <= 1'b1;
raster_pixel_emit_count <= raster_pixel_emit_count + 32'd1;
raster_pixel_x_q <= persp_x5;
raster_pixel_y_q <= persp_y5;
raster_pixel_color_q <= {32'd0, s1_tex_color};
raster_pixel_fb_addr_q <= persp_fb_addr;
raster_pixel_be_q <= 4'b1111;
raster_pixel_mask_q <= 32'hFFFF_FFFF;
raster_pixel_psm_q <= ras_psm;
end
// S3 emit (Brick 2a) — drive the emit outputs from the
// blended color. PSMCT32 full-word write (be=1111, mask=all).
if (s3_valid_q) begin
raster_pixel_emit <= 1'b1;
raster_pixel_emit_count <= raster_pixel_emit_count + 32'd1;
raster_pixel_x_q <= s3_x_q;
raster_pixel_y_q <= s3_y_q;
raster_pixel_color_q <= s3_blended_color64;
raster_pixel_fb_addr_q <= s3_fb_addr_q;
raster_pixel_be_q <= s3_be_q;
raster_pixel_mask_q <= s3_mask_q;
raster_pixel_psm_q <= s3_psm_q;
end
// S3 Z-test (Brick 2b) — PRIMARY-beat write.
//
// This always runs on z_advance (the primary beat for a
// Z scan, OR every cycle if somehow not a Z scan — but
// s3_zvalid_q is only ever set under ras_zte). On a
// passing pixel: write the COLOR to the framebuffer this
// beat (full-word PSMCT32), and queue the Z write for the
// following frozen z beat (unless ZMSK=1, which suppresses
// the Z update). On a failing pixel: write NOTHING (the
// pixel is occluded) and queue nothing.
zw_pending_q <= 1'b0;
if (s3_zvalid_q) begin
if (z_test_pass) begin
// Framebuffer color write (PSMCT32, full word).
raster_pixel_emit <= 1'b1;
raster_pixel_emit_count <= raster_pixel_emit_count + 32'd1;
raster_pixel_x_q <= s3_zx_q;
raster_pixel_y_q <= s3_zy_q;
raster_pixel_color_q <= s3_zcolor_q;
raster_pixel_fb_addr_q <= s3_zfb_addr_q;
raster_pixel_be_q <= 4'b1111;
raster_pixel_mask_q <= 32'hFFFF_FFFF;
raster_pixel_psm_q <= 6'h00; // PSMCT32 color fb
// Queue the Z-buffer write for the z beat unless
// ZMSK masks Z updates. The Z value is stored as a
// plain 32-bit word (PSMZ32 == 32-bit, bit-identical
// to a PSMCT32 word write at the Z address).
if (!s3_zmsk_q) begin
zw_pending_q <= 1'b1;
zw_addr_q <= s3_z_addr_q;
zw_val_q <= s3_zval_q;
end
end
// FAIL: no fb write, no Z write — pixel discarded.
end
// S0 advance + R_SCAN → R_DRAIN transition.
// COMBINED probe: the legacy 1-pixel/cycle advance is
// gated OFF while a combined scan owns the walker — the
// combined FSM (below) advances ras_cur_x/y only after a
// pixel's full 5-beat sequence (or early-out) completes.
// !ras_combined is constant 1 at param=0 (byte-identical).
if ((raster_state == R_SCAN) && !ras_combined) begin
if (ras_at_end_of_s0) begin
raster_state <= R_DRAIN;
end else if (ras_at_x_end) begin
ras_cur_x <= ras_x_min;
ras_cur_y <= ras_cur_y + 12'd1;
end else begin
ras_cur_x <= ras_cur_x + 12'd1;
end
end
// ======================================================
// COMBINED PROBE — per-pixel 5-beat FSM.
//
// Drives the SAME raster_pixel_* write port and the SAME
// tex/fb/z read ports as the legacy paths, but ONE beat
// at a time so the read2 mux + mutual-exclusion asserts in
// the BRAM top never see two consumers in one cycle. The
// read enables themselves (comb_z_req / comb_tex_req /
// comb_fb_req feeding z_rd_en / tex in_valid / fb_rd_en)
// are driven combinationally above; this block sequences
// the beat register, latches returned data, issues the two
// writes, and advances the walker. Guarded so it is dead
// logic at COMBINED_TAZ=0 (comb_scan_active const 0).
// ======================================================
if (comb_scan_active) begin
unique case (comb_state_r)
// Beat 0 — Z read issued this cycle (comb_z_req).
CB_Z: begin
if (comb_pix_inside_w) begin
comb_state_r <= CB_ZW;
end else begin
// Outside the triangle: NO reads, NO writes
// for this pixel — advance immediately.
comb_advance_walker;
end
end
// Beat 1 — stored Z ready (z_rd_data). Compute +
// latch the depth test. On PASS issue the texel
// read (comb_tex_req high this cycle). On FAIL no
// texel/dest read, no write — advance.
CB_ZW: begin
comb_ztest_pass_r <= comb_ztest_pass_w;
// Ch304 — capture the Z actually compared
// (tile Z in tile mode, VRAM Z otherwise).
// comb_z_compare_w == z_rd_data at TILE_LOCAL=0.
comb_zstored_r <= comb_z_compare_w;
if (comb_ztest_pass_w) begin
// Ch310 — on the bilinear path the texel read
// issued THIS cycle (comb_tex_req pulsed
// u_tex.in_valid; the 4-tap sampler started and
// tex_busy is now high). STALL in CB_TWAIT until
// it completes, then run the legacy CB_T beat.
// On the nearest path (bili_now=0, always so at
// BILINEAR_ENABLE=0) go straight to CB_T as
// before — byte-identical.
if (bili_now) comb_state_r <= CB_TWAIT;
else comb_state_r <= CB_T;
end else begin
comb_advance_walker;
end
end
// Ch310 — Beat 1.5 (BILINEAR ONLY) — hold while the
// 4-tap sampler runs its ~9-cycle multi-beat read. The
// sampler runs full-rate; this FSM only steps on
// z_advance beats, so we wait on the LEVEL !tex_busy
// (which stays low once the sampler returns to BS_IDLE)
// rather than the single-cycle out_valid pulse, so a
// frozen z beat can never miss the completion. in_valid
// is NOT re-pulsed here: comb_tex_req is gated on
// (comb_state_r==CB_ZW), and CB_TWAIT!=CB_ZW. When the
// sampler is done, fall through to CB_T — which latches
// s1_tex_color (HELD stable since out_valid per the
// gs_texture_unit hold register) and issues the dest-fb
// read exactly as the nearest path. This state is dead
// logic at BILINEAR_ENABLE=0 (never entered).
CB_TWAIT: begin
if (!tex_busy) comb_state_r <= CB_T;
// else: hold (do not advance, do not re-pulse).
end
// Beat 2 — texel ready (s1_tex_color). Latch as the
// source color Cs + As; issue the dest-fb read
// (comb_fb_req high this cycle). On the bilinear path
// s1_tex_color is the HELD blended ABGR captured above.
CB_T: begin
// Ch333/335 — MODULATE (TFX=0): texel * interpolated vertex color (Gouraud;
// flat collapses to the constant); DECAL (TFX=1): texel replaces (unchanged).
// Alpha keeps the texel value either way.
comb_cs_r <= comb_modulate ? gs_modulate_abgr(s1_tex_color, comb_interp_color)
: s1_tex_color;
comb_as_r <= s1_tex_color[31:24];
comb_state_r <= CB_FB;
end
// Beat 3 — dest color ready (fb_rd_data). WRITE #1:
// the blended PSMCT32 color to the FB address.
CB_FB: begin
// Ch304 — in tile-local mode the blended color
// write lands in the on-chip color tile (driven
// combinationally by the tile-RAM driver this
// same beat), so SUPPRESS the VRAM framebuffer
// emit here. tile_active is constant 0 at
// TILE_LOCAL=0, so this is the legacy emit then.
if (!tile_active) begin
raster_pixel_emit <= 1'b1;
raster_pixel_emit_count <= raster_pixel_emit_count + 32'd1;
raster_pixel_x_q <= ras_cur_x;
raster_pixel_y_q <= ras_cur_y;
// Upper 32 bits unused for a PSMCT32 (32-bit word)
// write. comb_cs_r is only [31:0], so the original
// comb_cs_r[63:32] was an out-of-range select (iverilog
// X-filled it; Quartus rejects it) — zero-fill instead.
raster_pixel_color_q <= {32'd0,
comb_blend_a, comb_blend_b,
comb_blend_g, comb_blend_r};
raster_pixel_fb_addr_q <= comb_fb_addr;
raster_pixel_be_q <= 4'b1111;
raster_pixel_mask_q <= 32'hFFFF_FFFF;
raster_pixel_psm_q <= 6'h00; // PSMCT32
end
comb_state_r <= CB_ZWR;
end
// Beat 4 — WRITE #2: the fragment Z to the Z addr
// (PSMCT32 full word at the Z address). If ZMSK,
// skip the write but STILL consume the beat. Then
// advance the walker (both writes are now done).
CB_ZWR: begin
// Ch304 — in tile-local mode the fragment-Z
// write lands in the on-chip Z tile (driven by
// the tile-RAM driver this same beat, also ZMSK-
// gated), so SUPPRESS the VRAM Z emit here.
// tile_active const 0 at TILE_LOCAL=0 (legacy).
if (!ras_zmsk && !tile_active) begin
raster_pixel_emit <= 1'b1;
raster_pixel_emit_count <= raster_pixel_emit_count + 32'd1;
raster_pixel_x_q <= ras_cur_x;
raster_pixel_y_q <= ras_cur_y;
raster_pixel_color_q <= {32'd0, comb_frag_z};
raster_pixel_fb_addr_q <= comb_z_addr;
raster_pixel_be_q <= 4'b1111;
raster_pixel_mask_q <= 32'hFFFF_FFFF;
raster_pixel_psm_q <= 6'h00; // 32-bit word
end
comb_advance_walker;
end
default: comb_state_r <= CB_Z;
endcase
end
end else begin
// ----------------------------------------------------
// Frozen z beat (Brick 2b): pipeline held. Fire the
// queued Z-buffer write through the single VRAM write
// port (emitted as a PSMCT32 full-word write at the Z
// address). No new color contends because S2/S3 are
// frozen this cycle.
// ----------------------------------------------------
if (zw_pending_q) begin
raster_pixel_emit <= 1'b1;
raster_pixel_emit_count <= raster_pixel_emit_count + 32'd1;
raster_pixel_x_q <= s3_zx_q;
raster_pixel_y_q <= s3_zy_q;
raster_pixel_color_q <= {32'd0, zw_val_q};
raster_pixel_fb_addr_q <= zw_addr_q;
raster_pixel_be_q <= 4'b1111;
raster_pixel_mask_q <= 32'hFFFF_FFFF;
raster_pixel_psm_q <= 6'h00; // 32-bit word write
end
zw_pending_q <= 1'b0;
// Registered read port: z_rd_addr was held across the
// primary->frozen edge, so z_rd_data on THIS frozen beat
// is the 1-cycle-late read of the S3-resident pixel's
// Z-buffer address — i.e. this pixel's stored Z. Capture
// it for the compare on the next primary beat. (The
// combinational port already captured at the S2 latch.)
if (Z_RD_REGISTERED && s3_zvalid_q)
s3_zstored_q <= z_rd_data;
end
// ==========================================================
// Ch304 TILE-LOCAL — phase FSM + CLEAR/FLUSH sweeps.
//
// Placed OUTSIDE the z_advance/frozen split so the CLEAR and
// FLUSH sweeps run at FULL rate (one tile RAM op / clock) and
// the registered tile-RAM read aligns to the very next clock —
// the combined prim is ZTE so z_beat toggles, and a sweep gated
// on z_advance would lose the read/emit alignment across frozen
// beats. The combined RENDER (above) still steps on z_advance;
// CLEAR/FLUSH touch only the tile RAMs, not the pixel pipeline.
// Entirely gated by tile_active → dead at TILE_LOCAL=0.
//
// TP_CLEAR : the combinational driver writes the clear
// color/Z at index tile_sweep_r this clock; here
// we advance the sweep. After all 256 entries
// (sweep 255 → next 256), move to TP_RENDER.
// TP_RENDER : the combined FSM (above) runs and, at the bbox
// corner, comb_advance_walker sets raster_state
// <= R_DRAIN. Observing R_DRAIN here closes the
// render → kick off TP_FLUSH.
// TP_FLUSH : the combinational driver presents color
// raddr=tile_sweep_r; ONE clock later we emit the
// FB write (flush_emit_q / flush_idx_q). After all
// 256 indices presented + the last emit drained,
// return to TP_OFF → ras_drain_done can fire and
// the FSM idles / pops the next primitive.
// ==========================================================
flush_emit_q <= 1'b0;
z_flush_emit_q <= 1'b0; // Ch323 — Z-flush pipeline strobe (pulse), default each cycle
if (tile_active) begin
if (mp_rd_state != RDS_IDLE) begin
// Ch328 1c — attr read IN FLIGHT: hold the tile phase until attr_rd_q is valid.
// RDS_ISSUE presented attr_rd_addr=mp_pend_slot this cycle (comb); attr_rd_q is
// valid in RDS_CONSUME, where mp_load_prim latches it and the phase advances.
if (mp_rd_state == RDS_ISSUE) begin
mp_rd_state <= RDS_CONSUME;
end else begin // RDS_CONSUME
`ifndef SYNTHESIS
// Ch329 — the binning gate must guarantee only COMBINED prims reach the grid
// render (non-combined have no tile-local path). If one leaks, the walk would
// stall (tile_active=0) — catch it here instead.
if (!fifo_combined[mp_pend_slot])
$error("ch329: multiprim grid loaded NON-combined prim (slot %0d) — binning gate failed", mp_pend_slot);
`endif
mp_load_prim(mp_pend_slot, mp_pend_cxlo, mp_pend_cxhi,
mp_pend_cylo, mp_pend_cyhi);
raster_state <= R_SCAN;
comb_state_r <= CB_Z;
// tile_phase is already TP_RENDER (set at the issuing site, so the clear/flush
// combinational drivers don't keep writing during the stall).
mp_rd_state <= RDS_IDLE;
end
end else
unique case (tile_phase_r)
// ==================================================
// TP_BIN (BIN_BUFFER_ENABLE only) — classification sweep.
// One (prim bin_p, tile bin_t) pair per cycle. If the pair
// overlaps, APPEND prim bin_p to tile bin_t's bin in p order.
// p (inner) advances every cycle; when p reaches the last
// prim of the batch, reset p to 0 and advance t (outer).
// When t passes the last tile, all pairs are classified ->
// enter TP_CLEAR of tile (0,0). Touches NO tile RAM, so the
// combinational tile-RAM driver's case-default leaves it inert.
// Schedule: prim_count_r * NTILES cycles (4 prims x 4 tiles
// = 16 cycles for a full 2x2 batch), then 1 cycle to TP_CLEAR.
// ==================================================
TP_BIN: if (BIN_BUFFER_ENABLE) begin
// Ch329 — count each non-combined prim ONCE (on the first tile pass); it is
// refused below. Non-combined prims have no tile-local path (combined-only grid).
if (bin_t == BIN_T_W'(0) && !fifo_combined[bin_p])
tile_refused_count <= tile_refused_count + 16'd1;
// Ch329 — only COMBINED prims are binned (they alone have a tile-local color/Z
// path). A non-combined prim is skipped → never loaded by the bin-walk → cannot
// stall the grid (the old behavior: tile_active=0 froze the walk).
if (bin_overlap && fifo_combined[bin_p]) begin
// Ch315 — bin-capacity guard. With the bin sized to the
// FIFO (M=N) bin_n can never reach FIFO_DEPTH (the FIFO
// drops the prim first), so the else-arm is defensive and
// would only fire under a future M<N decoupling.
if (bin_n[bin_t] < FIFO_CNT_W'(FIFO_DEPTH)) begin
bin_prim[bin_t][bin_n[bin_t]] <= bin_p;
bin_n[bin_t] <= bin_n[bin_t] + FIFO_CNT_W'(1);
// peak per-tile occupancy (post-append count).
if ((bin_n[bin_t] + FIFO_CNT_W'(1)) > bin_occ_max_r)
bin_occ_max_r <= bin_n[bin_t] + FIFO_CNT_W'(1);
end else begin
bin_overflow_r <= 1'b1;
end
end
if (bin_p >= prim_count_r - FIFO_CNT_W'(1)) begin
// last prim for this tile -> advance to next tile.
bin_p <= '0;
if (bin_t >= BIN_T_W'(NTILES - 1)) begin
// All (p,t) pairs classified -> start rendering.
tile_phase_r <= TP_CLEAR;
tile_sweep_r <= 9'd0;
end else begin
bin_t <= bin_t + BIN_T_W'(1);
end
end else begin
bin_p <= bin_p + FIFO_CNT_W'(1);
end
end
TP_CLEAR: begin
if (tile_sweep_r == 9'd255) begin
tile_sweep_r <= 9'd0;
if (BIN_BUFFER_ENABLE) begin
// BIN BUFFER — walk cur_t's PRECOMPUTED bin
// instead of re-scanning. Empty bin (bin_n==0)
// -> FLUSH (tile = clear). Else load bin slot 0's
// prim (bin_prim[cur_t][0]) with its tile-clipped
// bbox and render. Bin order == mp_next_nonempty
// order, so this is image-equivalent to the
// re-test path. (Reached only at TILE_MULTIPRIM=1,
// the only mode that ever enters TP_BIN.)
if (bin_n[cur_t] == FIFO_CNT_W'(0)) begin
bin_slot_r <= '0;
prim_idx_r <= '0;
tile_phase_r <= TP_FLUSH;
end else begin
bin_slot_r <= '0;
prim_idx_r <= binw_first_idx;
// Ch328 1c — issue single-port read; stall FSM consumes + → TP_RENDER.
mp_rd_state <= RDS_ISSUE;
mp_pend_slot <= binw_first_slot;
mp_pend_cxlo <= binw_first_cx_lo; mp_pend_cxhi <= binw_first_cx_hi;
mp_pend_cylo <= binw_first_cy_lo; mp_pend_cyhi <= binw_first_cy_hi;
tile_phase_r <= TP_RENDER;
end
end else if (TILE_MULTIPRIM) begin
// Ch305+ MULTIPRIM — the tile RAMs are now
// cleared. Load the FIRST non-empty primitive of
// the batch (skipping primitives whose clip
// against this tile is empty) and start
// rendering it. If NO primitive overlaps this
// tile, go straight to FLUSH (tile = clear).
// prim_idx_r is set to the chosen index so the
// RENDER-advance below continues from there.
if (mp_first_idx >= prim_count_r) begin
// No primitive overlaps this tile.
prim_idx_r <= '0;
tile_phase_r <= TP_FLUSH;
end else begin
prim_idx_r <= mp_first_idx;
// Ch328 1c — issue single-port read; stall FSM consumes + → TP_RENDER.
mp_rd_state <= RDS_ISSUE;
mp_pend_slot <= mp_first_slot;
mp_pend_cxlo <= mp_first_cx_lo; mp_pend_cxhi <= mp_first_cx_hi;
mp_pend_cylo <= mp_first_cy_lo; mp_pend_cyhi <= mp_first_cy_hi;
tile_phase_r <= TP_RENDER;
end
end else begin
// Ch305 — CLIP the primitive bbox to THIS tile and
// load it into the walker bbox. tile_idx_w (low 4
// bits) gives the tile-local RAM address for ANY
// 16-aligned tile, so no other retargeting needed.
// If the clip is EMPTY (prim doesn't overlap this
// tile) SKIP the RENDER → go straight to FLUSH so
// the tile shows the clear color. Re-arm R_SCAN so
// the combined walker runs again for this tile (it
// is left in R_DRAIN by the previous tile's render;
// at COLS=ROWS=1 raster_state is already R_SCAN, so
// this is a no-op → byte-identical).
ras_x_min <= clip_x_min;
ras_x_max <= clip_x_max;
ras_y_min <= clip_y_min;
ras_y_max <= clip_y_max;
ras_cur_x <= clip_x_min;
ras_cur_y <= clip_y_min;
if (tile_clip_empty) begin
tile_phase_r <= TP_FLUSH;
end else if (TILE_SPILL_ENABLE && spill_valid[tile_id_w]) begin
// Ch323 — RELOAD this tile's color+Z from LPDDR before RENDER, but
// ONLY if it has been spilled before (valid backing). Arm the
// staging fill, then TP_RELOAD sweeps it into the tile RAMs → RENDER.
tile_phase_r <= TP_RELOAD;
reload_start_o <= 1'b1; // pulse: arm de25 staging fill
reload_wait <= 1'b1;
tile_sweep_r <= 9'd0;
end else begin
// No valid backing yet (first pass) OR spill disabled: render from
// the local CLEAR (clear Z/color already written by TP_CLEAR). This
// is the clean-Z bootstrap — never reload garbage on the first batch.
tile_phase_r <= TP_RENDER;
raster_state <= R_SCAN;
comb_state_r <= CB_Z;
end
end
end else begin
tile_sweep_r <= tile_sweep_r + 9'd1;
end
end
TP_RENDER: begin
if (BIN_BUFFER_ENABLE) begin
// BIN BUFFER — same pipeline-drain gate as multiprim,
// but advance through cur_t's PRECOMPUTED bin: the
// next slot is bin_slot_r+1. If it is past bin_n[cur_t]
// the tile's bin is exhausted -> FLUSH. Else step the
// slot, load that bin entry's prim (tile-clipped), and
// STAY in TP_RENDER (no inter-prim clear). The walk
// order is the bin's append order == draw order ==
// mp_next_nonempty order, so image-equivalent.
if (comb_pipe_empty) begin
if (bin_slot_r + FIFO_CNT_W'(1) >= bin_n[cur_t]) begin
tile_phase_r <= TP_FLUSH;
tile_sweep_r <= 9'd0;
end else begin
bin_slot_r <= bin_slot_r + FIFO_CNT_W'(1);
prim_idx_r <= binw_next_idx;
// Ch328 1c — issue single-port read; stall FSM consumes (STAY TP_RENDER).
mp_rd_state <= RDS_ISSUE;
mp_pend_slot <= binw_next_fslot;
mp_pend_cxlo <= binw_next_cx_lo; mp_pend_cxhi <= binw_next_cx_hi;
mp_pend_cylo <= binw_next_cy_lo; mp_pend_cyhi <= binw_next_cy_hi;
end
end
end else if (TILE_MULTIPRIM) begin
// Ch305+ MULTIPRIM — wait for the per-pixel pipeline to
// FULLY flush (comb_pipe_empty), not just the walker
// reaching R_DRAIN, before touching ras_* / starting the
// next primitive. Otherwise primitive N's in-flight
// color/Z writes are lost and N+1's tile_z reads see
// stale data (broken occlusion). prim_idx_r just
// finished compositing; advance to the next NON-EMPTY
// primitive (mp_after_idx), or FLUSH if the batch is
// exhausted for this tile.
if (comb_pipe_empty) begin
if (mp_after_idx >= prim_count_r) begin
tile_phase_r <= TP_FLUSH;
tile_sweep_r <= 9'd0;
end else begin
prim_idx_r <= mp_after_idx;
// Ch328 1c — issue single-port read; stall FSM consumes (STAY TP_RENDER).
mp_rd_state <= RDS_ISSUE;
mp_pend_slot <= mp_after_slot;
mp_pend_cxlo <= mp_after_cx_lo; mp_pend_cxhi <= mp_after_cx_hi;
mp_pend_cylo <= mp_after_cy_lo; mp_pend_cyhi <= mp_after_cy_hi;
end
end
end else begin
// Ch304 single-primitive path — unchanged (byte-identical).
if (raster_state == R_DRAIN) begin
tile_phase_r <= TP_FLUSH;
tile_sweep_r <= 9'd0;
end
end
end
TP_FLUSH: begin
if (tile_sweep_r < 9'd256) begin
// Ch336 — sparse flush: a non-first batch (batch_full_r=0) writes ONLY the
// pixels it drew, preserving the accumulated FB. Always-full when not
// accumulating, or on a scene's first batch -> byte-identical legacy flush.
flush_emit_q <= (!TILE_ACCUM_ENABLE) || batch_full_r
|| tile_written_r[tile_sweep_r[7:0]];
flush_idx_q <= tile_sweep_r[7:0];
// Latch the tile origin alongside the emit so the
// registered-read +1-cycle FB write lands at the
// right screen offset. 0 at COLS=ROWS=1.
flush_ox_q <= tile_ox;
flush_oy_q <= tile_oy;
tile_sweep_r <= tile_sweep_r + 9'd1;
end else begin
// All 256 raddrs presented; the final emit (for
// index 255) drains THIS clock (flush_emit_q set
// last clock). Ch305 — ADVANCE to the next tile:
// if this was the last tile, the grid render is
// done → TP_OFF; otherwise step the tile counter
// (col++, wrap to row++) and re-CLEAR. At
// COLS=ROWS=1 tile_is_last is always true → TP_OFF
// after one tile (byte-identical to Ch303).
// Ch323 — when spilling, flush THIS tile's Z to LPDDR first
// (TP_ZFLUSH) and advance from there. Default (no spill) falls
// straight through to the unchanged advance below.
if (TILE_SPILL_ENABLE) begin
tile_phase_r <= TP_ZFLUSH;
tile_sweep_r <= 9'd0;
end else
if (tile_is_last) begin
if (TILE_MULTIPRIM) begin
// Ch305+ MULTIPRIM — last tile of the grid
// done. The batch IS the whole FIFO and the
// feed finished before grid start, so DRAIN
// the entire FIFO and end the grid, mirroring
// the normal ras_drain_done idle tidy-up.
// NOTE: streaming-multiprim (partial drain
// with count arithmetic) is future work.
fifo_rptr <= fifo_wptr;
fifo_count <= '0;
for (int di = 0; di < FIFO_DEPTH; di = di + 1)
fifo_grad_pending[di] <= 1'b0;
tile_phase_r <= TP_OFF;
raster_active <= 1'b0;
raster_state <= R_IDLE;
ras_mode <= RM_NONE;
end else begin
tile_phase_r <= TP_OFF;
end
end else begin
tile_phase_r <= TP_CLEAR;
tile_sweep_r <= 9'd0;
if (tile_col_r == 12'(TILE_COLS - 1)) begin
tile_col_r <= 12'd0;
tile_row_r <= tile_row_r + 12'd1;
end else begin
tile_col_r <= tile_col_r + 12'd1;
end
end
end
end
// Ch323 (TILE_SPILL_ENABLE) — RELOAD this tile's color+Z from the staging
// engine (filled from LPDDR) into the tile RAMs BEFORE rendering. Wait for
// the staging fill (tile_reload_ready_i), then sweep 0..255 presenting
// tile_reload_raddr_o; the returned color/Z (1-cyc later) is written via
// reload_wr_we (the comb arm above). ALL 256 written before TP_RENDER.
TP_RELOAD: begin
if (reload_wait) begin
// staging fill in flight — hold until warm, no tile-RAM writes yet.
if (tile_reload_ready_i) begin
reload_wait <= 1'b0;
tile_sweep_r <= 9'd0;
end
end else if (tile_sweep_r < 9'd256) begin
// present raddr=sweep this cycle; write idx=sweep NEXT cycle when its
// staging data is valid (reload_wr_we/addr registered 1-cyc behind).
reload_wr_we <= 1'b1;
reload_wr_addr <= tile_sweep_r[7:0];
tile_sweep_r <= tile_sweep_r + 9'd1;
end else begin
// sweep done; the final write (idx 255) drains THIS cycle via the
// reload_wr_we set last cycle. All 256 now resident → RENDER.
tile_phase_r <= TP_RENDER;
raster_state <= R_SCAN;
comb_state_r <= CB_Z;
end
end
// Ch323 (TILE_SPILL_ENABLE) — emit this tile's Z RAM to the LPDDR
// Z-backing, then run the SAME tile-advance as TP_FLUSH. Separate phase
// from color flush (Codex): color flush completes first, then Z flush;
// distinct counters/errors. Never entered at the default.
TP_ZFLUSH: begin
if (tile_sweep_r < 9'd256) begin
z_flush_emit_q <= 1'b1;
z_flush_idx_q <= tile_sweep_r[7:0];
z_flush_ox_q <= tile_ox;
z_flush_oy_q <= tile_oy;
tile_sweep_r <= tile_sweep_r + 9'd1;
end else begin
// 256 Z reads presented; the final Z emit (index 255) drains
// THIS clock. The tile's color+Z are now spilled to LPDDR, so its
// backing is VALID — later passes of this tile may reload it.
spill_valid[tile_id_w] <= 1'b1;
// Now advance exactly like TP_FLUSH does.
if (tile_is_last) begin
if (TILE_MULTIPRIM) begin
fifo_rptr <= fifo_wptr;
fifo_count <= '0;
for (int di = 0; di < FIFO_DEPTH; di = di + 1)
fifo_grad_pending[di] <= 1'b0;
tile_phase_r <= TP_OFF;
raster_active <= 1'b0;
raster_state <= R_IDLE;
ras_mode <= RM_NONE;
end else begin
tile_phase_r <= TP_OFF;
end
end else begin
tile_phase_r <= TP_CLEAR;
tile_sweep_r <= 9'd0;
if (tile_col_r == 12'(TILE_COLS - 1)) begin
tile_col_r <= 12'd0;
tile_row_r <= tile_row_r + 12'd1;
end else begin
tile_col_r <= tile_col_r + 12'd1;
end
end
end
end
default: ; // TP_OFF — idle.
endcase
end
// Ch304 — TILE-LOCAL flush emit. One clock after a FLUSH raddr
// was presented, the registered tile color rdata is valid; emit
// it as a PSMCT32 framebuffer write at the linear address for
// (x,y)=(flush_idx[3:0], flush_idx[7:4]). During CLEAR/RENDER no
// flush_emit_q is set, and the combined FSM suppresses its own
// emit in tile mode, so there is never a same-clock double-drive
// of raster_pixel_*. tile_active const 0 at TILE_LOCAL=0 (dead).
if (tile_active && flush_emit_q) begin
raster_pixel_emit <= 1'b1;
raster_pixel_emit_count <= raster_pixel_emit_count + 32'd1;
raster_pixel_x_q <= {8'd0, flush_idx_q[3:0]};
raster_pixel_y_q <= {8'd0, flush_idx_q[7:4]};
if (TILE_COLOR_PSMCT16) begin
// Ch305+ PSMCT16 flush. tile_color_rdata[15:0] is already
// the packed pix16; emit it exactly as the S2 PSMCT16
// raster path does (s2_psm_is_16bit arm ~line 3973):
// * pix16 in the LOW halfword of the color word,
// * be = 4'b0011, mask = 0xFFFF_FFFF, psm = 0x02,
// * fb_addr = PSMCT16 byte address (2 bytes/pixel).
// vram_normalize_pkg::normalize_write then takes
// payload[15:0] and commits it to the low or high
// halfword per byte_addr[1] (flush_fb_addr16_w supplies
// that bit), so a PSMCT16 scanout reads back this pix16.
raster_pixel_color_q <= {48'd0, tile_color_rdata[15:0]};
raster_pixel_fb_addr_q <= flush_fb_addr16_w;
raster_pixel_be_q <= 4'b0011;
raster_pixel_mask_q <= 32'hFFFF_FFFF;
raster_pixel_psm_q <= 6'h02; // PSMCT16
end else begin
// Legacy PSMCT32 flush (byte-identical).
raster_pixel_color_q <= {32'd0, tile_color_rdata32};
raster_pixel_fb_addr_q <= flush_fb_addr_w;
raster_pixel_be_q <= 4'b1111;
raster_pixel_mask_q <= 32'hFFFF_FFFF;
raster_pixel_psm_q <= 6'h00; // PSMCT32
end
end
// Ch323 — TILE Z-FLUSH emit (TILE_SPILL_ENABLE). One clock after a TP_ZFLUSH Z
// raddr was presented, the registered tile_z_rdata is valid; emit it to the
// LPDDR Z-backing via the SEPARATE z_flush_* channel (the color flush path above
// is untouched). z_flush_emit_q is only ever set in TP_ZFLUSH, so at the default
// this is dead and z_flush_emit_o stays 0.
if (TILE_SPILL_ENABLE && tile_active && z_flush_emit_q) begin
z_flush_emit_o <= 1'b1;
z_flush_addr_o <= z_flush_addr_w; // Z-backing-relative byte offset (index*4)
z_flush_data_o <= tile_z_rdata; // 32-bit Z for this tile pixel
end
// Ch323 — DEDICATED color-flush spill emit: gated EXACTLY like the Z-flush above
// (tile_active && flush_emit_q = TP_FLUSH only), so the color spill writer sees one
// emit per tile pixel ON FLUSH and NOT the RENDER-phase raster_pixel_emit traffic.
// Same pixel_index*4 addressing as Z (no FBP). flush_emit_q is set only in TP_FLUSH.
if (TILE_SPILL_ENABLE && tile_active && flush_emit_q) begin
tile_color_flush_emit_o <= 1'b1;
tile_color_flush_addr_o <= flush_pixel_index_w << 2;
tile_color_flush_data_o <= tile_color_rdata32;
end
end
end
// ------------------------------------------------------------------
// Trace emission
// ------------------------------------------------------------------
always_ff @(posedge clk) begin
if (!rst_n) begin
ev_valid <= 1'b0;
ev_subsys <= SUBSYS_GS;
ev_event <= EV_MODE;
ev_arg0 <= 64'd0;
ev_arg1 <= 64'd0;
ev_arg2 <= 64'd0;
ev_arg3 <= 64'd0;
ev_flags <= 32'd0;
end else if (reg_wr_en) begin
ev_valid <= 1'b1;
ev_subsys <= SUBSYS_GS;
ev_flags <= 32'd0;
// Privileged CRTC/MMIO writes. Only BGCOLOR is decoded
// today; the rest fall to EV_MODE.
if (reg_wr_addr == BGCOLOR_OFFSET) begin
ev_event <= EV_BGCOLOR;
ev_arg0 <= {56'd0, reg_wr_data[23:16]};
ev_arg1 <= {56'd0, reg_wr_data[15:8]};
ev_arg2 <= {56'd0, reg_wr_data[7:0]};
ev_arg3 <= 64'd0;
end else begin
ev_event <= EV_MODE;
ev_arg0 <= {48'd0, reg_wr_addr};
ev_arg1 <= reg_wr_data;
ev_arg2 <= 64'd0;
ev_arg3 <= 64'd0;
end
end else if (gif_reg_wr_en) begin
// Ch75/Ch98: GIF-A+D writes route through the new port.
// arg0 = reg#, arg1 = data, arg2 = stable per-reg
// selector (1=PRIM, 2=RGBAQ, 3=XYZF2, 4=XYZ2,
// 5=FRAME_1, 6=ZBUF_1, 7=TEX0_1; 0 for unknown).
// Known regs emit EV_WRITE, unknown ones EV_MODE.
//
// Ch76: an XYZ2/XYZF2 commit that closes a primitive
// preempts EV_WRITE in this same cycle with EV_PRIM_DRAW
// — arg0=prim_type, arg1=threshold, arg2=cumulative
// count (post-increment), arg3=closing vertex data. The
// xyz2_q/xyzf2_q latch still updates this cycle so the
// raw vertex value remains observable; the trace just
// upgrades to the more specific event.
ev_valid <= 1'b1;
ev_subsys <= SUBSYS_GS;
ev_flags <= 32'd0;
if (prim_complete_now) begin
ev_event <= EV_PRIM_DRAW;
ev_arg0 <= {61'd0, prim_type};
ev_arg1 <= {61'd0, vert_threshold};
ev_arg2 <= {32'd0, prim_complete_count + 32'd1};
ev_arg3 <= gif_reg_data;
end else begin
ev_event <= gif_is_known_reg ? EV_WRITE : EV_MODE;
ev_arg0 <= {56'd0, gif_reg_num};
ev_arg1 <= gif_reg_data;
ev_arg2 <= (gif_reg_num == GIF_REG_PRIM) ? 64'd1
: (gif_reg_num == GIF_REG_RGBAQ) ? 64'd2
: (gif_reg_num == GIF_REG_XYZF2) ? 64'd3
: (gif_reg_num == GIF_REG_XYZ2) ? 64'd4
: (gif_reg_num == GIF_REG_FRAME_1) ? 64'd5
: (gif_reg_num == GIF_REG_ZBUF_1) ? 64'd6
: (gif_reg_num == GIF_REG_TEX0_1) ? 64'd7
: (gif_reg_num == GIF_REG_ALPHA_1) ? 64'd8
: (gif_reg_num == GIF_REG_TEST_1) ? 64'd9
: 64'd0;
ev_arg3 <= 64'd0;
end
end else begin
ev_valid <= 1'b0;
end
end
endmodule : gs_stub