RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression (272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps, and all dump-derived textures/traces) is excluded via .gitignore and stays local. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
232 KiB
GIF/GS Contract
Status: Draft
Purpose
Define the graphics ingress and rendering/display boundary.
Owns
- GIF path intake and arbitration,
- GIF tag interpretation,
- GS register decode,
- GS VRAM-visible operations,
- framebuffer/zbuffer/texture-visible state handling,
- PCRTC/display output generation or a planned approximation layer.
Inputs
- DMAC channel 2 traffic,
- VIF/VU-generated graphics traffic,
- privileged GS register writes,
- reset and display configuration controls.
Outputs
- VRAM updates,
- display timing and pixel output,
- status/interrupt signals,
- packet and register trace events.
Questions to lock
- What is the first output milestone:
- GS privileged register acceptance only
- static background color
- minimal primitive draw
gsKit-style demo target
- Is Phase 1 display based on a faithful GS/PCRTC path or a temporary adapter?
- What VRAM organization assumptions must stay stable from the beginning?
Allowed early stubs
- privileged-register-only GS stub,
- BGCOLOR/test-pattern display path,
- packet logger with no rendering.
Required debug visibility
- GIF tags,
- PATH source and arbitration result,
- GS register writes,
- VRAM write summaries,
- display mode transitions.
First meaningful milestone
- a known packet stream or direct privileged-register sequence produces a stable, visible, repeatable output and matching trace.
GS write-port contract (Ch75)
The GS model has two architecturally distinct write ports because real PS2 hardware exposes two unrelated register namespaces. Conflating them was a Ch74 mistake; Ch75 split them.
reg_wr_* — privileged GS/MMIO writes
- Source: CPU MMIO writes to the
0x12000000privileged-register block, e.g. viaplatform_video_stubor a direct test-harness path. - Address:
reg_wr_addr[15:0]is the offset inside the privileged block. - Examples:
BGCOLORat offset0x00E0,PMODEat0x0000,SMODE2at0x0020, etc. - Currently latched:
BGCOLORonly. Other offsets emitEV_MODE.
gif_reg_* — GIF A+D register-number writes
- Source:
gif_packed_stubconsuming a PACKED A+D entry when run withREAL_AD_REG_MAP=1(the new default-on path for real PS2 packets; parameter still defaults to0for back-compat with project-local Ch72/Ch73 PACKED-A+D layout). - Address:
gif_reg_num[7:0]is the GIF A+D register number straight out of the PACKED entry'sin_data[71:64]. Source-of-truth is PCSX2pcsx2/GS/GSRegs.h. - Currently decoded:
PRIM=0x00,RGBAQ=0x01,XYZF2=0x04,XYZ2=0x05,FRAME_1=0x4C,ZBUF_1=0x4E(not0x4F— that isZBUF_2). Each has a dedicated 64-bit latch output. Other reg numbers emitEV_MODE.
Event taxonomy
The two write paths emit different events. Read this carefully — arg2
semantics differ across emitters.
-
EV_BGCOLOR— emitted only bygs_stubon the privileged port whenreg_wr_addr == 0x00E0. Carries the unpacked R/G/B inarg0/arg1/arg2. The privileged port has no per-register "selector" beyond this dedicated event; everything else on that port goes toEV_MODEwitharg0=offset,arg1=data. -
EV_WRITE— emitted in two places with differentarg2semantics:gif_packed_stubon a PACKED A+D accept (REGS nibble =0xE). Carries the raw PACKED address bits inarg2({48'd0, in_data[79:64]}). UnderREAL_AD_REG_MAP=1the low 8 bits are the real GIF reg# (in_data[71:64]); underREAL_AD_REG_MAP=0the low 16 bits are the project-local privileged-style offset. Not a stable selector — it is the address half of the wire.gs_stubon thegif_reg_*port for a tracked GIF reg (PRIM/RGBAQ/XYZF2/XYZ2/FRAME_1/ZBUF_1). Carries a stable per-register selector inarg2:1=PRIM, 2=RGBAQ, 3=XYZF2, 4=XYZ2, 5=FRAME_1, 6=ZBUF_1, 7=TEX0_1(Ch98).arg0=reg#,arg1=data. Use this selector for trace-side filtering; it does not depend onREAL_AD_REG_MAP.- Ch76 caveat: a tracked vertex commit (XYZ2 or XYZF2) on the
gif_reg_*port that closes a primitive does NOT emit EV_WRITE that cycle —EV_PRIM_DRAWpreempts it (see below). The xyz2_q / xyzf2_q latch still updates. Trace consumers counting "vertices seen" must sumEV_WRITE(selector=3 or 4) +EV_PRIM_DRAWto get the true total.
-
EV_PRIM_DRAW— Ch76 / Ch77. Fired bygs_stubonce per primitive completion: when an XYZ2 or XYZF2 vertex commit on thegif_reg_*port closes a primitive under the currentPRIM[2:0]. Preempts the EV_WRITE that the closing vertex would otherwise have emitted. Args:arg0=PRIM[2:0](prim type),arg1=primary threshold,arg2=cumulativeprim_complete_countpost-increment,arg3=closing vertex data(the same 64 bits that latched into xyz2_q / xyzf2_q on this cycle).- Discrete primitives (POINT=1, LINE=2, TRIANGLE=3, SPRITE=2): one draw per N vertices; the vertex counter resets to 0 after each draw.
- Strip / fan primitives (LINE_STRIP=2, TRI_STRIP=3, TRI_FAN=3):
Ch77. Anchor on the first N vertices, then fire one draw per
additional vertex commit. The vertex counter saturates at the
primary threshold so every subsequent vertex closes another
primitive. Ch78 adds vertex-identity tracking distinguishing
TRI_STRIP rolling triangles
{v_n-2, v_n-1, v_n}from TRI_FAN pivot triangles{v_pivot, v_n-1, v_n}— see the next section. - Reserved (PRIM=7): no draw, vertex commits do not increment the counter, latches still update.
- A PRIM write always resets the vertex counter so a fresh primitive type starts cleanly.
Per-primitive vertex snapshot (Ch78)
Alongside EV_PRIM_DRAW, gs_stub exposes three 64-bit outputs —
prim_v0_q, prim_v1_q, prim_v2_q — that hold the vertex tuple
of the most recently closed primitive. Snapshot is registered on the
same clock edge as the ev_valid pulse and held until the next
prim_complete, so a TB can sample it at the same time it sees
gs_ev_event == EV_PRIM_DRAW.
The number of valid slots is implicit in PRIM[2:0]:
PRIM |
type | valid slots | semantics |
|---|---|---|---|
| 0 | POINT | v0 |
the single vertex |
| 1 | LINE | v0, v1 |
endpoints |
| 2 | LINE_STRIP | v0, v1 |
each segment uses {v_n-1, v_n} |
| 3 | TRIANGLE | v0, v1, v2 |
the three vertices |
| 4 | TRI_STRIP | v0, v1, v2 |
rolling: {v_n-2, v_n-1, v_n} |
| 5 | TRI_FAN | v0, v1, v2 |
pivot+rolling: {v_pivot, v_n-1, v_n} |
| 6 | SPRITE | v0, v1 |
top-left + bottom-right |
| 7 | reserved | — | observer never closes |
The TRI_STRIP-vs-TRI_FAN distinction lives entirely in the
saturated-extension path: a TRI_STRIP advances v0 each draw with
the rolling window; a TRI_FAN pins v0 to v_pivot (the first
vertex committed since the most recent PRIM write). On the anchor
draw, v_pivot and the rolling v_prev happen to coincide, so
TRI_STRIP and TRI_FAN report the same tuple for their first
triangle.
A PRIM write clears the rolling window (v_curr / v_prev /
v_prev_prev / v_pivot / pivot_seen) so a fresh primitive
context starts with no residual vertex bleed. Slots not used by the
current primitive type read 0.
The snapshot tracks identity, not geometry — the values written are
the raw 64-bit gif_reg_data payloads of XYZ2 / XYZF2 commits, with
no decoding into screen-space coordinates. Rasterization is still
out of scope.
Per-primitive color snapshot (Ch79 / Ch80)
prim_color_q[63:0] is registered on the same edge as
prim_v0_q / prim_v1_q / prim_v2_q and carries the value of
rgbaq_q at the moment the primitive closed. RGBAQ writes are
separate A+D entries from XYZ2 / XYZF2 commits (gif_packed_stub
serializes A+D to one accept per cycle), so rgbaq_q is always
settled to its draw-time value when prim_complete_now fires.
prim_color_q reads 0 if no RGBAQ has been written since reset;
rgbaq_q itself is not cleared on a PRIM write — color carries
forward across PRIM context switches, matching real GS behavior —
but it does reset to 0 on rst_n.
Per-vertex Gouraud color (Ch80)
For real game streams that interleave RGBAQ writes with vertex
commits to drive Gouraud shading, gs_stub exposes three
additional outputs:
| Output | Slot semantics |
|---|---|
prim_color_v0_q[63:0] |
color of vertex 0 |
prim_color_v1_q[63:0] |
color of vertex 1 |
prim_color_v2_q[63:0] |
color of vertex 2 |
A parallel rolling color window (c_curr_q / c_prev_q /
c_prev_prev_q / c_pivot_q, internal) samples rgbaq_q on
every vertex commit, mirroring the Ch78 vertex-identity window.
The snapshot layout matches the vertex layout exactly:
PRIM |
type | _v0_q color of |
_v1_q color of |
_v2_q color of |
|---|---|---|---|---|
| 0 | POINT | the single vertex | 0 | 0 |
| 1 | LINE | first endpoint | closing | 0 |
| 2 | LINE_STRIP | previous vertex | closing | 0 |
| 3 | TRIANGLE | v_n-2 |
v_n-1 |
closing |
| 4 | TRI_STRIP | v_n-2 (rolls) |
v_n-1 |
closing |
| 5 | TRI_FAN, anchor | v1 (≡ pivot) |
v2 |
v3 |
| 5 | TRI_FAN, saturated | v_pivot (PINNED) |
v_n-1 |
closing |
| 6 | SPRITE | first endpoint | closing | 0 |
prim_color_q is exactly the closing-vertex color (≡
prim_color_v_close), kept as a convenience alias for consumers
that don't care about Gouraud.
For flat-shaded primitives (RGBAQ written once before the
strip), all per-vertex color slots used by the primitive equal
each other and equal prim_color_q. For Gouraud-shaded
primitives (RGBAQ rewritten between vertex commits), the slots
may differ — capturing the per-vertex color identity needed to
distinguish a strip's rolling colors from a fan's pivot color.
The color window is cleared on PRIM write (unlike rgbaq_q
itself, which carries forward). This means per-vertex color
identity stays tied to the current primitive context — a stream
that switches PRIM types mid-context starts color tracking fresh
for the new context. Slots not used by the current primitive type
read 0.
Like the vertex snapshot, this captures identity, not interpolated geometry — the stored values are the raw 64-bit RGBAQ payloads (packing R, G, B, A, and the texture-coord divisor Q together); GS-style Gouraud interpolation across the primitive interior remains out of scope.
Structured-field decode (Ch81)
gs_stub exposes pre-decoded snapshot outputs alongside the raw
64-bit slots so a downstream rasterizer or pixel-emit path doesn't
have to re-derive bit fields:
| Output | Type | Carries |
|---|---|---|
prim_v0_decoded_q / _v1_ / _v2_ |
trace_pkg::vertex_t |
x / y / z / fog / is_xyzf2 per slot |
prim_v0_color_decoded_q / _v1_ / _v2_ |
trace_pkg::color_t |
r / g / b / a / q per slot |
The decoded outputs latch on the same edge as the raw snapshots, so
a TB samples both atomically with EV_PRIM_DRAW.
vertex_t and the XYZ2 / XYZF2 distinction
typedef struct packed {
logic is_xyzf2; // 1 = XYZF2 source, 0 = XYZ2
logic [7:0] fog; // valid iff is_xyzf2; else 0
logic [31:0] z; // 32-bit (XYZ2) or zero-extended 24-bit (XYZF2)
logic [15:0] y; // 12.4 fixed-point screen Y
logic [15:0] x; // 12.4 fixed-point screen X
} vertex_t;
XYZ2 packs full 32-bit Z in data[63:32]. XYZF2 packs 24-bit Z in
data[55:32] and an 8-bit fog byte in data[63:56]. The is_xyzf2
flag is registered in a parallel rolling format-flag window
(xyzf2_curr_q / xyzf2_prev_q / xyzf2_prev_prev_q /
xyzf2_pivot_q) that tracks the source format of each vertex
through the rolling window — so when an XYZF2 vertex rolls into
the v_prev slot of a TRI_STRIP saturated extension, its
is_xyzf2 flag rolls with it.
Cleared on rst_n and on PRIM write, same as the vertex/color
windows.
color_t
typedef struct packed {
logic [31:0] q; // texture-coord divisor (IEEE float)
logic [7:0] a;
logic [7:0] b;
logic [7:0] g;
logic [7:0] r;
} color_t;
Direct bit-slice of the RGBAQ payload — no interpretation. Q is carried verbatim as a 32-bit IEEE float (the GS uses it for texture coordinate division during rasterization, which remains out of scope).
Decode helper functions
trace_pkg exposes decode_vertex(data, is_xyzf2) and
decode_color(data) so downstream code can re-decode raw 64-bit
values consistently with the gs_stub snapshot.
The decoded outputs are an additive contract — the raw prim_v*_q
and prim_color_v*_q outputs continue to work for consumers that
don't care about per-channel decoding.
Minimal pixel emit (Ch82)
gs_stub exposes a per-primitive pixel emit — the smallest
possible output that ties the recognition layer to a framebuffer
destination. One pixel per closed primitive (the closing vertex,
in screen-space integer coords), addressed by the latched
frame_1_q register. No interpolation, no coverage, no
rasterization — this is the contact point for a future raster
chapter, not a substitute for one.
| Output | Width | Carries |
|---|---|---|
pixel_emit |
1 | 1-cycle strobe; pulses on the same edge as prim_complete |
pixel_emit_count |
32 | Running tally of emits since reset |
pixel_x_q / pixel_y_q |
12 | Closing vertex integer screen coords (top 12 bits of 12.4 fixed-point) |
pixel_color_q |
64 | RGBAQ at the emit moment (= prim_color_q) |
pixel_fbp_q |
9 | FRAME_1[8:0] (framebuffer base / 2048) |
pixel_fbw_q |
6 | FRAME_1[21:16] (framebuffer width / 64 in pixels) |
pixel_psm_q |
6 | FRAME_1[29:24] (pixel storage format) |
pixel_fb_addr_q |
32 | Computed VRAM byte offset (see below) |
Address arithmetic
fb_addr = FBP * 2048 + (Y * FBW * 64 + X) * bytes_per_pixel
Ch83 added PSM-aware bytes_per_pixel derived from the latched
FRAME_1[29:24] (PSM field):
| PSM (hex) | Format | bytes/pixel | Notes |
|---|---|---|---|
| 00, 01 | PSMCT32 / PSMCT24 | 4 | host-word |
| 02, 0A | PSMCT16 / PSMCT16S | 2 | |
| 13 | PSMT8 | 1 | indexed |
| 14 | PSMT4 | 4 here (host-word) | legacy pixel_emit channel only — see note below |
| 1B, 24, 2C | PSMT8H / PSMT4HL / PSMT4HH | 4 | host-word (high/low nibble of 32-bit slot) |
| 30, 31 | PSMZ32 / PSMZ24 | 4 | depth |
| 32, 3A | PSMZ16 / PSMZ16S | 2 | depth |
| other | — | 4 (host-word fallback) | unrecognized PSM |
This table describes the legacy pixel_emit channel (the
single-pixel-per-primitive debug strobe from Ch82/Ch83). That
channel does not commit to vram_stub; it only emits a trace
event. Its PSMT4 entry stays at host-word fallback — the
recognition layer never tracked sub-byte position there.
The raster channel (raster_pixel_emit) does NOT use this
table. It owns its own PSM-aware emit packing in S2 with full
PSMT4 support after Ch106:
- Byte address =
pixel_index >> 1(overrides thepixel_index << ras_bpp_shiftform). - The 4-bit index from R[3:0] is placed in the targeted nibble
(low/high keyed by
pixel_index[0]) ofwrite_data[7:0]. raster_pixel_be_q = 4'b0001,raster_pixel_mask_q = 0x0For0xF0sovram_stub's per-bit merge updates only that nibble.
PSMT8H / PSMT4HL / PSMT4HH still address the host 32-bit slot, not the high/low byte/nibble within it; the extracted sub-byte is rasterizer/blit-specific and out of scope here.
pixel_psm_q is still exposed verbatim so consumers can apply
their own sub-slot offset arithmetic if needed.
Carry-forward semantics
frame_1_q is part of the standard GIF-context register file and
carries forward across PRIM writes (matching real GS). A stream
that sets FRAME_1 once and then emits multiple primitives
correctly addresses all of them. A stream that never writes
FRAME_1 lands every pixel at fb_addr=0 — observable but not
useful, behaves cleanly under reset.
rgbaq_q likewise carries forward, so pixel_color_q reflects
the most recent RGBAQ write at emit time. If a Gouraud-style
stream rewrites RGBAQ between vertices, pixel_color_q captures
the closing-vertex color — same semantic as Ch79's
prim_color_q.
Strobe channel, not trace event
pixel_emit is a dedicated 1-cycle strobe alongside the snapshot
outputs, not a multiplexed event on the main ev_valid trace
stream. This avoids contention with EV_PRIM_DRAW on the close
cycle. A consumer that wants both can sample on pixel_emit
posedge and read the snapshots atomically.
Minimal interior rasterizer (Ch84)
gs_stub adds a separate per-interior-pixel emit channel
alongside the per-primitive pixel_emit of Ch82. The Ch82
strobe is unchanged (still pulses once per closed primitive); the
new channel pulses once per pixel that the rasterizer determines
is inside the closed primitive's interior.
| Output | Width | Carries |
|---|---|---|
raster_pixel_emit |
1 | 1-cycle strobe per emitted interior pixel |
raster_pixel_emit_count |
32 | Cumulative interior pixels emitted since reset |
raster_pixel_x_q / _y_q |
12 | Integer screen coords of the emitted pixel |
raster_pixel_color_q |
64 | Per-pixel color: Gouraud-interpolated R/G/B/A for TRI/TRI_STRIP/TRI_FAN (Ch86), flat (= prim_color_q) for SPRITE. Q passes through from the closing vertex. |
raster_pixel_fb_addr_q |
32 | Computed VRAM byte offset (PSM-aware, same math as Ch82/Ch83) |
raster_active |
1 | High while the FSM is scanning a primitive |
raster_overflow |
1 | Latches if a new primitive closes while the 2-entry raster FIFO is full and no concurrent pop frees a slot (Ch87 + audit-medium fix). See "Raster command queue (Ch87)" below for the back-to-back-close budget. |
raster_degenerate |
1 | Latches if a TRI/STRIP/FAN closes with zero signed area (3 colinear vertices). SCAN is skipped; SPRITE never sets this. |
Per-primitive coverage
PRIM |
Raster behavior |
|---|---|
| 0 POINT | No raster emit — Ch82 closing-pixel only |
| 1 LINE | No raster emit — Ch82 closing-pixel only |
| 2 LINE_STRIP | No raster emit — Ch82 closing-pixel only |
| 3 TRIANGLE | Bounding-box scan with edge-function half-plane test |
| 4 TRI_STRIP | Same engine as TRIANGLE, fires per closed strip triangle |
| 5 TRI_FAN | Same engine as TRIANGLE, fires per closed fan triangle |
| 6 SPRITE | Bounding-box rectangle fill (every pixel inside) |
| 7 reserved | No raster emit |
Triangle edge-function math
For each candidate pixel p and each edge (vA, vB) of the
triangle:
e(p) = (p.x - vA.x) * (vB.y - vA.y) - (p.y - vA.y) * (vB.x - vA.x)
32-bit signed math is used to avoid overflow at typical coord ranges.
Top-left fill rule (Ch85)
Adjacent triangles that share an edge would double-paint pixels on that edge under a naïve same-sign test. Ch85 applies the standard D3D-style top-left fill rule so each shared-edge pixel is owned by exactly one of the two triangles.
At the IDLE→SCAN transition the FSM:
- Computes
signed_area = (v1-v0) × (v2-v0). - If
signed_area == 0→ degenerate (3 colinear vertices);raster_degeneratelatches and SCAN is skipped (no raster pixels emit). The Ch82pixel_emitandprim_completepulses still fire — only the interior raster is suppressed. - If
signed_area < 0→ CW winding; the FSM swapsv1andv2so the rule applies uniformly to a CCW-ordered triangle. - For each edge of the post-swap CCW triangle, classifies it as
top-or-left (inclusive) or right/bottom (exclusive):
- Top edge: horizontal going right (
dy == 0 && dx > 0). - Left edge: going down in Y-down screen (
dy > 0). - Anything else is a right or bottom edge.
- Top edge: horizontal going right (
The inside test in SCAN becomes:
inside = (e[i] + bias[i] <= 0) for all i in {0, 1, 2}
where bias[i] = 0 if edge i is top-or-left and bias[i] = 1
otherwise. The +1 bias converts the strict < 0 test for
right/bottom edges into a non-strict <= 0 test on the biased
value, keeping the math integer and uniform.
Result: for any two adjacent triangles sharing an edge, the edge's pixels are inclusive in exactly one triangle's bias configuration and exclusive in the other's — no double-paint.
Some shared-corner pixels may end up unpainted by either triangle. That's the standard top-left rule trade-off: non-overlap takes priority over coverage of every boundary pixel.
Per-pixel Gouraud color (Ch86)
Triangle interior pixels now use per-pixel Gouraud color
interpolation instead of flat shading. The three per-vertex
colors (the same Ch80 prim_color_v0_q / prim_color_v1_q /
prim_color_v2_q slot mapping) are latched at SCAN init with
the same v1↔v2 swap mirror as the vertex coords, so the
post-swap CCW vertex order matches the latched color order.
For each interior pixel p, barycentric weights are derived
directly from the unbiased edge functions:
L0(p) = -e1(p) // weight for v0 = signed area of (p, v1, v2)
L1(p) = -e2(p) // weight for v1
L2(p) = -e0(p) // weight for v2
— L0 + L1 + L2 == sa for all p inside the triangle
For each color channel ch ∈ {R, G, B, A}:
ch_out(p) = (L0(p)*c0.ch + L1(p)*c1.ch + L2(p)*c2.ch) / sa
Q (the texture-coord IEEE float in c2's upper 32 bits) is not interpolated — it passes through from the closing vertex's RGBAQ unchanged.
For a flat-shaded primitive (RGBAQ written once before all three
vertices, all three vertex colors equal), λ0+λ1+λ2 = 1 and
the formula collapses to c0 exactly with no rounding error —
existing flat-shaded raster TBs (raster_basic, raster_topleft)
continue to pass.
The R/G/B/A division uses integer truncation toward zero. Real PS2 GS uses fixed-point with specific rounding rules; the recognition-layer stub is intentionally simpler. SPRITE keeps flat shading (only 2 vertices, no barycentric weights defined).
Sprite rectangle fill
A SPRITE has two vertices forming opposite corners. The bounding
box is computed via min/max of each axis; every pixel inside
the box is emitted in row-major order.
FSM and scan timing
The FSM is IDLE → SCAN. On prim_complete_now for an eligible
primitive, the FSM latches the vertex tuple, color, FRAME_1
fields, and bounding box, then walks the box one pixel per cycle.
For each pixel: combinational inside-test → if inside, pulse
raster_pixel_emit and update the snapshot. Returns to IDLE
when (ras_cur_x, ras_cur_y) == (x_max, y_max).
Color is Gouraud-interpolated per pixel for triangles
(Ch86) and flat for sprites — see the dedicated subsections
below for the fill-rule and Gouraud math. The closing-primitive
flat color (prim_color_q) is still used as the SPRITE fill
color and as a backward-compat reference for flat-shaded TRIs
(when all three vertex colors are equal, the Gouraud formula
reduces to that flat value with no rounding error).
Coordinates are integer — the 4-bit sub-pixel of 12.4 fixed-point is discarded. Sub-pixel edge adjustment is not modeled (top-left fill rule IS modeled — see Ch85 subsection above).
Raster command queue (Ch87) and raster_overflow
gs_stub has a 2-entry FIFO in front of the SCAN FSM. Every
primitive close that targets the rasterizer (RM_TRI /
RM_SPRITE) snapshots its full per-prim context (vertices,
bias, signed area, per-vertex colors, FRAME_1 fields, bounding
box) into the queue at the close cycle. The FSM dequeues the
oldest entry whenever it's idle or finishing a scan. Effective
concurrency is 1 in-flight + 2 queued = up to 3 back-to-back
closes absorbed without drop.
raster_overflow now latches when a 4th close arrives while the
FIFO is full (1 in-flight, both FIFO slots occupied). The
4th primitive is dropped. Earlier chapters' bound of "1 close
mid-scan = overflow" is replaced by Ch87's "3 closes
back-to-back = OK; 4th = overflow."
Degenerate triangles are filtered at enqueue: they set
raster_degenerate and are not pushed into the queue. SPRITE
never sets raster_degenerate. POINT/LINE/LINE_STRIP don't
raster (RM_NONE) — they don't enqueue at all and the queue
ignores them.
Pop happens at IDLE→SCAN AND at drain-done (Ch88; see below)
when the queue has more work, so back-to-back scans run
contiguously without an IDLE bubble. raster_active stays
high across the boundary.
Real PS2 game streams emit thousands of primitives back-to-back; 3-deep concurrency is enough for most TRI_STRIP / TRI_FAN patterns with small bounding boxes. Larger sprites or larger triangles increase scan length and reduce headroom — a future chapter can grow the FIFO depth.
Pixel pipeline (Ch88)
The SCAN body is 3 stages, throughput 1 candidate pixel per cycle:
| Stage | Source | Work |
|---|---|---|
| S0 | ras_cur_x / ras_cur_y (bbox walker) |
Generate the next candidate coord; advance the bbox walker; on bbox corner, fire ras_at_end_of_s0 and transition R_SCAN→R_DRAIN. |
| S1 | s1_x_q / s1_y_q (registered) |
Combinational edge functions on (s1_x, s1_y) against the three triangle edges (or trivial-true for SPRITE), top-left bias, inside test → s1_pixel_inside. Latched into s2_inside_q. |
| S2 | s2_x_q / s2_y_q / s2_L0..L2_q / s2_inside_q |
Compute Gouraud interp_byte(λ_i, c_i) × 4 channels and s2_fb_addr from PSM/FBP/FBW. If s2_valid_q && s2_inside_q, drive raster_pixel_emit with the resolved fb_addr / x / y / color. |
raster_state is now a 3-state FSM:
- R_IDLE — no work;
pop_okfires on a non-empty FIFO. - R_SCAN — S0 produces one valid coord per cycle; S1/S2 latches propagate. On bbox corner, transitions to R_DRAIN.
- R_DRAIN — S0 stops producing valids (
s1_valid_q <= 0); S1 and S2 finish their in-flight pixels. When both pipeline valids are low (drain_done), the FSM either pops the next primitive (back-to-back contiguous SCAN) or returns to R_IDLE.
pop_ok = !fifo_empty && (R_IDLE || drain_done) — the
end-of-scan pop is now drain-done, three cycles after S0
produces the corner. This guarantees the pipeline-tail pixels
of the previous primitive are not overwritten by the next
primitive's pop, while still keeping raster_active high
across the seam.
Latency from pop_ok to first registered raster_pixel_emit
is 3 stages of pipeline + 1 cycle of FIFO turnaround + 1
cycle of registered emit output = 5 posedges from the close
cycle of the closing vertex (see
sim/tb/gif_gs/tb_gs_raster_pipeline.sv for the cycle-exact
contract).
-
EV_MODE— fired for any accept that did not resolve to a tracked register: REGLIST entries, IMAGE/DISABLE payload qwords, NOP-nibble PACKED slots, unknown privileged offsets, unknown GIF reg numbers. Reserved for "we know we saw something, we are intentionally not modeling it yet." -
EV_GIFTAG— one per accepted GIFtag; carriesflg/nreg/nloop/eopfor stream-level checking.
When trace event semantics change, audit this section and the per-stub trace-schema header comments together.
VRAM persistence (Ch89)
vram_stub (rtl/gif_gs/vram_stub.sv) is the first persistence
layer the rasterizer has had. Every raster_pixel_emit pulse
writes 4 bytes of pixel data at raster_pixel_fb_addr_q into
vram_stub's linear byte array. A combinational debug read port
exposes read_data byte-addressably so testbenches can verify
storage.
Wiring:
| vram_stub port | gs_stub source |
|---|---|
write_en |
raster_pixel_emit |
write_addr |
raster_pixel_fb_addr_q |
write_data |
raster_pixel_color_q[31:0] (the lower 32 bits — Q in the upper 32 is not framebuffer data) |
write_be |
raster_pixel_be_q (Ch95) — per-byte write enable: byte i (the byte at write_addr + i) is committed only when write_en && write_be[i]. Lets the same 32-bit write port serve PSMs of any byte width. |
write_mask |
raster_pixel_mask_q (Ch106) — per-bit merge mask: for each enabled byte, `mem[i] <= (mem[i] & ~mask_i) |
Scope (current write-side support, after Ch105):
- PSMCT32 + PSMCT16 + PSMT8 at the raster write port. The PSM
width is selected by
gs_stub'sbpp_shiftmux offFRAME_1.PSMand surfaced asraster_pixel_psm_q;gs_stub's S2 packs the pixel into the right byte lane and drivesraster_pixel_be_qsovram_stubcommits exactly the right bytes:- PSMCT32 (PSM=0x00) → 4 bytes/pixel,
be = 4'b1111, ABGR inwrite_data[31:0]. - PSMCT16 (PSM=0x02) → 2 bytes/pixel,
be = 4'b0011, RGB5A1 packed inwrite_data[15:0](Ch95).write_addris the halfword byte address — per-bytebemakes unaligned halfword writes safe. - PSMT8 (PSM=0x13) → 1 byte/pixel,
be = 4'b0001, the natural ABGR's R channel goes intowrite_data[7:0]as the PSMT8 index (Ch105).write_addris the exact byte address;vram_stubcommitsmem[write_addr] ← write_data[7:0]at any byte alignment without needing data-lane shifting. - PSMT4 (PSM=0x14) → 0.5 bytes/pixel (2 pixels per byte),
be = 4'b0001,write_mask = 0x0000_000F(low nibble) or0x0000_00F0(high nibble) perpixel_index[0]. The 4-bit index (low nibble of natural ABGR's R) is placed in the targeted nibble position inwrite_data[7:0]. vram_stub merges only the masked bits — the OTHER nibble of the same byte is preserved (Ch106). Back-to-back same-byte emits (e.g. PSMT4 pixels x=0 and x=1, both landing in byte 0) chain through NBA semantics: the second NBA samples mem[addr] AFTER the prior commit, so both nibbles end up in the byte without a bypass-forwarding net. - PSMCT24 / PSMCT16S / PSMZ32 / PSMZ24 / PSMZ16 / PSMZ16S /
PSMT8H / PSMT4HL / PSMT4HH —
bpp_shiftfalls through to a host-word default (4 bytes); raster emit through these PSMs is not contract-tested.
- PSMCT32 (PSM=0x00) → 4 bytes/pixel,
- Write-side addressing. Real PS2 VRAM is 4 MiB organized
into pages × blocks × columns per PSM. By DEFAULT, both
gs_stubraster emit andgif_image_xfer_stubTRXDIR uploads produce the linear-framebuffer layout PCSX2 calls "linear PSM". Optional per-PSM swizzle paths gated by parameters on each module:- PSMCT32:
PSMCT32_SWIZZLEparameter ongs_pcrtc_stub(Ch120 read-side),gif_image_xfer_stub(Ch121 image-xfer write-side), andgs_stub(Ch122 raster write-side). - PSMCT16:
PSMCT16_SWIZZLEparameter ongs_pcrtc_stub(Ch126 read-side),gif_image_xfer_stub(Ch127 image-xfer write-side), andgs_stub(Ch128 raster write-side). All three integration points live, mirroring the PSMCT32 trio. When on, byte addresses route through the per-PSM swizzle module (gs_swizzle_psmct32_stub/gs_swizzle_psmct16_stub); image-xfer addsdest_base_q = DBP*256on top of the swizzle output so any DBP works, while raster emit feeds the activeras_fbpdirectly so the swizzle output is already the absolute address. Per-PSM parameters are independent — enabling one doesn't affect the other PSM. PSMT8 has its full three-path swizzle integration as of Ch134, mirroring the PSMCT32/PSMCT16 trios: standalone math primitivegs_swizzle_psmt8_stub(Ch131) wired intogs_pcrtc_stub(Ch132 read-side,PSMT8_SWIZZLE),gif_image_xfer_stub(Ch133 write-side), andgs_stub(Ch134 raster emit) — same parameter name on all three modules. PSMT4 has its full three-path swizzle integration as of Ch140, mirroring the PSMCT32/PSMCT16/PSMT8 trios: standalone math primitivegs_swizzle_psmt4_stub(Ch137) wired intogs_pcrtc_stub(Ch138 read-side,PSMT4_SWIZZLE),gif_image_xfer_stub(Ch139 write-side), andgs_stub(Ch140 raster emit) — same parameter name on all three modules. The PSMT4 paths additionally thread the swizzle module'snibble_hioutput through the existing Ch106 (raster) / Ch118 (image-xfer) nibble RMW machinery (replacings2_pixel_index[0]/x_eff[0]as the high/low nibble selector when the gate is on). All parameter defaults are 0, so existing TBs see the legacy linear behavior. All four common GS PSMs (CT32 + CT16 + T8 + T4) now have COMPLETE three-path swizzle integration foundation.
- PSMCT32:
- Stub-sized. Default
BYTES = 65536. Real VRAM is 4 MiB; for TB purposes a small linear region is enough to verify that emitted pixels actually land at the addresses gs_stub computes. - Scanout path is provided by
gs_pcrtc_stub(Ch90 — see below). The legacyplatform_video_stubflood-fills BGCOLOR and is unaware of VRAM; TBs that want to verify the round trip usegs_pcrtc_stubinstead.
The Ch89 white-box TB tb_gs_vram_writeback.sv exercises the
contract end-to-end: drive a 4×4 SPRITE through gs_stub, capture
the (fb_addr, color) of each raster_pixel_emit pulse, then
read each fb_addr back from vram_stub and assert byte-exact
match.
PCRTC scanout (Ch90)
gs_pcrtc_stub (rtl/gif_gs/gs_pcrtc_stub.sv) is the scanout
side of the GS pipeline — its dual is gs_stub (the write
side). It models a minimal PCRTC (Programmable CRT Controller):
runs its own raster timing, generates a VRAM read address from
the current (hcnt, vcnt) using the same fb_addr math as
gs_stub, reads the byte returned by vram_stub's combinational
debug port, and drives r/g/b for the active area. Together
with Ch88's pipeline + Ch89's VRAM, this closes the loop:
raster_pixel_emit → vram_stub.write → vram_stub.read → pcrtc.r/g/b
Configuration (Ch91 — privileged-block CPU MMIO):
gs_pcrtc_stub consumes two real PS2 GS privileged display
register latches directly from gs_stub:
| pcrtc input | gs_stub source | Layout |
|---|---|---|
pmode_q[63:0] |
privileged write at offset 0x0000 | bit 0 = EN1 (display 1 enable) |
dispfb1_q[63:0] |
privileged write at offset 0x0070 | FBP[8:0], FBW[14:9], PSM[19:15], DBX[42:32] (Ch91-audit), DBY[53:43] (Ch91-audit) |
display1_q[63:0] (Ch92, Ch93) |
privileged write at offset 0x0080 | DX[11:0], DY[22:12], MAGH[26:23] (Ch93 — H scale = MAGH+1), MAGV[28:27] (Ch93 — V scale = MAGV+1), DW[43:32] (width-1), DH[54:44] (height-1) |
The Ch90 sideband ports (scanout_enable / dispfb_fbp /
dispfb_fbw) are gone. TBs program scanout the way a real
PS2 driver would: write DISPFB1, then DISPLAY1, then PMODE.EN1=1
(Ch92). Out of reset, all three registers are 0, so EN1 is low
and pcrtc outputs 0.
scanout_enable inside pcrtc is derived combinationally from
the latches:
scanout_enable = pmode_q[0] & (PSM ∈ {0, 2, 0x13, 0x14}).
PSMCT32 (=0), PSMCT16 (=2), PSMT8 (=0x13), and PSMT4 (=0x14) are
honored at this scope; any other PSM forces scanout off rather
than mis-decoding the byte layout.
DISPLAY1 (Ch92, Ch93) supplies the display window — the sub-rect inside the active area where pcrtc actually pulls pixels from VRAM — and the per-axis magnification: each VRAM column is shown for (MAGH+1) consecutive VCK pulses, each VRAM line for (MAGV+1) raster lines. Outside the window pcrtc drives r/g/b = 0 even with EN1=1. Pcrtc's H_TOTAL/V_TOTAL still come from module parameters at instantiation; only the active-area sub-rect gated by DX/DY/DW/DH is register-driven. Dual-display (PMODE.EN2 + DISPFB2 + DISPLAY2) is deferred.
Address math + display-window gating + magnification:
hmag_factor = MAGH + 1 // 1..16
vmag_factor = MAGV + 1 // 1..4
hwin_rel = hcnt - DX // pixel offset inside the window
vwin_rel = vcnt - DY
in_window = (hcnt >= DX) && (hwin_rel <= DW)
&& (vcnt >= DY) && (vwin_rel <= DH)
fbp_bytes = dispfb_fbp << 11 // FBP × 2048
pixels_per_row = dispfb_fbw << 6 // FBW × 64
vram_x_unshift = hwin_rel / hmag_factor // 4 displayed pixels = 1 VRAM column at MAGH=3
vram_y_unshift = vwin_rel / vmag_factor
effective_x = vram_x_unshift + DBX
effective_y = vram_y_unshift + DBY
pixel_index = effective_y × pixels_per_row + effective_x
bpp_shift = (PSM == PSMCT32) ? 2 :
(PSM == PSMCT16) ? 1 :
(PSM == PSMT8) ? 0 : 2
fb_addr = fbp_bytes + (pixel_index << bpp_shift)
r/g/b drive = (de && scanout_enable && in_window) ? decode(VRAM, PSM) : 0
Per-PSM color decode at vram_read_data:
- PSMCT32:
r = data[7:0],g = data[15:8],b = data[23:16]. Alpha at[31:24]is dropped (no DAC channel). - PSMCT16 (Ch94): RGB5A1 packed into the lower 16 bits as
{A[15], B[14:10], G[9:5], R[4:0]}. 5→8 expansion uses bit-replicater8 = {r5, r5[4:2]}(so 5'h1F → 8'hFF, 5'h00 → 8'h00). Alpha bit dropped at the DAC. - PSMT8 (Ch96/Ch97): index in
data[7:0]. Withclut_enable=1(Ch97), pcrtc presentsclut_read_idx = idx + (CSA << 4)to the externalclut_stuband decodes the returned PSMCT32 entry asr = clut_data[7:0],g = clut_data[15:8],b = clut_data[23:16]. Withclut_enable=0(Ch96 fallback), pcrtc surfaces the index as grayscale so the 8-bit storage lane is visually verifiable without programming a CLUT. - PSMT4 (Ch103): 2 pixels per byte.
byte_offset = pixel_index >> 1(overrides the standardpixel_index << bpp_shiftmath).nibble = pixel_index[0] ? data[7:4] : data[3:0]picks the 4-bit pixel; the zero-extended 8-bit value{4'd0, nibble}plus(CSA << 4)is presented onclut_read_idx. Withclut_enable=1, pcrtc decodes the returned PSMCT32 entry the same way as PSMT8. Withclut_enable=0, the fallback replicates the nibble across the 8-bit DAC value (r = g = b = {nibble, nibble}) so 4'hF → 0xFF and 4'h5 → 0x55. CSA is the natural per-palette-window selector for PSMT4 — multiple 16-entry palettes can share the 256-entry staging area, indexed by CSA.
Ch95 — gs_stub raster channel emits PSMCT16. The S2 stage
of the pipeline now packs ABGR → RGB5A1 (r5=R[7:3], g5=G[7:3],
b5=B[7:3], a1=A[7]) when ras_bpp_shift==1 (PSMCT16 / PSMCT16S
/ PSMZ16 / PSMZ16S — any 16-bit PSM). The packed 16-bit pixel
goes in the LOW halfword of raster_pixel_color_q[31:0], and a
new raster_pixel_be_q[3:0] selects which bytes vram_stub
commits: 4'b0011 for PSMCT16, 4'b1111 for PSMCT32. vram_stub
gates each byte write on write_be[i], so back-to-back PSMCT16
emits write 2 bytes each without halfword stomping. New
raster_pixel_psm_q[5:0] exposes the current PSM for trace.
The Ch95 TB tb_gs_raster_psmct16.sv exercises the round trip:
gs_stub renders a 4×4 SPRITE with FRAME_1.PSM=PSMCT16, then VRAM
read-back verifies each pixel landed at the right halfword AND
that the halfword right after the sprite stays zero (no leak).
Ch105 extends the raster channel to PSMT8 (FRAME_1.PSM=0x13).
When ras_bpp_shift==0, S2 takes the natural ABGR's R channel
(low 8 bits) as the PSMT8 index — the same lane real PS2 hardware
writes when the destination FB is PSMT8 — places it in the LOW
byte of the emit lane, and sets raster_pixel_be_q = 4'b0001 so
vram_stub commits exactly the 1 byte at fb_addr. The 1-byte
commit works at any byte alignment because vram_stub gates each
byte lane independently. The Ch105 TB tb_gs_raster_psmt8.sv
renders a 5×3 SPRITE (chosen so the row spans byte lanes 1, 2, 3,
0, 1 — exercising every lane alignment) at FRAME_1.PSM=PSMT8 with
RGBAQ R=0x55, G=0xAA, B=0xBB, A=0xCC; asserts each sprite byte
reads back as 0x55, the bytes immediately left and right of the
sprite stay 0x00 (so 32-bit-aligned overwrite would be visible),
and a full-VRAM sweep finds NO byte equal to 0xAA / 0xBB / 0xCC
(channel-isolation: only R reaches VRAM at PSMT8).
Ch106 closes the indexed-write gap with PSMT4 (FRAME_1.PSM=0x14)
as a per-bit RMW into vram_stub. Three changes form the
mechanism:
vram_stubgains a newwrite_mask[31:0]input (Ch106). The commit is nowmem[i] <= (mem[i] & ~mask_i) | (data_i & mask_i)for each enabled byte. PSMCT32/16/PSMT8 tie mask=0xFFFF_FFFF(no behavior change — full byte writes).gs_stub's S2 PSM-aware emit packing gets a PSMT4 branch: the byte address ispixel_index >> 1(overrides thepixel_index << ras_bpp_shiftform), the index is the low 4 bits of the natural ABGR's R channel, and the emit places that 4-bit value in either the low ({4'd0, idx}) or high ({idx, 4'd0}) nibble ofwrite_data[7:0]based onpixel_index[0].s2_emit_be = 4'b0001,s2_emit_mask = pixel_index[0] ? 0x0000_00F0 : 0x0000_000F.- New
raster_pixel_mask_q[31:0]output ongs_stubcarries the mask through tovram_stub.write_mask.
The Ch106 TB tb_gs_raster_psmt4.sv is intentionally
adversarial about preservation. VRAM is preloaded with 0xA5
(high=A, low=5) at every byte the sprites will touch. Three
phases:
- Phase A: 4×2 SPRITE at (0,0)..(3,1), R=0x05 → idx=5. Both
nibbles of each enclosing byte are written (8 emits across 4
bytes); each byte ends at
0x55and the four neighbouring preloaded bytes (2..3, 34..35) remain0xA5. This proves the back-to-back same-byte case (NBA chaining) and the neighbour- byte preservation in one go. - Phase B: single-pixel SPRITE at (5, 2). x=5 odd → high
nibble; pixel_index = 133, byte_addr = 66; idx=7. Preload
mem[66] = 0xA5. Expected after raster:mem[66] = 0x75— high nibble updated from A to 7, low nibble stays 5. Proves isolated high-nibble RMW preserves the low nibble. - Phase C: single-pixel SPRITE at (4, 3). x=4 even → low
nibble; pixel_index = 196, byte_addr = 98; idx=9. Preload
mem[98] = 0xA5. Expected after:mem[98] = 0xA9— low nibble updated from 5 to 9, high nibble stays A. Proves isolated low-nibble RMW preserves the high nibble.
Continuous observer asserts psm_q == 6'h14, be_q == 4'b0001,
and mask_q ∈ {0x0F, 0xF0} on every emit. Final aggregate
checks: 10 emits total, full-VRAM sweep finds NO byte equal to
0xAA / 0xBB / 0xCC (only R reaches the framebuffer at PSMT4).
DBX / DBY shift the VRAM origin: the pixel that appears at displayed (DX, DY) corresponds to VRAM (DBX, DBY). Real PS2 drivers use this for double-buffered framebuffers (alternate frames at different DBX/DBY) and offset display windows.
Five TBs lock these contracts:
tb_gs_scanout_basic.sv— DBX=DBY=0, DISPLAY1 covers full active area, MAGH=MAGV=0 (1×): classic sprite-at-origin scanout.tb_gs_scanout_dbx_dby.sv— sprite at VRAM (4,2)..(7,5), DISPFB1.DBX=4/DBY=2, DISPLAY1 full active area, MAGH=MAGV=0: sprite shows at displayed (0..3, 0..3).tb_gs_scanout_display_window.sv— sprite at VRAM (0..3, 0..3), DBX=DBY=0, DISPLAY1 with DX=2/DY=1/DW=3/DH=3, MAGH=MAGV=0: sprite shows at displayed (2..5, 1..4); pixels outside the window are black even though pcrtc's raster passes through them.tb_gs_scanout_magh_magv.sv(Ch93) — sprite at VRAM (0..3, 0..3), DBX=DBY=0, DISPLAY1 with DX=4/DY=2/DW=7/DH=7, MAGH=1/MAGV=1 (2×/2×): 4×4 VRAM sprite stretches to fill the 8×8 displayed window pixel-perfect; pixels outside the window are black.tb_gs_scanout_psm16.sv(Ch94) — 4×4 RGB5A1 sprite written directly to vram_stub at PSMCT16 byte stride, DISPFB1.PSM=0x02: 5→8 bit-replicate decode produces the right (R8, G8, B8) at scanout. (No gs_stub instantiated; this TB exercises the PSM decode path in isolation.)tb_gs_scanout_psmt8.sv(Ch96) — 4×4 PSMT8 sprite of indices 0x10..0x1F written directly to vram_stub at 1 byte/pixel stride. DISPFB1.PSM=0x13, DISPLAY1 with DX=4/DY=2/DW=7/DH=3 AND MAGH=1 (2× horizontal). Asserts each scan-out displayed pixel reads back as grayscale R=G=B=expected index, proving byte stride + display window + horizontal magnification all work at 1 byte/pixel.tb_gs_scanout_psmt8_clut.sv(Ch97) — same 4×4 PSMT8 sprite, plus a programmed CLUT whereCLUT[i] = ABGR(0xFF, i+0x80, i+0x40, i). Withclut_enable=1andclut_csa=0, asserts each scan-out pixel reads back as the CLUT entry for its index — PSMT8 storage + CLUT lookup compose correctly into real RGB. Three phases: full-frame CSA=0, single-pixel CSA=1 (idx 0x00 → CLUT[0x10]), and CSA=1 wrap (idx 0xF8 → CLUT[0x08]).tb_gs_tex0_clut.sv(Ch98) — drives gs_stub's GIF reg# 0x06 (TEX0_1) and asserts the latch + sub-field decoders match the encoded payload (CBP/CPSM/CSM/CSA/CLD bit ranges). Phase 2 wirespcrtc.clut_csafromgs_stub.tex0_1_csa_q(instead of TB-side sideband) and verifies the CSA value flows from a GIF register write into the CLUT lookup math at scan-out.tb_gs_clut_load.sv(Ch99) — full TEX0.CLD-driven VRAM→CLUT load round trip. Stages 256 PSMCT32 entries in VRAM atCBP*256(using the newvram_stubsecond read port), drives TEX0_1 withCBP=4, CPSM=PSMCT32, CSM=CSM2, CLD=1, waits forclut_loader_stub.load_busyto fall, then runs PSMT8 scanout and asserts each in-sprite pixel reads back as the CLUT entry the loader copied — no TB-direct CLUT writes needed. Also carries a Ch99-audit negative phase: a TEX0 write with CSM=0 (CSM1 swizzle, deferred) silently no-ops instead of laying down wrong linear bytes.tb_gs_clut_load_ct16.sv(Ch100) — CPSM=PSMCT16 variant of the Ch99 load TB. Stages 256 RGB5A1 entries (2 bytes each) in VRAM atCBP*256, drives TEX0_1 withCPSM=2. The loader now walks at 2-byte stride, unpacks RGB5A1 → PSMCT32 ABGR via 5→8 bit-replicate, and writes to clut_stub. PSMT8 scanout produces the expanded RGB. Ch100-audit alpha coverage: per-entrya1 = idx[0]varies the alpha bit so both{8{0}} = 0x00and{8{1}} = 0xFFare exercised; a TB-sideclut_wesnoop captures every loader write so alpha can be asserted directly without going through the RGB-only scanout path.tb_gs_clut_load_cld_modes.sv(Ch101 + Ch102) — conditional CLD-mode policy. Phases walk through CLD ∈ {0, 1, 2, 3, 4, 5, 6, 7} with varying CBP/CPSM/CSA, countingloader_busyrising edges to prove: CLD=0 never loads; CLD=1 always (full); CLD=2 loads only when CBP changed; CLD=3 loads when CBP/CPSM/CSA any-changed (CBP, CSA, and CPSM arms each isolated); CLD=4 always loads but only the 16-entry CSA window (Ch102 — write range correctness is locked bytb_gs_clut_load_csa_window); CLD ∈ {5, 6, 7} reserved no-ops.tb_gs_clut_load_csa_window.sv(Ch102) — CLD=4 write-range correctness. Phase 1 stages 256 distinct PSMCT32 entries in VRAM and runs CLD=1 to fill all 256 CLUT slots with pattern_a. Phase 2 stages 16 different entries at a new CBP, drives CLD=4 with CSA=2 (window = idx 32..47), and asserts via aclut_wesnoop that exactly 16 writes occurred AND the captured array contains: pattern_a(i) at i ∈ [0, 32) ∪ [48, 256), pattern_b(i-32) at i ∈ [32, 48). Proves 240 entries are preserved across the partial load. Audit-low extensions: Phase 3 covers the high-CSA wrap (CSA=16 → window-base wraps mod-256 to 0); Phase 4 covers CT16 partial (CPSM=PSMCT16, 2-byte stride, RGB5A1 unpack at the loader, window at idx 160..175).tb_gs_scanout_psmt4_clut.sv(Ch103) — PSMT4 scanout. Stages a 4×4 PSMT4 sprite (2 pixels/byte) and 16 CLUT entries. Phase 1 (clut_enable=1): asserts each pixel readsCLUT[zero-ext(nibble) + CSA*16]. Phase 2 (clut_enable=0): asserts the grayscale fallback replicates the 4-bit nibble across the 8-bit DAC value. Both phases verify byte-stride half-extraction (low/high nibble pick) at every active pixel. Audit-low Phase 3 locks PSMT4 + nonzero CSA (CSA=1, window 16..31) end-to-end: TB-direct CLUT writes plant a 0xDEAD_BEEF sentinel at entries 0..15 and a per-index formula at 16..31, scanout asserts each pixel reads the formula and never the sentinel.tb_gs_demo_psmt4_e2e.sv(Ch107) — first end-to-end demo for the GS/PCRTC stack. Scope is GS-side only: the post-GIF register stream (per-reg A+D writes viags_stub.gif_reg_*) plus privileged-block MMIO drive the pipeline;gif_packed_stub/ GIFtag-PACKED is BYPASSED — feeding the same demo through the GIF front-end is a future chapter. Step 1 stages 16 PSMCT32 palette entries in VRAM atCBP*256(modelled as a TB-direct write — DMA→GS image transfer is a future chapter, but the framebuffer itself is NOT TB-direct). Step 2 drives per-reg writes (PRIM/FRAME_1/RGBAQ/XYZ2) for four SPRITEs paying out a 4-quadrant 8×4 image (TL idx 0x5, TR idx 0x7, BL idx 0xA, BR idx 0xC) at FRAME_1.PSM=PSMT4 — all 32 framebuffer pixels arrive via the Ch106 raster channel. Step 3 drives TEX0_1 withCBP=palette, CPSM=PSMCT32, CSM=CSM2, CSA=0, CLD=4; loader writes clut_stub[0..15]. Step 4 brings up scanout via privileged-block writes to DISPFB1 (PSM=PSMT4) + DISPLAY1 + PMODE.EN1. Step 5 captures one full frame and asserts each pixel reads back asCLUT[quadrant_idx](orCLUT[0]outside the 8×4 image since vram_stub zero-init means nibble=0). Aggregate asserts: 32 PSMT4 emits, mask ∈ {0x0F, 0xF0} on every emit (channel-isolation locked architecturally — only R[3:0] ever reaches VRAM at PSMT4), loader fires exactly once, no raster_overflow / raster_degenerate. This TB is the first stack-wide proof that the GS-side post-GIF sequence — per-reg writes → indexed framebuffer → TEX0+CLD palette upload → PMODE/DISPFB/DISPLAY scanout — produces a coherent RGB frame end to end without TB sideband for the framebuffer pixels. Routing the same primitives through GIFtag/PACKED A+D viagif_packed_stubcloses the last sideband and is the natural Ch108 anchor.tb_gs_demo_psmt4_e2e_ee_full_bootlet.sv(Ch114) — extends Ch113's EE-driven control plane to ALSO drive the DMAC channel-2 setup from the same MIPS instruction stream. The EE program now writes the 4 GS-priv registers + the 3 DMAC ch2 registers (MADR / QWC / CHCR.start) via realswinstructions, then SYSCALLs to halt. Total: 7 EE-CPU MMIO writes (4 GS-priv + 3 DMAC) producing the same 16×8 captured frame. Architectural note: the program lives inbios_rom_stubat 0xBFC0_0000 / phys 0x1FC0_0000, NOT in RAM. A RAM-resident program would have its instruction fetches contend with the DMAC's RAM reads throughee_ram_stub's single read port (the map's CPU>DMAC arbitration silently corrupts DMAC data). Putting the program in BIOS decouples the two paths so EE and DMAC run truly in parallel. This also matches real PS2: the EE boots out of BIOS ROM. PASS criteria add to Ch113's: 3 EE-driven DMAC writes seen at the map's DMAC-ch2 decode; the existingdma=(1,36,1)event taxonomy still holds (those events are triggered by the EE's CHCR write, not a TB-direct write). The remaining TB-direct surfaces in the demo are now narrowly the GIF payload pre-stage in RAM (a real EE driver would itself stage this) and bios_rom_stub's program preload (which is the EE bootlet itself — not a runtime TB sideband).tb_gs_demo_psmt4_e2e_ee_program.sv(Ch113) — same demo as Ch112 but the 4 control-plane MMIO writes (PMODE / DISPFB1 / DISPLAY1 lo / DISPLAY1 hi) are no longer issued by the TB directly. Instead a 10-instruction MIPS program preloaded into ee_ram_stub at phys 0x800 (kseg0 0x80000800) is fetched and executed byee_core_stub(parameterized withPC_RESET=0x80000800). The program isLUI/ORI/SW × 4plus a SYSCALL terminator; the SW instructions target0x12000000+and flow throughee_memory_map_stub's GS-priv decode →ee_gs_priv_bridge_stub→gs_stub.reg_wr_*. Closes the very last TB-direct surface in the demo flow: every byte AND every register bit AND every control-plane decision now arrives from a real-shape source. PASS criteria add to Ch112's:core_halt_o == 1(asserts exactly once on the SYSCALL halt),core_trap == 0, EE program halts atEE_PROG_VA + 36 = 0x80000824(the SYSCALL slot). The TB still pre-stages the GIF payload and triggers the DMAC channel-2 transfer via TB-direct CHCR/MADR/QWC writes — a wider EE program that also drives DMAC bring-up is a separate future chapter.tb_gs_demo_psmt4_e2e_eemap.sv(Ch112) — same demo as Ch111 but the bridge is no longer driven by the TB directly. Instead the TB drivesee_memory_map_stub.ee_wr_*with full 32-bit physical addresses targeting the new GS-privileged-MMIO window at 0x1200_0000-0x1200_FFFF (64 KiB; phys[28:16] == 13'h1200). The map decodes the window, peels the 16-bit offset, and hands the 32-bit half-write toee_gs_priv_bridge_stub, which then fires gs_stub.reg_wr_* with the running 64-bit shadow value. Closes the last control-plane routing gap before a real EE instruction stream can drive the demo's bring-up: PMODE / DISPFB1 / DISPLAY1 are now reachable fromsw 0x1200_0080(...)- shaped writes rather than from a TB-shaped EE-MMIO port. PASS criteria identical to Ch111: 4 EE-MMIO writes / 4 bridge fires, same 16×8 captured frame. Architectural note: this chapter ALSO adds 4 new output ports toee_memory_map_stub(ee_gs_priv_wr_en/addr/data/be). Existing 56 ee_memory_map_ stub-using TBs leave those outputs unconnected (named-port instantiation tolerates omitted outputs); only the new Ch112 TB wires them through to the bridge.tb_gs_demo_psmt4_e2e_eemmio.sv(Ch111) — same demo as Ch110 but the privileged-block control writes (PMODE / DISPFB1 / DISPLAY1) now arrive throughee_gs_priv_bridge_stub(a new RTL module) driven by EE-shaped 32-bit MMIO writes from the TB, instead of TB-direct gs_stub.reg_wr_* pulses. The bridge accumulates 32-bit half-writes per 8-byte slot and fires a 64-bit gs_stub.reg_wr_* pulse on each EE half-write — single-half writes work for PMODE.EN1 and DISPFB1 (interesting bits in the low 32), and a pair of writes (lo+hi to consecutive 4-byte addresses) handles DISPLAY1 whose DW/DH live in the high 32. Bridge contract: full-word writes only —ee_wr_bemust be4'b1111; sub-word (per-byte) merging into the 64-bit shadow is intentionally out of scope and a$errorfires on any narrower be (control-plane GS registers are always written as full 32-bitswhalves of ansd). Scope precision: this chapter closes the TB-directgs_stub.reg_wr_*surface — i.e., the privileged-MMIO sink at the GS itself. The bridge is instantiated by the TB directly; it is NOT yet wired intoee_memory_map_stub, so the full EE-CPU / memory-map MMIO path (a real EE instruction stream reaching 0x12000000+ viasw) is a separate future chapter. PASS criteria add to Ch110's: 4 EE-MMIO writes (1 PMODE + 1 DISPFB1 + 2 DISPLAY1) and 4 bridge fires producing the same 16×8 captured frame as Ch110.tb_gs_demo_psmct32_swizzle_trxdir_e2e.sv(Ch124) — companion to Ch123: same EE-bootlet → DMAC → GIF data plane and same all- three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload throughgif_image_xfer_stubinstead of by raster. The Ch121 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every byte the GS produces comes out of the image-xfer engine at canonical PSMCT32 swizzled addresses, and the raster path is dormant. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT32} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0})- U2 (IMAGE, NLOOP=32: 32 IMAGE qwords carrying the 128 PSMCT32
pixels of the same four-quadrant pattern Ch123 used). DMAC QWC
= 38. Verification mirrors Ch123: (1) full 16×8 scanout frame
capture; (2) per-pixel byte readback at the canonical swizzled
address via vram_stub's 2nd read port; (3) strict linear-vs-
swizzled separator at byte 1024 stays 0. Aggregate counts:
dma=(1,38,1) ee_dmac_wr=3 giftags=2 ad_writes=4 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8. Ch123 + Ch124 together exercise BOTH PSMCT32 write-side paths (raster Ch122 + image-xfer Ch121) end-to-end through the same driver-shaped flow with the same swizzled-scanout (Ch120) read side.
- U2 (IMAGE, NLOOP=32: 32 IMAGE qwords carrying the 128 PSMCT32
pixels of the same four-quadrant pattern Ch123 used). DMAC QWC
= 38. Verification mirrors Ch123: (1) full 16×8 scanout frame
capture; (2) per-pixel byte readback at the canonical swizzled
address via vram_stub's 2nd read port; (3) strict linear-vs-
swizzled separator at byte 1024 stays 0. Aggregate counts:
tb_gs_demo_psmct32_swizzle_e2e.sv(Ch123) — full driver-shaped end-to-end demo with ALL THREE PSMCT32 swizzle gates flipped on simultaneously:gs_stub#(PSMCT32_SWIZZLE=1)(Ch122 raster),gif_image_xfer_stub#(PSMCT32_SWIZZLE=1)(Ch121 — instantiated but unused in this demo),gs_pcrtc_stub#(PSMCT32_SWIZZLE=1)(Ch120 read). The data plane is the same DMAC + GIF + EE-bootlet shape Ch107..Ch114 demos use: a BIOS-resident EE program (PC_RESET=0xBFC0_0000) configures GS-priv (DISPFB1, DISPLAY1 lo/hi, PMODE.EN1) viaswinstructions throughee_memory_map_stub→ee_gs_priv_bridge_stub→gs_stub.reg_wr_*, then kicks DMAC ch2 (MADR / QWC / CHCR) viaswto the DMAC reg window, thenSYSCALLhalts. DMAC delivers a 24-qword payload fromee_ram_stubtogif_packed_stub, which dispatches 4 SPRITE PACKED packets (1 GIFtag + 5 A+D each — PRIM, FRAME_1=PSMCT32, RGBAQ, XYZ2, XYZ2). The 4 sprites tile the 16×8 active area into 4 quadrants with unique RGB triples. With the raster gate on, all 128 per-pixel store addresses go throughgs_swizzle_psmct32_stub; with the pcrtc gate on, scanout reads from those same swizzled addresses. Two-phase verification: (1) scanout — every (x, y) in 16×8 captures its sprite's RGB; (2) byte readback via vram_stub's 2nd read port — for every (x, y), the 32-bit word atref_addr_psmct32(0, 1, x, y)equals the sprite's{A=0xFF, B, G, R}PSMCT32 word. Strict linear-vs-swizzled separator at byte 1024 (where the linear formula's y=4 row would land at stride=256) stays 0 — the swizzled write set for the 16×8 image stays in blocks (0,0) and (1,0) of page 0 (bytes 0..511), so a fall-through to linear would have placed sprite-2's color at byte 1024. Aggregate counts:dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8. This is the FIRST end-to-end demo where every PSMCT32 byte the GS produces lives at the canonical PCSX2 swizzled address AND the scanout reads from it — byte-accurate to real PS2 VRAM layout, end-to-end through the driver-shaped flow.tb_gs_raster_swizzle_psmct32.sv(Ch122) — focused contract for the newPSMCT32_SWIZZLEparameter ongs_stub. When the parameter is set to 1 AND the active raster PSM is PSMCT32 (ras_psm == 6'h00), the per-pixel raster emit address is routed through the Ch119gs_swizzle_psmct32_stub(FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) and its output is the absolute byte address (FBP*2048 already baked in). At Ch122 only, PSMCT16/PSMT8/PSMT4 raster emits always took the linear path. Ch128 later closed the PSMCT16 raster gate and Ch134 closed the PSMT8 raster gate (each with its own per-PSM parameter on this samegs_stub); PSMT4 raster still takes the linear path. Default 0 keeps every existing PSMCT32 raster TB unchanged. Three-phase verification: (1) origin SPRITE — drive a single 16×4 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT32} with RGBAQ R=0x55/G=0xAA/B=0xCC/A=0x77, expect 64 emits, per-pixel byte readback via vram_stub's 2nd read port at swizzled addresses confirms each pixel lands where the swizzle says. Strict linear-vs-swizzled separators at bytes 512 and 768 (the linear formula's y=2 / y=3 row starts) stay 0 — proves the gate is live. (2) scanout agreement — enable the Ch120 swizzled- pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each visible pixel reads back the SPRITE's RGB. Both gs_stub (Ch122 raster) and gs_pcrtc_stub (Ch120 scanout) instantiate the same swizzle module; a successful capture proves the two integrations agree at byte level — what raster wrote at swizzled addresses comes out on r/g/b at the same (x, y). (3) non-origin SPRITE — re-arm the raster with FRAME_1{FBP=4, FBW=2, PSMCT32} and an 8×2 SPRITE at (60, 4)..(67, 5) crossing the page-x boundary at x=64 (so page_index actually changes mid-row). Pins three contracts the origin transfer can't distinguish from a buggy implementation: (a)ras_fbpreaches the swizzle'sfbpinput (FBP=0 in Phase 1 would have masked a tied-zero regression), (b)ras_fbwreaches the swizzle'sfbwinput (FBW=1 would have masked a tied-one regression), (c) the swizzle gets the FULL absolute pixel coords (s2_x_q, s2_y_q) rather than bbox-local coords (Phase 1's sprite started at (0,0) so absolute and local were equal there). Strict linear-vs- swizzled separator at byte 10480 (where the linear formula would land Phase-3's first pixel) stays 0. Total emit count after all phases: 64 + 16 = 80. With Ch120 (read), Ch121 (TRXDIR upload), and Ch122 (raster emit) all live, the three major PSMCT32 paths are byte-consistent end-to-end.tb_gs_image_xfer_swizzle_psmct32.sv(Ch121) — focused contract for the newPSMCT32_SWIZZLEparameter ongif_image_xfer_stub. When the parameter is set to 1 AND the upload's PSM is PSMCT32, per-pixel VRAM byte addresses are routed through the Ch119gs_swizzle_psmct32_stub(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+ cur_y) anddest_base_q (= DBP*256)is added back to anchor at the upload's DBP base. PSMCT16/PSMT8/PSMT4 always take the linear path. Default 0 keeps every existing image-xfer TB unchanged. Three-phase verification: (1) origin transfer — TRXDIR upload of a 16×4 PSMCT32 image at DBP=DSAX=DSAY=0, DBW=1, RRW=16, RRH=4 → 64 pixels, 16 IMAGE qwords. After the upload completes, the TB reads VRAM via vram_stub's 2nd read port at the SWIZZLED byte address (TB-sideref_addr()mirrors the swizzle module) and asserts each pixel landed where the swizzle says. Strict linear-vs-swizzled separator: bytes 512 and 768 (where linear y=2 and y=3 rows would land) stay 0 under swizzled, since the 16×4 image only fills blocks (0,0) and (1,0) which together cover bytes [0..127] ∪ [256..383]. (2) scanout agreement — enable the Ch120 swizzled-pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each scanned-out pixel matches its uploaded color. Both upload and scanout instantiate the samegs_swizzle_psmct32_stub, so a successful capture proves the two integrations agree at byte level — what was written by TRXDIR comes out on r/g/b at the same (x, y). (3) non-origin transfer — re-arm with NONZERO DBP, DSAX, and DSAY (DBP=8, DSAX=4, DSAY=2, RRW=8, RRH=4) and verify each uploaded pixel lands atDBP*256 + swizzle(0, DBW, DSAX+x_local, DSAY+y_local). Phase 3 pins TWO contracts the origin transfer can't distinguish from a buggy implementation: (a)dest_base_q (= DBP*256)is correctly ADDED ON TOP of the swizzle output (with DBP=0 a missing-add regression would still pass), and (b) the swizzle is fed the FULL effective coordinates (with DSAX=DSAY=0 a "feeds only cur_x/cur_y" regression would still pass). Strict linear-vs-swizzled separator at byte 3088 (where the linear formula's y=2 row of the P3 image would land) stays 0 under swizzled. NOTE: gs_stub raster writes still use linear addressing — that wiring is a follow-on chapter.tb_gs_scanout_swizzle_psmct32.sv(Ch120) — focused contract for the newPSMCT32_SWIZZLEparameter ongs_pcrtc_stub. When the parameter is set to 1 AND the active PSM is PSMCT32, PCRTC reads VRAM at swizzled addresses (via the Ch119 swizzle module instantiated inside pcrtc) instead of the legacy linear formula. Other PSMs (CT16/T8/T4) andPSMCT32_SWIZZLE=0keep the original linear path unchanged. Topology: TB drivesvram_stub.write_*directly with each pixel's color preloaded at the swizzled byte address (TB-sideref_addr()mirrors the DUT swizzle math), then pcrtc withPSMCT32_SWIZZLE=1scans out the frame and the TB asserts each captured pixel matches the preloaded color. Image is 16×4 PSMCT32 (covers blocks (0,0) AND (1,0) horizontally) at FBP=0/FBW=1; pcrtc active area is 8×4 (block (0,0) entirely), but the swizzle vs. linear distinction shows up at any y>0 (linear y=1 → byte 64; swizzled byte 32) so even the in-window region is a strict separator. Per-pixel color is unique ({A=0xFF, B=y<<4, G=x<<4, R=0x10|(y*8+x)}) so any wrong- address commit surfaces immediately. NOTE: at Ch120 ONLY, gs_stub raster writes and gif_image_xfer_stub uploads still used linear addressing — Ch120 was read-side only. Ch121 (image-xfer) and Ch122 (raster) closed the write-side gates, and Ch123 demonstrates all three running together end-to-end.tb_gs_demo_psmt8_swizzle_trxdir_e2e.sv(Ch136) — companion to Ch135: same EE-bootlet → DMAC → GIF data plane and same all- three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload throughgif_image_xfer_stubinstead of by raster. The Ch133 PSMT8 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every byte the GS produces comes out of the image-xfer engine at canonical PSMT8 swizzled addresses, and the raster path is dormant. Mirrors Ch124 PSMCT32 + Ch130 PSMCT16 TRXDIR demos for the third PSM. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2, DPSM=PSMT8} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=8: 8 IMAGE qwords each carrying 16 PSMT8 bytes for the 16×8 image, row-major). DBW=2 is the minimum even DBW for PSMT8. DMAC QWC=14. Per-quadrant byte indices Q0=0xA0/Q1=0x40/Q2=0xC0/Q3=0x60 reused from Ch135 so the verify side is unchanged. Newtrxdir_arms_seencounter asserts =1 (single TRX setup) + xfer-side per-emit observer asserts every xfer_we pulse fires with be=4'b0001, mask= 0xFFFFFFFF (PSMT8 single-byte commit shape). Verification mirrors Ch135: (1) full 16×8 scanout frame capture; (2) per- pixel BYTE readback at the canonical swizzled byte address (withaddr[1:0]selecting the right byte from the 32-bit word) via vram_stub's 2nd port; (3) strict separators at bytes 128 and 256 stay 0. Aggregate counts:dma=(1,14,1) ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8. First-attempt PASS errors=0. Ch135 + Ch136 together close the PSMT8 byte-accuracy milestone end- to-end through the full driver-shaped flow — same Ch123+Ch124 (PSMCT32) and Ch129+Ch130 (PSMCT16) shape.tb_gs_demo_psmt4_swizzle_trxdir_e2e.sv(Ch142) — companion to Ch141 (raster-driven PSMT4 e2e): same EE-bootlet → DMAC → GIF data plane and same all-three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload throughgif_image_xfer_stubinstead of by raster. The Ch139 PSMT4 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every nibble the GS produces comes out of the image-xfer engine at canonical PSMT4 swizzled (addr, nibble_hi) slots, and the raster path is dormant. Mirrors Ch124's PSMCT32 TRXDIR demo, Ch130's PSMCT16 TRXDIR demo, and Ch136's PSMT8 TRXDIR demo for the fourth (and last) common GS PSM. Cloned from Ch136 and surgically retargeted to PSMT4. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2, DPSM=PSMT4} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=4 EOP=1: 4 IMAGE qwords carrying 32 PSMT4 nibbles each — at RRW=16 each qword holds 2 rows: lanes 0..15 = row 2qi, lanes 16..31 = row 2qi+1, matching Ch139's focused-TB packing). Total QWC = 10 (5+5). EE-bootlet DISPFB1 immediate identical to Ch141 (LUI 0x000A; ORI 0x0400 → PSM=PSMT4). Per-quadrant nibbles match Ch141 verbatim (Q0=0xA → 0xAA, Q1=0x4 → 0x44, Q2=0xC → 0xCC, Q3=0x6 → 0x66) so the verify side reuses Ch141's pattern unchanged. Verification mirrors Ch141: (1) full 16×8 scanout frame capture via Ch138 swizzled-pcrtc; (2) per-pixel NIBBLE readback at the canonical swizzled (addr, nibble_hi) slot via vram_stub's 2nd port (addr[1:0]-keyed byte selection then nibble_hi-keyed nibble selection); (3) strict linear- vs-swizzled separator at byte 128 stays 0 (per-byte check, not full word: a neighbor byte may legitimately be touched); (4) per-emit observer asserts every image-xfer write isbe=4'b0001/mask ∈ {0x0F, 0xF0}(PSMT4 nibble RMW shape) and thetrxdir_wr_qarming pulse fires exactly once. Aggregate counts:dma=(1,10,1) ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8. Ch141 + Ch142 together exercise BOTH PSMT4 write-side paths (raster Ch140 + image-xfer Ch139) end-to-end through the same driver-shaped flow with the same swizzled-scanout (Ch138) read side — bringing PSMT4 to full parity with the PSMCT32, PSMCT16, and PSMT8 e2e coverage from Ch123+Ch124, Ch129+Ch130, and Ch135+Ch136. Architectural milestone: this is the first state of the project where ALL FOUR common GS PSMs (CT32 + CT16 + T8 + T4) have BOTH a raster- driven AND a TRXDIR-driven driver-shaped end-to-end byte- accuracy demo — closing the four-PSM × three-path × dual- driver-shape e2e foundation (8 demos total). The bug-fix iteration: TB-sideref_col_idx4was first written with a 7-bit case key{yb[2:0], xb[3:0]}covering yb=0..7 in indices 0..127, but the values for yb=4..7 were miscopied from Ch139's yb=12..15 row (Ch139 only exercises yb=0..3 and yb=12..15). Phase 2 readback failed for all 64 pixels in y=4..7 withgot=0 expected=0xC/0x6— the engine wrote the right nibbles to the right addresses (scanout passed), but the TB's ref looked at the wrong slot. Fix: switched to Ch141's 9-bit case key{yb[3:0], xb[4:0]}and used Ch141's verified yb=0..7 values verbatim. First-attempt PASS after the table fix.tb_gs_demo_psmt4_swizzle_e2e.sv(Ch141) — first driver-shaped end-to-end PSMT4 demo with all three PSMT4 swizzle gates (Ch138 read-side pcrtc, Ch139 image-xfer write-side, Ch140 raster write-side) parameter-set to 1 simultaneously, but with the demo flow exercising only the raster (Ch140) + scanout (Ch138) paths as load-bearing. The Ch139 image-xfer gate is smoke-only here (parameter is set butxfer_writes_seen == 0is asserted, since no TRXDIR/IMAGE packet is delivered in the raster-driven payload); the Ch139 load-bearing variant is the Ch142 TRXDIR-driven PSMT4 e2e (mirrors Ch124/Ch130/Ch136). PSMT4 counterpart of Ch123's PSMCT32 / Ch129's PSMCT16 / Ch135's PSMT8 e2e demos. Same EE-bootlet → DMAC → GIF data plane: BIOS-resident EE program configures GS-priv (DISPFB1 PSMT4 with FBW=2, DISPLAY1, PMODE) viaswinstructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload (4 SPRITE PACKED packets) throughgif_packed_stubintogs_stubraster. The 4 sprites tile the 16×8 active area into 4 quadrants with per-quadrant unique RGBAQ.R[3:0] nibbles (Q0=0xA → 0xAA, Q1=0x4 → 0x44, Q2=0xC → 0xCC, Q3=0x6 → 0x66). PSMT4 raster (Ch106) takes RGBAQ.R[3:0] as the nibble that hits VRAM via the existing Ch106 nibble RMW machinery (write_be=4'b0001 + write_mask 0x0F or 0xF0); Ch140 keys the high/low nibble selector off the swizzle'snibble_hioutput instead ofs2_pixel_index[0]. PCRTC's Ch103 PSMT4 grayscale fallback (clut_enable=0) surfaces the nibble as r=g=b={n, n} at scanout, so each captured pixel IS the nibble we wrote (no CLUT setup needed for this demo; a CLUT-driven Ch141 variant is a future chapter). With the raster gate on, all 128 per-pixel nibble stores go throughgs_swizzle_psmt4_stub; with the pcrtc gate on, scanout reads from those same swizzled (addr, nibble_hi) slots. Two-phase verification: (1) full-frame scanout asserts each (x, y) reads back its quadrant's nibble as PSMT4 grayscale r=g=b={n, n}; (2) per-pixel NIBBLE readback at the canonical swizzled address (withaddr[1:0]selecting the right byte from the 32-bit word, thennibble_hiselecting which nibble of that byte) via vram_stub's 2nd port — the 16×8 PSMT4 image lives entirely in the upper-left of block (0,0) of page 0 (PSMT4 block = 32×16 px) and the within-block columnTable4 yb=0..7 / xb=0..15 exercises nibble_idx values [0..127]. Strict linear-vs-swizzled separator at byte 128 (linear y=2 row start at PSMT4 stride=64 with FBW=2) stays 0 — outside block (0,0)'s touched range. Per-emit observer locks PSM=0x14, be=4'b0001, mask ∈ {0x0F, 0xF0}. Aggregate counts:dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8. First-attempt PASS errors=0. Together with Ch123 (PSMCT32 e2e), Ch129 (PSMCT16 e2e), and Ch135 (PSMT8 e2e), this is the first state of the project where the full driver-shaped flow has end-to-end byte-accuracy demos for ALL FOUR common GS PSMs (CT32 + CT16 + T8 + T4) under software-shaped raster traffic. The TRXDIR-driven PSMT4 companion landed at Ch142 (mirror of Ch124/Ch130/Ch136 making Ch139 load-bearing), so Ch141 + Ch142 together close the PSMT4 byte-accuracy milestone end-to-end through both driver shapes — bringing PSMT4 to full parity with CT32/CT16/T8.tb_gs_demo_psmt8_swizzle_e2e.sv(Ch135) — first driver-shaped end-to-end PSMT8 demo with all three PSMT8 swizzle gates (Ch132 read-side pcrtc, Ch133 image-xfer write-side, Ch134 raster write-side) parameter-set to 1 simultaneously, but with the demo flow exercising only the raster (Ch134) + scanout (Ch132) paths as load-bearing. The Ch133 image-xfer gate is smoke-only here (parameter is set butxfer_writes_seen == 0is asserted, since no TRXDIR/IMAGE packet is delivered in the raster-driven payload); the Ch133 load-bearing variant is the Ch136 TRXDIR-driven PSMT8 e2e (mirror of Ch124/Ch130). PSMT8 counterpart of Ch123's PSMCT32 / Ch129's PSMCT16 e2e demos. Same EE-bootlet → DMAC → GIF data plane: BIOS-resident EE program configures GS-priv (DISPFB1 PSMT8 with FBW=2, DISPLAY1, PMODE) viaswinstructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload (4 SPRITE PACKED packets) throughgif_packed_stubintogs_stubraster. The 4 sprites tile the 16×8 active area into 4 quadrants with per-quadrant unique RGBAQ.R values (Q0=0xA0, Q1=0x40, Q2=0xC0, Q3=0x60). PSMT8 raster (Ch105) takes the natural ABGR's R channel as the byte index that hits VRAM; PCRTC's Ch96 grayscale fallback (clut_enable=0) surfaces the byte as R=G=B at scanout, so each captured pixel IS the byte we wrote (no CLUT setup needed for this demo; a CLUT-driven Ch135 variant is a future chapter). With the raster gate on, all 128 per-pixel byte stores go throughgs_swizzle_psmt8_stub; with the pcrtc gate on, scanout reads from those same swizzled addresses. Two-phase verification: (1) full-frame scanout asserts each (x, y) reads back its quadrant's byte as PSMT8 grayscale R=G=B; (2) per-pixel BYTE readback at the canonical swizzled address (withaddr[1:0]selecting the right byte from the 32-bit word) via vram_stub's 2nd port — the 16×8 PSMT8 image lives entirely in the upper half of block (0,0) of page 0 (PSMT8 block = 16×16 px) and the within-block columnTable8 yb=0..7 exercises byte values [0..127]. Strict linear-vs-swizzled separators at bytes 128 (linear y=1 row start at PSMT8 stride=128 with FBW=2) and 256 (linear y=2) stay 0 — both outside block (0,0)'s touched range. Aggregate counts:dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8. Together with Ch123 (PSMCT32 e2e) and Ch129 (PSMCT16 e2e), this was the first state of the project where the full driver-shaped flow had end-to-end byte-accuracy demos for the CT32/CT16/T8 trio under software-shaped traffic. PSMT4 was the natural follow-on and landed at Ch141 (raster- driven, mirror of this demo) + Ch142 (TRXDIR-driven, mirror of Ch136), closing the four-PSM × dual-driver-shape e2e matrix.tb_gs_demo_psmct16_swizzle_trxdir_e2e.sv(Ch130) — companion to Ch129: same EE-bootlet → DMAC → GIF data plane and same all- three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload throughgif_image_xfer_stubinstead of by raster. The Ch127 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every byte the GS produces comes out of the image-xfer engine at canonical PSMCT16 swizzled addresses, and the raster path is dormant. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT16} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0})- U2 (IMAGE, NLOOP=16: 16 IMAGE qwords carrying the 128 PSMCT16
halfwords of the same four-quadrant pattern Ch129 used). DMAC
QWC = 22. Verification mirrors Ch129: (1) full 16×8 scanout
frame capture; (2) per-pixel halfword readback at the canonical
swizzled byte address (with
addr[1]selecting the right 16-bit slot) via vram_stub's 2nd read port; (3) strict linear-vs- swizzled separators at bytes 256 and 384 stay 0; (4) per-emit observer asserts every image-xfer write isbe=4'b0011/mask=0xFFFF_FFFF(low halfword) and thetrxdir_wr_qarming pulse fires exactly once. Aggregate counts:dma=(1,22,1) ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8. Ch129 + Ch130 together exercise BOTH PSMCT16 write-side paths (raster Ch128 + image-xfer Ch127) end-to-end through the same driver-shaped flow with the same swizzled-scanout (Ch126) read side — bringing PSMCT16 to full parity with the PSMCT32 e2e coverage from Ch123 + Ch124.
- U2 (IMAGE, NLOOP=16: 16 IMAGE qwords carrying the 128 PSMCT16
halfwords of the same four-quadrant pattern Ch129 used). DMAC
QWC = 22. Verification mirrors Ch129: (1) full 16×8 scanout
frame capture; (2) per-pixel halfword readback at the canonical
swizzled byte address (with
tb_gs_demo_psmct16_swizzle_e2e.sv(Ch129) — full driver-shaped end-to-end demo with all three PSMCT16 swizzle gates (Ch126 read-side pcrtc, Ch127 image-xfer write-side, Ch128 raster write-side) parameter-set to 1 simultaneously, but with the demo flow exercising only the raster (Ch128) + scanout (Ch126) paths as load-bearing. The Ch127 image-xfer gate is smoke-only here (parameter is set butxfer_writes_seen == 0is asserted, since no TRXDIR/IMAGE packet is delivered in the raster-driven payload); Ch130 (TRXDIR-driven PSMCT16 e2e) is the load-bearing image-xfer-side counterpart. PSMCT16 counterpart of Ch123's PSMCT32 e2e demo. Same EE-bootlet → DMAC → GIF data plane: BIOS-resident EE program configures GS-priv (DISPFB1 PSMCT16, DISPLAY1, PMODE) viaswinstructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload (4 SPRITE PACKED packets) throughgif_packed_stubintogs_stubraster. The 4 sprites tile the 16×8 active area into 4 quadrants with per-quadrant unique RGB5A1 colors picked so the 5→8 bit-replicate at PCRTC output produces unique 8-bit RGB triples. With the raster gate on, all 128 per-pixel halfword stores go throughgs_swizzle_psmct16_stub; with the pcrtc gate on, scanout reads from those same swizzled addresses. Two-phase verification: (1) full-frame scanout asserts each (x, y) reads back its quadrant's 5→8-expanded RGB; (2) per-pixel halfword readback via vram_stub's 2nd port at swizzled addresses (withaddr[1]selecting the right 16-bit slot) confirms each sprite halfword landed where the swizzle says — the 16×8 PSMCT16 image lives entirely in block (0,0) of page 0 (PSMCT16 block = 16×8 px), so the readback exercises ALL 16 xb × 8 yb entries ofcolumnTable16. Strict linear-vs-swizzled separators at bytes 256 (linear y=2 row start at PSMCT16 stride=128) and 384 (linear y=3) stay 0 — both outside block (0,0)'s 256-byte range. Aggregate counts:dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8. Together with Ch123 (PSMCT32 e2e), this is the first state of the project where the full driver-shaped flow has end-to-end byte-accuracy demos for BOTH direct-color PS2 PSMs.tb_gs_raster_swizzle_psmct16.sv(Ch128) — focused contract for the newPSMCT16_SWIZZLEparameter ongs_stub(the raster emit surface). Mirrors Ch122's wiring shape but for PSMCT16: when the parameter is 1 AND the active raster PSM is PSMCT16 (ras_psm == 6'h02), the per-pixel raster emit address is routed through the Ch125gs_swizzle_psmct16_stub(FBP=ras_fbp, FBW= ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — its output is the absolute byte address. PSMCT32 is gated by its ownPSMCT32_SWIZZLEparameter (Ch122). At Ch128 only, PSMT8/PSMT4 raster emits stayed linear; Ch134 later closed the PSMT8 raster gate viaPSMT8_SWIZZLEon this samegs_stub. PSMT4 raster still takes the linear path. Default 0 keeps every existing PSMCT16 raster TB (Ch95 etc.) unchanged. Three-phase verification: (1) origin SPRITE — drive a 16×4 PSMCT16 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT16} with RGBAQ {R=0xAA, G=0x50, B=0xC0, A=0x00} → halfword 0x6155 (R5=0x15, G5=0x0A, B5=0x18, A1=0). Per-pixel halfword readback via vram_stub's 2nd port (withaddr[1]selecting the right 16-bit slot) confirms each lands at the swizzled byte. The 16×4 image lives in block (0,0) of page (0,0), so within-block columnTable16 rows 0..3 are exercised. Strict separators: bytes 128 (linear y=1 row start at PSMCT16 stride=128) and 256 (linear y=2) stay 0 — proves the gate is live, since a fall- through to the legacy linear path would put the SPRITE halfword there. (2) scanout agreement — enable the Ch126 swizzled-pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each visible pixel reads back the expected RGB after PCRTC's 5→8 bit-replicate (RGB = {0xAD, 0x52, 0xC6}). Both gs_stub (Ch128 raster) and gs_pcrtc_stub (Ch126 scanout) instantiate the same swizzle module. (3) non-origin SPRITE — re-arm with FRAME_1{FBP=4, FBW=2, PSMCT16} and an 8×4 SPRITE at (60, 4)..(67, 7) with distinct color (halfword 0x9F8E). Crosses the PAGE-x boundary at x=64 (page (0,0) for x∈[60..63] — block (0,3) by swizzle table — vs page (1,0) for x∈[64..67] — block (0,0)) so page_index changes mid-row. Within-block column-table coords (xb=12..3, yb=4..7) cover columnTable16 rows 4..7 — a different region than Phase 1's yb=0..3. Pins three contracts Phase 1 can't: (a)ras_fbpreaches the swizzle'sfbpinput (FBP=0 in P1 would mask a tied-zero); (b)ras_fbwreachesfbw(FBW=1 in P1 would mask a tied-one); (c) the swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q rather than bbox-local (P1's sprite started at (0,0), so absolute and local were equal). Strict P3 separator at byte 9336 (linear formula's effective (60, 4) byte) stays 0 — outside the P3 swizzled write set, which lives in block (0,3) of page (0,0) (10914..11006) and block (0,0) of page (1,0) (16512..16604). Total emit count after all phases: 64 + 32 = 96. With Ch126 (read), Ch127 (TRXDIR upload), and Ch128 (raster emit) all live, the three major PSMCT16 paths are byte-consistent end-to-end — completes the byte-accuracy milestone for the second PSM, mirroring the Ch120/Ch121/Ch122 PSMCT32 closure.tb_gs_image_xfer_swizzle_psmct16.sv(Ch127) — focused contract for the newPSMCT16_SWIZZLEparameter ongif_image_xfer_stub. Mirrors Ch121's wiring shape but for PSMCT16: when the parameter is 1 AND the upload's PSM is PSMCT16, per-pixel byte addresses route through the Ch125gs_swizzle_psmct16_stub(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y) anddest_base_q (= DBP*256)is added back to anchor at the upload's DBP base. PSMCT32 is gated by its own PSMCT32_SWIZZLE parameter (Ch121); PSMT8/T4 always linear. Default 0 keeps every existing PSMCT16 image-xfer TB unchanged. Three-phase verification: (1) origin transfer — TRXDIR upload of a 16×4 PSMCT16 image at DBP=DSAX=DSAY=0, DBW=1, RRW=16, RRH=4 → 64 pixels, 8 IMAGE qwords (8 px/qword for PSMCT16). After upload, the TB reads vram_stub's 2nd port at the SWIZZLED byte address (TB-sideref_addr16/ref_block_idx16/ref_col_idx16carry the verbatim PCSX2 tables locked at Ch125) and asserts each halfword landed where the swizzle says (selecting the right 16-bit slot inside the 32-bit word viaaddr[1]). Strict linear-vs-swizzled separators at bytes 128 (linear y=1) and 256 (linear y=2) stay 0 — swizzled writes for the 16×4 image fill only block (0,0) bytes [0..126]. (2) scanout agreement — enable the Ch126 swizzled-pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each scanned pixel matches the uploaded RGB5A1 → RGB888 5→8 bit-replicate. Both upload and scanout instantiate the samegs_swizzle_psmct16_stub. (3) non-origin transfer — re-arm with DBP=8, DSAX=12, DSAY=6, RRW=8, RRH=4. Effective coords (12..19, 6..9) cross block_x=0→1 at effective_x=16 AND block_y=0→1 at effective_y=8, exercising both block-table dimensions inside a single non-origin upload. Pins three contracts the origin transfer can't distinguish from a buggy implementation: (a)dest_base_q (= DBP*256)is added on top of the swizzle output (DBP=0 in P1 would mask a missing-add); (b) the swizzle is fed the FULL effective coords (DSAX=DSAY=0 in P1 would mask a "feeds only cur_x/cur_y" regression); (c) BOTH block_x and block_y propagate throughblockTable16[by][bx](block_x=0 throughout P1 would mask a tied-block_x regression). Strict P3 separator at byte 3096 (linear formula's effective (12, 8) byte) stays 0 — outside the P3 swizzled write set [2048..3071]. NOTE (now historical): PSMCT16 raster swizzle was deferred when Ch127 landed; it shipped at Ch128 (mirrors Ch122 for PSMCT32) so the PSMCT16 raster path is now byte-consistent with the image-xfer path documented here.tb_gs_raster_swizzle_psmt4.sv(Ch140) — focused contract for the newPSMT4_SWIZZLEparameter ongs_stub(the raster emit surface). Mirrors Ch122/Ch128/Ch134 wiring shape but for the fourth (and last) PSM, and threads the Ch137 swizzle module'snibble_hioutput into the existing Ch106 PSMT4 raster nibble RMW data lane (replacings2_pixel_index[0]as the high/low nibble selector when the gate is on). When the parameter is 1 AND the active raster PSM is PSMT4 (ras_psm == 6'h14), the per-pixel raster emit address is routed through the Ch137gs_swizzle_psmt4_stub(FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — itsaddroutput is the absolute byte address, AND itsnibble_hioutput keyss2_emit_color64's nibble placement ands2_emit_mask's high/low gating (write_be stays 4'b0001 for both paths). PSMCT32/PSMCT16/PSMT8 are gated by their own parameters; default 0 keeps every existing PSMT4 raster TB (Ch106 raster_psmt4, Ch107 PSMT4-e2e, Ch103 PSMT4+CLUT, Ch104 round- trip, etc.) on the original linear addressing. No new ports. Default-off smoke verification: ran Ch106 + Ch107 + Ch103 + Ch104 PSMT4 TBs before writing the new TB; all PASSed unchanged. Three-phase verification (mirrors Ch134 PSMT8 raster shape, with PSMT4 nibble adaptations + CLUT-disabled grayscale at scanout): (1) origin SPRITE at FBP=0/FBW=2 (FBW must be even per PCSX2 GSLocalMemory.h:560 — same as PSMT8). Drive a 16×4 PSMT4 SPRITE with RGBAQ.R=0xAA (PSMT4 raster channel takes R[3:0] as the nibble per Ch106 → nibble = 0xA). Per-pixel nibble readback via vram_stub's 2nd port (withaddr[1:0]-keyed byte selection thennibble_hi-keyed nibble selection inside the byte) confirms each pixel landed at the correct (byte, nibble) slot. The image lives in the upper-left of block (0,0) of page (0,0); within-block columnTable4 entries for yb=0..3, xb=0..15 cover nibble_idx values [0..127] → byte_in_block ∈ [0..63]. Strict separator: byte 64 (linear y=1 row start at PSMT4 FBW=2 stride 64) stays 0. (2) scanout agreement — enable Ch138 swizzled-pcrtc on the same VRAM, capture full 16×4 frame, assert each pixel reads back as PSMT4 grayscale R=G=B={0xA, 0xA} = 0xAA. Both gs_stub and gs_pcrtc_stub instantiate the samegs_swizzle_psmt4_stubAND thread itsnibble_hioutput through their respective nibble selectors — agreement at this layer means both integrations land at the same byte+nibble positions for PSMT4. (3) non-origin SPRITE at FBP=4/FBW=4 (bw_pg=2) drawing 8×4 SPRITE at (124, 4)..(131, 7) with R=0x55 (nibble = 0x5). Crosses PSMT4 PAGE-x at x=128 (page (0,0) for x∈[124..127], page (1,0) for x∈[128..131]). 2 blocks visited: blockTable4[0][3]=10 → page (0,0) block_base 10752; blockTable4[0][0]=0 → page (1,0) block_base 16384. Pins three contracts the origin transfer can't: ras_fbp reaches the swizzle's fbp input; ras_fbw reaches fbw; the swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q. Strict P3 separator at byte 8766 (linear (124, 4) at FBP=4/FBW=4) stays 0 — outside the P3 swizzled write set [10752..11007] + [16384..16639]. Total emit count: 64 + 32 = 96. First- attempt PASS errors=0. With Ch138 (read-side), Ch139 (TRXDIR upload), and Ch140 (raster emit) all live, the three major PSMT4 paths can be byte-consistent under the canonical swizzle when their gates are flipped on — completing the four-PSM × three-path byte-accuracy foundation (CT32 Ch120/Ch121/Ch122 + CT16 Ch126/Ch127/Ch128 + T8 Ch132/Ch133/Ch134 + T4 Ch138/Ch139/ Ch140). End-to-end PSMT4 swizzled demos (mirroring Ch123/ Ch124, Ch129/Ch130, Ch135/Ch136) are now possible.tb_gs_raster_swizzle_psmt8.sv(Ch134) — focused contract for the newPSMT8_SWIZZLEparameter ongs_stub(the raster emit surface). Mirrors Ch122's PSMCT32 + Ch128's PSMCT16 wiring shape but for the third PSM: when the parameter is 1 AND the active raster PSM is PSMT8 (ras_psm == 6'h13), the per-pixel raster emit address is routed through the Ch131gs_swizzle_psmt8_stub(FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — its output is the absolute byte address. PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays linear. Default 0 keeps every existing PSMT8 raster TB (Ch105 raster_psmt8, Ch107 PSMT4-via-CT16-CLUT palette path, etc.) on the original linear addressing. No new ports — parameter-only API change. Default- off smoke verification: ran Ch105tb_gs_raster_psmt8before writing the new TB; PASSed unchanged. Three-phase verification (mirrors Ch128 PSMCT16 raster shape): (1) origin SPRITE at FBP=0/FBW=2 (DBW must be even — PCSX2 asserts(bw & 1) == 0for PSMT8). Drive a 16×8 PSMT8 SPRITE with RGBAQ.R=0xA5 (PSMT8 raster channel uses R as the byte index per Ch105). Per-pixel byte readback via vram_stub's 2nd port confirms each lands at the swizzled byte. The 16×8 image lives in the upper half of block (0,0) of page (0,0); the within-block columnTable8 distributes the 128 bytes across yb rows 0..7 — byte values 0..127 within the block. Strict separators: bytes 128 (linear y=1 row start at PSMT8 stride=128) and 256 (linear y=2) stay 0 — proves the gate is live, since a fall-through to the legacy linear path would put the SPRITE byte there. (2) scanout agreement — enable the Ch132 swizzled-pcrtc path on the same VRAM, capture the full 16×8 frame, assert each pixel's PCRTC PSMT8 grayscale R=G=B matchesidx=0xA5. Both gs_stub and gs_pcrtc_stub instantiate the samegs_swizzle_psmt8_stub, so success proves byte-level agreement. (3) non-origin SPRITE at FBP=4/FBW=4 (bw_pg=2) drawing 8×4 SPRITE at (124, 4)..(131, 7) with RGBAQ.R=0x5A. Crosses PSMT8 PAGE-x at x=128 (x∈[124..127] is in page (0,0) block (0,7) by swizzle table; x∈[128..131] is in page (1,0) block (0,0)) so page_index changes mid-row. Pins three contracts the origin transfer can't:ras_fbpreaches the swizzle's fbp input (FBP=0 in P1 would mask a tied-zero);ras_fbwreaches fbw (FBW=2 would mask a tied-two); the swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q rather than bbox-local (P1 sprite started at (0,0) so absolute=local). PSMT8's page-x boundary at x=128 is different from CT32/CT16's x=64, so this exercises the PSMT8-specific x[7] wiring of the swizzle. Strict P3 separator at byte 9340 (linear (124, 4) at FBP=4/FBW=4) stays 0 — outside the P3 swizzled write set (page (0,0) block (0,7) at base 13568, page (1,0) block (0,0) at base 16384). Total emit count: 128 + 32 = 160. First-attempt PASS errors=0. With Ch132 (read-side), Ch133 (TRXDIR upload), and Ch134 (raster emit) all live, the three major PSMT8 paths can be byte-consistent under the canonical swizzle when their gates are flipped on — completing the third-PSM byte-accuracy milestone for ALL three integration points (mirrors the Ch120/Ch121/Ch122 PSMCT32 trio + the Ch126/Ch127/Ch128 PSMCT16 trio).tb_gs_image_xfer_swizzle_psmt4.sv(Ch139) — focused contract for the newPSMT4_SWIZZLEparameter ongif_image_xfer_stub. Mirrors Ch121/Ch127/Ch133 wiring shape but for the fourth (and last) PSM, and threads the Ch137 swizzle module'snibble_hioutput into the existing Ch118 nibble RMW data lane (replacingx_eff[0]as the high/low nibble selector when the gate is on). When the parameter is 1 AND the active DPSM is PSMT4, the per-pixel byte address isdest_base_q (= DBP*256) + swizzle_psmt4(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y).addr, ANDcur_mask_cis0x0000_00F0whenswizzle4_nibble_hi=1(high nibble) or0x0000_000Fwhen 0 (low nibble) — the per-bit write_mask machinery (vram_stub merges only the targeted nibble) layers on top of the swizzled address. PSMCT32 /PSMCT16/PSMT8 are gated by their own parameters. Default 0 keeps the legacy linear path for every existing PSMT4 image- xfer TB (Ch118 etc.). No new ports — parameter-only API change. Default-off smoke verification: ran Ch118tb_gs_image_xfer_psmt4before writing the new TB; PASSed unchanged. Three-phase verification (mirrors Ch127/Ch133 audit-closed shape): (1) origin write-side lock at DBP=0/ DBW=2/DSAX=DSAY=0 (DBW must be even per PCSX2 GSLocalMemory.h: 560 — same FBW-evenness as PSMT8). 16×4 PSMT4 image upload via 2 IMAGE qwords (32 px/qword for PSMT4 = 4 rows × 16-px row at RRW=16). After upload, per-pixel nibble readback at the swizzled(addr, nibble_hi)slot asserts each nibble landed where the swizzle says. Strict separator: PSMT4 row stride at DBW=2 = DBW32 = 64 bytes, so linear y=1 starts at byte 64. Swizzled write set lives in [0..63] within block (0,0). Byte 64 stays 0 (verified via per-byte check, not full-word — thecheck_byte_zerotask initially had a full-word bug that misreported neighbor-byte writes; fixed to check only the targeted byte viaaddr[1:0]-keyed case statement). (2) end-to-end agreement: enable Ch138 PSMT4 swizzled scanout on the same VRAM (PSMT4_SWIZZLE=1 on pcrtc, CLUT disabled), capture the 16×4 frame, verify each pixel's grayscale R=G=B={nibble, nibble} matchesnibble_at(xx, yy). Both modules instantiate the samegs_swizzle_psmt4_stubso success proves byte+nibble-level agreement under TRXDIR-style emit + scanout-style read. (3) non-origin transfer at DBP=8/DBW=2/DSAX=28/DSAY=12/ RRW=8/RRH=8. Effective coords (28..35, 12..19) cross block_x= 0→1 at effective_x=32 AND block_y=0→1 at effective_y=16 (PSMT4 block geometry: 32×16 px). All 4 corner blocks of page (0,0) at DBP=8 visited: blockTable4[0][0]=0, [0][1]=2, [1][0]=1, [1][1]=3 (block bases 2048/2560/2304/2816). Pins three contracts the origin transfer can't: dest_base_q ADDED ON TOP of the swizzle output (DBP=0 in P1 would mask a missing-add regression — fixed during bring-up after the TB initially passed P3_DBP directly to ref_pos_psmt4 instead of using fbp_v=0 + adding DBP256); FULL effective coords; BOTH block_x and block_y propagate throughblockTable4[by][bx]. Phase 3 strict separator: linear formula puts effective coord (28, 12) at byte 2830 — under linear, the neighboring pixel (29, 12) writes high nibble = 1 to that byte. Under swizzled, no Phase-3 pixel hits byte 2830 (cross-checked: col_idx_psmt4 for the 4-block × 16-pixel coord set never produces nibble_idx 28 or 29). Byte 2830 stays 0 → fall-through to linear would have stomped it with 0x10. PASS errors=0 after two bug-fix iterations: (a) ref_pos_psmt4(P3_DBP, ...) was wrong — engine feeds FBP=0 to the swizzle and adds DBP*256 separately, so TB must do the same; (b) check_byte_zero tested the full word instead of the targeted byte, producing false failures when a neighbor byte in the same word was independently touched. Counts: arms=2, writes=128 (P1 64 + P3 64). With Ch138 (read- side scanout) + Ch139 (image-xfer write-side) + Ch140 (raster write-side) all live, the Ch137 PSMT4 primitive now has all 3 integration points wired, and Ch141 closes the e2e demo.tb_gs_image_xfer_swizzle_psmt8.sv(Ch133) — focused contract for the newPSMT8_SWIZZLEparameter ongif_image_xfer_stub. Mirrors Ch121's PSMCT32 + Ch127's PSMCT16 wiring shape but for the third PSM: when the parameter is 1 AND the active DPSM is PSMT8, the per-pixel byte address isdest_base_q (= DBP*256) + swizzle_psmt8(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y). PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays linear (its swizzle math is future). Default 0 keeps the legacy linear path for every existing PSMT8 image-xfer TB (Ch117 etc.). No new ports — parameter-only API change. Default-off smoke verification: ran Ch117tb_gs_image_xfer_psmt8before writing the new TB; PASSed unchanged. Three-phase verification (mirrors Ch127 audit-closed shape): (1) origin write-side lock at DBP=0/DBW=2 (DBW must be even per PCSX2 GSLocalMemory.h:553 — PSMT8 pages are 128 px wide vs FBW's 64-px units, so 2 FBW units per page → bw_pg=1 here). 16×8 PSMT8 image upload via 8 IMAGE qwords (16 px/qword). Per- pixel indexidx_at(x, y) = (y[2:0] << 4) | x[3:0]∈ [0x00..0x7F]. After upload, byte-readback at the swizzled address asserts each byte landed where the swizzle says. Strict separators: linear y=1 (byte 128) and y=2 (byte 256) row starts stay 0 — swizzled write set lives entirely in [0..127]. (2) end-to-end agreement: enable Ch132 swizzled scanout on the same VRAM, capture the frame, verify each visible pixel's PCRTC PSMT8 grayscale R=G=B matchesidx_at(x, y). Both modules instantiate the samegs_swizzle_psmt8_stubso success proves byte-level agreement under TRXDIR-style emit + scanout-style read. (3) non-origin transfer at DBP=8/DBW=2/DSAX=12/DSAY=10/ RRW=8/RRH=8. Effective coords (12..19, 10..17) cross block_x=0→1 at effective_x=16 AND block_y=0→1 at effective_y=16, so all 4 corner blocks of page (0,0) at DBP=8 (blockTable8[0][0]=0, [0][1]=1, [1][0]=2, [1][1]=3 → block bases 2048/2304/2560/2816) are visited. Pins three contracts the origin transfer can't:dest_base_q = DBP*256ADDED ON TOP; the swizzle is fed FULL effective coords (DSAX/DSAY non-zero); BOTH block_x and block_y propagate throughblockTable8[by][bx]. Phase 3 distinct-pixel pattern usesp3_idx = 0x80 | idx∈ [0x80..0xFF] (disjoint from Phase 1's [0x00..0x7F]) so a P3 pixel landing at a P1 byte (or vice versa) surfaces as wrong RGB. Phase 3 strict separator: linear formula puts effective coord (12, 10) at byte2048 + 10*128 + 12 = 3340(outside swizzled set [2048..3071]); byte 3340 stays 0 — proves a fall-through to linear would have stomped that byte. First-attempt PASS: arms=2, writes=192 (=128+64), errors=0. NOTE: at Ch133 only, PSMT8 raster-side emits viags_stubstill used linear addressing — Ch133 was image-xfer write-side only. Ch134 later closed the raster-side gate viaPSMT8_SWIZZLEongs_stub(mirrors Ch122 for PSMCT32 and Ch128 for PSMCT16) — see Ch134 row above.tb_gs_scanout_swizzle_psmt4.sv(Ch138) — focused contract for the newPSMT4_SWIZZLEparameter ongs_pcrtc_stub. Mirrors Ch120/Ch126/Ch132's read-side-first wiring shape but adds the PSMT4-specific twist: the swizzle module outputs both an absolute byte address AND anibble_hiselector (PSMT4 = 4 bits/pixel = 2 pixels per byte, and the canonical PCSX2 column table reorders nibbles within a block, sopixel_index[0]is no longer the right selector under the swizzled layout). When the parameter is 1 AND the active PSM is PSMT4, scanout reads go through the Ch137gs_swizzle_psmt4_stuband the PSMT4 nibble extractor usesswizzle4_nibble_hiinstead ofpixel_index[0]. PSMCT32/PSMCT16/PSMT8 are gated by their own parameters; default 0 keeps every existing PSMT4 scanout TB (Ch103 PSMT4+CLUT, Ch104 PSMT4 round-trip, Ch107 PSMT4 e2e, etc.) on the legacy linear path. No new ports — parameter- only API change. Default-off smoke verification: ran Ch103tb_gs_scanout_psmt4_clut+ Ch104tb_gs_psmt4_round_tripbefore writing the new TB; both PASSed unchanged. Two-phase verification (mirrors Ch132 closure shape; CLUT disabled so PCRTC's PSMT4 grayscale fallback givesr=g=b={nibble, nibble}at scanout): (1) origin at FBP=0/FBW=2/DBX=DBY=0 (FBW must be even per PCSX2 GSLocalMemory.h:560 because PSMT4 pages are 128 px wide, same as PSMT8). 16×4 region preloaded at swizzled bytes via a TB-sidebyte_shadowaccumulator that lays each pixel's nibble at its(addr, nibble_hi)slot; bytes are then flushed to vram_stub via per-byte BE writes. Per-pixel nibble patternnibble_at(x, y) = ((y << 1) ^ x) & 4'h7∈ [0..7] gives unique gray values across the 16×4 frame. The image lives entirely in block (0,0) of page (0,0) and exercises within-block columnTable4 entries for yb=0..3, xb=0..15. Strict separator: byte 64 (linear y=1 row start at FBW=2 stride) pre-colored with sentinel 0xCC (gray=0xCC, unproducible by Phase 1's [0..7]-nibble pattern) — fall-through to linear would surface as RGB(0xCC, 0xCC, 0xCC). (2) non-origin at FBP=4/FBW=4 (bw_pg=2), DBX=120, DBY=126. Effective coords range x∈[120..135], y∈[126..129]. page_x crosses 0→1 at effective_x=128, page_y crosses 0→1 at effective_y=128 (PSMT4's 128-tall page boundary — different from PSMT8's 64-tall). All 4 corner pages of FBP=4/FBW=4 visited, each with a distinct blockTable4 lookup (blockTable4[7][3]=31 → page (0,0) block_base 16128; blockTable4[7][0]=21 → page (1,0) block_base 21760; blockTable4[0][3]=10 → page (0,1) block_base 27136; blockTable4[0][0]=0 → page (1,1) block_base 32768). A regression that tied any of {dispfb_fbp, dbx, dby, FBW, block_x, block_y, page_index, bw_pg=FBW/2, swizzle nibble_hi} to zero would NOT survive Phase 2. Strict P2 separator: byte 24380 (linear formula's place for (120, 126); outside all 4 swizzled chunks) pre-colored with sentinel 0xDD → fall-through to linear would surface as RGB(0xDD, 0xDD, 0xDD), unproducible by the Phase-2 pattern. PASS errors=0 after one bug-fix iteration: Phase 2's flush-loop initially hardcoded the wrong byte ranges due to ablockTable4[7][3]lookup mistake (the value is 31, not 15) — replaced with a shadow-array sweep [256..65535] that flushes any non-zero byte, eliminating the hardcode/lookup mismatch class entirely. NOTE (now historical): Ch138 was read-side only when introduced; the PSMT4 write-side is now live as well — Ch139 (image-xfer) + Ch140 (raster) + Ch141 (raster-driven e2e demo). With Ch138, all four common GS PSMs now have read- side byte-accuracy under their swizzle gates (CT32 Ch120 + CT16 Ch126 + T8 Ch132 + T4 Ch138).tb_gs_scanout_swizzle_psmt8.sv(Ch132) — focused contract for the newPSMT8_SWIZZLEparameter ongs_pcrtc_stub. Mirrors Ch120/Ch126's wiring shape but for PSMT8: when the parameter is 1 AND the active PSM is PSMT8, scanout reads go through the Ch131gs_swizzle_psmt8_stub(real PS2 GS page/block/column layout — 128×64 pixel pages, 4×8 block grid, 16×16 within-block bytes,bw_pg = FBW>>1) instead of the legacy linearFBW*64*y + xformula. PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays linear (its swizzle math is future). Default PSMT8_SWIZZLE=0 keeps every existing PSMT8 scanout TB (Ch96 storage-only, Ch97 PSMT8+CLUT, Ch103 PSMT4-via-CT16-CLUT, Ch107 PSMT4-e2e palette path) on the original linear addressing. No new ports — parameter-only API change. Default-off smoke verification: ran Ch96tb_gs_scanout_psmt8before writing the new TB; PASSed unchanged, confirming the new instance + 4-way mux extension don't disturb the linear path. Two-phase verification (mirrors Ch126 PSMCT16 closure shape): (1) origin (FBP=0, FBW=2, DBX=DBY=0; FBW must be even — PCSX2 asserts(bw & 1) == 0for PSMT8 because pages are 128 px wide vs FBW's 64-px units, so 2 FBW units per page → bw_pg=1 here). 16×8 region preloaded at swizzled bytes; per-pixel indexidx = (y[2:0] << 4) | x[3:0]∈ [0x00..0x7F] surfaces as grayscale R=G=B=idx via PCRTC's PSMT8 fallback (Ch96). x∈[0..15] is entirely block_x_in_page=0, so the within-block test exercises ALL 16 xb positions ofcolumnTable8across yb rows 0..7. Strict separators: linear y=1 starts at byte 128 (FBW=2 stride) but swizzled lands at byte 8 (columnTable8[1][0]=8, no*2scale since PSMT8 is 1 byte/pixel); linear x=8,y=0 is byte 8 but swizzled is byte 2. (2) non-origin (FBP=4, FBW=4 → bw_pg=2, DBX=120, DBY=60). Effective coords range x∈[120..135], y∈[60..67] — page_x crosses 0→1 at effective_x=128 (proves x[7] reaches the page-x lane of the PSMT8 swizzle — different boundary from CT16/CT32's x[6]); page_y crosses 0→1 at effective_y=64; block_x and block_y both flip; ALL 4 pages (0,0)/(1,0)/(0,1)/(1,1) are visited, each with a distinct blockTable8 lookup ([3][7]=31, [3][0]=10, [0][7]=21, [0][0]=0). A regression that tied any of {dispfb_fbp, dbx, dby, FBW, block_x, block_y, page_index, bw_pg=FBW/2} to zero would NOT survive Phase 2. Sentinel separator: byte 24500 (inside linear range 23672..25479 for the Phase-2 effective region, outside ALL 4 swizzled write-set blocks) pre-colored with 0xFF → fall-through to linear would surface as RGB(0xFF, 0xFF, 0xFF), which is unproducible by the Phase-2 unique pattern (idx ∈ [0x00..0x7F]). First-attempt PASS errors=0 — no audit iteration needed because Phase 2's coord choices were designed up front to make all 7 chain-layer wires load-bearing AND the page-x crossing boundary is at PSMT8's specific x=128 (not the 64-px boundary the direct-color PSMs use). NOTE (now historical): Ch132 was read-side only when introduced; Ch133 then Ch134 closed the image-xfer + raster write sides for PSMT8, so all three PSMT8 swizzle integration points are now live (mirrors Ch120/Ch121/Ch122 for PSMCT32 and Ch126/Ch127/Ch128 for PSMCT16).tb_gs_scanout_swizzle_psmct16.sv(Ch126) — focused contract for the newPSMCT16_SWIZZLEparameter ongs_pcrtc_stub. Mirrors Ch120's wiring shape but for PSMCT16: when the parameter is 1 AND the active PSM is PSMCT16, scanout reads go through the Ch125gs_swizzle_psmct16_stub(real PS2 GS page/block/column layout) instead of the legacy linearFBW*64*y + x*2formula. PSMCT32 is gated by its ownPSMCT32_SWIZZLEparameter (Ch120); PSMT8/PSMT4 stay linear. Default 0 keeps every existing PSMCT16 scanout TB (Ch94/Ch95/Ch103/etc.) on the original linear addressing. Topology: TB drivesvram_stub.write_*directly with each pixel's RGB5A1 halfword preloaded at the swizzled byte address (TB-sideref_addr16()mirrors the swizzle math + the Ch125 source-table-locked tables); pcrtc withPSMCT16_SWIZZLE=1scans out the 16×8 frame and the TB asserts each captured pixel matches the preloaded color after 5→8 bit-replicate. Per-pixel pattern is unique per (x, y): R5=(x^y)&0xF, G5=x&0xF, B5=y&0xF, expanded to 8 bits via PCRTC's bit-replicate. The PSMCT16 swizzle vs. linear distinction shows up at any y>0 (linear y=1 → byte 128 with FBW=1, but swizzled within block (0,0) yb=1 → columnTable16[1][0]=4 → byte 8) and at x=8, y=0 (linear byte 16 vs swizzled byte 2) so even within the first row + first block, the gate is a strict separator. NOTE (now historical): Ch126 was read-side only when introduced; Ch127 (image-xfer) then Ch128 (raster) closed the PSMCT16 write sides, mirroring Ch121/Ch122 for PSMCT32.tb_gs_swizzle_psmt4.sv(Ch137) — focused contract for the newgs_swizzle_psmt4_stubmath primitive: a pure-comb module mapping(FBP, FBW, x, y)to a VRAM byte address + nibble_hi selector using the real PS2 GS PSMT4 layout (8 KiB pages organized as 128×128 PSMT4 pixels — 4× as many pixels per page as PSMT8 since each PSMT4 pixel is a NIBBLE; 32 blocks/page in an 8-rows × 4-cols grid (same orientation as PSMCT16's blockTable16); each block 32×16 pixels = 512 nibbles = 256 bytes; 512-entry within-block column table — 2× the entries of PSMT8's 256-entry table due to the doubled block area, indexed [yb][xb] with yb=0..15 + xb=0..31 → nibble 0..511). PSMT4 is the most complex of the four common GS PSMs because each pixel is HALF a byte, so the swizzle outputs both a byte address and anibble_hiselector (=0 for low nibble of the byte ataddr, =1 for high). PSMT4 reuses PSMT8's page-stride convention (bw_pg = FBW >> 1; PCSX2 asserts FBW must be even at GSLocalMemory.h:560 because PSMT4 pages are 128 px wide). Source-table provenance pinned:_blockTable4taken verbatim from pcsx2/GS/GSTables.cpp lines 61–69;columnTable4from same file lines 147–213. Master HEAD commit3000e113e2b3a76357c08dfa80d3c747f40e2706; file blob SHA3581209b8217378f473f9de22a9dbc8c45ca49b6(same blob Ch131 pinned). Cross-checked against GSLocalMemory.h:558BlockNumber4+ thepxOffsettemplate at GSTables.cpp:247–258 (blockSize=512 in NIBBLE units, pageSize=16384 nibble units = 8192 bytes, pageWidth=128). The existing per-bit write_mask 0x0F/0xF0 nibble RMW from Ch106/Ch118 will still apply on top of the swizzled byte address — the swizzle module doesn't touch the nibble merge logic; it just produces (addr, nibble_hi). Five-phase verification (mirrors Ch125/Ch131 shape, scaled up): (1) spot-checks at 15 hand-computed corners (origin, intra-block xb=1/8/16/yb=1/yb=2-with-hi-nibble, last nibble of block (0,0), first/second/third/fourth horizontal blocks, second-row-of-blocks origin, page-x at x=128 + page-y at y=128, FBP=4 origin, page0-last-pixel (127,127) → addr 8191 hi=1). (2a) INDEPENDENT column-table source lock — 32 hard-codedcheck_nibble()calls for yb=0 (literal-by-literal verbatim from PCSX2 columnTable4 row 0) PLUS a programmatic walk for yb=1..15 against the in-TB ref function (480 more checks); Phase 2a's literal yb=0 row + Phase 5's bijectivity sweep + Phase 3's literal block-table lock together pin the table. (3) INDEPENDENT block-table source lock — 32 hard-coded checks (one per block in page 0) with expected block index taken VERBATIM from PCSX2 blockTable4. (4) block-swizzle walk via in-TB ref_block_idx4. (5) bijectivity sweep over the 128×128 page — 16384 NIBBLE slots (vs PSMT8's 8192 byte slots), every pixel must hit a unique (byte_addr, nibble_hi) pair and agree with both the in-TB ref byte address AND ref nibble_hi. Plus multi-page sanity at FBW=4/bw_pg=2 (page-x crossing at x=192 → byte 10496 with blockTable4[1][2]=9, and page-y crossing at y=128 → byte 16384) and non-page-aligned FBP coverage at FBP ∈ {1,2,3}, including FBP=3+FBW=4+page-(1,1) intra-block at (129, 129) → byte 30732 (= 6144 + 38192 + 0256- ref_col_idx4(1,1)/2 = 30720 + 12). First-attempt PASS
errors=0. NOTE: This module is NOT YET wired into
gs_pcrtc_stub/gif_image_xfer_stub/gs_stub— those still use linear PSMT4 addressing as of Ch137. The math is locked here so follow-on chapters can wirePSMT4_SWIZZLEparameter gates into the existing address paths without disturbing the legacy linear-PSMT4 TBs (Ch103 / Ch106 / Ch107 / Ch118). With Ch119 PSMCT32 + Ch125 PSMCT16 + Ch131 PSMT8 + Ch137 PSMT4, all four common GS PSMs now have byte-accurate- to-real-PS2 swizzle math available as standalone primitives — the four-PSM swizzle math foundation is complete. Future chapters can wire PSMT4 into pcrtc/image-xfer/raster behind a PSMT4_SWIZZLE parameter (mirroring Ch120→Ch124 / Ch126→Ch130 / Ch132→Ch136), with the existing nibble RMW machinery layered on top.
- ref_col_idx4(1,1)/2 = 30720 + 12). First-attempt PASS
errors=0. NOTE: This module is NOT YET wired into
tb_gs_swizzle_psmt8.sv(Ch131) — focused contract for the newgs_swizzle_psmt8_stubmath primitive: a pure-comb module mapping(FBP, FBW, x, y)to a VRAM byte address using the real PS2 GS PSMT8 layout (8 KiB pages organized as 128×64 PSMT8 pixels — 2× wider than CT16's 64×64 page; 32 blocks/page in a 4-rows × 8-cols grid; each block 16×16 pixels = 256 bytes; 256-entry within- block column table — 2× the entries of CT16's 128-entry table due to the doubled block area, indexed [yb][xb] with yb=0..15 + xb=0..15 → byte 0..255). PSMT8 also introduces a new page-stride constantbw_pg = FBW >> 1(PCSX2 asserts(bw & 1) == 0at GSLocalMemory.h:553 because PSMT8 pages are 128 px wide vs FBW's 64-px units → 2 FBW units per PSMT8 page, so FBW must be even). Source-table provenance pinned:blockTable8taken verbatim from pcsx2/GS/GSTables.cpp lines 53–59;columnTable8from same file lines 111–145. Master HEAD commit3000e113e2b3a76357c08dfa80d3c747f40e2706; file blob SHA3581209b8217378f473f9de22a9dbc8c45ca49b6. Cross-checked against GSLocalMemory.h:551BlockNumber8+ thepxOffsettemplate at GSTables.cpp:247–258 (blockSize=256, pageSize=8192, pageWidth=128). PCSX2'sbpis in 256-byte block-pointer units; in our FBP=2048-byte units,bp = FBP * 8sobp*256 = FBP*2048. Five-phase verification (mirrors Ch125 PSMCT16 shape): (1) spot-checks at 15 hand-computed corners (origin, intra- block xb=1/4/8/yb=1, last byte of block (0,0), first/second block origins, second row of blocks, third+fourth blocks, page-x at x=128 and page-y at y=64, FBP=4 origin); (2a) INDEPENDENT column-table source lock — 256 hard-codedcheck()calls (one per (yb, xb) inside block (0,0)) where the expected byte index is taken VERBATIM from PCSX2 columnTable8 with<literal>arithmetic, NOT derived from the in-TB ref function. Catches any case where DUT and ref share the same miscopy (the same trap Ch125 added Phase 2a for with PSMCT16's column table); (2b) within-block 16×16 walk via the in-TB ref_col_idx8 (self-check); (3) INDEPENDENT block-table source lock — 32 hard-coded checks (one per block in page 0) with the expected block index taken VERBATIM from PCSX2 blockTable8, NOT derived from the in-TB ref; (4) block-swizzle walk via in-TB ref_block_idx8; (5) bijectivity sweep over the 128×64 page — 8192 byte slots (vs CT16's 4096 halfword slots), every pixel must hit a unique byte address in[0, 8192)and agree with the in-TB reference. Plus multi-page sanity at FBW=4/bw_pg=2 (page-x crossing at x=192 and page-y crossing at y=64) and non-page-aligned FBP coverage at FBP ∈ {1, 2, 3}, including FBP=3+FBW=4+page-(1,1) intra-block crossing at (129, 65). First-attempt PASS errors=0. NOTE: This module is NOT YET wired intogs_pcrtc_stub/gif_image_xfer_stub/gs_stub— those still use linear PSMT8 addressing as of Ch131. The math is locked here so follow-on chapters can wirePSMT8_SWIZZLEparameter gates into the existing address paths without disturbing the legacy linear-PSMT8 TBs (Ch96 / Ch97 / Ch103 / Ch105 / Ch107 / Ch117). With Ch119 PSMCT32 + Ch125 PSMCT16 + Ch131 PSMT8, three of the four common GS PSMs now have byte- accurate-to-real-PS2 swizzle math available as standalone primitives; PSMT4 (with its 32×16 nibble intra-block layout) is the natural Ch132 candidate.tb_gs_swizzle_psmct16.sv(Ch125) — focused contract for the newgs_swizzle_psmct16_stubmath primitive: a pure-comb module mapping(FBP, FBW, x, y)to a VRAM byte address using the real PS2 GS PSMCT16 layout (8 KiB pages organized as 64×64 PSMCT16 pixels; 32 blocks/page in a 4×8 grid; each block 16×8 pixels = 256 bytes; non-trivial within-block column table — unlike PSMCT32 where within-block IS row-major halfwords by accident, PSMCT16 has 4 internal 16×2-pixel sub-columns with a 128-entry permutation). Source-table provenance pinned:blockTable16taken verbatim from pcsx2/GS/GSTables.cpp lines 29–39 (master HEAD commit 3d71e310; file-touch commit d983b2b0, 2026-01-12);columnTable16from same file lines 91–109. Cross-check against the older Debian-packaged GSdxPixelAddressOrg16(x, y, bp, bw) = (BlockNumber16(...) << 7) + columnTable16[y & 7][x & 15]confirms the address chain (<< 7lifts to halfword units, multiply by 2 for bytes; in our FBP=2048-byte units, bp = FBP * 8 so bp256 = FBP2048). Five-phase verification: (1) spot-checks at 13 well-defined corners (origin, intra-block, first/second block, second row of blocks, page-x and page-y boundaries, FBP=4 origin); (2) within-block 16×8 walk assertingbyte = 2 * columnTable16[yb][xb]— locks the column table; a row-major-halfwords regression would fail; (3) source-table lock — 32 hard-coded address checks (one per block in page 0) with the expected block index taken VERBATIM from PCSX2 blockTable16, NOT derived from the in-TB reference function; (4) block-swizzle walk cross-checking the in-TB ref function against the DUT (the bijectivity sweep relies on it being correct); (5) bijectivity sweep over the 64×64 page — 4096 halfword slots, every pixel must hit a unique halfword address in[0, 8192)and agree with the in-TB reference. Plus multi-page sanity at FBW=2 and non-page-aligned FBP coverage at FBP ∈ {1, 2, 3} (real PS2 supports any 2048-byte-aligned FBP — same broadening Ch119 adopted post- audit). NOTE: This module is NOT YET wired intogs_pcrtc_stub/gif_image_xfer_stub/gs_stub— those still use linear PSMCT16 addressing as of Ch125. The math is locked here so follow-on chapters can wirePSMCT16_SWIZZLEparameter gates into the existing address paths without disturbing the legacy linear-PSMCT16 TBs (Ch94 / Ch95 / Ch103 / Ch116).tb_gs_swizzle_psmct32.sv(Ch119) — focused contract for the newgs_swizzle_psmct32_stubmath primitive: a pure-combinational module mapping(FBP, FBW, x, y)to a VRAM byte address using the real PS2 GS PSMCT32 page/block swizzle layout (8 KiB pages, 4×8 grid of 8×8-pixel blocks per page, blocks ordered per the canonical PCSX2 PSMCT32 swizzle table, row-major within a block). Verification has five phases: (1) spot-checks on the well-defined corners — origin, intra-block walks, first/second block, second row of blocks, page-x and page-y boundaries, second page on x, and FBP=4 origin; (2) within-block 8×8 walk assertingbyte_in_block = yb*32 + xb*4; (3) source-table lock — 32 hard-coded address checks (one per block in page 0) where the expected block index is taken VERBATIM from PCSX2's PSMCT32 block table, NOT derived from the in-TB reference function. This proves the DUT'sswizzle_psmct32()table matches the canonical source; a copied-wrong table that happened to still be a valid permutation of 0..31 would fail this phase, while the bijectivity sweep below would pass it; (4) block-swizzle walk (redundant with phase 3, cross-checks ref_block_idx against the DUT — the bijectivity sweep relies on ref_block_idx being correct); (5) bijectivity sweep over the full 64×32 PSMCT32 page — every word slot in[0, 8192)reached exactly once (catches any swap/typo in the swizzle table). Plus a multi-page sanity check at FBW=2 (pixel (96, 16) → block (4,2) of page 1 → addr 14336) and a non-page- aligned FBP phase that drives FBP=1, 2, 3 (mid-page in the 8 KiB sense — real PS2 supports any 2048-byte-aligned FBP; our address formula is bit-correct for non-page-aligned FBP) plus FBP=3 with FBW=2 + intra-block crossing as a stress case. NOTE (now historical): at Ch119 this module was standalone math only; Ch120 (PCRTC read), Ch121 (image-xfer write), and Ch122 (raster write) wired it into the three integration points — the same shape that Ch125–Ch128 (PSMCT16), Ch131–Ch134 (PSMT8), and Ch137–Ch140 (PSMT4) followed for the other three PSMs.tb_gs_image_xfer_psmt4.sv(Ch118) — focused contract forgif_image_xfer_stub's PSMT4 path (the fourth and final supported PSM). PSMT4 packs 0.5 bytes/pixel (4-bit nibble per pixel = 2 px/byte), so each 128-bit IMAGE qword carries 32 pixels in 16 bytes. Each emit is a SUB-BYTE write:write_be = 4'b0001with a per-emit nibble mask (write_mask = 0x0000_000Ffor the LOW nibble,0x0000_00F0for the HIGH nibble), keyed by(DSAX+x)[0]; vram_stub's per-bit merge commits exactly the targeted nibble, preserving the OTHER nibble of the byte. Back-to-back emits to the same byte (e.g. x=0 + x=1 of the same row) chain through NBA semantics without bypass logic — the same trick the raster channel uses since Ch106. The TB is INTENTIONALLY adversarial: VRAM is preloaded with0xA5across every byte the engine will write (plus boundary bytes), then a single IMAGE qword (32 PSMT4 pixels) covers the entire 8×4 rect. Every byte ends as{nibble_high_pixel, nibble_low_pixel}(no trace of 0xA5); bytes immediately right of the rect on each row stay 0xA5 (proves no nibble leak past RRW); bytes before / after the destination region also stay 0xA5. Patternpixel(x,y) = 4'((y*8+x) & 0xF). Asserts: 1 trxdir arm, 32 vram writes, every emitbe=0001andmask ∈ {0x0F, 0xF0}, per-byte readback matches, boundary bytes preserved.tb_gs_image_xfer_psmt8.sv(Ch117) — focused contract forgif_image_xfer_stub's PSMT8 path. Pushes 2 IMAGE qwords (32 PSMT8 pixels = 16 px/qword × 2) through the engine after a TRXDIR-shaped GIF-A+D register sequence with DPSM=PSMT8 (=0x13). PSMT8 packs 1 byte/pixel (an 8-bit CLUT index), so each qword holds 16 pixels; the engine emits one 8-bit pixel per cycle withwrite_be = 4'b0001, the index in the LOW byte ofwrite_data, andwrite_mask = 0xFFFFFFFF; vram_stub commitsmem[write_addr] <= write_data[7:0]at any byte alignment. Pattern ispixel(x,y) = 8'(y*16 + x)— 32 distinct values across the 8×4 rect so a wrong-byte-lane commit shows up unambiguously. Asserts: 1 trxdir arm, 32 vram writes (allbe=0001,mask=0xFFFFFFFF), every pixel reads back atdest_base + y*64 + x, plus right-of-rect / before / after byte-zero boundary preservation. Each qword packs TWO rows of 8 pixels (lanes 0..7 = row y, lanes 8..15 = row y+1) — exercises the per-lane row-stride math at the qword boundary.tb_gs_image_xfer_psmct16.sv(Ch116) — focused contract forgif_image_xfer_stub's new PSMCT16 path. Pushes 4 IMAGE qwords (32 PSMCT16 pixels = 8 px/qword × 4) through the engine after a TRXDIR-shaped GIF-A+D register sequence (BITBLTBUF/TRXPOS/TRXREG/TRXDIR). PSMCT16 packs 2 bytes/pixel, so each qword holds 8 pixels (vs 4 for PSMCT32). The engine emits one 16-bit pixel per cycle to vram_stub withwrite_be = 4'b0011, the pixel value in the LOW halfword ofwrite_data, andwrite_mask = 0xFFFFFFFF; vram_stub commits the 2 bytes at the 2-byte-aligned destination address. Pattern ispixel(x,y) = 16'h{yyxx}{yyxx}— distinct per-pixel value so a wrong-lane commit shows up unambiguously. Asserts: 1 trxdir arm, 32 vram writes (allbe=0011,mask=0xFFFFFFFF), every pixel reads back atdest_base + y*row_stride + x*2, and the bytes immediately right of the rect on each row + before the dest region + after the dest region all stay zero (proves row-stride math + no halfword leak past RRW). PSMT8 image-xfer landed in Ch117 and PSMT4 image-xfer landed in Ch118 — see those TB rows for their own per-byte / per-nibble contract coverage.tb_gs_demo_psmt4_e2e_trxdir.sv(Ch110) — driver-shaped PSMT4 demo with the palette upload now arriving via a real TRXDIR/TRXPOS/TRXREG/HWREG image-transfer GIF packet sequence instead of TB-direct vram_stub writes. Closes the LAST TB-direct path in the e2e demo flow: every byte the GS sees — framebuffer pixels AND palette source — now arrives through a driver-shaped GIF stream. The DMAC delivers 36 qwords total: U1 (PACKED, NREG=4): BITBLTBUF/TRXPOS/TRXREG/TRXDIR — TRXDIR armsgif_image_xfer_stub. U2 (IMAGE, NLOOP=4): 4 qwords of 4 PSMCT32 entries each → 16 palette entries written into VRAM at DBP*256 bygif_image_xfer_stub. Then 4 SPRITE PACKED packets- 1 TEX0_1 PACKED packet. PASS criteria add to Ch109's:
1 EV_DMA_START / 36 EV_DMA_BEAT / 1 EV_DMA_DONE, 7
GIFtag accepts (U1 + U2 + 4×SPRITE + TEX0), 25 PACKED A+D
dispatches (4 TRX-setup + 20 SPRITE + 1 TEX0), 16
image-xfer VRAM writes from
gif_image_xfer_stub(DBP=4, DBW=1, DPSM=PSMCT32, DSAX=DSAY=0, RRW=16, RRH=1). The vram_stub write port is muxed at TB level:xfer_busy ? xfer_we : raster_pixel_emit(sequenced — palette upload completes before sprites raster). Ch110 also added a backpressure path ongif_packed_stub(image_data_readyinput) so the upstream DMA stalls whilegif_image_xfer_stubis draining the previous IMAGE qword's 4 PSMCT32 lanes; outside S_IMAGE the gate is a no-op (in_ready stays high). Privileged-block MMIO (PMODE/ DISPFB1/DISPLAY1) remains TB-direct because those are CPU MMIO writes in real hardware, not GIF traffic.
- 1 TEX0_1 PACKED packet. PASS criteria add to Ch109's:
1 EV_DMA_START / 36 EV_DMA_BEAT / 1 EV_DMA_DONE, 7
GIFtag accepts (U1 + U2 + 4×SPRITE + TEX0), 25 PACKED A+D
dispatches (4 TRX-setup + 20 SPRITE + 1 TEX0), 16
image-xfer VRAM writes from
tb_gs_demo_psmt4_e2e_dmac.sv(Ch109) — same 4-quadrant PSMT4 demo as Ch108, but the GIFtag + PACKED A+D quadwords arrive atgif_packed_stubvia the DMAC channel-2 →ee_memory_map_stub→ee_ram_stubpath instead of being TB-driven directly. Closes the last GIF-side sideband from Ch108: the demo is now reachable the way real EE/IOP code reaches it. The TB pre-stages the same 26 qwords (4 SPRITE packets × 6 qwords + 1 TEX0_1 packet × 2 qwords) into RAM at PAYLOAD_MADR, then writes DMAC channel-2 MADR/QWC/CHCR; a single NORMAL transfer with QWC=26 streams them into the GIF. PASS criteria add to Ch108's: 1 EV_DMA_START / 26 EV_DMA_BEAT / 1 EV_DMA_DONE (DMA event taxonomy locked), with the same downstream chain — 5 GIFtag accepts, 21 A+D dispatches in the expected reg-num sequence, 32 PSMT4 emits, 1 loader_busy rise, identical 16×8 captured frame. Privileged- block MMIO and palette pre-stage stay TB-direct (NOT GIF-side); TRXDIR/HWREG image-transfer for palette upload is a separate future chapter.tb_gs_demo_psmt4_e2e_packed.sv(Ch108) — same 4-quadrant PSMT4 demo as Ch107 but routed through the GIFtag / PACKED A+D front-end (gif_packed_stubwith REAL_AD_REG_MAP=1). Closes the last bit of GS-side sideband from Ch107: instead of TB-drivinggs_stub.gif_reg_*directly, the TB pushes raw 128-bit GIFtag + PACKED A+D quadwords intogif_packed_stub. in_*exactly the way the real GIF would receive them from PATH3. Each SPRITE is a packet of 1 GIFtag (NLOOP=1, NREG=5, PACKED, REGS=0xEEEEE — 5×A+D in the low 5 nibble slots) + 5 PACKED A+D qwords (PRIM, FRAME_1=PSMT4, RGBAQ, XYZ2, XYZ2); TEX0_1 load is its own 1-tag/1-A+D packet. Total: 5 GIFtag accepts (4 SPRITEs + 1 TEX0_1) and 4×5 + 1×1 = 21 PACKED A+D register-write dispatches into gs_stub.gif_reg_*. 32 PSMT4 raster emits arrive (Ch106 RMW), loader fires exactly once on TEX0_1, and the captured 16×8 frame matches the same expected CLUT-decoded RGB as Ch107 — i.e. real-format GIF packets reach the GS register file with the same cadence the TB previously synthesised by hand. Privileged-block MMIO (PMODE/DISPFB1/DISPLAY1) and the palette pre-stage in VRAM remain TB-direct because they are NOT GIF-side; the palette upload via real-PS2 TRXDIR/TRXPOS/TRXREG/HWREG image-transfer packets is a separate future chapter, as is the DMAC channel-2 burst that would normally deliver the GIFtag qwords (this TB drivesgif_packed_stub.in_*directly to keep the demo narrow and deterministic; the full DMAC→RAM→GIF round trip is what the integration-tiertb_ee_core_gif_*family covers).tb_gs_psmt4_round_trip.sv(Ch104) — full driver-shaped PSMT4 + CLD=4 + CSA round trip. Wiresgs_stub+vram_stub+clut_stub+clut_loader_stub+gs_pcrtc_stubend-to-end withpcrtc.clut_csa = gs_stub.tex0_1_csa_q(the Ch98 sideband-free pattern). Phase 1: stages a 4×4 PSMT4 sprite in VRAM, plus a 16-entry pattern_a palette in VRAM atCBP_A*256. Drives TEX0_1 withCBP=4, CPSM=PSMCT32, CSM=CSM2, CSA=0, CLD=4; the loader writes pattern_a intoclut_stub[0..15]andpcrtc.clut_csais 0, so PSMT4 scanout reads pattern_a per nibble. Phase 2: stages a different pattern_b palette atCBP_B*256and drives TEX0_1 withCBP=8, CSA=4, CLD=4; the loader writes pattern_b intoclut_stub[64..79](the CSA=4 window) andpcrtc.clut_csaflips to 4, so the same VRAM sprite — same DISPFB1 / DISPLAY1 / PMODE — now reads pattern_b. Proves loader policy + clut_stub contents + PCRTC lookup are wired consistently.
Scope (current, after Ch165):
- PSMCT32 (DISPFB1.PSM=0), PSMCT16 (PSM=2), PSMT8 (PSM=0x13),
and PSMT4 (PSM=0x14) honored at BOTH the read and write
sides (Ch94 + Ch95 + Ch96 + Ch97 + Ch103 + Ch105 + Ch106).
PSMCT24/PSMCT16S/PSMZ32/etc. force scanout off and are not
contract-tested at the raster channel. The write side
(gs_stub.raster_pixel_emit) emits the four supported PSMs via
raster_pixel_be_q(per-byte gate) andraster_pixel_mask_q(per-bit merge mask, Ch106): PSMCT32 = be0xF/ mask0xFFFFFFFF, PSMCT16 = be0x3/ mask0xFFFFFFFF, PSMT8 = be0x1/ mask0xFFFFFFFF, PSMT4 = be0x1/ mask0x0For0xF0. The mask path is no-op for byte-or-larger PSMs (mem[i] = data[i] when mask_i = 0xFF) and only meaningful for PSMT4 sub-byte writes. PSMT8 / PSMT4 scanout surfaces the index/nibble as grayscale by default; withclut_enable=1(Ch97/Ch103) and a programmedclut_stub, the index/nibble looks up real RGB. CLUT contents come either from a TB-direct write OR (Ch99..Ch102) from a VRAM→CLUT load triggered by a TEX0_1 GIF write withCSM == 1(CSM2 linear),CPSM∈ {PSMCT32, PSMCT16}, and a CLD value passing the policy: CLD=0 never; CLD=1 always (full 256-entry load); CLD=2 if CBP changed since last load (full); CLD=3 if CBP/CPSM/CSA any-changed (full); CLD=4 always but only the 16-entry CSA window at indicesCSA*16 + i(Ch102 — preserves the other 240 entries); CLD ∈ {5..7} silently no-op (reserved).clut_loader_stubwalks the entries viavram_stub's second read port; PSMCT16 entries are unpacked with the same 5→8 bit-replicate the scanout side uses (Ch94). CSM1 swizzle and CPSM ∉ {PSMCT32, PSMCT16} remain deferred. - Single CRTC, single DISPFB. Real PS2 has two interlace- capable CRTCs (DISPFB1, DISPFB2). One context is enough for TBs to verify the round trip; PMODE.EN2 + DISPFB2 + DISPLAY2 is deferred.
- Read-side addressing. Linear by default (legacy formula
vram_read_addr = FBP*2048 + (effective_y*FBW*64 + effective_x) << bpp_shift). Four OPTIONAL per-PSM swizzle paths gated by parameters ongs_pcrtc_stub:PSMCT32_SWIZZLE=1(Ch120) routes PSMCT32 reads throughgs_swizzle_psmct32_stub;PSMCT16_SWIZZLE=1(Ch126) routes PSMCT16 reads throughgs_swizzle_psmct16_stub;PSMT8_SWIZZLE=1(Ch132) routes PSMT8 reads throughgs_swizzle_psmt8_stub(Ch131) — FBW must be even because PSMT8 pages are 128 px wide and the swizzle internally divides FBW by 2;PSMT4_SWIZZLE=1(Ch138) routes PSMT4 reads throughgs_swizzle_psmt4_stub(Ch137); FBW must be even (same as PSMT8). The four parameters are independent — enabling one doesn't affect the others. PSMT4's swizzle module also outputs anibble_hiselector that PCRTC uses in place ofpixel_index[0]to pick which nibble of the byte at the swizzled address holds this pixel (PSMT4 packs 2 pixels per byte and the canonical PCSX2 column table reorders nibbles within a block, so the linear formula'spixel_index[0]selector is no longer correct under the swizzled layout). All four swizzle parameter defaults are 0 so all existing PCRTC- using TBs see the legacy linear behavior unchanged. The PSMT4 image-xfer (Ch139) and raster (Ch140) write-side wiring is now live as well, completing the four-PSM × three- path swizzle integration. Both driver-shape e2e demos for PSMT4 are also live: raster-driven (Ch141) and TRXDIR-driven (Ch142). All four common GS PSMs now have BOTH driver-shape e2e demos (CT32 Ch123+Ch124, CT16 Ch129+Ch130, T8 Ch135+ Ch136, T4 Ch141+Ch142) — closing the four-PSM × three-path × dual-driver-shape e2e foundation. - Parallel to
platform_video_stub, not a replacement. We did not extendplatform_video_stub(which would have rippled through 6 existing TBs). Pcrtc is the alternative video source for TBs that want VRAM-backed scanout. The legacy flood-fill module stays as-is.
End-to-end demo manifest (Ch143)
Eight driver-shaped end-to-end byte-accurate demos cover the four common GS PSMs across both driver shapes (raster-driven PACKED-SPRITE payload + TRXDIR-driven IMAGE payload). Each demo runs the same EE-bootlet → DMAC → GIF → GS → vram → swizzled- PCRTC chain with all three same-PSM swizzle gates parameter-set to 1; the listed write-side path is load-bearing and the other write-side path is asserted dormant in the demo flow.
All eight demos emit a 16×8 framebuffer (128 pixels). The raster
column shows (emits, xfer_writes); the TRXDIR column shows
(xfer_writes, emits) — in both cases the load-bearing path
fires 128 times and the dormant path is asserted 0.
| PSM | Raster-driven e2e | TRXDIR-driven e2e |
|---|---|---|
| PSMCT32 | Ch123 — tb_gs_demo_psmct32_swizzle_e2e (128, 0) |
Ch124 — tb_gs_demo_psmct32_swizzle_trxdir_e2e (128, 0) |
| PSMCT16 | Ch129 — tb_gs_demo_psmct16_swizzle_e2e (128, 0) |
Ch130 — tb_gs_demo_psmct16_swizzle_trxdir_e2e (128, 0) |
| PSMT8 | Ch135 — tb_gs_demo_psmt8_swizzle_e2e (128, 0) |
Ch136 — tb_gs_demo_psmt8_swizzle_trxdir_e2e (128, 0) |
| PSMT4 | Ch141 — tb_gs_demo_psmt4_swizzle_e2e (128, 0) |
Ch142 — tb_gs_demo_psmt4_swizzle_trxdir_e2e (128, 0) |
For each row both demos use the same per-quadrant pixel pattern
(so the verify side is shared across the row), the same DBW-
even constraint where applicable (PSMT8 / PSMT4: 128-px-wide
pages → DBW=2 minimum even), and verification through the
freed-up vram_stub 2nd read port. Ch141 + Ch142 together
close the four-PSM × three-path × dual-driver-shape e2e
foundation — the foundation Ch143 manifests and seals.
Hardware-demo candidates:
- PSMCT32 swizzled raster e2e (Ch123) — simplest direct-
color path: 4 SPRITE PACKED packets, RGBAQ.{R,G,B,A} mapped
1:1 to scanout RGB, no CLUT, no nibble RMW. The natural first
hardware demo because every byte from EE-bootlet through the
swizzled 16×8 framebuffer to PCRTC RGB is visible without
any indirection. Build target:
make tb_gs_demo_psmct32_swizzle_e2e. - PSMT4 swizzled TRXDIR e2e (Ch142) — strongest indexed/
CLUT-like stress path: U1 PACKED A+D TRX setup + U2 IMAGE
NLOOP=4 with 32 PSMT4 nibbles per qword, image-xfer engine
decoding the canonical PCSX2 columnTable4 (which reorders
nibbles within a block — the linear
pixel_index[0]rule is wrong under swizzle), and per-pixel nibble RMW on vram_stub viawrite_be=4'b0001 + write_mask ∈ {0x0F, 0xF0}keyed by the swizzle'snibble_hi. Exercises the full sub-byte pipeline + the canonical-source-locked column table. Build target:make tb_gs_demo_psmt4_swizzle_trxdir_e2e.
First hardware-targeted top wrapper (Ch146)
Ch146 turns the Ch144 readiness audit + Ch145 BRAM-shrink groundwork
into a real top-level SystemVerilog module: rtl/top/top_psmct32_raster_demo.sv.
This is the module a board-level synthesis project would target
first. Board-level concerns (HDMI/VGA PHY, pin constraints, .mem
bake tooling, clock-domain crossings) are deliberately deferred —
Ch146 proves the design can be expressed as a single hardware-
shape module.
Top ports:
clk/rst_n/core_go— clock, active-low synchronous reset, start pulse (a board reset-release sequencer can tiecore_gohigh afterrst_ndeasserts).r/g/b/hsync/vsync/de— 8-bit RGB scanout from PCRTC.core_halt/dma_done_seen/frame_seen— debug/status bundle suitable for LEDs or a board-level state observer.
Top parameters: H_ACTIVE (default 16), V_ACTIVE (default 8),
BIOS_SIZE_BYTES, RAM_SIZE_BYTES, VRAM_BYTES,
USEG_SHADOW_WORDS_PARAM (default 1024 = 4 KiB per Ch145).
Image fixtures are passed via macros (iverilog-12 string-
parameter forwarding limitation):
TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE and
TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE. The fixtures are
baked by sim/data/top_psmct32_raster_demo/bake.py
which writes:
bios.mem— 18-word EE bootlet (one 32-bit hex word per line)payload.mem— 40 qwords for ee_ram_stub (16 zero qwords + 24 GIF qwords carrying 4 SPRITE PACKED packets)
The bake script is a deterministic Python rewrite of the
procedural ee_prog_word() + preload_qword() loops in the
Ch123 TB. Same bit-exact values, just baked into static repo
artifacts so a hardware top can $readmemh them.
Focused TB: sim/tb/top/tb_top_psmct32_raster_demo.sv.
Drives the top with the static fixtures, captures one full
PCRTC frame after the EE halts and DMAC completes, and asserts
the per-quadrant RGB matches the Ch123 frame exactly. Counts:
raster_emits=128, errors=0, core_halt=1, dma_done_seen=1, frame_seen=1.
Bug-fix iteration: the first bake had Y in XYZ2 placed at
bits[43:32] instead of bits[31:20] — a Python translation error
of the SystemVerilog {32'd0, y_int, 4'd0, x_int, 4'd0}
concatenation. Symptom: per-sprite emit count was 8 instead of
32 (each sprite drew one row), and VRAM held the per-sprite R
component scattered across 32 consecutive 4-byte cells. Caught
by adding a per-emit observer that printed
(addr, data, be, mask, color_q) for the first 10 emits.
Fix: y << 20 instead of y << 32 in bake.py. PASS after
the fix.
What's still NOT in this chapter (deferred to Ch147+):
- Real
.membake tooling integration (currently thebake.pyis run manually before sim; a Makefile target or CI step that invokes it would belong in Ch147). - Board-specific top: pin constraints, target FPGA family, PHY shim (HDMI/DVI/VGA), reset-release sequencer.
- A multi-PSM top (the Ch142 PSMT4 TRXDIR variant would be a natural second wrapper once the build flow is proven).
Fixture bake flow (Ch147)
Ch147 makes the Ch146 .mem bake first-class so the static
fixtures can't drift from bake.py. Three new Makefile targets:
| Target | Purpose |
|---|---|
top_psmct32_raster_demo_mem |
Re-runs bake.py; produces bios.mem + payload.mem atomically. |
top_psmct32_raster_demo_mem_check |
Verifies fixture sizes (bios.mem = 1024 lines, payload.mem = 256). |
tb_top_psmct32_raster_demo (existing) |
Now declares top_psmct32_raster_demo_mem as a prerequisite. |
The bake target uses Make's grouped-target syntax (&:) so a
single bake.py run produces both files atomically — they can
never be out-of-step.
The size-check target counts payload lines (skipping blanks +
// ... comment-only lines) and asserts the exact expected
counts. A non-matching count exits with status 1, surfacing a
fixture/script drift as a hard build failure.
Deleting the fixtures and running the TB triggers the bake automatically:
$ make tb_top_psmct32_raster_demo
=== bake top_psmct32_raster_demo .mem fixtures ===
python3 .../bake.py
[bake] wrote bios.mem (1024 words, 18 active) and payload.mem (256 qwords, 40 active)
=== build tb_top_psmct32_raster_demo ===
...
[tb_top_psmct32_raster_demo] PASS
Synthesis-facing macros
When pointing a synthesis tool at rtl/top/top_psmct32_raster_demo.sv,
two preprocessor defines must be set so bios_rom_stub and
ee_ram_stub find their $readmemh images. These are macros
(NOT module parameters) per the iverilog-12 string-parameter
forwarding workaround documented in the Ch146 wrapper banner;
they map cleanly to FPGA-tool defines.
| Macro | Value |
|---|---|
TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE |
Absolute (or tool-relative) path to bios.mem |
TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE |
Absolute (or tool-relative) path to payload.mem |
Both default to "" so the wrapper still elaborates without
fixtures (synthetic NOP-sled in bios_rom_stub + zero-init
ee_ram_stub, which produces no DMAC payload but a stable
PCRTC frame with r=g=b=0).
Vivado (preprocessor verilog_define on the synthesis +
implementation filesets — these are macros, not module
generics):
set_property verilog_define { \
TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE="$path/bios.mem" \
TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE="$path/payload.mem" \
} [get_filesets sources_1]
Repeat for the implementation fileset if it diverges from
sources_1.
Quartus (project-level macro defines):
set_global_assignment -name VERILOG_MACRO \
"TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE=\"$path/bios.mem\""
set_global_assignment -name VERILOG_MACRO \
"TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE=\"$path/payload.mem\""
Iverilog (sim): the Ch147 Makefile passes them via -D
flags in the tb_top_psmct32_raster_demo build rule —
-DTOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE='"$(SIM_DIR)/data/... /bios.mem"' — and the top_psmct32_raster_demo_mem
prerequisite ensures the .mem files exist before the TB
elaborates.
DE25-Nano synthesis scaffold (Ch148)
Ch148 makes the Ch146 hardware top synthesis-addressable on DE25-Nano without committing to a video PHY shim or final pin constraints (those land in Ch149+).
| File / target | Purpose |
|---|---|
synth/de25_nano/top_psmct32_raster_demo/files.f |
RTL filelist — Ch123 dep tree only (~14 entries). |
synth/de25_nano/top_psmct32_raster_demo/README.md |
Top module + macros + fixtures + DE25-Nano clock/reset/video assumptions. |
make top_psmct32_raster_demo_synth_check |
Validates files.f paths + fixture presence. |
The synth-check target depends on top_psmct32_raster_demo_mem_check,
so a single command verifies fixture sizes AND that every file
referenced by the synth filelist exists. It exits non-zero on
any miss — surfacing both fixture drift (Ch147 size guard) and
filelist drift as hard build failures.
.qsf (Quartus pin assignments) is not committed in Ch148.
The README documents the board assumptions (clock domain,
reset polarity, core_go strategy, video-out path candidates,
LED status mapping) so the next chapter can author it without
inventing context. The point of Ch148 is that a Quartus project
import (or Vivado / verilator --lint-only) finds every file
the design needs, with the macros documented end-to-end.
DE25-Nano board wrapper (Ch149)
Ch149 turns the Ch146 board-agnostic top into a real board top without yet committing to pin assignments or a video PHY. New:
| Artifact | Purpose |
|---|---|
rtl/top/de25_nano_psmct32_raster_demo_top.sv |
Board wrapper — DE25-Nano signal names + reset sequencer + LED status. |
sim/tb/top/tb_de25_nano_psmct32_raster_demo_top.sv |
Smoke TB exercising clock/reset/core_go/LED/video pins. |
Top ports (matching the Terasic Golden_top.v conventions
from the DE25-Nano resource CD): CLOCK0_50 / CLOCK1_50 /
CLOCK2_50, KEY[1:0] (active-LOW), SW[3:0], LED[7:0]
(active-LOW), and raw VIDEO_R/G/B/HSYNC/VSYNC/DE outputs that
a future PHY shim will consume.
Reset bridge:
ninit_donesourced from Terasic'sreset_releaseIP under\ifdef USE_TERASIC_RESET_RELEASE_IP` (default-off; sim uses an inline 16-cycle stub matching the IP's shape).KEY[0]+ninit_donefeed an async-assert/sync-deassert 2-stage shift register on CLOCK2_50. Mirrors the retroDE_nes pattern atretroDE_nes.sv:170-177.
core_go sequencer: 16-cycle delay after core_rst_n
deasserts, then a one-cycle core_go pulse. Matches the
"recommended hardware path" documented in the Ch148 README and
the level-sensitive go_i semantics at ee_core_stub.sv:812-813.
LED status: the Ch146 wrapper's three sticky status outputs
drive LED[2:0] (active-LOW): LED[0] = ~core_halt,
LED[1] = ~dma_done_seen, LED[2] = ~frame_seen. LED[7:3]
tied HIGH (OFF).
Smoke TB counts: core_go_pulses=1, all three status LEDs
eventually latch (the actual fall-edge order is frame_seen
first, then core_halt, then dma_done_seen — frame_seen
is a "PCRTC alive" indicator that fires on the first empty
frame after reset, well before the bootlet runs), and
VIDEO_DE rises inside the active region. Standalone PASS.
.qsf (pin assignments), PLL, and video PHY shim remain
deferred (Ch150+). Ch149 makes the design board-shaped, not
yet board-pinned.
Quartus scaffold for DE25-Nano (Ch150)
Ch150 commits the first real Quartus artifacts for the Ch149
board wrapper — a minimal .qsf + .sdc pair, deliberately
PHY-light:
| File | Purpose |
|---|---|
synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf |
Device + family + pin assignments + IO standards + .mem macros + file list. |
synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc |
CLOCK2_50 50 MHz clock + reset-sync false-path + IO false-paths. |
make top_psmct32_raster_demo_quartus_scaffold_check |
Validates both files exist + top entity + pins + clock period. |
Device (sourced from retroDE_splash.qsf): Agilex 5
A5EB013BB23BE4SCS, package VPBGA. Top entity:
de25_nano_psmct32_raster_demo_top (the Ch149 board wrapper —
NOT the inner Ch146 module). Pin assignments match the
DE25-Nano board pinout used by retroDE_splash and
retroDE_nes: CLOCK2_50 → PIN_BF23, KEY[0] → PIN_C8,
LED[2:0] → PIN_DN22 / PIN_DJ32 / PIN_DF35. CLOCK0/1_50,
KEY[1], SW[3:0], and LED[7:3] are also assigned (their canonical
pins) so Quartus doesn't flag them as unconstrained inputs/
outputs even though the Ch149 wrapper ties them off.
SDC (sourced from retroDE_splash.sdc): a single 50 MHz
create_clock on CLOCK2_50, the standard reset-sync first-stage
false-path (set_false_path -to [get_registers -nowarn {*rst_sync[0]}]), and IO false paths for KEY[*], SW[*],
LED[*] plus the as-yet-unpinned VIDEO_* outputs (replaced
by real set_output_delay constraints when the PHY shim
lands).
.mem macros baked into the QSF (project-relative paths):
TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE = sim/data/top_psmct32_raster_demo/bios.mem
and the matching payload macro. Run make -C sim top_psmct32_raster_demo_mem before launching Quartus.
USE_TERASIC_RESET_RELEASE_IP is not defined in this
QSF — keeping the wrapper self-contained for the first project
import. To wire in Terasic's reset_release IP, define the
macro and add the IP file from
DE25_Nano_ResourceCD/Demonstration/FPGA/Board_Info_RTL/reset_release/.
Deferred to Ch151+: video PHY pins + shim (HDMI ADV7513 +
I²C config FSM, VGA DAC, or PMOD), PLL .ip config, LPDDR4 /
SDRAM / HPS / CAM / UART / GPIO assignments. Ch150 makes the
project Quartus-importable, not yet Quartus-buildable for video
output.
PLL + lock-gated reset (Ch151)
Ch151 adds the most conservative hardware bring-up step before
touching the video PHY: a board-clock PLL on the path between
CLOCK2_50 and the design clock, with the reset bridge gated
on PLL lock so the design can only leave reset once the PLL is
stable.
| Artifact | Purpose |
|---|---|
rtl/top/de25_nano_pll_stub.sv |
Sim stub matching the Quartus IOPLL pll module signature. |
rtl/top/de25_nano_psmct32_raster_demo_top.sv (Ch151) |
Reworked with PLL instantiation + lock-gated reset bridge + design_clk distribution to the Ch146 wrapper and core_go sequencer. |
tb_de25_nano_psmct32_raster_demo_top (Ch151 update) |
Adds rising-edge timestamps for pll_locked / core_rst_n / core_go and asserts the contract pll_locked < core_rst_n < core_go. |
PLL signature (matches retroDE_nes/ip/pll/pll_bb.v and
retroDE_splash/ip/sys_pll/sys_pll_bb.v):
module pll (
input wire refclk,
input wire rst,
output wire outclk_0,
output wire locked
);
Sim stub behavior: outclk_0 = refclk (pass-through, no
multiplication — sim doesn't need a different frequency, and a
pass-through still exercises the lock-gated reset bridge).
locked rises after 32 cycles with rst low; held LOW while
rst is HIGH.
Reset gating: the board top's rst_sync register
async-asserts on (ninit_done | ~pll_locked) — both FPGA init
AND PLL lock must complete before reset can deassert.
Synth swap: define USE_PLL_IP and add a Quartus IOPLL
.qip to the project; the board wrapper's \ifdef USE_PLL_IP`
swaps the stub for the real IP. The QSF documents the swap
mechanism but ships with the IP commented out, keeping the
scaffold self-contained until the PLL chapter (Ch152+) commits
a frequency choice + IP file.
TB contract (smoke output): t_pll/rstn/go=(950000,990000, 1330000) ns — PLL locks at 950 ns, reset deasserts 40 ns
later (the 2-stage sync register prop), core_go fires
340 ns later (the GO_DELAY=16 wait). Order assertions catch
any future regression of the gating.
Deferred to Ch152+: real PLL output frequency tuning (the
stub passes refclk through; a real build sets outclk_0 to
whatever the video PHY chapter needs), committing the actual
IOPLL .ip file under synth/de25_nano/.../ip/, the video
PHY shim itself.
First Quartus compile + baseline report (Ch152)
Ch152 is the chapter where the toolchain is finally asked the honest question: "does this DE25-Nano board top synthesize, fit, and pass static timing analysis?"
Driver: synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh
runs quartus_syn → quartus_fit → quartus_sta against the Ch150
QSF + Ch151 PLL stub. quartus_asm (bitstream gen) is
deliberately skipped — Ch152 is a compile-and-report smoke,
not a deploy path. USE_PLL_IP is left undefined so the Ch151
self-contained PLL stub stays under test (per Codex framing).
Make targets:
| Target | Action |
|---|---|
make quartus_compile |
Full syn + fit + sta flow. |
make quartus_compile_clean |
Wipe outputs first, then full flow. |
make quartus_syn_only |
Synthesis only (~14 min smoke). |
make quartus_compile_report |
Run parse_reports.py on the latest output. |
Ch152 RTL fixes that landed before synthesis would even elaborate:
| Issue | Fix |
|---|---|
QSF line-continuation (\) parse error in set_global_assignment -name VERILOG_MACRO |
Collapsed to single-line lines. |
vram_stub.mem 8192-iter init loop exceeded Quartus's 5000-iter synthesizable-loop limit (Error 13356) |
Wrapped initial block in // synthesis translate_off / _on pragmas. Real Altera/Intel BRAM is power-on-zero so the procedural loop is sim-only. |
gs_pcrtc_stub / gif_image_xfer_stub / gs_stub unconditionally instantiate all four swizzle math primitives even when their gate is 0 |
Added gs_swizzle_psmct16/8/4_stub.sv to the synth filelist + QSF (iverilog trimmed silently; Quartus errors). |
gs_stub.interp_byte (Ch86 Gouraud TRI math) 64-bit signed divide hits Quartus Pro's lpm_divide LPM_WIDTHN ≤64 limit (Error 272006) |
Wrapped divide in // synthesis translate_off; default fallback returns 0. The Ch123 SPRITE-only demo doesn't exercise Gouraud TRIs, so this is dead code in the build. A future Gouraud-TRI hardware demo would need a divider redesign sized for Agilex 5. |
QSF SDC_FILE referenced via repo-root-relative path failed when the build script ran Quartus from a per-build work dir (Warning 16124) |
Changed to basename-only — works from either the repo root or the work dir (the script symlinks the SDC alongside the QSF). |
First successful synthesis: 0 errors, 3 warnings, 14:08 elapsed. 160 RAM segments + 26 DSP elements inferred.
Fitter result — design too large for the part (the chapter's honest answer):
Total dedicated logic registers : 121,176
Total pins : 17 / 351 ( 5 %)
Total block memory bits : 65,536 / 7,331,840 (<1 %)
Total RAM Blocks : 6 / 358 ( 2 %)
Total DSP Blocks : 20 / 188 (11 %)
Logic utilization (ALMs needed) : 155,104 / 46,800 (331 %)
The design needs 155,104 ALMs vs the part's 46,800 — 3.31×
oversized. Error (170011): Design contains 260,263 blocks of type combinational node. However, the device contains only 93,600 blocks.
Why so big (the precise picture, to be drilled into by Ch153+):
The synthesis log reports Info (22567): extracting RAM for
all four memory identifiers — ee_ram_stub.mem,
bios_rom_stub.mem, ee_memory_map_stub.useg_shadow_mem, and
vram_stub.mem — so Quartus did recognize each as a memory
structure at syn time. But the fit report shows only 65,536
bits / 6 RAM Blocks committed (roughly enough for BIOS 4 KB +
EE-RAM 4 KB). Something between syn and fit caused the larger
arrays — most likely vram_stub.mem (8 KB) and possibly
useg_shadow_mem (4 KB after Ch145's 1024-word shrink) — to
either (a) be replicated into combinational mux/decoder logic
because of their access-port shape, or (b) lose their RAM
attribute during fitter optimization and fall back to
flip-flop implementation. The 121,176 dedicated registers + the
260,263 combinational nodes are consistent with at least
u_vram getting massively unrolled.
Ch153's job is to isolate which array(s) and which port
shape(s) prevent compact block-RAM implementation. The
likely candidates: vram_stub's dual read ports + per-byte
write_be lane (Ch95's per-byte gate may not be RAM-block-
friendly on Agilex 5), and the EE memory map's wide arbitration
into the useg-shadow port. None of this is fixed in Ch152 —
surfacing the gap precisely is the chapter's deliverable.
Other notable findings (full list in
output_files/build_logs/):
- Critical Warning 20759: "Use the Reset Release IP in
Agilex 5 designs to ensure a successful configuration." This
is the Ch151
\ifdef USE_TERASIC_RESET_RELEASE_IP` opt-in; enabling it (and committing the IP file) is a Ch153+ task. - 6× Warning 16749: identifiers used before declaration in
dmac_reg_stub,gif_packed_stub,gs_stub,gif_image_xfer_stub. Style/lint warnings, no functional impact; clean-up candidate for a future polish chapter. - STA never ran because fit failed.
What Ch152 leaves for Ch153+:
- Resource reduction. Most likely candidates: BRAM-infer
vram_stub.memanduseg_shadow_memcleanly (Quartus attribute hints / restructure read ports), or shrink the EE core's MIPS decode (table-driven vs LUT-driven), or move to a larger Agilex 5 part if available. - Enabling
USE_TERASIC_RESET_RELEASE_IPand committing the Terasicreset_releaseIP file. - The PHY shim chapter (
VIDEO_*virtualized → real HDMI ADV7513 / VGA / PMOD pins). - Cleaning up the 6× forward-reference style warnings.
Memory-shape forensics (Ch153)
Ch153 is a memory-forensics chapter (NOT a rewrite chapter): two
isolated tiny Quartus projects under synth/de25_nano/experiments/
target the same Agilex 5 part as the Ch150 board top so resource
deltas are apples-to-apples. The goal is to identify which feature(s)
of vram_stub's shape prevent compact block-RAM implementation and
drive the Ch152 size deficit.
| Experiment | Memory shape |
|---|---|
exp_a_bram_friendly |
2048 × 32-bit, single port, sync read + sync write with byte-WE. Intel-friendly BRAM template. |
exp_b_vram_shape |
8192 × 8-bit, dual COMBINATIONAL read, byte-WE + per-bit mask RMW. Exact vram_stub shape. |
The result is decisive:
| Metric | exp_a (BRAM-friendly) | exp_b (vram_stub-shape) |
|---|---|---|
| Fitter status | ✅ Successful | ❌ Failed |
| Logic utilization (ALMs) | 46 / 46,800 (< 1 %) | (fit failed — placement reports 257,986 combinational nodes vs 93,600 device max) |
| Total dedicated logic registers | 0 | 65,536 |
| Total RAM Blocks | 4 / 358 (1 %) | 0 / 358 (0 %) |
| Total block memory bits | 65,536 (8 KB) | 0 |
Interpretation:
- The Intel-friendly shape maps the same 8 KB to 4 RAM Blocks with zero combinational logic and zero registers beyond the read-output flop.
- The
vram_stubshape maps the same 8 KB to zero RAM Blocks, 65,536 dedicated registers (one flip-flop per byte), and 257,986 combinational nodes (the 4-byte concatenation multiplexers for the dual combinational reads + the per-bit mask RMW gates). - The 257,986 combinational-node figure for a single 8 KB memory
almost exactly matches the 260,263 combinational-node figure
Ch152 reported for the entire top-wrapper design —
empirical confirmation that
u_vramalone accounts for essentially all of the Ch152 size deficit.
Which feature is the dominant cost (the four candidates the shape diff isolates):
The exp_a vs exp_b diff folds four feature changes together
(byte-addressable storage, combinational reads, dual reads,
per-bit mask RMW). To pin down which feature(s) dominate, a
future chapter could insert intermediate experiments — but the
exp_a result already gives the upper bound on what BRAM-native
inference can buy: ~4 RAM blocks + ~50 ALMs for 8 KB. Anything
that gets vram_stub close to that bar wins back the entire
Ch152 fit headroom.
The most likely individual culprit is the per-bit mask RMW:
Agilex 5's M20K BRAM has byte-WE primitives but does NOT have
per-bit RMW. Quartus has to materialize the
(mem & ~mask) | (data & mask) arithmetic outside the BRAM,
which forces the storage out of BRAM and into per-bit flip-flops.
Combinational reads are the second most likely (BRAMs are
synchronous-read-only on Agilex 5; Quartus has to either insert
a register on the read path or materialize the storage as
discrete flip-flops to feed the comb output).
Make targets:
| Target | Action |
|---|---|
make quartus_experiments |
Compile every synth/.../experiments/exp_* project. |
make quartus_experiments_clean |
Wipe outputs first, then compile. |
make quartus_experiments_report |
Side-by-side resource summary (no recompile). |
What Ch153 leaves for Ch154+:
- Refactor
vram_stubinto a BRAM-friendly shape: replace combinational reads with sync (registered output) reads, replace per-bit mask RMW with byte-WE-only writes (move the per-pixel sub-byte merging logic into the writer module — most likelygs_stub.raster_pixel_emitfor the PSMT4 nibble case), and switch to 32-bit word-addressable storage with byte-WE for the unaligned-byte case. - Audit
useg_shadow_memnext — it hadInfo (22567): extracting RAMat synthesis but didn't survive to fit. Likely culprits there: theCh64/Ch65/Ch70mirror-write features that turn the simple useg-shadow into a multi-port write structure.
BRAM-friendly vram sibling (Ch154)
Ch154 adds a hardware-friendly sibling of vram_stub —
rtl/gif_gs/vram_bram_stub.sv — that maps cleanly onto Agilex 5
M20K block-RAM. Per Codex's framing, the chapter's blast radius
stays narrow: add the sibling + prove it works + measure the
BRAM-inference win. The actual swap of the board top to use
the new module + the writer-side PSMT4 nibble-RMW rework lands
in Ch155+.
vram_bram_stub shape vs vram_stub:
| Feature | vram_stub (legacy / sim reference) |
vram_bram_stub (Ch154, hw-friendly) |
|---|---|---|
| Storage | 8192 × 8-bit byte-addressable | 2048 × 32-bit word-addressable |
| Reads | Combinational; arbitrary alignment | Synchronous (1-cycle); word-aligned only |
| Read ports | 2 (combinational) | 2 (sync, true dual-port M20K) |
| Write granularity | byte-WE + per-bit write_mask RMW |
byte-WE only |
| Per-bit mask RMW (Ch106) | yes — supports PSMT4 nibble splice | NO — caller must splice on writer side |
New equivalence TB: tb_vram_bram_stub_equivalence.
Drives both DUTs in lockstep with byte-WE-only writes
(write_mask = 0xFFFFFFFF on the legacy module so the per-bit
RMW path is a no-op), aligns sample times across the new
module's 1-cycle sync-read latency, and asserts data
equivalence across:
- 32-bit word writes (
be=4'b1111) - per-byte-lane writes (
be=4'b0001 / 0010 / 0100 / 1000) - per-byte non-wrapping admission near MAX_BASE
- dual-port read agreement
PASS standalone + in the full sim regression.
Quartus experiment exp_c_vram_bram_stub (synth/.../experiments/exp_c_vram_bram_stub/)
proves the new module infers BRAM cleanly. Side-by-side with
the Ch153 baselines, all on the same Agilex 5 part:
| Experiment | Fit | ALMs | Registers | RAM Blocks | Block memory bits |
|---|---|---|---|---|---|
exp_a_bram_friendly |
✅ Success | 46 / 46,800 | 0 | 4 / 358 | 65,536 |
exp_b_vram_shape |
❌ Failed | (261,578 comb nodes vs 93,600 device max) | 65,536 | 0 / 358 | 0 |
exp_c_vram_bram_stub |
✅ Success | 190 / 46,800 | 2 | 8 / 358 | 131,072 |
Interpretation:
exp_clands close toexp_a's ideal (190 vs 46 ALMs; 8 vs 4 RAM Blocks). The slight overhead vsexp_ais the dual read port (M20K replicates storage to serve two independent read addresses simultaneously, hence 2× block memory bits) plus the per-byte non-wrapping admission gate Ch95 inherited fromvram_stub.exp_cconsumes 3.4× fewer dedicated registers thanexp_awould have ifread_datawas reset (2 vs the 32 a reset would require) — the canonical Quartus inference template demands no reset on the BRAM data register.- vs
exp_b's 65,536 registers + 261,578 combinational nodes, swappingvram_stub→vram_bram_stubrecovers essentially all of the Ch152 ALM headroom on the vram side. Useg-shadow is the next forensic target (likely similar shape).
Inference template gotcha (caught + fixed in this chapter):
the first cut of vram_bram_stub had a reset on read_data
inside the always_ff block AND an in-bounds gate guarding the
mem read. Quartus rejected BRAM inference with
Info (276007): RAM logic ... uninferred due to asynchronous read logic. Fix: simplified the read path to the canonical
template (always_ff @(posedge clk) read_data <= mem[idx];)
and moved bounds + alignment checks to a parallel read_valid
pipeline. Then Implemented 64 RAM segments instead of 0.
Ch155+ surface — writer-side normalization for ALL sub-32-bit
PSMs, not just PSMT4: vram_bram_stub's contract is stricter
than vram_stub's — write_addr MUST be word-aligned
(write_addr[1:0] == 2'b00), and the byte lane(s) being written
are selected via write_be with the payload pre-shifted into
the right byte lane(s) of write_data[31:0]. Today's writer-
side RTL emits at sub-word boundaries:
- PSMCT16 raster + image-xfer write at halfword addresses
(
write_addr[1] == 1for the high halfword) withbe=4'b0011or4'b1100and the 16-bit payload inwrite_data[15:0]. - PSMT8 raster + image-xfer write at byte addresses
(any
write_addr[1:0]) withbe=4'b0001and the 8-bit payload inwrite_data[7:0]. - PSMT4 raster + image-xfer write at byte addresses with
be=4'b0001+ per-bitwrite_mask0x0F or 0xF0 to splice one nibble. - PSMCT32 raster + image-xfer write at word addresses with
be=4'b1111+ the full 32-bit payload — the ONLY PSM that natively matchesvram_bram_stub's contract today.
If we swap the board top to vram_bram_stub without writer-side
normalization, CT16/T8/T4 writes silently drop because
write_addr[1:0] != 0 fails admission. So Ch155 must rework
each writer to:
- Mask
write_addrdown to its word base (write_addr & ~32'd3). - Shift the payload from its native byte lane into the
appropriate byte lane(s) of a 32-bit
write_databased on the originalwrite_addr[1:0]. - Generate
write_bewith bits set only for the byte lanes the original sub-word address actually targets. - For PSMT4 specifically: replace the per-bit
write_masknibble splice with a writer-side read-modify-write — read the existing byte first, splice the new nibble in, then issue a normal byte-WE write. Adds ~1 cycle of latency per nibble-write but that's well within the 16×8 demo budget.
The rework lands inside gs_stub.raster_pixel_emit (Ch95/Ch105/
Ch106 wrote the legacy paths) and gif_image_xfer_stub's per-
PSM dispatch. A focused TB that drives sub-word writes through
the normalizer and asserts the resulting vram_bram_stub words
match the legacy vram_stub byte-/halfword-/nibble-level
state would be the cleanest proof.
Other Ch155+ work:
- Update scanout / debug TBs that sample VRAM via vram_stub's
combinational reads to handle the 1-cycle sync-read latency
(or keep them on
vram_stubif they're sim-only). - Swap the Ch146 board top to instantiate
vram_bram_stubAFTER the writer-side normalization lands. Rerun the full Quartus compile and expect a dramatic ALM/register reduction. - Audit
useg_shadow_memnext — Ch64/Ch65/Ch70 mirror-write features may make it multi-port-write-shaped.
VRAM write normalizer + first BRAM integration (Ch155)
Ch155 lands the writer-side normalization layer that bridges
the contract gap between the legacy vram_stub (byte-addressed
sub-word writes + per-bit RMW) and the new vram_bram_stub
(word-aligned + byte-WE only). Per Codex's framing the chapter
keeps blast radius narrow: build the normalizer + verify it
standalone for all 4 PSMs + prove the easiest case (PSMCT32)
end-to-end through the new VRAM. RTL plumbing into
gs_stub.raster_pixel_emit and gif_image_xfer_stub lands in
Ch156+.
| Artifact | Purpose |
|---|---|
rtl/gif_gs/vram_normalize_pkg.sv |
Pure-comb normalize_write function — natural byte address + PSM + payload + (T4-only) old_byte → word-aligned write_addr + shifted write_data + write_be. |
tb_vram_normalize_write |
Focused unit TB — 17 cases across CT32 / CT16 / T8 / T4 lanes + misuse detection. |
rtl/top/top_psmct32_raster_demo_bram.sv |
Sibling of the Ch146 wrapper with vram_bram_stub swapped in. |
tb_top_psmct32_raster_demo_bram |
Integration TB — drives Ch146 fixtures + verifies VRAM contents at PSMCT32 swizzled addresses via hierarchical probe. |
Function contract (vram_normalize_pkg::normalize_write):
| PSM | byte_addr alignment | payload bits used | output write_be shape |
extras |
|---|---|---|---|---|
| PSMCT32 | word (addr[1:0]==0) |
payload[31:0] (full ABGR) |
4'b1111 |
misuse → drop (be=0000) |
| PSMCT16 | halfword (addr[0]==0) |
payload[15:0] (RGB5A1) |
4'b0011 (low) / 4'b1100 (high), keyed on addr[1] |
misuse → drop |
| PSMT8 | byte (any) | payload[7:0] (index byte) |
one of 4'b0001 / 0010 / 0100 / 1000, keyed on addr[1:0] |
— |
| PSMT4 | byte (any) | payload[3:0] (nibble) |
one of 4'b0001 / 0010 / 0100 / 1000, keyed on addr[1:0] |
needs old_byte + nibble_hi; output is the spliced full byte at the addressed lane |
| any other | — | — | 4'b0000 |
— |
PSMT4 splice math (the only PSM whose output depends on
prior memory state): given nibble_hi=0, the function returns
new_byte = {old_byte[7:4], payload[3:0]} — preserves the
upper nibble, replaces the lower. With nibble_hi=1,
new_byte = {payload[3:0], old_byte[3:0]}. The CALLER is
responsible for sourcing old_byte via a 1-cycle read of
mem[byte_addr] upstream of the write; the function itself is
purely combinational. The Ch156+ RTL plumbing chapter is
where that read pipeline lives inside
gs_stub.raster_pixel_emit and gif_image_xfer_stub.
top_psmct32_raster_demo_bram integration result: the new
sibling wrapper substitutes vram_bram_stub for vram_stub,
drops write_mask wiring (CT32's mask=0xFFFFFFFF makes the
per-bit RMW path a no-op so dropping it is functionally
equivalent), and accepts the 1-cycle sync-read latency on
PCRTC's vram_read_data path (so PCRTC scanout is 1-pixel
shifted; the integration TB skips frame capture and verifies
VRAM content via direct hierarchical probe). All 128 pixel
words at canonical PSMCT32 swizzled addresses match expected
ABGR. Standalone PASS.
Ch155 critical audit check: vram_normalize_write's
function-level misuse handling pins the contract — passing an
unaligned byte_addr for CT32 OR CT16 returns write_be=4'b0000,
which vram_bram_stub then drops cleanly. Combined with
Codex's stance that "no sub-32-bit writer is allowed to hand
an unaligned address directly to vram_bram_stub", the Ch156+
plumbing chapter has a hard contract to verify against.
Ch156+ surface:
- Insert a 1-cycle byte-read pipeline upstream of the PSMT4
raster emit + image-xfer paths inside
gs_stubandgif_image_xfer_stub. The read returnsold_bytefornormalize_write's splice input. - Apply
normalize_writeto all four PSM emit lanes inside both writers. - Add focused TBs for PSMCT16 / PSMT8 / PSMT4 paths analogous
to
tb_top_psmct32_raster_demo_bram— each verifies the swizzled VRAM contents under the new normalizer + bram_stub. - Add a 1-cycle address-stage register inside
gs_pcrtc_stubso scanout consumers see a clean combinational-look read (addr→datawith the BRAM's internal sync stage hidden). - Once all four lanes pass, swap the Ch146 board top to use
vram_bram_stubdirectly (or retirevram_stuboutright). - Audit
useg_shadow_memnext — the Ch64/Ch65/Ch70 mirror- write features may make it multi-port-write-shaped, which is its own forensic exercise.
Writer-side normalize plumbing — CT16 + T8 (Ch156)
Ch156 plumbs the Ch155 vram_normalize_pkg::normalize_write
function into the BRAM-friendly path so PSMCT16 and PSMT8
raster emits land at the right vram_bram_stub byte lane. The
chapter intentionally keeps blast radius narrow — the function
is wired in at the wrapper site between the unmodified
writer engines (gs_stub.raster_pixel_emit) and
vram_bram_stub, so the legacy byte-addressable contract on
gs_stub's raster emit ports stays exactly as Ch128/Ch134 / etc.
defined them. PSMT4 still requires the read-modify-write
pipeline and is deferred to Ch157+.
| File / target | Role |
|---|---|
rtl/top/top_psmct32_raster_demo_bram.sv |
Wrapper updated: raster_pixel_psm_q exposed; bitbltbuf_q[61:56] provides the PSM during xfer; the muxed (byte_addr, psm, payload) triple is run through vram_normalize_pkg::normalize_write and the result feeds vram_bram_stub. CT32 path remains a passthrough; CT16/T8 paths now write to the right lane. |
tb_gs_raster_bram_psmct16 |
Focused CT16 integration TB — 16×4 SPRITE at FBP=0/FBW=1, halfword 0x6155. Drives gs_stub#(PSMCT16_SWIZZLE=1) directly; verifies all 64 swizzled halfwords land in u_vram.mem[byte_addr >> 2] at the addr[1]-keyed lane; pins the linear-stride separator at byte 0x80 = zero. |
tb_gs_raster_bram_psmt8 |
Focused PSMT8 integration TB — 16×8 SPRITE at FBP=0/FBW=2, byte index 0xA5. Drives gs_stub#(PSMT8_SWIZZLE=1) directly; verifies all 128 swizzled bytes land in u_vram.mem[byte_addr >> 2] at the addr[1:0]-keyed lane. |
Why wrapper-site, not in-engine: keeping gs_stub and
gif_image_xfer_stub byte-addressable preserves the contract
that every Ch128 / Ch134 / Ch140 swizzle TB (and the legacy
vram_stub) was written against. Ch156's only structural
change is that a top wrapper which targets vram_bram_stub
also runs normalize_write between the writer and VRAM. A
future chapter can promote the normalizer into the writer
engines once we've decided to retire vram_stub; until then
the function lives where it can be removed without changing
the writers.
PSMT4 deferral — explicit hard-gate (Ch156 audit Medium #1
fix; superseded by Ch157): when Ch156 closed, the wrapper
masked write_en off when the active PSM was PSMT4
(vram_psmt4_block = (vram_psm_pre == PSM_PSMT4),
vram_we_mux = vram_we_pre && !vram_psmt4_block). Without that
gate, normalize_write's PSMT4 branch returned a real one-byte
write spliced against old_byte=0, silently corrupting VRAM
on any T4 raster emit. The Ch156 focused TB
tb_gs_raster_bram_psmt4_gate drove a 16×4 PSMT4 SPRITE
through the wrapper-shape gate and asserted (1) raster_pixel_emit
pulses fired, (2) every pulse hit the gate (blocked == emit),
(3) VRAM stayed at sentinel 0xDEADBEEF — zero corruption.
Ch157 retires both the gate and that TB: the wrapper now
runs a real RMW pipeline (see "PSMT4 RMW pipeline" section
below) and supplies a live old_byte so the splice produces
correct bytes. The retired TB's coverage is replaced by
tb_gs_raster_bram_psmt4, which drives the same kind of PSMT4
SPRITE but verifies correct nibble splices instead of
absence of writes.
Adversarial coverage on the CT16 / PSMT8 TBs (Ch156 audit Medium #2 fix): both TBs originally drove a single uniform payload across the whole sprite, so a buggy normalizer that wrote all four byte lanes (or duplicated payload, or stomped neighboring lanes) could still leave every checked pixel matching. The TBs now split the image into TWO half-width SPRITEs with distinct payloads:
tb_gs_raster_bram_psmct16drives(0,0)..(7,3)with halfword 0x6155 (low halfword lane via PSMCT16 swizzle) and(8,0)..(15,3)with halfword 0x9F8E (high halfword lane of the same 32-bit words). Sentinel preload (0xDEADBEEF) on every VRAM word before the drive plus a linear-stride separator check at byte 0x80 (outside the swizzled set).tb_gs_raster_bram_psmt8drives(0,0)..(7,7)with byte 0xA5 (lanes {0,1}) and(8,0)..(15,7)with byte 0x5A (lanes {2,3}). Same sentinel preload.
A normalizer that swaps lanes, sets be too wide, or fails to preserve the other halfword/byte lane(s) of the shared word now surfaces as a per-pixel mismatch.
Sim regression: 141 PASS / 0 FAIL after the audit fixes
(140 + the new tb_gs_raster_bram_psmt4_gate).
xfer-side coverage: gif_image_xfer_stub already feeds
the wrapper's pre-normalize mux during xfer_busy. CT32
TRXDIR uploads (no Ch156 TB exists yet, but the path is
wired) pass through the normalizer cleanly because xfer
emits CT32 word-aligned. CT16 + T8 xfer TBs that exercise
this path are a follow-on item — the wiring is already in
place; only a focused TB is missing.
Sim regression: 140 PASS / 0 FAIL after Ch156 (138 + 2 new BRAM-integration TBs).
PSMT4 RMW pipeline — vram_bram_stub writes enabled (Ch157)
Ch157 closes the last writer-PSM gap that Ch156 left behind: the
PSMT4 hard-gate is replaced by a wrapper-site read-modify-write
pipeline that supplies a LIVE old_byte from VRAM, splices the
new nibble against it, and commits a full-byte write through
vram_bram_stub's byte-WE (no per-bit RMW required). The
nibble splice itself uses the SAME math as vram_normalize_pkg's
PSMT4 branch (new = nibble_hi ? {nib, old[3:0]} : {old[7:4], nib})
but lives inline in the wrapper, not inside a call to
normalize_write — the function is pure-comb and would have
required old_byte to be combinationally available, whereas
vram_bram_stub's registered read port hands the byte back one
cycle later. The CT32/CT16/T8 paths still call normalize_write
directly (same-cycle, no read-back required). Goal Codex framed:
"all writer PSMs safe before swapping the board top."
Pipeline shape (inside
rtl/top/top_psmct32_raster_demo_bram.sv):
emit cycle N: is_t4_emit=1; vram_read2_addr = byte_addr & ~3;
pipe_q <= (byte_addr, nibble_hi, nibble[3:0]).
posedge → cycle N+1: vram_read2_data = mem[byte_addr] (sync read);
splice new_byte = nibble_hi
? {nibble, old[3:0]}
: {old[7:4], nibble};
drive vram_we_final=1, write_addr=byte_addr&~3,
write_data shifted to byte_addr[1:0] lane,
write_be one-hot to that lane.
posedge → cycle N+2: mem[byte_addr] commits new_byte.
old_byte is sourced from the lane-correct slice of
vram_read2_data. CT32/CT16/T8 emits skip the pipe entirely and
fall through vram_norm same-cycle (CT32 stays a passthrough,
existing TBs unaffected).
Forwarding hazard — back-to-back same-byte writes: a PSMT4
SPRITE rasters adjacent pixels at x=2k and x=2k+1 to the
SAME byte_addr (low + high nibble of a single byte). At cycle
N+1 the wrapper reads mem[byte_addr] for emit-2 in the SAME
posedge that emit-1's write commits. NBA semantics inside
vram_bram_stub (separate always_ff blocks for the write port
and the read port) make the read see the PRE-write value, so
emit-2 would splice against stale data. The Ch157 pipe carries
a 1-deep t4_prev_* register set (addr + new_byte from the
just-completed RMW) and forwards t4_prev_new_byte_q whenever
the in-flight emit's byte_addr matches the previous emit's
byte_addr. The forwarding chain extends across any number of
back-to-back same-byte emits — emit-N reads emit-(N-1)'s
new_byte from the forward register, splices on top, and
emit-(N+1) reads emit-N's new_byte from that same register.
| File / target | Role |
|---|---|
rtl/top/top_psmct32_raster_demo_bram.sv |
Ch156 hard-gate replaced by the RMW pipe + forwarding registers; vram_read2_addr driven on T4 emit cycles; vram_we_final mux selects T4 pipe write or non-T4 same-cycle path. |
tb_gs_raster_bram_psmt4 |
New positive-proof TB — drives a 16×4 LINEAR PSMT4 SPRITE (PSMT4_SWIZZLE=0 so adjacent x's hit the same byte) split into two halves with distinct nibbles (0xA / 0x5). 64 raster emits; verifies every byte under the sprite holds the expected pair of spliced nibbles (left half = 0xAA, right half = 0x55) plus sentinel preserved on bytes outside the sprite. PASS. |
tb_gs_raster_bram_psmt4_gate |
Retired — the gate it asserted no longer exists. |
Why LINEAR PSMT4 in the new TB: the linear address formula
(y*FBW*32) + (x>>1) puts adjacent x's at the same byte, which
is exactly the back-to-back same-byte forwarding hazard. The
swizzled path scatters bytes via columnTable4, so it touches
the forwarding logic less often. Linear coverage is strictly
stronger here.
Non-T4 TB cleanup: tb_gs_raster_bram_psmct16 and
tb_gs_raster_bram_psmt8 still mirror the non-T4 portion of
the wrapper-site plumbing, but they no longer carry the Ch156
PSMT4 hard-gate (now removed in the wrapper). Both wire
raster_pixel_emit straight to write_en and let
vram_norm drive addr/data/be — focused TBs verifying their
own PSM lane. Full pipe coverage lives in tb_gs_raster_bram_psmt4
and the top wrapper TB.
Sim regression: 141 PASS / 0 FAIL after Ch157 (140 + new
tb_gs_raster_bram_psmt4 − retired tb_gs_raster_bram_psmt4_gate).
PCRTC sync-read alignment (Ch158)
Ch158 closes the last big blocker before swapping the board top
to vram_bram_stub: the PCRTC's data-decode + sync-output
pipeline is now aware that vram_bram_stub's read_data is
registered with 1-cycle latency, so the captured scanout no
longer trails the address stage by one column.
gs_pcrtc_stub change (in
rtl/gif_gs/gs_pcrtc_stub.sv):
new module parameter VRAM_SYNC_READ (default 0). When set to 1,
every hcnt/vcnt-derived signal that the data-decode comb consumes
is run through a 1-cycle register before the consumer sees it
(active_h_dec, active_v_dec, in_hsync_dec, in_vsync_dec,
in_display_window_dec, scanout_enable_dec, dispfb_psm_*_dec,
psm4_nibble_select_dec, end_of_frame_dec). The address-side
(vram_read_addr) keeps using the current (hcnt, vcnt) so the
read is issued one pixel "ahead"; the registered vram_read_data
arrives one cycle later, paired with the matching delayed counter
view. Outputs r/g/b/hsync/vsync/de come from the _dec signals,
so the entire output stream shifts right by exactly one clock
when VRAM_SYNC_READ=1. Default VRAM_SYNC_READ=0 is a pure
passthrough — every existing PCRTC TB written against the legacy
vram_stub (comb-read) shape is unaffected.
top_psmct32_raster_demo_bram change: instantiates
gs_pcrtc_stub with .VRAM_SYNC_READ(1'b1). The wrapper banner
updates to drop the Ch155 caveat about scanout being 1 column
shifted — that caveat is now resolved.
tb_top_psmct32_raster_demo_bram extension: adds a Phase 2
frame-capture block that arms on the next vsync rising edge
after raster drain, captures one full frame's r/g/b into
cap_*[v][h] indexed by a 1-cycle-delayed copy of PCRTC's
address-stage counters (since the registered de aligns with
those delayed counters), and asserts each captured pixel's
post-decode r/g/b matches the expected ABGR for its quadrant.
Phase 1 (per-pixel VRAM probe via hierarchical mem[byte_addr >> 2])
is unchanged. PASS — 16×8 active region, all 128 pixels
captured + all 128 VRAM words probe-verified, frame_seen
latched.
Open Ch159+ items:
- xfer-side T4 coverage TB — the Ch157 wrapper handles xfer-side
T4 emits identically (the mux feeds
vram_psm_prefrombitbltbuf_q[61:56]duringxfer_busy), but no focused TB exercises that path yet. - Swap the Ch146 board top to instantiate
vram_bram_stuband the Ch158 PCRTC-sync mode directly (or retirevram_stuboutright). All four writer PSMs and PCRTC scanout are now proven correct against the BRAM-friendly contract; the remaining work is the integration commit on the board side. - Audit
useg_shadow_memfor the same BRAM-shape forensics that Ch153 ran onvram_stub(Ch64/Ch65/Ch70 mirror writes may make it multi-port-write-shaped).
Ch158 audit Medium fix — sub-word PSM lane selection: the
initial Ch158 cut shifted the data-decode pipeline by 1 cycle
to align with vram_bram_stub's registered output, but it
still extracted CT16 / PSMT8 / PSMT4 sub-word values from the
LOW lane of vram_read_data (i.e. [15:0] halfword and
[7:0] byte). That worked for vram_stub (byte-addressable;
the read returns 4 bytes starting at byte_addr so the
sub-word always lands at the low lane) but NOT for
vram_bram_stub (word-addressable; read_data is
mem[byte_addr >> 2] so the sub-word lives at lane
byte_addr[1:0] of the returned word). Codex Ch158 audit
called this out as a blocker for any sub-word PSM scanout
through the BRAM. The fix adds:
vram_addr_lane_q— 1-cycle-delayed copy ofvram_read_addr[1:0], paralleling the other_qdecode- stage registers added in the original Ch158 cut.data_lane = VRAM_SYNC_READ ? vram_addr_lane_dec : 2'd0— forces the legacy comb-read path to keep using the low lane (preserving every existing PCRTC TB's expectation), and resolves to the correct byte_addr-keyed lane in sync mode.psm16_pixel = data_lane[1] ? read_data[31:16] : read_data[15:0].- A
vram_byte_lanemux extracting one of 4 byte lanes for PSMT8 (psm8_idx) and PSMT4 (psm4_byte_lane→ nibble splice).
Two new focused integration TBs prove the fix end-to-end with adversarial pre-loads:
| TB | Coverage |
|---|---|
tb_gs_scanout_bram_psmct16 |
4-pixel CT16 scanout reading mem[0]/mem[1] with FOUR distinct halfwords across both halfword lanes (byte_addr[1]∈{0,1}); each pixel's captured 5→8-decoded RGB matches the expected halfword. PASS |
tb_gs_scanout_bram_psmt8 |
4-pixel PSMT8 scanout reading mem[0] with FOUR distinct byte indices, one per byte lane (byte_addr[1:0] ∈ {0,1,2,3}); each pixel's grayscale RGB matches the expected byte. PASS |
Without the fix, both TBs would have failed: the CT16 TB would
emit the same pair of pixels twice (low halfword of each word),
and the PSMT8 TB would emit IDX_0 for all four pixels.
Sim regression: 143 PASS / 0 FAIL after Ch158 audit fixes (141 + 2 new BRAM scanout TBs).
Board-top swap to BRAM wrapper + Quartus fit recovery (Ch159)
Ch159 commits the integration step that the prior chapters
were building toward: the DE25-Nano board top
(rtl/top/de25_nano_psmct32_raster_demo_top.sv)
now instantiates top_psmct32_raster_demo_bram
instead of the legacy top_psmct32_raster_demo.
External port shape is identical so this is drop-in at the
board level; the BRAM-backed wrapper carries through every
Ch155-Ch158 fix (writer-side normalize + PSMT4 RMW pipe +
PCRTC sync-read alignment + sub-word lane select). The synth
file list (synth/de25_nano/top_psmct32_raster_demo/files.f)
and Quartus QSF gain vram_normalize_pkg.sv, vram_bram_stub.sv,
and top_psmct32_raster_demo_bram.sv; the legacy vram_stub
- legacy top stay on the project for back-compat with sim TBs that still target them.
Quartus fit recovery — vs Ch152 baseline: the headline of
this chapter. Ch152 fit FAILED at 155k ALMs needed (331% over)
because vram_stub's 8 KiB byte-addressable + per-bit-RMW
storage didn't infer as M20K and landed as a 65,536-flip-flop
array, dragging 121k registers and 199k synthesis ALMs along
with it. Ch159 swap turns those numbers around:
| Metric | Ch152 (vram_stub) | Ch159 (vram_bram_stub) | Δ |
|---|---|---|---|
| Synthesis status | Successful | Successful | — |
| Synthesis ALMs estimate | 199,103 / 46,800 (425% over) | 22,704 / 46,800 (49%) | −176,399 (−88.6%) |
| Synthesis registers | 101,457 | 36,008 | −65,449 (−64.5%) |
| Fit status | FAILED (155k / 331% over) | Successful (30,364 / 65%) | ✅ fits |
| Fit registers | 121,176 | 39,085 | −82,091 (−67.7%) |
| Fit RAM blocks | 6 / 358 | 14 / 358 | +8 (BRAM-inferred VRAM) |
| Fit block memory bits | 65,536 | 196,608 | +131,072 (data in M20K) |
| Fit DSP blocks | 20 | 18 | −2 |
| STA status | DID NOT RUN (fit failed) | Successful (12 warnings) | ✅ STA reachable |
| STA setup slack worst (CLOCK2_50) | n/a | −6.950 ns | timing miss at 50 MHz |
| Fmax | n/a | 37.11 MHz | (Ch160+ tunes) |
The eight new RAM blocks are the same vram_bram_stub
footprint exp_c proved in Ch154 (8 RAM blocks for the dual-port
- admission-gated 8 KiB shape; the +6 already in the Ch152
baseline came from
bios_rom_stub+ee_ram_stub+useg_shadow_memcorrectly inferring as BRAM there). The register drop (121k → 39k) is essentially the entire VRAM flip-flop array vanishing.
Setup-slack reality check: STA reports −6.950 ns slack at the CLOCK2_50 50 MHz constraint (Fmax = 37.11 MHz). The critical path is somewhere in the Ch123 dep tree's longer combinational chains (likely the Gouraud divider or one of the swizzle muxers). That is NOT a Ch159 regression — it's a brand-new visibility unlocked by being able to run STA at all. Ch160+ owns timing closure (PLL down-clock to ≤30 MHz, critical-path pipelining, or both).
Snapshots preserved: the Ch152 baseline reports are saved
under
synth/de25_nano/top_psmct32_raster_demo/baseline_ch152/
(syn / fit summaries + flow.rpt + parse_report.txt) so future
chapters can diff against them without re-running the failing
Ch152 baseline.
Sim regression: 143 PASS / 0 FAIL unchanged. The Ch149 board-wrapper TB exercises the same external behavior with the new core wrapper inside.
Down-clock target + first .sof bitstream (Ch160)
Ch160 closes the loop Codex framed at the end of Ch159 — "first add a down-clock PLL profile so we can get a real bitstream moving on hardware, then use the successful STA path report to decide whether to pipeline toward 50 MHz." The chapter is SDC- and build-flow-only; no RTL changes.
SDC retarget (synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc)
relaxes the CLOCK2_50 period from 20.000 ns (50 MHz) to
33.333 ns (30 MHz). The DE25-Nano's CLOCK2_50 oscillator is
physically still 50 MHz; the SDC tells Quartus to ASSUME a
30 MHz input so the fitter closes timing at the down-clock
target. A real PLL .ip that divides 50 → 30 MHz on hardware
is the Ch161+ commit (the QSF's commented-out QIP_FILE
swap-point is staged for it). Until then, the .sof produced
under this constraint is structurally clean for 30 MHz
operation; programming it on a board where CLOCK2_50 is still
wired straight through gives an effective 50 MHz chip clock
that may show setup-violating behavior — Ch161 closes that
gap.
build_quartus.sh adds quartus_asm (synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh)
gated on a clean STA, so a .sof bitstream is now produced
when the design fits and timing closes. The Make scaffold
check is loosened to accept either the 50 MHz (legacy) or
33.333 ns (Ch160 down-clock) period.
Quartus result vs Ch159:
| Metric | Ch159 (50 MHz target) | Ch160 (30 MHz target) |
|---|---|---|
| Synth ALMs estimate | 22,704 / 46,800 (49 %) | 22,704 / 46,800 (49 %) |
| Synth registers | 36,008 | 36,008 |
| Fit status | Successful | Successful |
| Fit ALMs | 30,364 / 46,800 (65 %) | 31,056 / 46,800 (66 %) |
| Fit registers | 39,085 | 37,381 |
| Fit RAM blocks | 14 / 358 | 14 / 358 |
| STA setup slack worst | −6.950 ns (timing miss) | +0.805 ns (closes) |
| Fmax (CLOCK2_50) | 37.11 MHz | 30.74 MHz |
quartus_asm |
(skipped) | Successful — .sof produced |
The synth-side numbers are identical because no RTL changed —
the differences are entirely in the fitter's placement choices
under the looser timing constraint. Fmax dropped slightly
(37.11 → 30.74 MHz) because Quartus optimizes harder when the
target is tighter; the headline is that at the 30 MHz target
the design CLOSES (positive slack on every report) and a
real .sof is now generated.
Critical path (from
output_files/de25_nano_psmct32_raster_demo_top.sta.rpt,
worst-10 paths all in the same module hierarchy):
| Field | Value |
|---|---|
| Slack | +0.805 ns (worst path of 10 with this slack value) |
| From / To | `u_demo |
| Data Delay | 32.516 ns (out of 33.333 ns period) |
| Critical net | The EE core's auto-generated 64-bit signed divider (the Ch152-noted Gouraud TRI divider — dead code in the PSMCT32 raster demo because no RM_TRI primitive is dispatched). |
Ch161+ pipelining handoff: the path Codex's framing asked us to surface is now visible. Two options:
- Pipeline the divider — re-implement
ee_core's 64-bit division as an N-cycle multi-cycle path. Quartus's auto- generated divider is a single-cycle ripple chain; making it 2-3 stage pipelined would put Fmax comfortably above 50 MHz. - Strip it from the build — gate the Gouraud TRI
divider behind a
STRIP_GOURAUD_TRIparameter (default off), so the PSMCT32 raster demo's hardware build instances the EE core without it. Quartus removes the entirediv_0_rtl_0block; Fmax should jump dramatically.
Option 2 is the lower-blast-radius hardware bring-up move (removes ~32 ns of dead-code combinational chain); option 1 is the long-term correct fix once the Gouraud TRI path goes load-bearing.
Snapshots: Ch159 baseline reports preserved under
baseline_ch159/
(syn / fit / sta summaries + parse_report).
Sim regression: 143 PASS / 0 FAIL unchanged (no RTL changes). Scaffold check + Ch149 board TB + top BRAM TB all green under the new SDC.
Real PLL IP commit — .sof actually runs at 30 MHz (Ch161)
Ch161 retires the Ch160 hardware-honesty caveat by committing a
real Quartus IOPLL .ip configured for 50 MHz refclk → 30 MHz
outclk_0. The wrapper's \ifdef USE_PLL_IP(staged in Ch151) now flips to the IP-generatedpllmodule on Quartus builds; sim TBs continue to use the pass-throughde25_nano_pll_stub`.
Files committed under
synth/de25_nano/top_psmct32_raster_demo/ip/:
pll.ip— adapted fromretroDE_nes/ip/audio_pll.ip(single- output Agilex 5 IOPLL template), retargeted to 50 MHz refclk → 30 MHz outclk_0.pll/pll.qip+pll/synth/pll.v+pll/pll_bb.v— Quartus IP-generated artifacts (quartus_ipgenerate de25_nano_psmct32_raster_demo_top --ip_file=ip/pll.ip --generate_ip_file --synthesis=verilog). The generatedpllmodule exposes(refclk, rst, outclk_0, locked)— exactly the Ch151 stub's signature, so the\ifdef` swap is drop-in.
Wiring changes:
de25_nano_psmct32_raster_demo_top.qsfuncommented theset_global_assignment -name QIP_FILE ip/pll/pll.qipswap- point and addedset_global_assignment -name VERILOG_MACRO "USE_PLL_IP=1"so Quartus instantiates the IPpllinstead of thede25_nano_pll_stub.de25_nano_psmct32_raster_demo_top.sdcreverted the Ch160 CLOCK2_50 period back to 20.000 ns (the physical 50 MHz oscillator). The IOPLL's auto-generated SDC inside the .qip declares the post-PLLoutclk_0clock at 30 MHz, so STA picks up two domains:u_pll|iopll_0_refclk(50 MHz, the pin) andu_pll|iopll_0_outclk0(30 MHz, the design clock).build_quartus.shsymlinks theip/dir alongside the existingrtl/andsim/symlinks so the QIP_FILE'sip/pll/pll.qippath resolves from the work dir.
Quartus result vs Ch160:
| Metric | Ch160 (SDC profile only) | Ch161 (real PLL IP) |
|---|---|---|
| Fit ALMs | 31,056 / 46,800 (66 %) | 30,898 / 46,800 (66 %) |
| Fit registers | 37,381 | 37,352 |
| Fit PLLs | 0 / 11 | 1 / 11 (real IOPLL) |
| RAM blocks | 14 / 358 | 14 / 358 |
| Setup slack worst (design_clk) | +0.805 ns @ CLOCK2_50 | **+0.565 ns @ u_pll |
| Fmax (design_clk) | 30.74 MHz | 30.74 MHz |
quartus_asm |
Successful | Successful (.sof produced) |
The +1 PLL block is the real IOPLL on the chip; ALMs go down
slightly because the stub's clock-distribution path no longer
needs ALM glue. STA now reports BOTH clock domains: the refclk
(50 MHz, +19.249 ns slack — trivially fast) and the design_clk
(30 MHz post-PLL, +0.565 ns slack — comfortable margin). The
.sof produced under this configuration genuinely runs at
30 MHz on the DE25-Nano: the IOPLL takes the 50 MHz CLOCK2_50
input and divides to 30 MHz inside the chip, so the entire
design downstream of u_pll.outclk_0 operates at the
constrained frequency. (Setup slack landed at +0.914 ns on the
initial Ch161 build; the Ch161 audit's wider reset false-path
nudged the fitter into a slightly different placement, dropping
the worst-case setup slack to +0.565 ns. Recovery analysis on
the rst_sync stages — which had been hiding a real -0.079 ns
violation under the original *rst_sync[0] constraint — is now
gone from the .sta.summary entirely after the false-path was
widened to *rst_sync[*].)
Snapshots: Ch160 baseline (parse_report + summaries +
.sof) preserved under
baseline_ch160/.
Open Ch162+ items (Ch161 forward-ref, superseded by Ch162 below):
Pipeline or strip the EE-core 64-bit Gouraud TRI divider— closed in Ch162 viaSTRIP_HW_DIVIDER(note: the actual divider is the Ch43 DIVU divider, not Gouraud TRI; the forward-ref's name was loose). The Ch162 strip retired theu_demo|u_core|div_0_rtl_0|...STA worst path entirely; see the Ch162 section below for the new critical path.- xfer-side T4 coverage TB (open from Ch157+).
useg_shadow_memBRAM-shape forensics.- Video PHY shim (HDMI / VGA / PMOD) —
VIDEO_*pins virtualized.
Sim regression: 143 PASS / 0 FAIL unchanged. Sim ignores
the \ifdef USE_PLL_IP(no+define+USE_PLL_IP` in the
iverilog Makefile) so the stub stays active under sim.
Strip the EE-core hardware divider (Ch162)
Ch162 takes the lower-blast move from the Ch161 STA handoff: add a parameter that gates the EE-core's auto-inferred 32-bit hardware divider out of synthesis on the PSMCT32 SPRITE-only hardware build, then re-measure Fmax.
RTL change (rtl/ee/ee_core_stub.sv)
gains parameter bit STRIP_HW_DIVIDER = 1'b0. Two / and %
sites tied to the Ch43 DIVU instruction are gated by this
parameter — the writeback (lines ~932-935) and the retire-
trace arg3 mirror (lines ~1005-1014). Default 0 keeps
DIVU semantics intact for every existing sim TB
(tb_ee_core_divu_mflo is the only consumer; it stays at the
default). When the parameter is 1, the writeback becomes a
no-op (HI/LO unchanged, identical to the divisor==0 case the
spec calls undefined) and the retire-trace arg3 reports 0.
Quartus then has nothing to infer — the div_0_rtl_0 block
disappears.
Wrapper plumbing:
top_psmct32_raster_demo_bram
gains a matching STRIP_HW_DIVIDER parameter and forwards it
to ee_core_stub. The
DE25-Nano board top
sets .STRIP_HW_DIVIDER(1'b1) on its u_demo instantiation
(the bootlet doesn't execute DIVU, so this is behavior-neutral
for the demo). Sim TBs that instantiate the BRAM wrapper
directly use the default 0.
Quartus result vs Ch161 (real-PLL baseline):
| Metric | Ch161 (real PLL) | Ch162 (real PLL + strip) |
|---|---|---|
| Fit ALMs | 30,898 / 46,800 (66 %) | 30,006 / 46,800 (64 %) |
| Fit registers | 37,352 | 36,618 |
| Fit PLLs | 1 | 1 |
| RAM blocks | 14 | 14 |
| Setup slack worst (design) | +0.565 ns | +3.567 ns |
| Fmax (design domain) | 30.74 MHz | 33.6 MHz (+9.4 %) |
quartus_asm |
Successful | Successful (.sof produced) |
Stripping the divider freed 892 ALMs / 734 registers and yielded ~3 ns of new setup margin. Fmax climbs from 30.74 MHz to 33.6 MHz — a real jump, but not enough to clear the 50 MHz target (which would need a +67 % jump). Codex's Ch162 framing predicted this branch: "if Fmax jumps, we have a clean path to a 50 MHz demo bitstream; if not, the next real critical path will reveal itself." We landed in the second branch — Fmax jumped, but not far enough.
New critical path (the Ch163+ handoff, from
output_files/de25_nano_psmct32_raster_demo_top.sta.rpt):
| Field | Value |
|---|---|
| Slack | +3.567 ns |
| From | `u_demo |
| To | `u_demo |
| Data delay | 38.443 ns of arrival vs 42.010 ns required (period 33.333 ns + clock skew + uncertainty) |
The PCRTC divider comes from
gs_pcrtc_stub.sv lines:
assign vram_x_unshift = {20'd0, hwin_rel} / hmag_factor;
assign vram_y_unshift = {20'd0, vwin_rel} / vmag_factor;
where hmag_factor = MAGH + 1 and vmag_factor = MAGV + 1.
For the demo MAGH = MAGV = 0, so the divisor is constant 1
— but Quartus doesn't constant-propagate through this
formulation and synthesizes a real 32-bit divider anyway. The
parallel Ch162 fix shape would be a STRIP_PCRTC_MAG_DIV
parameter (or a more general "demo doesn't use magnification"
hint that bypasses the divider when both MAGH and MAGV are
constant 0).
Snapshots: Ch161 baseline preserved under
baseline_ch161/
(syn / fit / sta summaries + parse_report + .sof) for diff.
Open Ch163+ items:
- Strip the PCRTC magnification divider on hardware builds
(next critical path; same shape as Ch162's
STRIP_HW_DIVIDER). - Once Fmax climbs north of 50 MHz, retune the IOPLL
.ipto outclk_0 = 50 MHz, retarget the SDC, and ship a 50 MHz bitstream. - xfer-side T4 coverage TB (still open from Ch157+).
useg_shadow_memBRAM-shape forensics.- Video PHY shim (HDMI / VGA / PMOD) —
VIDEO_*pins virtualized.
Sim regression: 143 PASS / 0 FAIL unchanged. Default
STRIP_HW_DIVIDER=0 preserves DIVU semantics for
tb_ee_core_divu_mflo; the board top's STRIP_HW_DIVIDER=1
goes through tb_de25_nano_psmct32_raster_demo_top cleanly
because the Ch149 board TB doesn't exercise DIVU.
Strip PCRTC magnification divider + 50 MHz close (Ch163)
Ch163 takes the next critical-path attack from the Ch162 STA report (the PCRTC magnification divider) and uses the resulting Fmax headroom to retune the PLL IP to 50 MHz output — closing the journey that started at the Ch152 fit failure with a real 50 MHz bitstream.
RTL change (rtl/gif_gs/gs_pcrtc_stub.sv)
gains parameter bit STRIP_PCRTC_MAG_DIV = 1'b0. The two /
operators are gated:
assign vram_x_unshift = STRIP_PCRTC_MAG_DIV
? {20'd0, hwin_rel}
: ({20'd0, hwin_rel} / hmag_factor);
assign vram_y_unshift = STRIP_PCRTC_MAG_DIV
? {20'd0, vwin_rel}
: ({20'd0, vwin_rel} / vmag_factor);
Default 0 keeps the live divider math for every Ch93-era
magnification scanout TB (tb_gs_scanout_magh_magv etc.). When
1, the math collapses to a passthrough — equivalent to the
MAGH=MAGV=0 case the demo always hits but with no inferred
divider for Quartus to synthesize.
Wrapper plumbing:
top_psmct32_raster_demo_bram
gains a matching STRIP_PCRTC_MAG_DIV parameter that forwards
to gs_pcrtc_stub. The
DE25-Nano board top
sets .STRIP_PCRTC_MAG_DIV(1'b1) on its u_demo instantiation.
Quartus result, two stages:
Stage A — strip @ 30 MHz target (still on the Ch161 PLL .ip):
| Metric | Ch162 (strip EE divider only) | Ch163 (strip both, 30 MHz) |
|---|---|---|
| Fit ALMs | 30,006 / 46,800 (64 %) | 27,216 / 46,800 (58 %) |
| Setup slack worst | +3.567 ns | +21.113 ns |
| Fmax (design) | 33.6 MHz | 81.83 MHz (+143 %) |
The Ch163 strip alone freed +17.5 ns of margin and 2,790 ALMs — large enough to clear 50 MHz outright. Codex's Ch162 framing predicted both branches of the if-Fmax-jumps fork; Ch163 lands in the first branch ("clean path to a 50 MHz demo bitstream").
Stage B — retune PLL .ip from 30 MHz → 50 MHz output:
The pll.ip source's gui_output_clock_frequency0 and
gui_output_clock_frequency_ps0 are bumped (30.0 → 50.0 MHz;
33333.333 → 20000.0 ps). quartus_ipgenerate rebuilds the
.qip / synth files in-place. No SDC change needed — CLOCK2_50
stays pinned at the physical 50 MHz period; the IOPLL's auto-
generated SDC declares the new outclk_0 frequency.
| Metric | Ch163 strip @ 30 MHz target | Ch163 strip @ 50 MHz target |
|---|---|---|
| Fit ALMs | 27,216 / 46,800 (58 %) | 27,543 / 46,800 (59 %) |
| RAM blocks / PLLs | 14 / 1 | 14 / 1 |
| Setup slack worst | +21.113 ns | +7.500 ns |
| Fmax (design) | 81.83 MHz | 80.0 MHz |
.sof produced |
yes (30 MHz run on hw) | yes — 50 MHz on hw |
The .sof produced under Stage B genuinely runs at 50 MHz on the DE25-Nano — the IOPLL takes 50 MHz CLOCK2_50 in and emits 50 MHz outclk_0 (effectively a 1:1 relation through the real PLL hardware so the chip's clock distribution still goes through the IOPLL's clock network). All 8 timing classes positive; no recovery violations; build gate Successful.
Snapshots:
baseline_ch162/— Ch162 30 MHz state with EE divider stripped only.baseline_ch163_30mhz/— Ch163 strip-both at 30 MHz target (Stage A milestone).
Open Ch164+ items (the project has hit the major hardware milestone Codex called out at Ch157+; Ch164+ is post-launch):
- xfer-side T4 coverage TB (open from Ch157+).
useg_shadow_memBRAM-shape forensics.- Video PHY shim (HDMI / VGA / PMOD) —
VIDEO_*pins still virtualized; this is the next big front-end deliverable before the demo can paint a real screen.
Sim regression: 143 PASS / 0 FAIL unchanged. Default-off
on STRIP_PCRTC_MAG_DIV preserves every Ch93 magnification
scanout TB; the board top's STRIP_PCRTC_MAG_DIV=1 propagates
cleanly through tb_de25_nano_psmct32_raster_demo_top since
the demo locks MAGH=MAGV=0.
HDMI pin shim — pixels off-chip (Ch164)
Ch164 is the first video-PHY chapter — Codex's framing was "small
PHY shim chapter, not a full display-stack leap. Get pixels off-
chip before making them pretty." Replace the abstract
VIDEO_R/G/B/HSYNC/VSYNC/DE virtual pins with real DE25-Nano
HDMI transmitter signals; the ADV7513 chip itself stays asleep
(its I²C wake-up FSM is the Ch165+ chapter), so the bitstream
makes the FPGA pins toggle correctly but a real monitor stays
dark until Ch165 lands.
Wrapper change (rtl/top/de25_nano_psmct32_raster_demo_top.sv):
five new top-level outputs added — HDMI_TX_CLK (= design_clk,
the 50 MHz pixel clock), HDMI_TX_D[23:0] packing
{VIDEO_R, VIDEO_G, VIDEO_B} (R in MSBs, ADV7513 default 24-bit
RGB), and HDMI_TX_HS / HDMI_TX_VS / HDMI_TX_DE mirroring the
abstract VIDEO_* signals. The VIDEO_* ports are kept on the
wrapper as VIRTUAL_PIN ON (the Ch149 board TB references them
via hierarchical probe).
QSF change (synth/.../de25_nano_psmct32_raster_demo_top.qsf):
HDMI pinout sourced from
retroDE_nes/retroDE_nes.qsf
for the same DE25-Nano (Terasic Agilex 5) board — HDMI_TX_CLK
on PIN_DJ24 with 1.1-V IO standard (matches the on-board level
shifter), data + sync pins on 3.3-V LVCMOS. The companion
ADV7513 control pins (HDMI_I2C_SCL, HDMI_I2C_SDA,
HDMI_TX_INT, HDMI_MCLK) are intentionally NOT pinned — the
chip stays in standby on power-up and ignores its 24-bit RGB
input until the I²C wake-up FSM lands in Ch165+.
SDC change (synth/.../de25_nano_psmct32_raster_demo_top.sdc):
set_false_path -to for each HDMI output port. Proper
set_output_delay constraints with respect to a generated
HDMI_TX_CLK domain land alongside the Ch165+ wake-up FSM,
when the ADV7513's actual setup/hold window comes out of the
chip's datasheet pass.
Scaffold-check extension (sim/Makefile):
top_psmct32_raster_demo_quartus_scaffold_check now also
verifies HDMI_TX_CLK + HDMI_TX_D[0..23] + HS/VS/DE are
pin-assigned (sentinel set; not exhaustive) — fails the gate
if Quartus would auto-place them on arbitrary package pins.
Quartus result vs Ch163 (50 MHz):
| Metric | Ch163 (50 MHz, no HDMI pins) | Ch164 (50 MHz + HDMI pins) |
|---|---|---|
| Fit ALMs | 27,543 / 46,800 (59 %) | 27,271 / 46,800 (58 %) |
| Fit RAM / PLL blocks | 14 / 1 | 14 / 1 |
| Fit pins | 17 / 351 (5 %) | 45 / 351 (13 %) (+28 HDMI) |
| Setup slack worst (design) | +7.500 ns | +7.536 ns |
| Fmax (design domain) | 80.0 MHz | ~80 MHz (unchanged) |
quartus_asm |
Successful | Successful (.sof produced) |
The +28 pins are exactly the new HDMI shim — 24 RGB lanes, 1
clock, 3 sync (HS / VS / DE). Setup slack stays at ~+7.5 ns
because the new pins are false_path'd — STA doesn't time
anything against them yet. ALMs ticked down slightly as the
fitter rebalanced under the wider pin map.
Snapshot: Ch163 50 MHz baseline preserved at
baseline_ch163_50mhz/
(syn / fit / sta summaries + parse_report + .sof). The
baseline_ch163_30mhz/
30-MHz milestone is also preserved.
Open Ch165+ items:
- ADV7513 I²C wake-up FSM — without this the HDMI port
outputs nothing on a real monitor. Ch165 owns the chip
bring-up: pin
HDMI_I2C_SCL/HDMI_I2C_SDA/HDMI_TX_INT/HDMI_MCLK, drop in an I²C master that walks the canonical ADV7513 register-set (sourced fromretroDE_nes's working bring-up). - Proper
set_output_delayconstraints once the ADV7513 setup/hold window is documented (replacing Ch164'sfalse_path). - Make the rendered pattern bigger than Ch123's 16×8 SPRITE so there's something visible to admire on a real screen.
- xfer-side T4 coverage TB (still open from Ch157+).
useg_shadow_memBRAM-shape forensics.
Sim regression: 143 PASS / 0 FAIL unchanged — no RTL
changes that touched sim semantics; the new HDMI ports are
combinational mirrors of existing VIDEO_* signals, and
tb_de25_nano_psmct32_raster_demo_top references VIDEO_*
unchanged.
Wake the ADV7513 — first .sof that drives a real HDMI monitor (Ch165)
Ch165 turns "FPGA pins toggling" into "monitor has a fighting chance of showing the tiny frame" — Codex's framing for the chapter. The ADV7513 chip stays in standby on power-up; an I²C master needs to walk a canonical register-write sequence to configure 24-bit RGB input + sync polarity + power-up + HPD override before the chip will accept the FPGA's HDMI_TX_* data and drive the connector.
Modules ported (Terasic DE-series reference design, free use on Terasic hardware per the license that ships with the DE25-Nano System CD; copyright retained):
rtl/platform/I2C_Controller.v— bit-bang I²C master with 23-step transaction layout (start / slave-addr / sub-addr / data / stop, ~50 µs per byte at the derived 20 kHz I²C clock).rtl/platform/I2C_HDMI_Config.v— wake-up FSM that walks a 38-entry LUT of ADV7513 register writes (slave 0x72): power-up + HPD override + audio init + AVI InfoFrame for full-range RGB 444 + dither + clock-divide + HDMI mode select. Adapted from theretroDE_splash/rtl/platform/versions (same DE25-Nano board); LUT customizations (HPD override, AVI InfoFrame for full-range RGB) carry through.
Wrapper changes (rtl/top/de25_nano_psmct32_raster_demo_top.sv):
- Four new top-level ports:
inout HDMI_I2C_SCL,inout HDMI_I2C_SDA(open-drain I²C bus),input HDMI_TX_INT(chip's HPD / monitor-sense interrupt, active-low), andoutput HDMI_MCLK(audio sample-rate reference, driven by CLOCK2_50 since the demo is video-only). I2C_HDMI_Config u_hdmi_i2cinstantiated. Clocked onCLOCK2_50(NOTdesign_clk— the wake-up runs even before the PLL locks); reset on~ninit_done(raw async reset; the I²C bus stays held in a clean state until FPGA init completes). OutputREADY(=hdmi_init_done) goes high after the LUT walk;HDMI_TX_INTgoing low retriggers the walk so a late hot-plug after FPGA boot still wakes the chip.- New status LED:
LED[3] = ~hdmi_init_done(active-low; lit means the chip is configured).LED[7:4]retie at HIGH.
QSF + files.f + sim Makefile:
QSF
gains pin assignments for the 4 new control pins (sourced from
retroDE_nes: BT1 / BW2 / CF2 / CF1) plus IO standards
(3.3-V LVCMOS for everything). The two new platform Verilog
sources are added to the QSF source list, the synth
files.f,
and the sim Makefile's RTL_SRCS. The
scaffold-check
extends to verify all 4 control pins are pin-assigned + IO
standard'd, alongside the Ch164 24-pin HDMI data set.
SDC change
(de25_nano_psmct32_raster_demo_top.sdc):
set_false_path -to / -from on the new control pins. The I²C
bus runs at ~20 kHz (50 µs per SCL period) and is inherently
async to the design clock; HDMI_MCLK is driven by CLOCK2_50 and
sampled by the chip's audio PLL — both well below any
constraint on the fabric.
Quartus result vs Ch164:
| Metric | Ch164 (HDMI data only) | Ch165 (HDMI data + wake-up) |
|---|---|---|
| Fit ALMs | 27,271 / 46,800 (58 %) | 27,374 / 46,800 (58 %) |
| Fit RAM / PLL blocks | 14 / 1 | 14 / 1 (unchanged) |
| Fit pins | 45 / 351 | 49 / 351 (+4 control) |
| Setup slack worst | +7.536 ns | +7.198 ns |
quartus_asm |
Successful | Successful (.sof produced) |
The +103 ALMs are the I²C controller's bit-bang state machine
and the 38-entry LUT walker. STA stays positive on every
class — the wake-up FSM lives entirely on the I²C-clock domain
(slow), and Recovery analysis on iRST_N async-deassert is
cleanly +17.621 ns of slack.
TB note — tb_de25_nano_psmct32_raster_demo_top (the
Ch149 board smoke) wires up the new HDMI_TX_INT input
(tied high = no interrupt) and leaves the I²C SCL/SDA lines
floating; the wake-up FSM walks the LUT but full completion
takes ~125 ms simulated at the production divider
(controller-clock period ~100 µs × 33 phases/byte × 38 bytes),
far longer than the existing 5 ms TB runtime. The board TB
doesn't observe hdmi_init_done directly — it pre-dates the
wake-up FSM and only smoke-tests the wrapper. The Ch165 audit
landed tb_hdmi_i2c_wake_smoke (sim/tb/top/), which
overrides CLK_Freq / I2C_Freq to collapse the divider so the
walk runs in microseconds and asserts the LUT walk + READY
rise + HDMI_TX_INT retrigger + open-drain SDA + the Ch166
sticky NACK watchdog. Ch167 added a bus-level byte-sequence
lock: the TB switched its SDA model from pulldown to
pullup + a phase-aware slave-ACK driver (drives strong-LOW
exactly when u_dut.u0.phase is PH_ACK0/1/2, releases
otherwise so the master's data bits are visible on the
wire). A decoder samples SDA on each SCL rising edge
between START and STOP, assembles the three bytes per
transaction into a 24-bit {dev_addr, reg, data} tuple,
and compares against u_dut.mI2C_DATA[23:0] snapshotted
on mI2C_GO rising edges. Asserts: 38 captured == 38
intent, every byte matches, every dev_addr is 8'h72.
The Phase 3 open-drain check also flipped semantics from
"SDA never strong-HIGH" to "SDA never 'x" (the right
violation test for the pullup bus).
Snapshots: Ch164 baseline preserved at
baseline_ch164/;
Ch165 baseline at
baseline_ch165/.
Open Ch168+ items:
- Proper
set_output_delayconstraints on HDMI_TX_* once the ADV7513 setup/hold window is locked from the bring-up datasheet pass (replaces the Ch164set_false_path -to). - Make the rendered pattern bigger than Ch123's 16×8 SPRITE so there's something visible to admire on a real screen.
- xfer-side T4 coverage TB (still open from Ch157+).
useg_shadow_memBRAM-shape forensics.
Sim regression: 144 PASS / 0 FAIL.
tb_de25_nano_psmct32_raster_demo_top PASSES with the new
HDMI control ports wired up (HDMI_TX_INT held high in the
TB; LED=0b11111000 shows the existing 3 status LEDs lit
— LED[3] stays unlit because the LUT walk doesn't complete
in 5 ms of sim). tb_hdmi_i2c_wake_smoke PASSES the
accelerated bring-up + Ch166 NACK-watchdog assertions.
Hardware-readiness pass for the Ch123 PSMCT32 raster demo (Ch144)
Ch144 is a synthesis/FPGA-readiness audit around the first hardware-demo candidate (Ch123 PSMCT32 raster e2e, marked above). No RTL changes — Ch144 documents what a top-level FPGA wrapper needs to know before attempting a first build.
RTL dependency tree (Ch123-only) — what the demo actually
instantiates. The full RTL_SRCS list compiled by sim contains
~40 modules; Ch123 only reaches these 11, plus the swizzle math
primitive that the three swizzle-aware modules each instantiate
internally:
| Module | Role in Ch123 |
|---|---|
bios_rom_stub |
EE bootlet at 0xBFC0_0000 (~18 instructions) |
ee_ram_stub |
DMAC-side GIF payload (~24 qwords) |
ee_memory_map_stub |
EE-CPU + DMAC + bios + map's GS-priv decode |
ee_core_stub |
MIPS R5900 core running the bootlet |
ee_gs_priv_bridge_stub |
EE 32-bit MMIO → 64-bit GS-priv reg writes |
dmac_reg_stub |
DMAC ch2 NORMAL transfer |
gif_packed_stub |
GIFtag + PACKED A+D parser |
gs_stub |
GS register file + raster (PSMCT32_SWIZZLE=1) |
gif_image_xfer_stub |
TRXDIR/IMAGE engine (PSMCT32_SWIZZLE=1, dormant in Ch123) |
vram_stub |
8 KiB VRAM (one PSMCT32 page) |
gs_pcrtc_stub |
PCRTC scanout (PSMCT32_SWIZZLE=1) |
gs_swizzle_psmct32_stub |
Pure-comb math, instantiated x3 inside the gates above |
Sim-only constructs audit (full sweep of the 12 modules above):
bios_rom_stub.svandee_ram_stub.sv—$display/$readmemhinsideinitial begin. Both are synth-safe: Xilinx Vivado and Intel Quartus support$readmemhfor BRAM initialization, and$displayis silently ignored by all major synthesizers.vram_stub.svL114-117 — single$errorparameter validator insideinitial begin. Synth ignores it; the BYTES parameter must be set to a sane value at instantiation regardless.ee_gs_priv_bridge_stub.svL118 — runtime$erroron unsupported byte enables, insidealways_ff. Synth ignores the$error; the surrounding logic still synthesizes correctly.- No
$finish/$dumpfile/$random/force/release/real-typed signals / hierarchical refs in any module of the Ch123 dep tree. (TBs use hierarchical refs intobios_rom_stubto preload the bootlet — that's a TB- only concern; on hardware the bootlet image is the BRAM init. Out-of-tree note:boot_install_agent_stub.sv(SIF subsystem, not in the Ch123 dep tree) contains a$fatalruntime validator, but it is never compiled into the Ch123 hardware build.)
Memory sizing:
| Memory | Default | Ch123 sim setting | Ch123 hw recommendation | FPGA fit |
|---|---|---|---|---|
bios_rom_stub |
4 MiB | 4 KiB | 4 KiB | ≤1 BRAM tile |
ee_ram_stub |
16 KiB | 4 KiB | 4 KiB | ≤1 BRAM tile |
vram_stub |
64 KiB | 8 KiB | 8 KiB | ≤2 BRAM tiles (one PSMCT32 page) |
ee_memory_map_stub.useg_shadow_mem (Ch145) |
4 MiB | 4 MiB | 4 KiB (override USEG_SHADOW_WORDS_PARAM=1024) |
≤1 BRAM tile when overridden |
The 16×8 framebuffer needs only 16×8×4 = 512 bytes; 8 KiB gives
the full first PSMCT32 page (FBP=0). For a more ambitious
hardware demo (multi-page framebuffers, textures), grow
vram_stub.BYTES toward 1 MiB / 4 MiB. Real PS2 has 4 MiB of
VRAM; a first hardware build can stay at 8 KiB.
Ch145 — useg_shadow_mem parameterization: pre-Ch145, the
ee_memory_map_stub's useg-shadow backing was a fixed 1M-word /
4 MiB array. That was correct for the BIOS-smoke chapters that
need full first-4-MiB-of-useg coverage, but it's wasted area
for the Ch123 hardware demo (which never touches useg — the
bootlet runs from BIOS at 0xBFC0_0000 and the GIF payload from
RAM at phys 0x100). Ch145 promotes USEG_SHADOW_WORDS from a
hardcoded localparam to the USEG_SHADOW_WORDS_PARAM module
parameter (default 1M words = 4 MiB → existing TBs unchanged).
For the Ch123 hardware demo, the top-level wrapper instantiates
ee_memory_map_stub with .USEG_SHADOW_WORDS_PARAM(1024) to
shrink the inferred BRAM footprint by ~1024×; correctness is
unaffected because no useg access ever happens in the Ch123
data plane.
Clock / reset assumptions:
- Single clock domain (
clk) — all 12 modules share one input. - Active-low synchronous reset input (
rst_n) — also a single shared input. No reset gating, no per-module variants. The reset is sampled insidealways_ff @(posedge clk)via theif (!rst_n)pattern (NOTposedge clk or negedge rst_n) — i.e., it is NOT an async reset despite being active-low. On FPGA this should be brought up via the device's reset bridge so the deasserting edge is synchronous toclk. - No clock gating, no derived clocks. The PCRTC's hsync/vsync/de are regular clock-domain outputs, not separate clocks.
Swizzle gate parameter defaults:
- All four swizzle parameters (
PSMCT32_SWIZZLE,PSMCT16_SWIZZLE,PSMT8_SWIZZLE,PSMT4_SWIZZLE) default to1'b0ongs_stub,gs_pcrtc_stub, andgif_image_xfer_stub. For the Ch123 hardware demo, instantiate these three modules withPSMCT32_SWIZZLE(1'b1)and the other three left at1'b0. The swizzle-math primitives (gs_swizzle_psmct32_stubetc.) are pure-comb and trim cleanly when their gate is off.
Top-level harness expectations (for a future
top_psmct32_raster_demo.sv):
- Inputs:
clk,rst_n, plus board-level video-out connections (HDMI / DVI / VGA — driven byr/g/b/hsync/vsync/defromgs_pcrtc_stub). - The EE bootlet image must be preloaded into
bios_rom_stubvia eitherIMAGE_FILE(→$readmemh) or a bake-step that writes a.memnext to the synthesis project. The bootlet is 18 MIPS instructions (currently authored procedurally in the Ch123 TB body viaee_prog_word()); for hardware this needs to become a static.memchecked into the repo. - The GIF payload must be preloaded into
ee_ram_stubvia the same mechanism — 24 qwords starting atPAYLOAD_MADR=0x100. Current TB authors them procedurally withpreload_qword(); hardware needs a static.mem. - The
core_gosignal must be tied high (or pulsed by a board reset-release sequencer) so the EE starts fetching from0xBFC0_0000.
Known sim-only constructs that should NOT block first build:
$displaylines in BIOS/RAM init (synth ignores).$readmemh(synth tools handle it for BRAM init).$errorparameter validators (synth ignores).
Known sim-only constructs that WOULD block first build:
- None found in the Ch123 dep tree.
Open questions for the hardware-build session (deliberately not answered here — they need a board-level decision):
- Target FPGA family + clock frequency (PCRTC was designed around 13.5 MHz pixel clock for the 16×8 active area; first build can run at any clock since the TB doesn't model real CRTC timing).
- Video-out PHY (HDMI core, VGA DAC, on-board HDMI transmitter chip).
- BIOS / payload bake step (Vivado
update_compile_order+.memfiles vs. a SystemVeriloglocalparamarray preload). - Whether to keep
ee_core_stub'sSTRICT_UNSUPPORTEDgate active on hardware (catches unknown opcodes by halt+latch — useful for debugging, but a hard failure on any unintended fetch).
The Ch90 white-box TB tb_gs_scanout_basic.sv exercises the
full round trip: instantiates gs_stub + vram_stub +
gs_pcrtc_stub, drives a 4×4 sprite through the GIF reg port,
waits for raster to fully drain, then enables scanout and
captures one full frame's (hcnt, vcnt) → (r, g, b) trace.
Asserts: every pixel inside the sprite reads as the emitted
color, every pixel outside reads as 0, and at least one EV_MODE
frame trace fires.