Files
thejayman77 ec82764bef Initial commit: retroDE_ps2 — first-of-its-kind PS2 GS FPGA core (DE25-Nano / Agilex 5)
RTL (GS rasterizer, EE core stub, platform bridge, LPDDR4B path), sim regression
(272 TBs), docs, and tooling. Copyrighted PS2 content (BIOS, game code, GS dumps,
and all dump-derived textures/traces) is excluded via .gitignore and stays local.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 20:10:50 -04:00

232 KiB
Raw Permalink Blame History

GIF/GS Contract

Status: Draft

Purpose

Define the graphics ingress and rendering/display boundary.

Owns

  • GIF path intake and arbitration,
  • GIF tag interpretation,
  • GS register decode,
  • GS VRAM-visible operations,
  • framebuffer/zbuffer/texture-visible state handling,
  • PCRTC/display output generation or a planned approximation layer.

Inputs

  • DMAC channel 2 traffic,
  • VIF/VU-generated graphics traffic,
  • privileged GS register writes,
  • reset and display configuration controls.

Outputs

  • VRAM updates,
  • display timing and pixel output,
  • status/interrupt signals,
  • packet and register trace events.

Questions to lock

  • What is the first output milestone:
    • GS privileged register acceptance only
    • static background color
    • minimal primitive draw
    • gsKit-style demo target
  • Is Phase 1 display based on a faithful GS/PCRTC path or a temporary adapter?
  • What VRAM organization assumptions must stay stable from the beginning?

Allowed early stubs

  • privileged-register-only GS stub,
  • BGCOLOR/test-pattern display path,
  • packet logger with no rendering.

Required debug visibility

  • GIF tags,
  • PATH source and arbitration result,
  • GS register writes,
  • VRAM write summaries,
  • display mode transitions.

First meaningful milestone

  • a known packet stream or direct privileged-register sequence produces a stable, visible, repeatable output and matching trace.

GS write-port contract (Ch75)

The GS model has two architecturally distinct write ports because real PS2 hardware exposes two unrelated register namespaces. Conflating them was a Ch74 mistake; Ch75 split them.

reg_wr_* — privileged GS/MMIO writes

  • Source: CPU MMIO writes to the 0x12000000 privileged-register block, e.g. via platform_video_stub or a direct test-harness path.
  • Address: reg_wr_addr[15:0] is the offset inside the privileged block.
  • Examples: BGCOLOR at offset 0x00E0, PMODE at 0x0000, SMODE2 at 0x0020, etc.
  • Currently latched: BGCOLOR only. Other offsets emit EV_MODE.

gif_reg_* — GIF A+D register-number writes

  • Source: gif_packed_stub consuming a PACKED A+D entry when run with REAL_AD_REG_MAP=1 (the new default-on path for real PS2 packets; parameter still defaults to 0 for back-compat with project-local Ch72/Ch73 PACKED-A+D layout).
  • Address: gif_reg_num[7:0] is the GIF A+D register number straight out of the PACKED entry's in_data[71:64]. Source-of-truth is PCSX2 pcsx2/GS/GSRegs.h.
  • Currently decoded: PRIM=0x00, RGBAQ=0x01, XYZF2=0x04, XYZ2=0x05, FRAME_1=0x4C, ZBUF_1=0x4E (not 0x4F — that is ZBUF_2). Each has a dedicated 64-bit latch output. Other reg numbers emit EV_MODE.

Event taxonomy

The two write paths emit different events. Read this carefully — arg2 semantics differ across emitters.

  • EV_BGCOLOR — emitted only by gs_stub on the privileged port when reg_wr_addr == 0x00E0. Carries the unpacked R/G/B in arg0/arg1/arg2. The privileged port has no per-register "selector" beyond this dedicated event; everything else on that port goes to EV_MODE with arg0=offset, arg1=data.

  • EV_WRITE — emitted in two places with different arg2 semantics:

    • gif_packed_stub on a PACKED A+D accept (REGS nibble = 0xE). Carries the raw PACKED address bits in arg2 ({48'd0, in_data[79:64]}). Under REAL_AD_REG_MAP=1 the low 8 bits are the real GIF reg# (in_data[71:64]); under REAL_AD_REG_MAP=0 the low 16 bits are the project-local privileged-style offset. Not a stable selector — it is the address half of the wire.
    • gs_stub on the gif_reg_* port for a tracked GIF reg (PRIM/RGBAQ/XYZF2/XYZ2/FRAME_1/ZBUF_1). Carries a stable per-register selector in arg2: 1=PRIM, 2=RGBAQ, 3=XYZF2, 4=XYZ2, 5=FRAME_1, 6=ZBUF_1, 7=TEX0_1 (Ch98). arg0=reg#, arg1=data. Use this selector for trace-side filtering; it does not depend on REAL_AD_REG_MAP.
    • Ch76 caveat: a tracked vertex commit (XYZ2 or XYZF2) on the gif_reg_* port that closes a primitive does NOT emit EV_WRITE that cycle — EV_PRIM_DRAW preempts it (see below). The xyz2_q / xyzf2_q latch still updates. Trace consumers counting "vertices seen" must sum EV_WRITE(selector=3 or 4) + EV_PRIM_DRAW to get the true total.
  • EV_PRIM_DRAW — Ch76 / Ch77. Fired by gs_stub once per primitive completion: when an XYZ2 or XYZF2 vertex commit on the gif_reg_* port closes a primitive under the current PRIM[2:0]. Preempts the EV_WRITE that the closing vertex would otherwise have emitted. Args: arg0=PRIM[2:0] (prim type), arg1=primary threshold, arg2=cumulative prim_complete_count post-increment, arg3=closing vertex data (the same 64 bits that latched into xyz2_q / xyzf2_q on this cycle).

    • Discrete primitives (POINT=1, LINE=2, TRIANGLE=3, SPRITE=2): one draw per N vertices; the vertex counter resets to 0 after each draw.
    • Strip / fan primitives (LINE_STRIP=2, TRI_STRIP=3, TRI_FAN=3): Ch77. Anchor on the first N vertices, then fire one draw per additional vertex commit. The vertex counter saturates at the primary threshold so every subsequent vertex closes another primitive. Ch78 adds vertex-identity tracking distinguishing TRI_STRIP rolling triangles {v_n-2, v_n-1, v_n} from TRI_FAN pivot triangles {v_pivot, v_n-1, v_n} — see the next section.
    • Reserved (PRIM=7): no draw, vertex commits do not increment the counter, latches still update.
    • A PRIM write always resets the vertex counter so a fresh primitive type starts cleanly.

Per-primitive vertex snapshot (Ch78)

Alongside EV_PRIM_DRAW, gs_stub exposes three 64-bit outputs — prim_v0_q, prim_v1_q, prim_v2_q — that hold the vertex tuple of the most recently closed primitive. Snapshot is registered on the same clock edge as the ev_valid pulse and held until the next prim_complete, so a TB can sample it at the same time it sees gs_ev_event == EV_PRIM_DRAW.

The number of valid slots is implicit in PRIM[2:0]:

PRIM type valid slots semantics
0 POINT v0 the single vertex
1 LINE v0, v1 endpoints
2 LINE_STRIP v0, v1 each segment uses {v_n-1, v_n}
3 TRIANGLE v0, v1, v2 the three vertices
4 TRI_STRIP v0, v1, v2 rolling: {v_n-2, v_n-1, v_n}
5 TRI_FAN v0, v1, v2 pivot+rolling: {v_pivot, v_n-1, v_n}
6 SPRITE v0, v1 top-left + bottom-right
7 reserved observer never closes

The TRI_STRIP-vs-TRI_FAN distinction lives entirely in the saturated-extension path: a TRI_STRIP advances v0 each draw with the rolling window; a TRI_FAN pins v0 to v_pivot (the first vertex committed since the most recent PRIM write). On the anchor draw, v_pivot and the rolling v_prev happen to coincide, so TRI_STRIP and TRI_FAN report the same tuple for their first triangle.

A PRIM write clears the rolling window (v_curr / v_prev / v_prev_prev / v_pivot / pivot_seen) so a fresh primitive context starts with no residual vertex bleed. Slots not used by the current primitive type read 0.

The snapshot tracks identity, not geometry — the values written are the raw 64-bit gif_reg_data payloads of XYZ2 / XYZF2 commits, with no decoding into screen-space coordinates. Rasterization is still out of scope.

Per-primitive color snapshot (Ch79 / Ch80)

prim_color_q[63:0] is registered on the same edge as prim_v0_q / prim_v1_q / prim_v2_q and carries the value of rgbaq_q at the moment the primitive closed. RGBAQ writes are separate A+D entries from XYZ2 / XYZF2 commits (gif_packed_stub serializes A+D to one accept per cycle), so rgbaq_q is always settled to its draw-time value when prim_complete_now fires.

prim_color_q reads 0 if no RGBAQ has been written since reset; rgbaq_q itself is not cleared on a PRIM write — color carries forward across PRIM context switches, matching real GS behavior — but it does reset to 0 on rst_n.

Per-vertex Gouraud color (Ch80)

For real game streams that interleave RGBAQ writes with vertex commits to drive Gouraud shading, gs_stub exposes three additional outputs:

Output Slot semantics
prim_color_v0_q[63:0] color of vertex 0
prim_color_v1_q[63:0] color of vertex 1
prim_color_v2_q[63:0] color of vertex 2

A parallel rolling color window (c_curr_q / c_prev_q / c_prev_prev_q / c_pivot_q, internal) samples rgbaq_q on every vertex commit, mirroring the Ch78 vertex-identity window. The snapshot layout matches the vertex layout exactly:

PRIM type _v0_q color of _v1_q color of _v2_q color of
0 POINT the single vertex 0 0
1 LINE first endpoint closing 0
2 LINE_STRIP previous vertex closing 0
3 TRIANGLE v_n-2 v_n-1 closing
4 TRI_STRIP v_n-2 (rolls) v_n-1 closing
5 TRI_FAN, anchor v1 (≡ pivot) v2 v3
5 TRI_FAN, saturated v_pivot (PINNED) v_n-1 closing
6 SPRITE first endpoint closing 0

prim_color_q is exactly the closing-vertex color (≡ prim_color_v_close), kept as a convenience alias for consumers that don't care about Gouraud.

For flat-shaded primitives (RGBAQ written once before the strip), all per-vertex color slots used by the primitive equal each other and equal prim_color_q. For Gouraud-shaded primitives (RGBAQ rewritten between vertex commits), the slots may differ — capturing the per-vertex color identity needed to distinguish a strip's rolling colors from a fan's pivot color.

The color window is cleared on PRIM write (unlike rgbaq_q itself, which carries forward). This means per-vertex color identity stays tied to the current primitive context — a stream that switches PRIM types mid-context starts color tracking fresh for the new context. Slots not used by the current primitive type read 0.

Like the vertex snapshot, this captures identity, not interpolated geometry — the stored values are the raw 64-bit RGBAQ payloads (packing R, G, B, A, and the texture-coord divisor Q together); GS-style Gouraud interpolation across the primitive interior remains out of scope.

Structured-field decode (Ch81)

gs_stub exposes pre-decoded snapshot outputs alongside the raw 64-bit slots so a downstream rasterizer or pixel-emit path doesn't have to re-derive bit fields:

Output Type Carries
prim_v0_decoded_q / _v1_ / _v2_ trace_pkg::vertex_t x / y / z / fog / is_xyzf2 per slot
prim_v0_color_decoded_q / _v1_ / _v2_ trace_pkg::color_t r / g / b / a / q per slot

The decoded outputs latch on the same edge as the raw snapshots, so a TB samples both atomically with EV_PRIM_DRAW.

vertex_t and the XYZ2 / XYZF2 distinction

typedef struct packed {
    logic        is_xyzf2;  // 1 = XYZF2 source, 0 = XYZ2
    logic [7:0]  fog;       // valid iff is_xyzf2; else 0
    logic [31:0] z;         // 32-bit (XYZ2) or zero-extended 24-bit (XYZF2)
    logic [15:0] y;         // 12.4 fixed-point screen Y
    logic [15:0] x;         // 12.4 fixed-point screen X
} vertex_t;

XYZ2 packs full 32-bit Z in data[63:32]. XYZF2 packs 24-bit Z in data[55:32] and an 8-bit fog byte in data[63:56]. The is_xyzf2 flag is registered in a parallel rolling format-flag window (xyzf2_curr_q / xyzf2_prev_q / xyzf2_prev_prev_q / xyzf2_pivot_q) that tracks the source format of each vertex through the rolling window — so when an XYZF2 vertex rolls into the v_prev slot of a TRI_STRIP saturated extension, its is_xyzf2 flag rolls with it.

Cleared on rst_n and on PRIM write, same as the vertex/color windows.

color_t

typedef struct packed {
    logic [31:0] q;  // texture-coord divisor (IEEE float)
    logic [7:0]  a;
    logic [7:0]  b;
    logic [7:0]  g;
    logic [7:0]  r;
} color_t;

Direct bit-slice of the RGBAQ payload — no interpretation. Q is carried verbatim as a 32-bit IEEE float (the GS uses it for texture coordinate division during rasterization, which remains out of scope).

Decode helper functions

trace_pkg exposes decode_vertex(data, is_xyzf2) and decode_color(data) so downstream code can re-decode raw 64-bit values consistently with the gs_stub snapshot.

The decoded outputs are an additive contract — the raw prim_v*_q and prim_color_v*_q outputs continue to work for consumers that don't care about per-channel decoding.

Minimal pixel emit (Ch82)

gs_stub exposes a per-primitive pixel emit — the smallest possible output that ties the recognition layer to a framebuffer destination. One pixel per closed primitive (the closing vertex, in screen-space integer coords), addressed by the latched frame_1_q register. No interpolation, no coverage, no rasterization — this is the contact point for a future raster chapter, not a substitute for one.

Output Width Carries
pixel_emit 1 1-cycle strobe; pulses on the same edge as prim_complete
pixel_emit_count 32 Running tally of emits since reset
pixel_x_q / pixel_y_q 12 Closing vertex integer screen coords (top 12 bits of 12.4 fixed-point)
pixel_color_q 64 RGBAQ at the emit moment (= prim_color_q)
pixel_fbp_q 9 FRAME_1[8:0] (framebuffer base / 2048)
pixel_fbw_q 6 FRAME_1[21:16] (framebuffer width / 64 in pixels)
pixel_psm_q 6 FRAME_1[29:24] (pixel storage format)
pixel_fb_addr_q 32 Computed VRAM byte offset (see below)

Address arithmetic

fb_addr = FBP * 2048 + (Y * FBW * 64 + X) * bytes_per_pixel

Ch83 added PSM-aware bytes_per_pixel derived from the latched FRAME_1[29:24] (PSM field):

PSM (hex) Format bytes/pixel Notes
00, 01 PSMCT32 / PSMCT24 4 host-word
02, 0A PSMCT16 / PSMCT16S 2
13 PSMT8 1 indexed
14 PSMT4 4 here (host-word) legacy pixel_emit channel only — see note below
1B, 24, 2C PSMT8H / PSMT4HL / PSMT4HH 4 host-word (high/low nibble of 32-bit slot)
30, 31 PSMZ32 / PSMZ24 4 depth
32, 3A PSMZ16 / PSMZ16S 2 depth
other 4 (host-word fallback) unrecognized PSM

This table describes the legacy pixel_emit channel (the single-pixel-per-primitive debug strobe from Ch82/Ch83). That channel does not commit to vram_stub; it only emits a trace event. Its PSMT4 entry stays at host-word fallback — the recognition layer never tracked sub-byte position there.

The raster channel (raster_pixel_emit) does NOT use this table. It owns its own PSM-aware emit packing in S2 with full PSMT4 support after Ch106:

  • Byte address = pixel_index >> 1 (overrides the pixel_index << ras_bpp_shift form).
  • The 4-bit index from R[3:0] is placed in the targeted nibble (low/high keyed by pixel_index[0]) of write_data[7:0].
  • raster_pixel_be_q = 4'b0001, raster_pixel_mask_q = 0x0F or 0xF0 so vram_stub's per-bit merge updates only that nibble.

PSMT8H / PSMT4HL / PSMT4HH still address the host 32-bit slot, not the high/low byte/nibble within it; the extracted sub-byte is rasterizer/blit-specific and out of scope here.

pixel_psm_q is still exposed verbatim so consumers can apply their own sub-slot offset arithmetic if needed.

Carry-forward semantics

frame_1_q is part of the standard GIF-context register file and carries forward across PRIM writes (matching real GS). A stream that sets FRAME_1 once and then emits multiple primitives correctly addresses all of them. A stream that never writes FRAME_1 lands every pixel at fb_addr=0 — observable but not useful, behaves cleanly under reset.

rgbaq_q likewise carries forward, so pixel_color_q reflects the most recent RGBAQ write at emit time. If a Gouraud-style stream rewrites RGBAQ between vertices, pixel_color_q captures the closing-vertex color — same semantic as Ch79's prim_color_q.

Strobe channel, not trace event

pixel_emit is a dedicated 1-cycle strobe alongside the snapshot outputs, not a multiplexed event on the main ev_valid trace stream. This avoids contention with EV_PRIM_DRAW on the close cycle. A consumer that wants both can sample on pixel_emit posedge and read the snapshots atomically.

Minimal interior rasterizer (Ch84)

gs_stub adds a separate per-interior-pixel emit channel alongside the per-primitive pixel_emit of Ch82. The Ch82 strobe is unchanged (still pulses once per closed primitive); the new channel pulses once per pixel that the rasterizer determines is inside the closed primitive's interior.

Output Width Carries
raster_pixel_emit 1 1-cycle strobe per emitted interior pixel
raster_pixel_emit_count 32 Cumulative interior pixels emitted since reset
raster_pixel_x_q / _y_q 12 Integer screen coords of the emitted pixel
raster_pixel_color_q 64 Per-pixel color: Gouraud-interpolated R/G/B/A for TRI/TRI_STRIP/TRI_FAN (Ch86), flat (= prim_color_q) for SPRITE. Q passes through from the closing vertex.
raster_pixel_fb_addr_q 32 Computed VRAM byte offset (PSM-aware, same math as Ch82/Ch83)
raster_active 1 High while the FSM is scanning a primitive
raster_overflow 1 Latches if a new primitive closes while the 2-entry raster FIFO is full and no concurrent pop frees a slot (Ch87 + audit-medium fix). See "Raster command queue (Ch87)" below for the back-to-back-close budget.
raster_degenerate 1 Latches if a TRI/STRIP/FAN closes with zero signed area (3 colinear vertices). SCAN is skipped; SPRITE never sets this.

Per-primitive coverage

PRIM Raster behavior
0 POINT No raster emit — Ch82 closing-pixel only
1 LINE No raster emit — Ch82 closing-pixel only
2 LINE_STRIP No raster emit — Ch82 closing-pixel only
3 TRIANGLE Bounding-box scan with edge-function half-plane test
4 TRI_STRIP Same engine as TRIANGLE, fires per closed strip triangle
5 TRI_FAN Same engine as TRIANGLE, fires per closed fan triangle
6 SPRITE Bounding-box rectangle fill (every pixel inside)
7 reserved No raster emit

Triangle edge-function math

For each candidate pixel p and each edge (vA, vB) of the triangle:

e(p) = (p.x - vA.x) * (vB.y - vA.y) - (p.y - vA.y) * (vB.x - vA.x)

32-bit signed math is used to avoid overflow at typical coord ranges.

Top-left fill rule (Ch85)

Adjacent triangles that share an edge would double-paint pixels on that edge under a naïve same-sign test. Ch85 applies the standard D3D-style top-left fill rule so each shared-edge pixel is owned by exactly one of the two triangles.

At the IDLE→SCAN transition the FSM:

  1. Computes signed_area = (v1-v0) × (v2-v0).
  2. If signed_area == 0 → degenerate (3 colinear vertices); raster_degenerate latches and SCAN is skipped (no raster pixels emit). The Ch82 pixel_emit and prim_complete pulses still fire — only the interior raster is suppressed.
  3. If signed_area < 0 → CW winding; the FSM swaps v1 and v2 so the rule applies uniformly to a CCW-ordered triangle.
  4. For each edge of the post-swap CCW triangle, classifies it as top-or-left (inclusive) or right/bottom (exclusive):
    • Top edge: horizontal going right (dy == 0 && dx > 0).
    • Left edge: going down in Y-down screen (dy > 0).
    • Anything else is a right or bottom edge.

The inside test in SCAN becomes:

inside = (e[i] + bias[i] <= 0)  for all i in {0, 1, 2}

where bias[i] = 0 if edge i is top-or-left and bias[i] = 1 otherwise. The +1 bias converts the strict < 0 test for right/bottom edges into a non-strict <= 0 test on the biased value, keeping the math integer and uniform.

Result: for any two adjacent triangles sharing an edge, the edge's pixels are inclusive in exactly one triangle's bias configuration and exclusive in the other's — no double-paint.

Some shared-corner pixels may end up unpainted by either triangle. That's the standard top-left rule trade-off: non-overlap takes priority over coverage of every boundary pixel.

Per-pixel Gouraud color (Ch86)

Triangle interior pixels now use per-pixel Gouraud color interpolation instead of flat shading. The three per-vertex colors (the same Ch80 prim_color_v0_q / prim_color_v1_q / prim_color_v2_q slot mapping) are latched at SCAN init with the same v1↔v2 swap mirror as the vertex coords, so the post-swap CCW vertex order matches the latched color order.

For each interior pixel p, barycentric weights are derived directly from the unbiased edge functions:

L0(p) = -e1(p)   // weight for v0 = signed area of (p, v1, v2)
L1(p) = -e2(p)   // weight for v1
L2(p) = -e0(p)   // weight for v2
       —  L0 + L1 + L2 == sa  for all p inside the triangle

For each color channel ch ∈ {R, G, B, A}:

ch_out(p) = (L0(p)*c0.ch + L1(p)*c1.ch + L2(p)*c2.ch) / sa

Q (the texture-coord IEEE float in c2's upper 32 bits) is not interpolated — it passes through from the closing vertex's RGBAQ unchanged.

For a flat-shaded primitive (RGBAQ written once before all three vertices, all three vertex colors equal), λ0+λ1+λ2 = 1 and the formula collapses to c0 exactly with no rounding error — existing flat-shaded raster TBs (raster_basic, raster_topleft) continue to pass.

The R/G/B/A division uses integer truncation toward zero. Real PS2 GS uses fixed-point with specific rounding rules; the recognition-layer stub is intentionally simpler. SPRITE keeps flat shading (only 2 vertices, no barycentric weights defined).

Sprite rectangle fill

A SPRITE has two vertices forming opposite corners. The bounding box is computed via min/max of each axis; every pixel inside the box is emitted in row-major order.

FSM and scan timing

The FSM is IDLESCAN. On prim_complete_now for an eligible primitive, the FSM latches the vertex tuple, color, FRAME_1 fields, and bounding box, then walks the box one pixel per cycle. For each pixel: combinational inside-test → if inside, pulse raster_pixel_emit and update the snapshot. Returns to IDLE when (ras_cur_x, ras_cur_y) == (x_max, y_max).

Color is Gouraud-interpolated per pixel for triangles (Ch86) and flat for sprites — see the dedicated subsections below for the fill-rule and Gouraud math. The closing-primitive flat color (prim_color_q) is still used as the SPRITE fill color and as a backward-compat reference for flat-shaded TRIs (when all three vertex colors are equal, the Gouraud formula reduces to that flat value with no rounding error).

Coordinates are integer — the 4-bit sub-pixel of 12.4 fixed-point is discarded. Sub-pixel edge adjustment is not modeled (top-left fill rule IS modeled — see Ch85 subsection above).

Raster command queue (Ch87) and raster_overflow

gs_stub has a 2-entry FIFO in front of the SCAN FSM. Every primitive close that targets the rasterizer (RM_TRI / RM_SPRITE) snapshots its full per-prim context (vertices, bias, signed area, per-vertex colors, FRAME_1 fields, bounding box) into the queue at the close cycle. The FSM dequeues the oldest entry whenever it's idle or finishing a scan. Effective concurrency is 1 in-flight + 2 queued = up to 3 back-to-back closes absorbed without drop.

raster_overflow now latches when a 4th close arrives while the FIFO is full (1 in-flight, both FIFO slots occupied). The 4th primitive is dropped. Earlier chapters' bound of "1 close mid-scan = overflow" is replaced by Ch87's "3 closes back-to-back = OK; 4th = overflow."

Degenerate triangles are filtered at enqueue: they set raster_degenerate and are not pushed into the queue. SPRITE never sets raster_degenerate. POINT/LINE/LINE_STRIP don't raster (RM_NONE) — they don't enqueue at all and the queue ignores them.

Pop happens at IDLESCAN AND at drain-done (Ch88; see below) when the queue has more work, so back-to-back scans run contiguously without an IDLE bubble. raster_active stays high across the boundary.

Real PS2 game streams emit thousands of primitives back-to-back; 3-deep concurrency is enough for most TRI_STRIP / TRI_FAN patterns with small bounding boxes. Larger sprites or larger triangles increase scan length and reduce headroom — a future chapter can grow the FIFO depth.

Pixel pipeline (Ch88)

The SCAN body is 3 stages, throughput 1 candidate pixel per cycle:

Stage Source Work
S0 ras_cur_x / ras_cur_y (bbox walker) Generate the next candidate coord; advance the bbox walker; on bbox corner, fire ras_at_end_of_s0 and transition R_SCAN→R_DRAIN.
S1 s1_x_q / s1_y_q (registered) Combinational edge functions on (s1_x, s1_y) against the three triangle edges (or trivial-true for SPRITE), top-left bias, inside test → s1_pixel_inside. Latched into s2_inside_q.
S2 s2_x_q / s2_y_q / s2_L0..L2_q / s2_inside_q Compute Gouraud interp_byte(λ_i, c_i) × 4 channels and s2_fb_addr from PSM/FBP/FBW. If s2_valid_q && s2_inside_q, drive raster_pixel_emit with the resolved fb_addr / x / y / color.

raster_state is now a 3-state FSM:

  • R_IDLE — no work; pop_ok fires on a non-empty FIFO.
  • R_SCAN — S0 produces one valid coord per cycle; S1/S2 latches propagate. On bbox corner, transitions to R_DRAIN.
  • R_DRAIN — S0 stops producing valids (s1_valid_q <= 0); S1 and S2 finish their in-flight pixels. When both pipeline valids are low (drain_done), the FSM either pops the next primitive (back-to-back contiguous SCAN) or returns to R_IDLE.

pop_ok = !fifo_empty && (R_IDLE || drain_done) — the end-of-scan pop is now drain-done, three cycles after S0 produces the corner. This guarantees the pipeline-tail pixels of the previous primitive are not overwritten by the next primitive's pop, while still keeping raster_active high across the seam.

Latency from pop_ok to first registered raster_pixel_emit is 3 stages of pipeline + 1 cycle of FIFO turnaround + 1 cycle of registered emit output = 5 posedges from the close cycle of the closing vertex (see sim/tb/gif_gs/tb_gs_raster_pipeline.sv for the cycle-exact contract).

  • EV_MODE — fired for any accept that did not resolve to a tracked register: REGLIST entries, IMAGE/DISABLE payload qwords, NOP-nibble PACKED slots, unknown privileged offsets, unknown GIF reg numbers. Reserved for "we know we saw something, we are intentionally not modeling it yet."

  • EV_GIFTAG — one per accepted GIFtag; carries flg/nreg/nloop/eop for stream-level checking.

When trace event semantics change, audit this section and the per-stub trace-schema header comments together.

VRAM persistence (Ch89)

vram_stub (rtl/gif_gs/vram_stub.sv) is the first persistence layer the rasterizer has had. Every raster_pixel_emit pulse writes 4 bytes of pixel data at raster_pixel_fb_addr_q into vram_stub's linear byte array. A combinational debug read port exposes read_data byte-addressably so testbenches can verify storage.

Wiring:

vram_stub port gs_stub source
write_en raster_pixel_emit
write_addr raster_pixel_fb_addr_q
write_data raster_pixel_color_q[31:0] (the lower 32 bits — Q in the upper 32 is not framebuffer data)
write_be raster_pixel_be_q (Ch95) — per-byte write enable: byte i (the byte at write_addr + i) is committed only when write_en && write_be[i]. Lets the same 32-bit write port serve PSMs of any byte width.
write_mask raster_pixel_mask_q (Ch106) — per-bit merge mask: for each enabled byte, `mem[i] <= (mem[i] & ~mask_i)

Scope (current write-side support, after Ch105):

  • PSMCT32 + PSMCT16 + PSMT8 at the raster write port. The PSM width is selected by gs_stub's bpp_shift mux off FRAME_1.PSM and surfaced as raster_pixel_psm_q; gs_stub's S2 packs the pixel into the right byte lane and drives raster_pixel_be_q so vram_stub commits exactly the right bytes:
    • PSMCT32 (PSM=0x00) → 4 bytes/pixel, be = 4'b1111, ABGR in write_data[31:0].
    • PSMCT16 (PSM=0x02) → 2 bytes/pixel, be = 4'b0011, RGB5A1 packed in write_data[15:0] (Ch95). write_addr is the halfword byte address — per-byte be makes unaligned halfword writes safe.
    • PSMT8 (PSM=0x13) → 1 byte/pixel, be = 4'b0001, the natural ABGR's R channel goes into write_data[7:0] as the PSMT8 index (Ch105). write_addr is the exact byte address; vram_stub commits mem[write_addr] ← write_data[7:0] at any byte alignment without needing data-lane shifting.
    • PSMT4 (PSM=0x14) → 0.5 bytes/pixel (2 pixels per byte), be = 4'b0001, write_mask = 0x0000_000F (low nibble) or 0x0000_00F0 (high nibble) per pixel_index[0]. The 4-bit index (low nibble of natural ABGR's R) is placed in the targeted nibble position in write_data[7:0]. vram_stub merges only the masked bits — the OTHER nibble of the same byte is preserved (Ch106). Back-to-back same-byte emits (e.g. PSMT4 pixels x=0 and x=1, both landing in byte 0) chain through NBA semantics: the second NBA samples mem[addr] AFTER the prior commit, so both nibbles end up in the byte without a bypass-forwarding net.
    • PSMCT24 / PSMCT16S / PSMZ32 / PSMZ24 / PSMZ16 / PSMZ16S / PSMT8H / PSMT4HL / PSMT4HH — bpp_shift falls through to a host-word default (4 bytes); raster emit through these PSMs is not contract-tested.
  • Write-side addressing. Real PS2 VRAM is 4 MiB organized into pages × blocks × columns per PSM. By DEFAULT, both gs_stub raster emit and gif_image_xfer_stub TRXDIR uploads produce the linear-framebuffer layout PCSX2 calls "linear PSM". Optional per-PSM swizzle paths gated by parameters on each module:
    • PSMCT32: PSMCT32_SWIZZLE parameter on gs_pcrtc_stub (Ch120 read-side), gif_image_xfer_stub (Ch121 image-xfer write-side), and gs_stub (Ch122 raster write-side).
    • PSMCT16: PSMCT16_SWIZZLE parameter on gs_pcrtc_stub (Ch126 read-side), gif_image_xfer_stub (Ch127 image-xfer write-side), and gs_stub (Ch128 raster write-side). All three integration points live, mirroring the PSMCT32 trio. When on, byte addresses route through the per-PSM swizzle module (gs_swizzle_psmct32_stub / gs_swizzle_psmct16_stub); image-xfer adds dest_base_q = DBP*256 on top of the swizzle output so any DBP works, while raster emit feeds the active ras_fbp directly so the swizzle output is already the absolute address. Per-PSM parameters are independent — enabling one doesn't affect the other PSM. PSMT8 has its full three-path swizzle integration as of Ch134, mirroring the PSMCT32/PSMCT16 trios: standalone math primitive gs_swizzle_psmt8_stub (Ch131) wired into gs_pcrtc_stub (Ch132 read-side, PSMT8_SWIZZLE), gif_image_xfer_stub (Ch133 write-side), and gs_stub (Ch134 raster emit) — same parameter name on all three modules. PSMT4 has its full three-path swizzle integration as of Ch140, mirroring the PSMCT32/PSMCT16/PSMT8 trios: standalone math primitive gs_swizzle_psmt4_stub (Ch137) wired into gs_pcrtc_stub (Ch138 read-side, PSMT4_SWIZZLE), gif_image_xfer_stub (Ch139 write-side), and gs_stub (Ch140 raster emit) — same parameter name on all three modules. The PSMT4 paths additionally thread the swizzle module's nibble_hi output through the existing Ch106 (raster) / Ch118 (image-xfer) nibble RMW machinery (replacing s2_pixel_index[0] / x_eff[0] as the high/low nibble selector when the gate is on). All parameter defaults are 0, so existing TBs see the legacy linear behavior. All four common GS PSMs (CT32 + CT16 + T8 + T4) now have COMPLETE three-path swizzle integration foundation.
  • Stub-sized. Default BYTES = 65536. Real VRAM is 4 MiB; for TB purposes a small linear region is enough to verify that emitted pixels actually land at the addresses gs_stub computes.
  • Scanout path is provided by gs_pcrtc_stub (Ch90 — see below). The legacy platform_video_stub flood-fills BGCOLOR and is unaware of VRAM; TBs that want to verify the round trip use gs_pcrtc_stub instead.

The Ch89 white-box TB tb_gs_vram_writeback.sv exercises the contract end-to-end: drive a 4×4 SPRITE through gs_stub, capture the (fb_addr, color) of each raster_pixel_emit pulse, then read each fb_addr back from vram_stub and assert byte-exact match.

PCRTC scanout (Ch90)

gs_pcrtc_stub (rtl/gif_gs/gs_pcrtc_stub.sv) is the scanout side of the GS pipeline — its dual is gs_stub (the write side). It models a minimal PCRTC (Programmable CRT Controller): runs its own raster timing, generates a VRAM read address from the current (hcnt, vcnt) using the same fb_addr math as gs_stub, reads the byte returned by vram_stub's combinational debug port, and drives r/g/b for the active area. Together with Ch88's pipeline + Ch89's VRAM, this closes the loop:

raster_pixel_emit → vram_stub.write → vram_stub.read → pcrtc.r/g/b

Configuration (Ch91 — privileged-block CPU MMIO):

gs_pcrtc_stub consumes two real PS2 GS privileged display register latches directly from gs_stub:

pcrtc input gs_stub source Layout
pmode_q[63:0] privileged write at offset 0x0000 bit 0 = EN1 (display 1 enable)
dispfb1_q[63:0] privileged write at offset 0x0070 FBP[8:0], FBW[14:9], PSM[19:15], DBX[42:32] (Ch91-audit), DBY[53:43] (Ch91-audit)
display1_q[63:0] (Ch92, Ch93) privileged write at offset 0x0080 DX[11:0], DY[22:12], MAGH[26:23] (Ch93 — H scale = MAGH+1), MAGV[28:27] (Ch93 — V scale = MAGV+1), DW[43:32] (width-1), DH[54:44] (height-1)

The Ch90 sideband ports (scanout_enable / dispfb_fbp / dispfb_fbw) are gone. TBs program scanout the way a real PS2 driver would: write DISPFB1, then DISPLAY1, then PMODE.EN1=1 (Ch92). Out of reset, all three registers are 0, so EN1 is low and pcrtc outputs 0.

scanout_enable inside pcrtc is derived combinationally from the latches: scanout_enable = pmode_q[0] & (PSM ∈ {0, 2, 0x13, 0x14}). PSMCT32 (=0), PSMCT16 (=2), PSMT8 (=0x13), and PSMT4 (=0x14) are honored at this scope; any other PSM forces scanout off rather than mis-decoding the byte layout.

DISPLAY1 (Ch92, Ch93) supplies the display window — the sub-rect inside the active area where pcrtc actually pulls pixels from VRAM — and the per-axis magnification: each VRAM column is shown for (MAGH+1) consecutive VCK pulses, each VRAM line for (MAGV+1) raster lines. Outside the window pcrtc drives r/g/b = 0 even with EN1=1. Pcrtc's H_TOTAL/V_TOTAL still come from module parameters at instantiation; only the active-area sub-rect gated by DX/DY/DW/DH is register-driven. Dual-display (PMODE.EN2 + DISPFB2 + DISPLAY2) is deferred.

Address math + display-window gating + magnification:

hmag_factor    = MAGH + 1                        // 1..16
vmag_factor    = MAGV + 1                        // 1..4
hwin_rel       = hcnt - DX                       // pixel offset inside the window
vwin_rel       = vcnt - DY
in_window      = (hcnt >= DX) && (hwin_rel <= DW)
              && (vcnt >= DY) && (vwin_rel <= DH)
fbp_bytes      = dispfb_fbp << 11               // FBP × 2048
pixels_per_row = dispfb_fbw << 6                // FBW × 64
vram_x_unshift = hwin_rel / hmag_factor          // 4 displayed pixels = 1 VRAM column at MAGH=3
vram_y_unshift = vwin_rel / vmag_factor
effective_x    = vram_x_unshift + DBX
effective_y    = vram_y_unshift + DBY
pixel_index    = effective_y × pixels_per_row + effective_x
bpp_shift      = (PSM == PSMCT32) ? 2 :
                 (PSM == PSMCT16) ? 1 :
                 (PSM == PSMT8)   ? 0 : 2
fb_addr        = fbp_bytes + (pixel_index << bpp_shift)
r/g/b drive    = (de && scanout_enable && in_window) ? decode(VRAM, PSM) : 0

Per-PSM color decode at vram_read_data:

  • PSMCT32: r = data[7:0], g = data[15:8], b = data[23:16]. Alpha at [31:24] is dropped (no DAC channel).
  • PSMCT16 (Ch94): RGB5A1 packed into the lower 16 bits as {A[15], B[14:10], G[9:5], R[4:0]}. 5→8 expansion uses bit-replicate r8 = {r5, r5[4:2]} (so 5'h1F → 8'hFF, 5'h00 → 8'h00). Alpha bit dropped at the DAC.
  • PSMT8 (Ch96/Ch97): index in data[7:0]. With clut_enable=1 (Ch97), pcrtc presents clut_read_idx = idx + (CSA << 4) to the external clut_stub and decodes the returned PSMCT32 entry as r = clut_data[7:0], g = clut_data[15:8], b = clut_data[23:16]. With clut_enable=0 (Ch96 fallback), pcrtc surfaces the index as grayscale so the 8-bit storage lane is visually verifiable without programming a CLUT.
  • PSMT4 (Ch103): 2 pixels per byte. byte_offset = pixel_index >> 1 (overrides the standard pixel_index << bpp_shift math). nibble = pixel_index[0] ? data[7:4] : data[3:0] picks the 4-bit pixel; the zero-extended 8-bit value {4'd0, nibble} plus (CSA << 4) is presented on clut_read_idx. With clut_enable=1, pcrtc decodes the returned PSMCT32 entry the same way as PSMT8. With clut_enable=0, the fallback replicates the nibble across the 8-bit DAC value (r = g = b = {nibble, nibble}) so 4'hF → 0xFF and 4'h5 → 0x55. CSA is the natural per-palette-window selector for PSMT4 — multiple 16-entry palettes can share the 256-entry staging area, indexed by CSA.

Ch95 — gs_stub raster channel emits PSMCT16. The S2 stage of the pipeline now packs ABGR → RGB5A1 (r5=R[7:3], g5=G[7:3], b5=B[7:3], a1=A[7]) when ras_bpp_shift==1 (PSMCT16 / PSMCT16S / PSMZ16 / PSMZ16S — any 16-bit PSM). The packed 16-bit pixel goes in the LOW halfword of raster_pixel_color_q[31:0], and a new raster_pixel_be_q[3:0] selects which bytes vram_stub commits: 4'b0011 for PSMCT16, 4'b1111 for PSMCT32. vram_stub gates each byte write on write_be[i], so back-to-back PSMCT16 emits write 2 bytes each without halfword stomping. New raster_pixel_psm_q[5:0] exposes the current PSM for trace.

The Ch95 TB tb_gs_raster_psmct16.sv exercises the round trip: gs_stub renders a 4×4 SPRITE with FRAME_1.PSM=PSMCT16, then VRAM read-back verifies each pixel landed at the right halfword AND that the halfword right after the sprite stays zero (no leak).

Ch105 extends the raster channel to PSMT8 (FRAME_1.PSM=0x13). When ras_bpp_shift==0, S2 takes the natural ABGR's R channel (low 8 bits) as the PSMT8 index — the same lane real PS2 hardware writes when the destination FB is PSMT8 — places it in the LOW byte of the emit lane, and sets raster_pixel_be_q = 4'b0001 so vram_stub commits exactly the 1 byte at fb_addr. The 1-byte commit works at any byte alignment because vram_stub gates each byte lane independently. The Ch105 TB tb_gs_raster_psmt8.sv renders a 5×3 SPRITE (chosen so the row spans byte lanes 1, 2, 3, 0, 1 — exercising every lane alignment) at FRAME_1.PSM=PSMT8 with RGBAQ R=0x55, G=0xAA, B=0xBB, A=0xCC; asserts each sprite byte reads back as 0x55, the bytes immediately left and right of the sprite stay 0x00 (so 32-bit-aligned overwrite would be visible), and a full-VRAM sweep finds NO byte equal to 0xAA / 0xBB / 0xCC (channel-isolation: only R reaches VRAM at PSMT8).

Ch106 closes the indexed-write gap with PSMT4 (FRAME_1.PSM=0x14) as a per-bit RMW into vram_stub. Three changes form the mechanism:

  1. vram_stub gains a new write_mask[31:0] input (Ch106). The commit is now mem[i] <= (mem[i] & ~mask_i) | (data_i & mask_i) for each enabled byte. PSMCT32/16/PSMT8 tie mask=0xFFFF_FFFF (no behavior change — full byte writes).
  2. gs_stub's S2 PSM-aware emit packing gets a PSMT4 branch: the byte address is pixel_index >> 1 (overrides the pixel_index << ras_bpp_shift form), the index is the low 4 bits of the natural ABGR's R channel, and the emit places that 4-bit value in either the low ({4'd0, idx}) or high ({idx, 4'd0}) nibble of write_data[7:0] based on pixel_index[0]. s2_emit_be = 4'b0001, s2_emit_mask = pixel_index[0] ? 0x0000_00F0 : 0x0000_000F.
  3. New raster_pixel_mask_q[31:0] output on gs_stub carries the mask through to vram_stub.write_mask.

The Ch106 TB tb_gs_raster_psmt4.sv is intentionally adversarial about preservation. VRAM is preloaded with 0xA5 (high=A, low=5) at every byte the sprites will touch. Three phases:

  • Phase A: 4×2 SPRITE at (0,0)..(3,1), R=0x05 → idx=5. Both nibbles of each enclosing byte are written (8 emits across 4 bytes); each byte ends at 0x55 and the four neighbouring preloaded bytes (2..3, 34..35) remain 0xA5. This proves the back-to-back same-byte case (NBA chaining) and the neighbour- byte preservation in one go.
  • Phase B: single-pixel SPRITE at (5, 2). x=5 odd → high nibble; pixel_index = 133, byte_addr = 66; idx=7. Preload mem[66] = 0xA5. Expected after raster: mem[66] = 0x75 — high nibble updated from A to 7, low nibble stays 5. Proves isolated high-nibble RMW preserves the low nibble.
  • Phase C: single-pixel SPRITE at (4, 3). x=4 even → low nibble; pixel_index = 196, byte_addr = 98; idx=9. Preload mem[98] = 0xA5. Expected after: mem[98] = 0xA9 — low nibble updated from 5 to 9, high nibble stays A. Proves isolated low-nibble RMW preserves the high nibble.

Continuous observer asserts psm_q == 6'h14, be_q == 4'b0001, and mask_q ∈ {0x0F, 0xF0} on every emit. Final aggregate checks: 10 emits total, full-VRAM sweep finds NO byte equal to 0xAA / 0xBB / 0xCC (only R reaches the framebuffer at PSMT4).

DBX / DBY shift the VRAM origin: the pixel that appears at displayed (DX, DY) corresponds to VRAM (DBX, DBY). Real PS2 drivers use this for double-buffered framebuffers (alternate frames at different DBX/DBY) and offset display windows.

Five TBs lock these contracts:

  • tb_gs_scanout_basic.sv — DBX=DBY=0, DISPLAY1 covers full active area, MAGH=MAGV=0 (1×): classic sprite-at-origin scanout.
  • tb_gs_scanout_dbx_dby.sv — sprite at VRAM (4,2)..(7,5), DISPFB1.DBX=4/DBY=2, DISPLAY1 full active area, MAGH=MAGV=0: sprite shows at displayed (0..3, 0..3).
  • tb_gs_scanout_display_window.sv — sprite at VRAM (0..3, 0..3), DBX=DBY=0, DISPLAY1 with DX=2/DY=1/DW=3/DH=3, MAGH=MAGV=0: sprite shows at displayed (2..5, 1..4); pixels outside the window are black even though pcrtc's raster passes through them.
  • tb_gs_scanout_magh_magv.sv (Ch93) — sprite at VRAM (0..3, 0..3), DBX=DBY=0, DISPLAY1 with DX=4/DY=2/DW=7/DH=7, MAGH=1/MAGV=1 (2×/2×): 4×4 VRAM sprite stretches to fill the 8×8 displayed window pixel-perfect; pixels outside the window are black.
  • tb_gs_scanout_psm16.sv (Ch94) — 4×4 RGB5A1 sprite written directly to vram_stub at PSMCT16 byte stride, DISPFB1.PSM=0x02: 5→8 bit-replicate decode produces the right (R8, G8, B8) at scanout. (No gs_stub instantiated; this TB exercises the PSM decode path in isolation.)
  • tb_gs_scanout_psmt8.sv (Ch96) — 4×4 PSMT8 sprite of indices 0x10..0x1F written directly to vram_stub at 1 byte/pixel stride. DISPFB1.PSM=0x13, DISPLAY1 with DX=4/DY=2/DW=7/DH=3 AND MAGH=1 (2× horizontal). Asserts each scan-out displayed pixel reads back as grayscale R=G=B=expected index, proving byte stride + display window + horizontal magnification all work at 1 byte/pixel.
  • tb_gs_scanout_psmt8_clut.sv (Ch97) — same 4×4 PSMT8 sprite, plus a programmed CLUT where CLUT[i] = ABGR(0xFF, i+0x80, i+0x40, i). With clut_enable=1 and clut_csa=0, asserts each scan-out pixel reads back as the CLUT entry for its index — PSMT8 storage + CLUT lookup compose correctly into real RGB. Three phases: full-frame CSA=0, single-pixel CSA=1 (idx 0x00 → CLUT[0x10]), and CSA=1 wrap (idx 0xF8 → CLUT[0x08]).
  • tb_gs_tex0_clut.sv (Ch98) — drives gs_stub's GIF reg# 0x06 (TEX0_1) and asserts the latch + sub-field decoders match the encoded payload (CBP/CPSM/CSM/CSA/CLD bit ranges). Phase 2 wires pcrtc.clut_csa from gs_stub.tex0_1_csa_q (instead of TB-side sideband) and verifies the CSA value flows from a GIF register write into the CLUT lookup math at scan-out.
  • tb_gs_clut_load.sv (Ch99) — full TEX0.CLD-driven VRAM→CLUT load round trip. Stages 256 PSMCT32 entries in VRAM at CBP*256 (using the new vram_stub second read port), drives TEX0_1 with CBP=4, CPSM=PSMCT32, CSM=CSM2, CLD=1, waits for clut_loader_stub.load_busy to fall, then runs PSMT8 scanout and asserts each in-sprite pixel reads back as the CLUT entry the loader copied — no TB-direct CLUT writes needed. Also carries a Ch99-audit negative phase: a TEX0 write with CSM=0 (CSM1 swizzle, deferred) silently no-ops instead of laying down wrong linear bytes.
  • tb_gs_clut_load_ct16.sv (Ch100) — CPSM=PSMCT16 variant of the Ch99 load TB. Stages 256 RGB5A1 entries (2 bytes each) in VRAM at CBP*256, drives TEX0_1 with CPSM=2. The loader now walks at 2-byte stride, unpacks RGB5A1 → PSMCT32 ABGR via 5→8 bit-replicate, and writes to clut_stub. PSMT8 scanout produces the expanded RGB. Ch100-audit alpha coverage: per-entry a1 = idx[0] varies the alpha bit so both {8{0}} = 0x00 and {8{1}} = 0xFF are exercised; a TB-side clut_we snoop captures every loader write so alpha can be asserted directly without going through the RGB-only scanout path.
  • tb_gs_clut_load_cld_modes.sv (Ch101 + Ch102) — conditional CLD-mode policy. Phases walk through CLD ∈ {0, 1, 2, 3, 4, 5, 6, 7} with varying CBP/CPSM/CSA, counting loader_busy rising edges to prove: CLD=0 never loads; CLD=1 always (full); CLD=2 loads only when CBP changed; CLD=3 loads when CBP/CPSM/CSA any-changed (CBP, CSA, and CPSM arms each isolated); CLD=4 always loads but only the 16-entry CSA window (Ch102 — write range correctness is locked by tb_gs_clut_load_csa_window); CLD ∈ {5, 6, 7} reserved no-ops.
  • tb_gs_clut_load_csa_window.sv (Ch102) — CLD=4 write-range correctness. Phase 1 stages 256 distinct PSMCT32 entries in VRAM and runs CLD=1 to fill all 256 CLUT slots with pattern_a. Phase 2 stages 16 different entries at a new CBP, drives CLD=4 with CSA=2 (window = idx 32..47), and asserts via a clut_we snoop that exactly 16 writes occurred AND the captured array contains: pattern_a(i) at i ∈ [0, 32) [48, 256), pattern_b(i-32) at i ∈ [32, 48). Proves 240 entries are preserved across the partial load. Audit-low extensions: Phase 3 covers the high-CSA wrap (CSA=16 → window-base wraps mod-256 to 0); Phase 4 covers CT16 partial (CPSM=PSMCT16, 2-byte stride, RGB5A1 unpack at the loader, window at idx 160..175).
  • tb_gs_scanout_psmt4_clut.sv (Ch103) — PSMT4 scanout. Stages a 4×4 PSMT4 sprite (2 pixels/byte) and 16 CLUT entries. Phase 1 (clut_enable=1): asserts each pixel reads CLUT[zero-ext(nibble) + CSA*16]. Phase 2 (clut_enable=0): asserts the grayscale fallback replicates the 4-bit nibble across the 8-bit DAC value. Both phases verify byte-stride half-extraction (low/high nibble pick) at every active pixel. Audit-low Phase 3 locks PSMT4 + nonzero CSA (CSA=1, window 16..31) end-to-end: TB-direct CLUT writes plant a 0xDEAD_BEEF sentinel at entries 0..15 and a per-index formula at 16..31, scanout asserts each pixel reads the formula and never the sentinel.
  • tb_gs_demo_psmt4_e2e.sv (Ch107) — first end-to-end demo for the GS/PCRTC stack. Scope is GS-side only: the post-GIF register stream (per-reg A+D writes via gs_stub.gif_reg_*) plus privileged-block MMIO drive the pipeline; gif_packed_stub / GIFtag-PACKED is BYPASSED — feeding the same demo through the GIF front-end is a future chapter. Step 1 stages 16 PSMCT32 palette entries in VRAM at CBP*256 (modelled as a TB-direct write — DMA→GS image transfer is a future chapter, but the framebuffer itself is NOT TB-direct). Step 2 drives per-reg writes (PRIM/FRAME_1/RGBAQ/XYZ2) for four SPRITEs paying out a 4-quadrant 8×4 image (TL idx 0x5, TR idx 0x7, BL idx 0xA, BR idx 0xC) at FRAME_1.PSM=PSMT4 — all 32 framebuffer pixels arrive via the Ch106 raster channel. Step 3 drives TEX0_1 with CBP=palette, CPSM=PSMCT32, CSM=CSM2, CSA=0, CLD=4; loader writes clut_stub[0..15]. Step 4 brings up scanout via privileged-block writes to DISPFB1 (PSM=PSMT4) + DISPLAY1 + PMODE.EN1. Step 5 captures one full frame and asserts each pixel reads back as CLUT[quadrant_idx] (or CLUT[0] outside the 8×4 image since vram_stub zero-init means nibble=0). Aggregate asserts: 32 PSMT4 emits, mask ∈ {0x0F, 0xF0} on every emit (channel-isolation locked architecturally — only R[3:0] ever reaches VRAM at PSMT4), loader fires exactly once, no raster_overflow / raster_degenerate. This TB is the first stack-wide proof that the GS-side post-GIF sequence — per-reg writes → indexed framebuffer → TEX0+CLD palette upload → PMODE/DISPFB/DISPLAY scanout — produces a coherent RGB frame end to end without TB sideband for the framebuffer pixels. Routing the same primitives through GIFtag/PACKED A+D via gif_packed_stub closes the last sideband and is the natural Ch108 anchor.
  • tb_gs_demo_psmt4_e2e_ee_full_bootlet.sv (Ch114) — extends Ch113's EE-driven control plane to ALSO drive the DMAC channel-2 setup from the same MIPS instruction stream. The EE program now writes the 4 GS-priv registers + the 3 DMAC ch2 registers (MADR / QWC / CHCR.start) via real sw instructions, then SYSCALLs to halt. Total: 7 EE-CPU MMIO writes (4 GS-priv + 3 DMAC) producing the same 16×8 captured frame. Architectural note: the program lives in bios_rom_stub at 0xBFC0_0000 / phys 0x1FC0_0000, NOT in RAM. A RAM-resident program would have its instruction fetches contend with the DMAC's RAM reads through ee_ram_stub's single read port (the map's CPU>DMAC arbitration silently corrupts DMAC data). Putting the program in BIOS decouples the two paths so EE and DMAC run truly in parallel. This also matches real PS2: the EE boots out of BIOS ROM. PASS criteria add to Ch113's: 3 EE-driven DMAC writes seen at the map's DMAC-ch2 decode; the existing dma=(1,36,1) event taxonomy still holds (those events are triggered by the EE's CHCR write, not a TB-direct write). The remaining TB-direct surfaces in the demo are now narrowly the GIF payload pre-stage in RAM (a real EE driver would itself stage this) and bios_rom_stub's program preload (which is the EE bootlet itself — not a runtime TB sideband).
  • tb_gs_demo_psmt4_e2e_ee_program.sv (Ch113) — same demo as Ch112 but the 4 control-plane MMIO writes (PMODE / DISPFB1 / DISPLAY1 lo / DISPLAY1 hi) are no longer issued by the TB directly. Instead a 10-instruction MIPS program preloaded into ee_ram_stub at phys 0x800 (kseg0 0x80000800) is fetched and executed by ee_core_stub (parameterized with PC_RESET=0x80000800). The program is LUI/ORI/SW × 4 plus a SYSCALL terminator; the SW instructions target 0x12000000+ and flow through ee_memory_map_stub's GS-priv decode → ee_gs_priv_bridge_stubgs_stub.reg_wr_*. Closes the very last TB-direct surface in the demo flow: every byte AND every register bit AND every control-plane decision now arrives from a real-shape source. PASS criteria add to Ch112's: core_halt_o == 1 (asserts exactly once on the SYSCALL halt), core_trap == 0, EE program halts at EE_PROG_VA + 36 = 0x80000824 (the SYSCALL slot). The TB still pre-stages the GIF payload and triggers the DMAC channel-2 transfer via TB-direct CHCR/MADR/QWC writes — a wider EE program that also drives DMAC bring-up is a separate future chapter.
  • tb_gs_demo_psmt4_e2e_eemap.sv (Ch112) — same demo as Ch111 but the bridge is no longer driven by the TB directly. Instead the TB drives ee_memory_map_stub.ee_wr_* with full 32-bit physical addresses targeting the new GS-privileged-MMIO window at 0x1200_0000-0x1200_FFFF (64 KiB; phys[28:16] == 13'h1200). The map decodes the window, peels the 16-bit offset, and hands the 32-bit half-write to ee_gs_priv_bridge_stub, which then fires gs_stub.reg_wr_* with the running 64-bit shadow value. Closes the last control-plane routing gap before a real EE instruction stream can drive the demo's bring-up: PMODE / DISPFB1 / DISPLAY1 are now reachable from sw 0x1200_0080(...)- shaped writes rather than from a TB-shaped EE-MMIO port. PASS criteria identical to Ch111: 4 EE-MMIO writes / 4 bridge fires, same 16×8 captured frame. Architectural note: this chapter ALSO adds 4 new output ports to ee_memory_map_stub (ee_gs_priv_wr_en/addr/data/be). Existing 56 ee_memory_map_ stub-using TBs leave those outputs unconnected (named-port instantiation tolerates omitted outputs); only the new Ch112 TB wires them through to the bridge.
  • tb_gs_demo_psmt4_e2e_eemmio.sv (Ch111) — same demo as Ch110 but the privileged-block control writes (PMODE / DISPFB1 / DISPLAY1) now arrive through ee_gs_priv_bridge_stub (a new RTL module) driven by EE-shaped 32-bit MMIO writes from the TB, instead of TB-direct gs_stub.reg_wr_* pulses. The bridge accumulates 32-bit half-writes per 8-byte slot and fires a 64-bit gs_stub.reg_wr_* pulse on each EE half-write — single-half writes work for PMODE.EN1 and DISPFB1 (interesting bits in the low 32), and a pair of writes (lo+hi to consecutive 4-byte addresses) handles DISPLAY1 whose DW/DH live in the high 32. Bridge contract: full-word writes only — ee_wr_be must be 4'b1111; sub-word (per-byte) merging into the 64-bit shadow is intentionally out of scope and a $error fires on any narrower be (control-plane GS registers are always written as full 32-bit sw halves of an sd). Scope precision: this chapter closes the TB-direct gs_stub.reg_wr_* surface — i.e., the privileged-MMIO sink at the GS itself. The bridge is instantiated by the TB directly; it is NOT yet wired into ee_memory_map_stub, so the full EE-CPU / memory-map MMIO path (a real EE instruction stream reaching 0x12000000+ via sw) is a separate future chapter. PASS criteria add to Ch110's: 4 EE-MMIO writes (1 PMODE + 1 DISPFB1 + 2 DISPLAY1) and 4 bridge fires producing the same 16×8 captured frame as Ch110.
  • tb_gs_demo_psmct32_swizzle_trxdir_e2e.sv (Ch124) — companion to Ch123: same EE-bootlet → DMAC → GIF data plane and same all- three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload through gif_image_xfer_stub instead of by raster. The Ch121 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every byte the GS produces comes out of the image-xfer engine at canonical PSMCT32 swizzled addresses, and the raster path is dormant. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT32} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0})
    • U2 (IMAGE, NLOOP=32: 32 IMAGE qwords carrying the 128 PSMCT32 pixels of the same four-quadrant pattern Ch123 used). DMAC QWC = 38. Verification mirrors Ch123: (1) full 16×8 scanout frame capture; (2) per-pixel byte readback at the canonical swizzled address via vram_stub's 2nd read port; (3) strict linear-vs- swizzled separator at byte 1024 stays 0. Aggregate counts: dma=(1,38,1) ee_dmac_wr=3 giftags=2 ad_writes=4 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8. Ch123 + Ch124 together exercise BOTH PSMCT32 write-side paths (raster Ch122 + image-xfer Ch121) end-to-end through the same driver-shaped flow with the same swizzled-scanout (Ch120) read side.
  • tb_gs_demo_psmct32_swizzle_e2e.sv (Ch123) — full driver-shaped end-to-end demo with ALL THREE PSMCT32 swizzle gates flipped on simultaneously: gs_stub#(PSMCT32_SWIZZLE=1) (Ch122 raster), gif_image_xfer_stub#(PSMCT32_SWIZZLE=1) (Ch121 — instantiated but unused in this demo), gs_pcrtc_stub#(PSMCT32_SWIZZLE=1) (Ch120 read). The data plane is the same DMAC + GIF + EE-bootlet shape Ch107..Ch114 demos use: a BIOS-resident EE program (PC_RESET=0xBFC0_0000) configures GS-priv (DISPFB1, DISPLAY1 lo/hi, PMODE.EN1) via sw instructions through ee_memory_map_stubee_gs_priv_bridge_stubgs_stub.reg_wr_*, then kicks DMAC ch2 (MADR / QWC / CHCR) via sw to the DMAC reg window, then SYSCALL halts. DMAC delivers a 24-qword payload from ee_ram_stub to gif_packed_stub, which dispatches 4 SPRITE PACKED packets (1 GIFtag + 5 A+D each — PRIM, FRAME_1=PSMCT32, RGBAQ, XYZ2, XYZ2). The 4 sprites tile the 16×8 active area into 4 quadrants with unique RGB triples. With the raster gate on, all 128 per-pixel store addresses go through gs_swizzle_psmct32_stub; with the pcrtc gate on, scanout reads from those same swizzled addresses. Two-phase verification: (1) scanout — every (x, y) in 16×8 captures its sprite's RGB; (2) byte readback via vram_stub's 2nd read port — for every (x, y), the 32-bit word at ref_addr_psmct32(0, 1, x, y) equals the sprite's {A=0xFF, B, G, R} PSMCT32 word. Strict linear-vs-swizzled separator at byte 1024 (where the linear formula's y=4 row would land at stride=256) stays 0 — the swizzled write set for the 16×8 image stays in blocks (0,0) and (1,0) of page 0 (bytes 0..511), so a fall-through to linear would have placed sprite-2's color at byte 1024. Aggregate counts: dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8. This is the FIRST end-to-end demo where every PSMCT32 byte the GS produces lives at the canonical PCSX2 swizzled address AND the scanout reads from it — byte-accurate to real PS2 VRAM layout, end-to-end through the driver-shaped flow.
  • tb_gs_raster_swizzle_psmct32.sv (Ch122) — focused contract for the new PSMCT32_SWIZZLE parameter on gs_stub. When the parameter is set to 1 AND the active raster PSM is PSMCT32 (ras_psm == 6'h00), the per-pixel raster emit address is routed through the Ch119 gs_swizzle_psmct32_stub (FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) and its output is the absolute byte address (FBP*2048 already baked in). At Ch122 only, PSMCT16/PSMT8/PSMT4 raster emits always took the linear path. Ch128 later closed the PSMCT16 raster gate and Ch134 closed the PSMT8 raster gate (each with its own per-PSM parameter on this same gs_stub); PSMT4 raster still takes the linear path. Default 0 keeps every existing PSMCT32 raster TB unchanged. Three-phase verification: (1) origin SPRITE — drive a single 16×4 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT32} with RGBAQ R=0x55/G=0xAA/B=0xCC/A=0x77, expect 64 emits, per-pixel byte readback via vram_stub's 2nd read port at swizzled addresses confirms each pixel lands where the swizzle says. Strict linear-vs-swizzled separators at bytes 512 and 768 (the linear formula's y=2 / y=3 row starts) stay 0 — proves the gate is live. (2) scanout agreement — enable the Ch120 swizzled- pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each visible pixel reads back the SPRITE's RGB. Both gs_stub (Ch122 raster) and gs_pcrtc_stub (Ch120 scanout) instantiate the same swizzle module; a successful capture proves the two integrations agree at byte level — what raster wrote at swizzled addresses comes out on r/g/b at the same (x, y). (3) non-origin SPRITE — re-arm the raster with FRAME_1{FBP=4, FBW=2, PSMCT32} and an 8×2 SPRITE at (60, 4)..(67, 5) crossing the page-x boundary at x=64 (so page_index actually changes mid-row). Pins three contracts the origin transfer can't distinguish from a buggy implementation: (a) ras_fbp reaches the swizzle's fbp input (FBP=0 in Phase 1 would have masked a tied-zero regression), (b) ras_fbw reaches the swizzle's fbw input (FBW=1 would have masked a tied-one regression), (c) the swizzle gets the FULL absolute pixel coords (s2_x_q, s2_y_q) rather than bbox-local coords (Phase 1's sprite started at (0,0) so absolute and local were equal there). Strict linear-vs- swizzled separator at byte 10480 (where the linear formula would land Phase-3's first pixel) stays 0. Total emit count after all phases: 64 + 16 = 80. With Ch120 (read), Ch121 (TRXDIR upload), and Ch122 (raster emit) all live, the three major PSMCT32 paths are byte-consistent end-to-end.
  • tb_gs_image_xfer_swizzle_psmct32.sv (Ch121) — focused contract for the new PSMCT32_SWIZZLE parameter on gif_image_xfer_stub. When the parameter is set to 1 AND the upload's PSM is PSMCT32, per-pixel VRAM byte addresses are routed through the Ch119 gs_swizzle_psmct32_stub (FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+ cur_y) and dest_base_q (= DBP*256) is added back to anchor at the upload's DBP base. PSMCT16/PSMT8/PSMT4 always take the linear path. Default 0 keeps every existing image-xfer TB unchanged. Three-phase verification: (1) origin transfer — TRXDIR upload of a 16×4 PSMCT32 image at DBP=DSAX=DSAY=0, DBW=1, RRW=16, RRH=4 → 64 pixels, 16 IMAGE qwords. After the upload completes, the TB reads VRAM via vram_stub's 2nd read port at the SWIZZLED byte address (TB-side ref_addr() mirrors the swizzle module) and asserts each pixel landed where the swizzle says. Strict linear-vs-swizzled separator: bytes 512 and 768 (where linear y=2 and y=3 rows would land) stay 0 under swizzled, since the 16×4 image only fills blocks (0,0) and (1,0) which together cover bytes [0..127] [256..383]. (2) scanout agreement — enable the Ch120 swizzled-pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each scanned-out pixel matches its uploaded color. Both upload and scanout instantiate the same gs_swizzle_psmct32_stub, so a successful capture proves the two integrations agree at byte level — what was written by TRXDIR comes out on r/g/b at the same (x, y). (3) non-origin transfer — re-arm with NONZERO DBP, DSAX, and DSAY (DBP=8, DSAX=4, DSAY=2, RRW=8, RRH=4) and verify each uploaded pixel lands at DBP*256 + swizzle(0, DBW, DSAX+x_local, DSAY+y_local). Phase 3 pins TWO contracts the origin transfer can't distinguish from a buggy implementation: (a) dest_base_q (= DBP*256) is correctly ADDED ON TOP of the swizzle output (with DBP=0 a missing-add regression would still pass), and (b) the swizzle is fed the FULL effective coordinates (with DSAX=DSAY=0 a "feeds only cur_x/cur_y" regression would still pass). Strict linear-vs-swizzled separator at byte 3088 (where the linear formula's y=2 row of the P3 image would land) stays 0 under swizzled. NOTE: gs_stub raster writes still use linear addressing — that wiring is a follow-on chapter.
  • tb_gs_scanout_swizzle_psmct32.sv (Ch120) — focused contract for the new PSMCT32_SWIZZLE parameter on gs_pcrtc_stub. When the parameter is set to 1 AND the active PSM is PSMCT32, PCRTC reads VRAM at swizzled addresses (via the Ch119 swizzle module instantiated inside pcrtc) instead of the legacy linear formula. Other PSMs (CT16/T8/T4) and PSMCT32_SWIZZLE=0 keep the original linear path unchanged. Topology: TB drives vram_stub.write_* directly with each pixel's color preloaded at the swizzled byte address (TB-side ref_addr() mirrors the DUT swizzle math), then pcrtc with PSMCT32_SWIZZLE=1 scans out the frame and the TB asserts each captured pixel matches the preloaded color. Image is 16×4 PSMCT32 (covers blocks (0,0) AND (1,0) horizontally) at FBP=0/FBW=1; pcrtc active area is 8×4 (block (0,0) entirely), but the swizzle vs. linear distinction shows up at any y>0 (linear y=1 → byte 64; swizzled byte 32) so even the in-window region is a strict separator. Per-pixel color is unique ({A=0xFF, B=y<<4, G=x<<4, R=0x10|(y*8+x)}) so any wrong- address commit surfaces immediately. NOTE: at Ch120 ONLY, gs_stub raster writes and gif_image_xfer_stub uploads still used linear addressing — Ch120 was read-side only. Ch121 (image-xfer) and Ch122 (raster) closed the write-side gates, and Ch123 demonstrates all three running together end-to-end.
  • tb_gs_demo_psmt8_swizzle_trxdir_e2e.sv (Ch136) — companion to Ch135: same EE-bootlet → DMAC → GIF data plane and same all- three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload through gif_image_xfer_stub instead of by raster. The Ch133 PSMT8 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every byte the GS produces comes out of the image-xfer engine at canonical PSMT8 swizzled addresses, and the raster path is dormant. Mirrors Ch124 PSMCT32 + Ch130 PSMCT16 TRXDIR demos for the third PSM. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2, DPSM=PSMT8} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=8: 8 IMAGE qwords each carrying 16 PSMT8 bytes for the 16×8 image, row-major). DBW=2 is the minimum even DBW for PSMT8. DMAC QWC=14. Per-quadrant byte indices Q0=0xA0/Q1=0x40/Q2=0xC0/Q3=0x60 reused from Ch135 so the verify side is unchanged. New trxdir_arms_seen counter asserts =1 (single TRX setup) + xfer-side per-emit observer asserts every xfer_we pulse fires with be=4'b0001, mask= 0xFFFFFFFF (PSMT8 single-byte commit shape). Verification mirrors Ch135: (1) full 16×8 scanout frame capture; (2) per- pixel BYTE readback at the canonical swizzled byte address (with addr[1:0] selecting the right byte from the 32-bit word) via vram_stub's 2nd port; (3) strict separators at bytes 128 and 256 stay 0. Aggregate counts: dma=(1,14,1) ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8. First-attempt PASS errors=0. Ch135 + Ch136 together close the PSMT8 byte-accuracy milestone end- to-end through the full driver-shaped flow — same Ch123+Ch124 (PSMCT32) and Ch129+Ch130 (PSMCT16) shape.
  • tb_gs_demo_psmt4_swizzle_trxdir_e2e.sv (Ch142) — companion to Ch141 (raster-driven PSMT4 e2e): same EE-bootlet → DMAC → GIF data plane and same all-three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload through gif_image_xfer_stub instead of by raster. The Ch139 PSMT4 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every nibble the GS produces comes out of the image-xfer engine at canonical PSMT4 swizzled (addr, nibble_hi) slots, and the raster path is dormant. Mirrors Ch124's PSMCT32 TRXDIR demo, Ch130's PSMCT16 TRXDIR demo, and Ch136's PSMT8 TRXDIR demo for the fourth (and last) common GS PSM. Cloned from Ch136 and surgically retargeted to PSMT4. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2, DPSM=PSMT4} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=4 EOP=1: 4 IMAGE qwords carrying 32 PSMT4 nibbles each — at RRW=16 each qword holds 2 rows: lanes 0..15 = row 2qi, lanes 16..31 = row 2qi+1, matching Ch139's focused-TB packing). Total QWC = 10 (5+5). EE-bootlet DISPFB1 immediate identical to Ch141 (LUI 0x000A; ORI 0x0400 → PSM=PSMT4). Per-quadrant nibbles match Ch141 verbatim (Q0=0xA → 0xAA, Q1=0x4 → 0x44, Q2=0xC → 0xCC, Q3=0x6 → 0x66) so the verify side reuses Ch141's pattern unchanged. Verification mirrors Ch141: (1) full 16×8 scanout frame capture via Ch138 swizzled-pcrtc; (2) per-pixel NIBBLE readback at the canonical swizzled (addr, nibble_hi) slot via vram_stub's 2nd port (addr[1:0]-keyed byte selection then nibble_hi-keyed nibble selection); (3) strict linear- vs-swizzled separator at byte 128 stays 0 (per-byte check, not full word: a neighbor byte may legitimately be touched); (4) per-emit observer asserts every image-xfer write is be=4'b0001 / mask ∈ {0x0F, 0xF0} (PSMT4 nibble RMW shape) and the trxdir_wr_q arming pulse fires exactly once. Aggregate counts: dma=(1,10,1) ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8. Ch141 + Ch142 together exercise BOTH PSMT4 write-side paths (raster Ch140 + image-xfer Ch139) end-to-end through the same driver-shaped flow with the same swizzled-scanout (Ch138) read side — bringing PSMT4 to full parity with the PSMCT32, PSMCT16, and PSMT8 e2e coverage from Ch123+Ch124, Ch129+Ch130, and Ch135+Ch136. Architectural milestone: this is the first state of the project where ALL FOUR common GS PSMs (CT32 + CT16 + T8 + T4) have BOTH a raster- driven AND a TRXDIR-driven driver-shaped end-to-end byte- accuracy demo — closing the four-PSM × three-path × dual- driver-shape e2e foundation (8 demos total). The bug-fix iteration: TB-side ref_col_idx4 was first written with a 7-bit case key {yb[2:0], xb[3:0]} covering yb=0..7 in indices 0..127, but the values for yb=4..7 were miscopied from Ch139's yb=12..15 row (Ch139 only exercises yb=0..3 and yb=12..15). Phase 2 readback failed for all 64 pixels in y=4..7 with got=0 expected=0xC/0x6 — the engine wrote the right nibbles to the right addresses (scanout passed), but the TB's ref looked at the wrong slot. Fix: switched to Ch141's 9-bit case key {yb[3:0], xb[4:0]} and used Ch141's verified yb=0..7 values verbatim. First-attempt PASS after the table fix.
  • tb_gs_demo_psmt4_swizzle_e2e.sv (Ch141) — first driver-shaped end-to-end PSMT4 demo with all three PSMT4 swizzle gates (Ch138 read-side pcrtc, Ch139 image-xfer write-side, Ch140 raster write-side) parameter-set to 1 simultaneously, but with the demo flow exercising only the raster (Ch140) + scanout (Ch138) paths as load-bearing. The Ch139 image-xfer gate is smoke-only here (parameter is set but xfer_writes_seen == 0 is asserted, since no TRXDIR/IMAGE packet is delivered in the raster-driven payload); the Ch139 load-bearing variant is the Ch142 TRXDIR-driven PSMT4 e2e (mirrors Ch124/Ch130/Ch136). PSMT4 counterpart of Ch123's PSMCT32 / Ch129's PSMCT16 / Ch135's PSMT8 e2e demos. Same EE-bootlet → DMAC → GIF data plane: BIOS-resident EE program configures GS-priv (DISPFB1 PSMT4 with FBW=2, DISPLAY1, PMODE) via sw instructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload (4 SPRITE PACKED packets) through gif_packed_stub into gs_stub raster. The 4 sprites tile the 16×8 active area into 4 quadrants with per-quadrant unique RGBAQ.R[3:0] nibbles (Q0=0xA → 0xAA, Q1=0x4 → 0x44, Q2=0xC → 0xCC, Q3=0x6 → 0x66). PSMT4 raster (Ch106) takes RGBAQ.R[3:0] as the nibble that hits VRAM via the existing Ch106 nibble RMW machinery (write_be=4'b0001 + write_mask 0x0F or 0xF0); Ch140 keys the high/low nibble selector off the swizzle's nibble_hi output instead of s2_pixel_index[0]. PCRTC's Ch103 PSMT4 grayscale fallback (clut_enable=0) surfaces the nibble as r=g=b={n, n} at scanout, so each captured pixel IS the nibble we wrote (no CLUT setup needed for this demo; a CLUT-driven Ch141 variant is a future chapter). With the raster gate on, all 128 per-pixel nibble stores go through gs_swizzle_psmt4_stub; with the pcrtc gate on, scanout reads from those same swizzled (addr, nibble_hi) slots. Two-phase verification: (1) full-frame scanout asserts each (x, y) reads back its quadrant's nibble as PSMT4 grayscale r=g=b={n, n}; (2) per-pixel NIBBLE readback at the canonical swizzled address (with addr[1:0] selecting the right byte from the 32-bit word, then nibble_hi selecting which nibble of that byte) via vram_stub's 2nd port — the 16×8 PSMT4 image lives entirely in the upper-left of block (0,0) of page 0 (PSMT4 block = 32×16 px) and the within-block columnTable4 yb=0..7 / xb=0..15 exercises nibble_idx values [0..127]. Strict linear-vs-swizzled separator at byte 128 (linear y=2 row start at PSMT4 stride=64 with FBW=2) stays 0 — outside block (0,0)'s touched range. Per-emit observer locks PSM=0x14, be=4'b0001, mask ∈ {0x0F, 0xF0}. Aggregate counts: dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8. First-attempt PASS errors=0. Together with Ch123 (PSMCT32 e2e), Ch129 (PSMCT16 e2e), and Ch135 (PSMT8 e2e), this is the first state of the project where the full driver-shaped flow has end-to-end byte-accuracy demos for ALL FOUR common GS PSMs (CT32 + CT16 + T8 + T4) under software-shaped raster traffic. The TRXDIR-driven PSMT4 companion landed at Ch142 (mirror of Ch124/Ch130/Ch136 making Ch139 load-bearing), so Ch141 + Ch142 together close the PSMT4 byte-accuracy milestone end-to-end through both driver shapes — bringing PSMT4 to full parity with CT32/CT16/T8.
  • tb_gs_demo_psmt8_swizzle_e2e.sv (Ch135) — first driver-shaped end-to-end PSMT8 demo with all three PSMT8 swizzle gates (Ch132 read-side pcrtc, Ch133 image-xfer write-side, Ch134 raster write-side) parameter-set to 1 simultaneously, but with the demo flow exercising only the raster (Ch134) + scanout (Ch132) paths as load-bearing. The Ch133 image-xfer gate is smoke-only here (parameter is set but xfer_writes_seen == 0 is asserted, since no TRXDIR/IMAGE packet is delivered in the raster-driven payload); the Ch133 load-bearing variant is the Ch136 TRXDIR-driven PSMT8 e2e (mirror of Ch124/Ch130). PSMT8 counterpart of Ch123's PSMCT32 / Ch129's PSMCT16 e2e demos. Same EE-bootlet → DMAC → GIF data plane: BIOS-resident EE program configures GS-priv (DISPFB1 PSMT8 with FBW=2, DISPLAY1, PMODE) via sw instructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload (4 SPRITE PACKED packets) through gif_packed_stub into gs_stub raster. The 4 sprites tile the 16×8 active area into 4 quadrants with per-quadrant unique RGBAQ.R values (Q0=0xA0, Q1=0x40, Q2=0xC0, Q3=0x60). PSMT8 raster (Ch105) takes the natural ABGR's R channel as the byte index that hits VRAM; PCRTC's Ch96 grayscale fallback (clut_enable=0) surfaces the byte as R=G=B at scanout, so each captured pixel IS the byte we wrote (no CLUT setup needed for this demo; a CLUT-driven Ch135 variant is a future chapter). With the raster gate on, all 128 per-pixel byte stores go through gs_swizzle_psmt8_stub; with the pcrtc gate on, scanout reads from those same swizzled addresses. Two-phase verification: (1) full-frame scanout asserts each (x, y) reads back its quadrant's byte as PSMT8 grayscale R=G=B; (2) per-pixel BYTE readback at the canonical swizzled address (with addr[1:0] selecting the right byte from the 32-bit word) via vram_stub's 2nd port — the 16×8 PSMT8 image lives entirely in the upper half of block (0,0) of page 0 (PSMT8 block = 16×16 px) and the within-block columnTable8 yb=0..7 exercises byte values [0..127]. Strict linear-vs-swizzled separators at bytes 128 (linear y=1 row start at PSMT8 stride=128 with FBW=2) and 256 (linear y=2) stay 0 — both outside block (0,0)'s touched range. Aggregate counts: dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8. Together with Ch123 (PSMCT32 e2e) and Ch129 (PSMCT16 e2e), this was the first state of the project where the full driver-shaped flow had end-to-end byte-accuracy demos for the CT32/CT16/T8 trio under software-shaped traffic. PSMT4 was the natural follow-on and landed at Ch141 (raster- driven, mirror of this demo) + Ch142 (TRXDIR-driven, mirror of Ch136), closing the four-PSM × dual-driver-shape e2e matrix.
  • tb_gs_demo_psmct16_swizzle_trxdir_e2e.sv (Ch130) — companion to Ch129: same EE-bootlet → DMAC → GIF data plane and same all- three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload through gif_image_xfer_stub instead of by raster. The Ch127 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every byte the GS produces comes out of the image-xfer engine at canonical PSMCT16 swizzled addresses, and the raster path is dormant. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT16} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0})
    • U2 (IMAGE, NLOOP=16: 16 IMAGE qwords carrying the 128 PSMCT16 halfwords of the same four-quadrant pattern Ch129 used). DMAC QWC = 22. Verification mirrors Ch129: (1) full 16×8 scanout frame capture; (2) per-pixel halfword readback at the canonical swizzled byte address (with addr[1] selecting the right 16-bit slot) via vram_stub's 2nd read port; (3) strict linear-vs- swizzled separators at bytes 256 and 384 stay 0; (4) per-emit observer asserts every image-xfer write is be=4'b0011 / mask=0xFFFF_FFFF (low halfword) and the trxdir_wr_q arming pulse fires exactly once. Aggregate counts: dma=(1,22,1) ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8. Ch129 + Ch130 together exercise BOTH PSMCT16 write-side paths (raster Ch128 + image-xfer Ch127) end-to-end through the same driver-shaped flow with the same swizzled-scanout (Ch126) read side — bringing PSMCT16 to full parity with the PSMCT32 e2e coverage from Ch123 + Ch124.
  • tb_gs_demo_psmct16_swizzle_e2e.sv (Ch129) — full driver-shaped end-to-end demo with all three PSMCT16 swizzle gates (Ch126 read-side pcrtc, Ch127 image-xfer write-side, Ch128 raster write-side) parameter-set to 1 simultaneously, but with the demo flow exercising only the raster (Ch128) + scanout (Ch126) paths as load-bearing. The Ch127 image-xfer gate is smoke-only here (parameter is set but xfer_writes_seen == 0 is asserted, since no TRXDIR/IMAGE packet is delivered in the raster-driven payload); Ch130 (TRXDIR-driven PSMCT16 e2e) is the load-bearing image-xfer-side counterpart. PSMCT16 counterpart of Ch123's PSMCT32 e2e demo. Same EE-bootlet → DMAC → GIF data plane: BIOS-resident EE program configures GS-priv (DISPFB1 PSMCT16, DISPLAY1, PMODE) via sw instructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload (4 SPRITE PACKED packets) through gif_packed_stub into gs_stub raster. The 4 sprites tile the 16×8 active area into 4 quadrants with per-quadrant unique RGB5A1 colors picked so the 5→8 bit-replicate at PCRTC output produces unique 8-bit RGB triples. With the raster gate on, all 128 per-pixel halfword stores go through gs_swizzle_psmct16_stub; with the pcrtc gate on, scanout reads from those same swizzled addresses. Two-phase verification: (1) full-frame scanout asserts each (x, y) reads back its quadrant's 5→8-expanded RGB; (2) per-pixel halfword readback via vram_stub's 2nd port at swizzled addresses (with addr[1] selecting the right 16-bit slot) confirms each sprite halfword landed where the swizzle says — the 16×8 PSMCT16 image lives entirely in block (0,0) of page 0 (PSMCT16 block = 16×8 px), so the readback exercises ALL 16 xb × 8 yb entries of columnTable16. Strict linear-vs-swizzled separators at bytes 256 (linear y=2 row start at PSMCT16 stride=128) and 384 (linear y=3) stay 0 — both outside block (0,0)'s 256-byte range. Aggregate counts: dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8. Together with Ch123 (PSMCT32 e2e), this is the first state of the project where the full driver-shaped flow has end-to-end byte-accuracy demos for BOTH direct-color PS2 PSMs.
  • tb_gs_raster_swizzle_psmct16.sv (Ch128) — focused contract for the new PSMCT16_SWIZZLE parameter on gs_stub (the raster emit surface). Mirrors Ch122's wiring shape but for PSMCT16: when the parameter is 1 AND the active raster PSM is PSMCT16 (ras_psm == 6'h02), the per-pixel raster emit address is routed through the Ch125 gs_swizzle_psmct16_stub (FBP=ras_fbp, FBW= ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — its output is the absolute byte address. PSMCT32 is gated by its own PSMCT32_SWIZZLE parameter (Ch122). At Ch128 only, PSMT8/PSMT4 raster emits stayed linear; Ch134 later closed the PSMT8 raster gate via PSMT8_SWIZZLE on this same gs_stub. PSMT4 raster still takes the linear path. Default 0 keeps every existing PSMCT16 raster TB (Ch95 etc.) unchanged. Three-phase verification: (1) origin SPRITE — drive a 16×4 PSMCT16 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT16} with RGBAQ {R=0xAA, G=0x50, B=0xC0, A=0x00} → halfword 0x6155 (R5=0x15, G5=0x0A, B5=0x18, A1=0). Per-pixel halfword readback via vram_stub's 2nd port (with addr[1] selecting the right 16-bit slot) confirms each lands at the swizzled byte. The 16×4 image lives in block (0,0) of page (0,0), so within-block columnTable16 rows 0..3 are exercised. Strict separators: bytes 128 (linear y=1 row start at PSMCT16 stride=128) and 256 (linear y=2) stay 0 — proves the gate is live, since a fall- through to the legacy linear path would put the SPRITE halfword there. (2) scanout agreement — enable the Ch126 swizzled-pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each visible pixel reads back the expected RGB after PCRTC's 5→8 bit-replicate (RGB = {0xAD, 0x52, 0xC6}). Both gs_stub (Ch128 raster) and gs_pcrtc_stub (Ch126 scanout) instantiate the same swizzle module. (3) non-origin SPRITE — re-arm with FRAME_1{FBP=4, FBW=2, PSMCT16} and an 8×4 SPRITE at (60, 4)..(67, 7) with distinct color (halfword 0x9F8E). Crosses the PAGE-x boundary at x=64 (page (0,0) for x∈[60..63] — block (0,3) by swizzle table — vs page (1,0) for x∈[64..67] — block (0,0)) so page_index changes mid-row. Within-block column-table coords (xb=12..3, yb=4..7) cover columnTable16 rows 4..7 — a different region than Phase 1's yb=0..3. Pins three contracts Phase 1 can't: (a) ras_fbp reaches the swizzle's fbp input (FBP=0 in P1 would mask a tied-zero); (b) ras_fbw reaches fbw (FBW=1 in P1 would mask a tied-one); (c) the swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q rather than bbox-local (P1's sprite started at (0,0), so absolute and local were equal). Strict P3 separator at byte 9336 (linear formula's effective (60, 4) byte) stays 0 — outside the P3 swizzled write set, which lives in block (0,3) of page (0,0) (10914..11006) and block (0,0) of page (1,0) (16512..16604). Total emit count after all phases: 64 + 32 = 96. With Ch126 (read), Ch127 (TRXDIR upload), and Ch128 (raster emit) all live, the three major PSMCT16 paths are byte-consistent end-to-end — completes the byte-accuracy milestone for the second PSM, mirroring the Ch120/Ch121/Ch122 PSMCT32 closure.
  • tb_gs_image_xfer_swizzle_psmct16.sv (Ch127) — focused contract for the new PSMCT16_SWIZZLE parameter on gif_image_xfer_stub. Mirrors Ch121's wiring shape but for PSMCT16: when the parameter is 1 AND the upload's PSM is PSMCT16, per-pixel byte addresses route through the Ch125 gs_swizzle_psmct16_stub (FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y) and dest_base_q (= DBP*256) is added back to anchor at the upload's DBP base. PSMCT32 is gated by its own PSMCT32_SWIZZLE parameter (Ch121); PSMT8/T4 always linear. Default 0 keeps every existing PSMCT16 image-xfer TB unchanged. Three-phase verification: (1) origin transfer — TRXDIR upload of a 16×4 PSMCT16 image at DBP=DSAX=DSAY=0, DBW=1, RRW=16, RRH=4 → 64 pixels, 8 IMAGE qwords (8 px/qword for PSMCT16). After upload, the TB reads vram_stub's 2nd port at the SWIZZLED byte address (TB-side ref_addr16/ref_block_idx16/ref_col_idx16 carry the verbatim PCSX2 tables locked at Ch125) and asserts each halfword landed where the swizzle says (selecting the right 16-bit slot inside the 32-bit word via addr[1]). Strict linear-vs-swizzled separators at bytes 128 (linear y=1) and 256 (linear y=2) stay 0 — swizzled writes for the 16×4 image fill only block (0,0) bytes [0..126]. (2) scanout agreement — enable the Ch126 swizzled-pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each scanned pixel matches the uploaded RGB5A1 → RGB888 5→8 bit-replicate. Both upload and scanout instantiate the same gs_swizzle_psmct16_stub. (3) non-origin transfer — re-arm with DBP=8, DSAX=12, DSAY=6, RRW=8, RRH=4. Effective coords (12..19, 6..9) cross block_x=0→1 at effective_x=16 AND block_y=0→1 at effective_y=8, exercising both block-table dimensions inside a single non-origin upload. Pins three contracts the origin transfer can't distinguish from a buggy implementation: (a) dest_base_q (= DBP*256) is added on top of the swizzle output (DBP=0 in P1 would mask a missing-add); (b) the swizzle is fed the FULL effective coords (DSAX=DSAY=0 in P1 would mask a "feeds only cur_x/cur_y" regression); (c) BOTH block_x and block_y propagate through blockTable16[by][bx] (block_x=0 throughout P1 would mask a tied-block_x regression). Strict P3 separator at byte 3096 (linear formula's effective (12, 8) byte) stays 0 — outside the P3 swizzled write set [2048..3071]. NOTE (now historical): PSMCT16 raster swizzle was deferred when Ch127 landed; it shipped at Ch128 (mirrors Ch122 for PSMCT32) so the PSMCT16 raster path is now byte-consistent with the image-xfer path documented here.
  • tb_gs_raster_swizzle_psmt4.sv (Ch140) — focused contract for the new PSMT4_SWIZZLE parameter on gs_stub (the raster emit surface). Mirrors Ch122/Ch128/Ch134 wiring shape but for the fourth (and last) PSM, and threads the Ch137 swizzle module's nibble_hi output into the existing Ch106 PSMT4 raster nibble RMW data lane (replacing s2_pixel_index[0] as the high/low nibble selector when the gate is on). When the parameter is 1 AND the active raster PSM is PSMT4 (ras_psm == 6'h14), the per-pixel raster emit address is routed through the Ch137 gs_swizzle_psmt4_stub (FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — its addr output is the absolute byte address, AND its nibble_hi output keys s2_emit_color64's nibble placement and s2_emit_mask's high/low gating (write_be stays 4'b0001 for both paths). PSMCT32/PSMCT16/PSMT8 are gated by their own parameters; default 0 keeps every existing PSMT4 raster TB (Ch106 raster_psmt4, Ch107 PSMT4-e2e, Ch103 PSMT4+CLUT, Ch104 round- trip, etc.) on the original linear addressing. No new ports. Default-off smoke verification: ran Ch106 + Ch107 + Ch103 + Ch104 PSMT4 TBs before writing the new TB; all PASSed unchanged. Three-phase verification (mirrors Ch134 PSMT8 raster shape, with PSMT4 nibble adaptations + CLUT-disabled grayscale at scanout): (1) origin SPRITE at FBP=0/FBW=2 (FBW must be even per PCSX2 GSLocalMemory.h:560 — same as PSMT8). Drive a 16×4 PSMT4 SPRITE with RGBAQ.R=0xAA (PSMT4 raster channel takes R[3:0] as the nibble per Ch106 → nibble = 0xA). Per-pixel nibble readback via vram_stub's 2nd port (with addr[1:0]-keyed byte selection then nibble_hi-keyed nibble selection inside the byte) confirms each pixel landed at the correct (byte, nibble) slot. The image lives in the upper-left of block (0,0) of page (0,0); within-block columnTable4 entries for yb=0..3, xb=0..15 cover nibble_idx values [0..127] → byte_in_block ∈ [0..63]. Strict separator: byte 64 (linear y=1 row start at PSMT4 FBW=2 stride 64) stays 0. (2) scanout agreement — enable Ch138 swizzled-pcrtc on the same VRAM, capture full 16×4 frame, assert each pixel reads back as PSMT4 grayscale R=G=B={0xA, 0xA} = 0xAA. Both gs_stub and gs_pcrtc_stub instantiate the same gs_swizzle_psmt4_stub AND thread its nibble_hi output through their respective nibble selectors — agreement at this layer means both integrations land at the same byte+nibble positions for PSMT4. (3) non-origin SPRITE at FBP=4/FBW=4 (bw_pg=2) drawing 8×4 SPRITE at (124, 4)..(131, 7) with R=0x55 (nibble = 0x5). Crosses PSMT4 PAGE-x at x=128 (page (0,0) for x∈[124..127], page (1,0) for x∈[128..131]). 2 blocks visited: blockTable4[0][3]=10 → page (0,0) block_base 10752; blockTable4[0][0]=0 → page (1,0) block_base 16384. Pins three contracts the origin transfer can't: ras_fbp reaches the swizzle's fbp input; ras_fbw reaches fbw; the swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q. Strict P3 separator at byte 8766 (linear (124, 4) at FBP=4/FBW=4) stays 0 — outside the P3 swizzled write set [10752..11007] + [16384..16639]. Total emit count: 64 + 32 = 96. First- attempt PASS errors=0. With Ch138 (read-side), Ch139 (TRXDIR upload), and Ch140 (raster emit) all live, the three major PSMT4 paths can be byte-consistent under the canonical swizzle when their gates are flipped on — completing the four-PSM × three-path byte-accuracy foundation (CT32 Ch120/Ch121/Ch122 + CT16 Ch126/Ch127/Ch128 + T8 Ch132/Ch133/Ch134 + T4 Ch138/Ch139/ Ch140). End-to-end PSMT4 swizzled demos (mirroring Ch123/ Ch124, Ch129/Ch130, Ch135/Ch136) are now possible.
  • tb_gs_raster_swizzle_psmt8.sv (Ch134) — focused contract for the new PSMT8_SWIZZLE parameter on gs_stub (the raster emit surface). Mirrors Ch122's PSMCT32 + Ch128's PSMCT16 wiring shape but for the third PSM: when the parameter is 1 AND the active raster PSM is PSMT8 (ras_psm == 6'h13), the per-pixel raster emit address is routed through the Ch131 gs_swizzle_psmt8_stub (FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — its output is the absolute byte address. PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays linear. Default 0 keeps every existing PSMT8 raster TB (Ch105 raster_psmt8, Ch107 PSMT4-via-CT16-CLUT palette path, etc.) on the original linear addressing. No new ports — parameter-only API change. Default- off smoke verification: ran Ch105 tb_gs_raster_psmt8 before writing the new TB; PASSed unchanged. Three-phase verification (mirrors Ch128 PSMCT16 raster shape): (1) origin SPRITE at FBP=0/FBW=2 (DBW must be even — PCSX2 asserts (bw & 1) == 0 for PSMT8). Drive a 16×8 PSMT8 SPRITE with RGBAQ.R=0xA5 (PSMT8 raster channel uses R as the byte index per Ch105). Per-pixel byte readback via vram_stub's 2nd port confirms each lands at the swizzled byte. The 16×8 image lives in the upper half of block (0,0) of page (0,0); the within-block columnTable8 distributes the 128 bytes across yb rows 0..7 — byte values 0..127 within the block. Strict separators: bytes 128 (linear y=1 row start at PSMT8 stride=128) and 256 (linear y=2) stay 0 — proves the gate is live, since a fall-through to the legacy linear path would put the SPRITE byte there. (2) scanout agreement — enable the Ch132 swizzled-pcrtc path on the same VRAM, capture the full 16×8 frame, assert each pixel's PCRTC PSMT8 grayscale R=G=B matches idx=0xA5. Both gs_stub and gs_pcrtc_stub instantiate the same gs_swizzle_psmt8_stub, so success proves byte-level agreement. (3) non-origin SPRITE at FBP=4/FBW=4 (bw_pg=2) drawing 8×4 SPRITE at (124, 4)..(131, 7) with RGBAQ.R=0x5A. Crosses PSMT8 PAGE-x at x=128 (x∈[124..127] is in page (0,0) block (0,7) by swizzle table; x∈[128..131] is in page (1,0) block (0,0)) so page_index changes mid-row. Pins three contracts the origin transfer can't: ras_fbp reaches the swizzle's fbp input (FBP=0 in P1 would mask a tied-zero); ras_fbw reaches fbw (FBW=2 would mask a tied-two); the swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q rather than bbox-local (P1 sprite started at (0,0) so absolute=local). PSMT8's page-x boundary at x=128 is different from CT32/CT16's x=64, so this exercises the PSMT8-specific x[7] wiring of the swizzle. Strict P3 separator at byte 9340 (linear (124, 4) at FBP=4/FBW=4) stays 0 — outside the P3 swizzled write set (page (0,0) block (0,7) at base 13568, page (1,0) block (0,0) at base 16384). Total emit count: 128 + 32 = 160. First-attempt PASS errors=0. With Ch132 (read-side), Ch133 (TRXDIR upload), and Ch134 (raster emit) all live, the three major PSMT8 paths can be byte-consistent under the canonical swizzle when their gates are flipped on — completing the third-PSM byte-accuracy milestone for ALL three integration points (mirrors the Ch120/Ch121/Ch122 PSMCT32 trio + the Ch126/Ch127/Ch128 PSMCT16 trio).
  • tb_gs_image_xfer_swizzle_psmt4.sv (Ch139) — focused contract for the new PSMT4_SWIZZLE parameter on gif_image_xfer_stub. Mirrors Ch121/Ch127/Ch133 wiring shape but for the fourth (and last) PSM, and threads the Ch137 swizzle module's nibble_hi output into the existing Ch118 nibble RMW data lane (replacing x_eff[0] as the high/low nibble selector when the gate is on). When the parameter is 1 AND the active DPSM is PSMT4, the per-pixel byte address is dest_base_q (= DBP*256) + swizzle_psmt4(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y).addr, AND cur_mask_c is 0x0000_00F0 when swizzle4_nibble_hi=1 (high nibble) or 0x0000_000F when 0 (low nibble) — the per-bit write_mask machinery (vram_stub merges only the targeted nibble) layers on top of the swizzled address. PSMCT32 /PSMCT16/PSMT8 are gated by their own parameters. Default 0 keeps the legacy linear path for every existing PSMT4 image- xfer TB (Ch118 etc.). No new ports — parameter-only API change. Default-off smoke verification: ran Ch118 tb_gs_image_xfer_psmt4 before writing the new TB; PASSed unchanged. Three-phase verification (mirrors Ch127/Ch133 audit-closed shape): (1) origin write-side lock at DBP=0/ DBW=2/DSAX=DSAY=0 (DBW must be even per PCSX2 GSLocalMemory.h: 560 — same FBW-evenness as PSMT8). 16×4 PSMT4 image upload via 2 IMAGE qwords (32 px/qword for PSMT4 = 4 rows × 16-px row at RRW=16). After upload, per-pixel nibble readback at the swizzled (addr, nibble_hi) slot asserts each nibble landed where the swizzle says. Strict separator: PSMT4 row stride at DBW=2 = DBW32 = 64 bytes, so linear y=1 starts at byte 64. Swizzled write set lives in [0..63] within block (0,0). Byte 64 stays 0 (verified via per-byte check, not full-word — the check_byte_zero task initially had a full-word bug that misreported neighbor-byte writes; fixed to check only the targeted byte via addr[1:0]-keyed case statement). (2) end-to-end agreement: enable Ch138 PSMT4 swizzled scanout on the same VRAM (PSMT4_SWIZZLE=1 on pcrtc, CLUT disabled), capture the 16×4 frame, verify each pixel's grayscale R=G=B={nibble, nibble} matches nibble_at(xx, yy). Both modules instantiate the same gs_swizzle_psmt4_stub so success proves byte+nibble-level agreement under TRXDIR-style emit + scanout-style read. (3) non-origin transfer at DBP=8/DBW=2/DSAX=28/DSAY=12/ RRW=8/RRH=8. Effective coords (28..35, 12..19) cross block_x= 0→1 at effective_x=32 AND block_y=0→1 at effective_y=16 (PSMT4 block geometry: 32×16 px). All 4 corner blocks of page (0,0) at DBP=8 visited: blockTable4[0][0]=0, [0][1]=2, [1][0]=1, [1][1]=3 (block bases 2048/2560/2304/2816). Pins three contracts the origin transfer can't: dest_base_q ADDED ON TOP of the swizzle output (DBP=0 in P1 would mask a missing-add regression — fixed during bring-up after the TB initially passed P3_DBP directly to ref_pos_psmt4 instead of using fbp_v=0 + adding DBP256); FULL effective coords; BOTH block_x and block_y propagate through blockTable4[by][bx]. Phase 3 strict separator: linear formula puts effective coord (28, 12) at byte 2830 — under linear, the neighboring pixel (29, 12) writes high nibble = 1 to that byte. Under swizzled, no Phase-3 pixel hits byte 2830 (cross-checked: col_idx_psmt4 for the 4-block × 16-pixel coord set never produces nibble_idx 28 or 29). Byte 2830 stays 0 → fall-through to linear would have stomped it with 0x10. PASS errors=0 after two bug-fix iterations: (a) ref_pos_psmt4(P3_DBP, ...) was wrong — engine feeds FBP=0 to the swizzle and adds DBP*256 separately, so TB must do the same; (b) check_byte_zero tested the full word instead of the targeted byte, producing false failures when a neighbor byte in the same word was independently touched. Counts: arms=2, writes=128 (P1 64 + P3 64). With Ch138 (read- side scanout) + Ch139 (image-xfer write-side) + Ch140 (raster write-side) all live, the Ch137 PSMT4 primitive now has all 3 integration points wired, and Ch141 closes the e2e demo.
  • tb_gs_image_xfer_swizzle_psmt8.sv (Ch133) — focused contract for the new PSMT8_SWIZZLE parameter on gif_image_xfer_stub. Mirrors Ch121's PSMCT32 + Ch127's PSMCT16 wiring shape but for the third PSM: when the parameter is 1 AND the active DPSM is PSMT8, the per-pixel byte address is dest_base_q (= DBP*256) + swizzle_psmt8(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y). PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays linear (its swizzle math is future). Default 0 keeps the legacy linear path for every existing PSMT8 image-xfer TB (Ch117 etc.). No new ports — parameter-only API change. Default-off smoke verification: ran Ch117 tb_gs_image_xfer_psmt8 before writing the new TB; PASSed unchanged. Three-phase verification (mirrors Ch127 audit-closed shape): (1) origin write-side lock at DBP=0/DBW=2 (DBW must be even per PCSX2 GSLocalMemory.h:553 — PSMT8 pages are 128 px wide vs FBW's 64-px units, so 2 FBW units per page → bw_pg=1 here). 16×8 PSMT8 image upload via 8 IMAGE qwords (16 px/qword). Per- pixel index idx_at(x, y) = (y[2:0] << 4) | x[3:0] ∈ [0x00..0x7F]. After upload, byte-readback at the swizzled address asserts each byte landed where the swizzle says. Strict separators: linear y=1 (byte 128) and y=2 (byte 256) row starts stay 0 — swizzled write set lives entirely in [0..127]. (2) end-to-end agreement: enable Ch132 swizzled scanout on the same VRAM, capture the frame, verify each visible pixel's PCRTC PSMT8 grayscale R=G=B matches idx_at(x, y). Both modules instantiate the same gs_swizzle_psmt8_stub so success proves byte-level agreement under TRXDIR-style emit + scanout-style read. (3) non-origin transfer at DBP=8/DBW=2/DSAX=12/DSAY=10/ RRW=8/RRH=8. Effective coords (12..19, 10..17) cross block_x=0→1 at effective_x=16 AND block_y=0→1 at effective_y=16, so all 4 corner blocks of page (0,0) at DBP=8 (blockTable8[0][0]=0, [0][1]=1, [1][0]=2, [1][1]=3 → block bases 2048/2304/2560/2816) are visited. Pins three contracts the origin transfer can't: dest_base_q = DBP*256 ADDED ON TOP; the swizzle is fed FULL effective coords (DSAX/DSAY non-zero); BOTH block_x and block_y propagate through blockTable8[by][bx]. Phase 3 distinct-pixel pattern uses p3_idx = 0x80 | idx ∈ [0x80..0xFF] (disjoint from Phase 1's [0x00..0x7F]) so a P3 pixel landing at a P1 byte (or vice versa) surfaces as wrong RGB. Phase 3 strict separator: linear formula puts effective coord (12, 10) at byte 2048 + 10*128 + 12 = 3340 (outside swizzled set [2048..3071]); byte 3340 stays 0 — proves a fall-through to linear would have stomped that byte. First-attempt PASS: arms=2, writes=192 (=128+64), errors=0. NOTE: at Ch133 only, PSMT8 raster-side emits via gs_stub still used linear addressing — Ch133 was image-xfer write-side only. Ch134 later closed the raster-side gate via PSMT8_SWIZZLE on gs_stub (mirrors Ch122 for PSMCT32 and Ch128 for PSMCT16) — see Ch134 row above.
  • tb_gs_scanout_swizzle_psmt4.sv (Ch138) — focused contract for the new PSMT4_SWIZZLE parameter on gs_pcrtc_stub. Mirrors Ch120/Ch126/Ch132's read-side-first wiring shape but adds the PSMT4-specific twist: the swizzle module outputs both an absolute byte address AND a nibble_hi selector (PSMT4 = 4 bits/pixel = 2 pixels per byte, and the canonical PCSX2 column table reorders nibbles within a block, so pixel_index[0] is no longer the right selector under the swizzled layout). When the parameter is 1 AND the active PSM is PSMT4, scanout reads go through the Ch137 gs_swizzle_psmt4_stub and the PSMT4 nibble extractor uses swizzle4_nibble_hi instead of pixel_index[0]. PSMCT32/PSMCT16/PSMT8 are gated by their own parameters; default 0 keeps every existing PSMT4 scanout TB (Ch103 PSMT4+CLUT, Ch104 PSMT4 round-trip, Ch107 PSMT4 e2e, etc.) on the legacy linear path. No new ports — parameter- only API change. Default-off smoke verification: ran Ch103 tb_gs_scanout_psmt4_clut + Ch104 tb_gs_psmt4_round_trip before writing the new TB; both PASSed unchanged. Two-phase verification (mirrors Ch132 closure shape; CLUT disabled so PCRTC's PSMT4 grayscale fallback gives r=g=b={nibble, nibble} at scanout): (1) origin at FBP=0/FBW=2/DBX=DBY=0 (FBW must be even per PCSX2 GSLocalMemory.h:560 because PSMT4 pages are 128 px wide, same as PSMT8). 16×4 region preloaded at swizzled bytes via a TB-side byte_shadow accumulator that lays each pixel's nibble at its (addr, nibble_hi) slot; bytes are then flushed to vram_stub via per-byte BE writes. Per-pixel nibble pattern nibble_at(x, y) = ((y << 1) ^ x) & 4'h7 ∈ [0..7] gives unique gray values across the 16×4 frame. The image lives entirely in block (0,0) of page (0,0) and exercises within-block columnTable4 entries for yb=0..3, xb=0..15. Strict separator: byte 64 (linear y=1 row start at FBW=2 stride) pre-colored with sentinel 0xCC (gray=0xCC, unproducible by Phase 1's [0..7]-nibble pattern) — fall-through to linear would surface as RGB(0xCC, 0xCC, 0xCC). (2) non-origin at FBP=4/FBW=4 (bw_pg=2), DBX=120, DBY=126. Effective coords range x∈[120..135], y∈[126..129]. page_x crosses 0→1 at effective_x=128, page_y crosses 0→1 at effective_y=128 (PSMT4's 128-tall page boundary — different from PSMT8's 64-tall). All 4 corner pages of FBP=4/FBW=4 visited, each with a distinct blockTable4 lookup (blockTable4[7][3]=31 → page (0,0) block_base 16128; blockTable4[7][0]=21 → page (1,0) block_base 21760; blockTable4[0][3]=10 → page (0,1) block_base 27136; blockTable4[0][0]=0 → page (1,1) block_base 32768). A regression that tied any of {dispfb_fbp, dbx, dby, FBW, block_x, block_y, page_index, bw_pg=FBW/2, swizzle nibble_hi} to zero would NOT survive Phase 2. Strict P2 separator: byte 24380 (linear formula's place for (120, 126); outside all 4 swizzled chunks) pre-colored with sentinel 0xDD → fall-through to linear would surface as RGB(0xDD, 0xDD, 0xDD), unproducible by the Phase-2 pattern. PASS errors=0 after one bug-fix iteration: Phase 2's flush-loop initially hardcoded the wrong byte ranges due to a blockTable4[7][3] lookup mistake (the value is 31, not 15) — replaced with a shadow-array sweep [256..65535] that flushes any non-zero byte, eliminating the hardcode/lookup mismatch class entirely. NOTE (now historical): Ch138 was read-side only when introduced; the PSMT4 write-side is now live as well — Ch139 (image-xfer) + Ch140 (raster) + Ch141 (raster-driven e2e demo). With Ch138, all four common GS PSMs now have read- side byte-accuracy under their swizzle gates (CT32 Ch120 + CT16 Ch126 + T8 Ch132 + T4 Ch138).
  • tb_gs_scanout_swizzle_psmt8.sv (Ch132) — focused contract for the new PSMT8_SWIZZLE parameter on gs_pcrtc_stub. Mirrors Ch120/Ch126's wiring shape but for PSMT8: when the parameter is 1 AND the active PSM is PSMT8, scanout reads go through the Ch131 gs_swizzle_psmt8_stub (real PS2 GS page/block/column layout — 128×64 pixel pages, 4×8 block grid, 16×16 within-block bytes, bw_pg = FBW>>1) instead of the legacy linear FBW*64*y + x formula. PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays linear (its swizzle math is future). Default PSMT8_SWIZZLE=0 keeps every existing PSMT8 scanout TB (Ch96 storage-only, Ch97 PSMT8+CLUT, Ch103 PSMT4-via-CT16-CLUT, Ch107 PSMT4-e2e palette path) on the original linear addressing. No new ports — parameter-only API change. Default-off smoke verification: ran Ch96 tb_gs_scanout_psmt8 before writing the new TB; PASSed unchanged, confirming the new instance + 4-way mux extension don't disturb the linear path. Two-phase verification (mirrors Ch126 PSMCT16 closure shape): (1) origin (FBP=0, FBW=2, DBX=DBY=0; FBW must be even — PCSX2 asserts (bw & 1) == 0 for PSMT8 because pages are 128 px wide vs FBW's 64-px units, so 2 FBW units per page → bw_pg=1 here). 16×8 region preloaded at swizzled bytes; per-pixel index idx = (y[2:0] << 4) | x[3:0] ∈ [0x00..0x7F] surfaces as grayscale R=G=B=idx via PCRTC's PSMT8 fallback (Ch96). x∈[0..15] is entirely block_x_in_page=0, so the within-block test exercises ALL 16 xb positions of columnTable8 across yb rows 0..7. Strict separators: linear y=1 starts at byte 128 (FBW=2 stride) but swizzled lands at byte 8 (columnTable8[1][0]=8, no *2 scale since PSMT8 is 1 byte/pixel); linear x=8,y=0 is byte 8 but swizzled is byte 2. (2) non-origin (FBP=4, FBW=4 → bw_pg=2, DBX=120, DBY=60). Effective coords range x∈[120..135], y∈[60..67] — page_x crosses 0→1 at effective_x=128 (proves x[7] reaches the page-x lane of the PSMT8 swizzle — different boundary from CT16/CT32's x[6]); page_y crosses 0→1 at effective_y=64; block_x and block_y both flip; ALL 4 pages (0,0)/(1,0)/(0,1)/(1,1) are visited, each with a distinct blockTable8 lookup ([3][7]=31, [3][0]=10, [0][7]=21, [0][0]=0). A regression that tied any of {dispfb_fbp, dbx, dby, FBW, block_x, block_y, page_index, bw_pg=FBW/2} to zero would NOT survive Phase 2. Sentinel separator: byte 24500 (inside linear range 23672..25479 for the Phase-2 effective region, outside ALL 4 swizzled write-set blocks) pre-colored with 0xFF → fall-through to linear would surface as RGB(0xFF, 0xFF, 0xFF), which is unproducible by the Phase-2 unique pattern (idx ∈ [0x00..0x7F]). First-attempt PASS errors=0 — no audit iteration needed because Phase 2's coord choices were designed up front to make all 7 chain-layer wires load-bearing AND the page-x crossing boundary is at PSMT8's specific x=128 (not the 64-px boundary the direct-color PSMs use). NOTE (now historical): Ch132 was read-side only when introduced; Ch133 then Ch134 closed the image-xfer + raster write sides for PSMT8, so all three PSMT8 swizzle integration points are now live (mirrors Ch120/Ch121/Ch122 for PSMCT32 and Ch126/Ch127/Ch128 for PSMCT16).
  • tb_gs_scanout_swizzle_psmct16.sv (Ch126) — focused contract for the new PSMCT16_SWIZZLE parameter on gs_pcrtc_stub. Mirrors Ch120's wiring shape but for PSMCT16: when the parameter is 1 AND the active PSM is PSMCT16, scanout reads go through the Ch125 gs_swizzle_psmct16_stub (real PS2 GS page/block/column layout) instead of the legacy linear FBW*64*y + x*2 formula. PSMCT32 is gated by its own PSMCT32_SWIZZLE parameter (Ch120); PSMT8/PSMT4 stay linear. Default 0 keeps every existing PSMCT16 scanout TB (Ch94/Ch95/Ch103/etc.) on the original linear addressing. Topology: TB drives vram_stub.write_* directly with each pixel's RGB5A1 halfword preloaded at the swizzled byte address (TB-side ref_addr16() mirrors the swizzle math + the Ch125 source-table-locked tables); pcrtc with PSMCT16_SWIZZLE=1 scans out the 16×8 frame and the TB asserts each captured pixel matches the preloaded color after 5→8 bit-replicate. Per-pixel pattern is unique per (x, y): R5=(x^y)&0xF, G5=x&0xF, B5=y&0xF, expanded to 8 bits via PCRTC's bit-replicate. The PSMCT16 swizzle vs. linear distinction shows up at any y>0 (linear y=1 → byte 128 with FBW=1, but swizzled within block (0,0) yb=1 → columnTable16[1][0]=4 → byte 8) and at x=8, y=0 (linear byte 16 vs swizzled byte 2) so even within the first row + first block, the gate is a strict separator. NOTE (now historical): Ch126 was read-side only when introduced; Ch127 (image-xfer) then Ch128 (raster) closed the PSMCT16 write sides, mirroring Ch121/Ch122 for PSMCT32.
  • tb_gs_swizzle_psmt4.sv (Ch137) — focused contract for the new gs_swizzle_psmt4_stub math primitive: a pure-comb module mapping (FBP, FBW, x, y) to a VRAM byte address + nibble_hi selector using the real PS2 GS PSMT4 layout (8 KiB pages organized as 128×128 PSMT4 pixels — 4× as many pixels per page as PSMT8 since each PSMT4 pixel is a NIBBLE; 32 blocks/page in an 8-rows × 4-cols grid (same orientation as PSMCT16's blockTable16); each block 32×16 pixels = 512 nibbles = 256 bytes; 512-entry within-block column table — 2× the entries of PSMT8's 256-entry table due to the doubled block area, indexed [yb][xb] with yb=0..15 + xb=0..31 → nibble 0..511). PSMT4 is the most complex of the four common GS PSMs because each pixel is HALF a byte, so the swizzle outputs both a byte address and a nibble_hi selector (=0 for low nibble of the byte at addr, =1 for high). PSMT4 reuses PSMT8's page-stride convention (bw_pg = FBW >> 1; PCSX2 asserts FBW must be even at GSLocalMemory.h:560 because PSMT4 pages are 128 px wide). Source-table provenance pinned: _blockTable4 taken verbatim from pcsx2/GS/GSTables.cpp lines 6169; columnTable4 from same file lines 147213. Master HEAD commit 3000e113e2b3a76357c08dfa80d3c747f40e2706; file blob SHA 3581209b8217378f473f9de22a9dbc8c45ca49b6 (same blob Ch131 pinned). Cross-checked against GSLocalMemory.h:558 BlockNumber4 + the pxOffset template at GSTables.cpp:247258 (blockSize=512 in NIBBLE units, pageSize=16384 nibble units = 8192 bytes, pageWidth=128). The existing per-bit write_mask 0x0F/0xF0 nibble RMW from Ch106/Ch118 will still apply on top of the swizzled byte address — the swizzle module doesn't touch the nibble merge logic; it just produces (addr, nibble_hi). Five-phase verification (mirrors Ch125/Ch131 shape, scaled up): (1) spot-checks at 15 hand-computed corners (origin, intra-block xb=1/8/16/yb=1/yb=2-with-hi-nibble, last nibble of block (0,0), first/second/third/fourth horizontal blocks, second-row-of-blocks origin, page-x at x=128 + page-y at y=128, FBP=4 origin, page0-last-pixel (127,127) → addr 8191 hi=1). (2a) INDEPENDENT column-table source lock — 32 hard-coded check_nibble() calls for yb=0 (literal-by-literal verbatim from PCSX2 columnTable4 row 0) PLUS a programmatic walk for yb=1..15 against the in-TB ref function (480 more checks); Phase 2a's literal yb=0 row + Phase 5's bijectivity sweep + Phase 3's literal block-table lock together pin the table. (3) INDEPENDENT block-table source lock — 32 hard-coded checks (one per block in page 0) with expected block index taken VERBATIM from PCSX2 blockTable4. (4) block-swizzle walk via in-TB ref_block_idx4. (5) bijectivity sweep over the 128×128 page — 16384 NIBBLE slots (vs PSMT8's 8192 byte slots), every pixel must hit a unique (byte_addr, nibble_hi) pair and agree with both the in-TB ref byte address AND ref nibble_hi. Plus multi-page sanity at FBW=4/bw_pg=2 (page-x crossing at x=192 → byte 10496 with blockTable4[1][2]=9, and page-y crossing at y=128 → byte 16384) and non-page-aligned FBP coverage at FBP ∈ {1,2,3}, including FBP=3+FBW=4+page-(1,1) intra-block at (129, 129) → byte 30732 (= 6144 + 38192 + 0256
    • ref_col_idx4(1,1)/2 = 30720 + 12). First-attempt PASS errors=0. NOTE: This module is NOT YET wired into gs_pcrtc_stub / gif_image_xfer_stub / gs_stub — those still use linear PSMT4 addressing as of Ch137. The math is locked here so follow-on chapters can wire PSMT4_SWIZZLE parameter gates into the existing address paths without disturbing the legacy linear-PSMT4 TBs (Ch103 / Ch106 / Ch107 / Ch118). With Ch119 PSMCT32 + Ch125 PSMCT16 + Ch131 PSMT8 + Ch137 PSMT4, all four common GS PSMs now have byte-accurate- to-real-PS2 swizzle math available as standalone primitives — the four-PSM swizzle math foundation is complete. Future chapters can wire PSMT4 into pcrtc/image-xfer/raster behind a PSMT4_SWIZZLE parameter (mirroring Ch120→Ch124 / Ch126→Ch130 / Ch132→Ch136), with the existing nibble RMW machinery layered on top.
  • tb_gs_swizzle_psmt8.sv (Ch131) — focused contract for the new gs_swizzle_psmt8_stub math primitive: a pure-comb module mapping (FBP, FBW, x, y) to a VRAM byte address using the real PS2 GS PSMT8 layout (8 KiB pages organized as 128×64 PSMT8 pixels — 2× wider than CT16's 64×64 page; 32 blocks/page in a 4-rows × 8-cols grid; each block 16×16 pixels = 256 bytes; 256-entry within- block column table — 2× the entries of CT16's 128-entry table due to the doubled block area, indexed [yb][xb] with yb=0..15 + xb=0..15 → byte 0..255). PSMT8 also introduces a new page-stride constant bw_pg = FBW >> 1 (PCSX2 asserts (bw & 1) == 0 at GSLocalMemory.h:553 because PSMT8 pages are 128 px wide vs FBW's 64-px units → 2 FBW units per PSMT8 page, so FBW must be even). Source-table provenance pinned: blockTable8 taken verbatim from pcsx2/GS/GSTables.cpp lines 5359; columnTable8 from same file lines 111145. Master HEAD commit 3000e113e2b3a76357c08dfa80d3c747f40e2706; file blob SHA 3581209b8217378f473f9de22a9dbc8c45ca49b6. Cross-checked against GSLocalMemory.h:551 BlockNumber8 + the pxOffset template at GSTables.cpp:247258 (blockSize=256, pageSize=8192, pageWidth=128). PCSX2's bp is in 256-byte block-pointer units; in our FBP=2048-byte units, bp = FBP * 8 so bp*256 = FBP*2048. Five-phase verification (mirrors Ch125 PSMCT16 shape): (1) spot-checks at 15 hand-computed corners (origin, intra- block xb=1/4/8/yb=1, last byte of block (0,0), first/second block origins, second row of blocks, third+fourth blocks, page-x at x=128 and page-y at y=64, FBP=4 origin); (2a) INDEPENDENT column-table source lock — 256 hard-coded check() calls (one per (yb, xb) inside block (0,0)) where the expected byte index is taken VERBATIM from PCSX2 columnTable8 with <literal> arithmetic, NOT derived from the in-TB ref function. Catches any case where DUT and ref share the same miscopy (the same trap Ch125 added Phase 2a for with PSMCT16's column table); (2b) within-block 16×16 walk via the in-TB ref_col_idx8 (self-check); (3) INDEPENDENT block-table source lock — 32 hard-coded checks (one per block in page 0) with the expected block index taken VERBATIM from PCSX2 blockTable8, NOT derived from the in-TB ref; (4) block-swizzle walk via in-TB ref_block_idx8; (5) bijectivity sweep over the 128×64 page — 8192 byte slots (vs CT16's 4096 halfword slots), every pixel must hit a unique byte address in [0, 8192) and agree with the in-TB reference. Plus multi-page sanity at FBW=4/bw_pg=2 (page-x crossing at x=192 and page-y crossing at y=64) and non-page-aligned FBP coverage at FBP ∈ {1, 2, 3}, including FBP=3+FBW=4+page-(1,1) intra-block crossing at (129, 65). First-attempt PASS errors=0. NOTE: This module is NOT YET wired into gs_pcrtc_stub / gif_image_xfer_stub / gs_stub — those still use linear PSMT8 addressing as of Ch131. The math is locked here so follow-on chapters can wire PSMT8_SWIZZLE parameter gates into the existing address paths without disturbing the legacy linear-PSMT8 TBs (Ch96 / Ch97 / Ch103 / Ch105 / Ch107 / Ch117). With Ch119 PSMCT32 + Ch125 PSMCT16 + Ch131 PSMT8, three of the four common GS PSMs now have byte- accurate-to-real-PS2 swizzle math available as standalone primitives; PSMT4 (with its 32×16 nibble intra-block layout) is the natural Ch132 candidate.
  • tb_gs_swizzle_psmct16.sv (Ch125) — focused contract for the new gs_swizzle_psmct16_stub math primitive: a pure-comb module mapping (FBP, FBW, x, y) to a VRAM byte address using the real PS2 GS PSMCT16 layout (8 KiB pages organized as 64×64 PSMCT16 pixels; 32 blocks/page in a 4×8 grid; each block 16×8 pixels = 256 bytes; non-trivial within-block column table — unlike PSMCT32 where within-block IS row-major halfwords by accident, PSMCT16 has 4 internal 16×2-pixel sub-columns with a 128-entry permutation). Source-table provenance pinned: blockTable16 taken verbatim from pcsx2/GS/GSTables.cpp lines 2939 (master HEAD commit 3d71e310; file-touch commit d983b2b0, 2026-01-12); columnTable16 from same file lines 91109. Cross-check against the older Debian-packaged GSdx PixelAddressOrg16(x, y, bp, bw) = (BlockNumber16(...) << 7) + columnTable16[y & 7][x & 15] confirms the address chain (<< 7 lifts to halfword units, multiply by 2 for bytes; in our FBP=2048-byte units, bp = FBP * 8 so bp256 = FBP2048). Five-phase verification: (1) spot-checks at 13 well-defined corners (origin, intra-block, first/second block, second row of blocks, page-x and page-y boundaries, FBP=4 origin); (2) within-block 16×8 walk asserting byte = 2 * columnTable16[yb][xb] — locks the column table; a row-major-halfwords regression would fail; (3) source-table lock — 32 hard-coded address checks (one per block in page 0) with the expected block index taken VERBATIM from PCSX2 blockTable16, NOT derived from the in-TB reference function; (4) block-swizzle walk cross-checking the in-TB ref function against the DUT (the bijectivity sweep relies on it being correct); (5) bijectivity sweep over the 64×64 page — 4096 halfword slots, every pixel must hit a unique halfword address in [0, 8192) and agree with the in-TB reference. Plus multi-page sanity at FBW=2 and non-page-aligned FBP coverage at FBP ∈ {1, 2, 3} (real PS2 supports any 2048-byte-aligned FBP — same broadening Ch119 adopted post- audit). NOTE: This module is NOT YET wired into gs_pcrtc_stub / gif_image_xfer_stub / gs_stub — those still use linear PSMCT16 addressing as of Ch125. The math is locked here so follow-on chapters can wire PSMCT16_SWIZZLE parameter gates into the existing address paths without disturbing the legacy linear-PSMCT16 TBs (Ch94 / Ch95 / Ch103 / Ch116).
  • tb_gs_swizzle_psmct32.sv (Ch119) — focused contract for the new gs_swizzle_psmct32_stub math primitive: a pure-combinational module mapping (FBP, FBW, x, y) to a VRAM byte address using the real PS2 GS PSMCT32 page/block swizzle layout (8 KiB pages, 4×8 grid of 8×8-pixel blocks per page, blocks ordered per the canonical PCSX2 PSMCT32 swizzle table, row-major within a block). Verification has five phases: (1) spot-checks on the well-defined corners — origin, intra-block walks, first/second block, second row of blocks, page-x and page-y boundaries, second page on x, and FBP=4 origin; (2) within-block 8×8 walk asserting byte_in_block = yb*32 + xb*4; (3) source-table lock — 32 hard-coded address checks (one per block in page 0) where the expected block index is taken VERBATIM from PCSX2's PSMCT32 block table, NOT derived from the in-TB reference function. This proves the DUT's swizzle_psmct32() table matches the canonical source; a copied-wrong table that happened to still be a valid permutation of 0..31 would fail this phase, while the bijectivity sweep below would pass it; (4) block-swizzle walk (redundant with phase 3, cross-checks ref_block_idx against the DUT — the bijectivity sweep relies on ref_block_idx being correct); (5) bijectivity sweep over the full 64×32 PSMCT32 page — every word slot in [0, 8192) reached exactly once (catches any swap/typo in the swizzle table). Plus a multi-page sanity check at FBW=2 (pixel (96, 16) → block (4,2) of page 1 → addr 14336) and a non-page- aligned FBP phase that drives FBP=1, 2, 3 (mid-page in the 8 KiB sense — real PS2 supports any 2048-byte-aligned FBP; our address formula is bit-correct for non-page-aligned FBP) plus FBP=3 with FBW=2 + intra-block crossing as a stress case. NOTE (now historical): at Ch119 this module was standalone math only; Ch120 (PCRTC read), Ch121 (image-xfer write), and Ch122 (raster write) wired it into the three integration points — the same shape that Ch125Ch128 (PSMCT16), Ch131Ch134 (PSMT8), and Ch137Ch140 (PSMT4) followed for the other three PSMs.
  • tb_gs_image_xfer_psmt4.sv (Ch118) — focused contract for gif_image_xfer_stub's PSMT4 path (the fourth and final supported PSM). PSMT4 packs 0.5 bytes/pixel (4-bit nibble per pixel = 2 px/byte), so each 128-bit IMAGE qword carries 32 pixels in 16 bytes. Each emit is a SUB-BYTE write: write_be = 4'b0001 with a per-emit nibble mask (write_mask = 0x0000_000F for the LOW nibble, 0x0000_00F0 for the HIGH nibble), keyed by (DSAX+x)[0]; vram_stub's per-bit merge commits exactly the targeted nibble, preserving the OTHER nibble of the byte. Back-to-back emits to the same byte (e.g. x=0 + x=1 of the same row) chain through NBA semantics without bypass logic — the same trick the raster channel uses since Ch106. The TB is INTENTIONALLY adversarial: VRAM is preloaded with 0xA5 across every byte the engine will write (plus boundary bytes), then a single IMAGE qword (32 PSMT4 pixels) covers the entire 8×4 rect. Every byte ends as {nibble_high_pixel, nibble_low_pixel} (no trace of 0xA5); bytes immediately right of the rect on each row stay 0xA5 (proves no nibble leak past RRW); bytes before / after the destination region also stay 0xA5. Pattern pixel(x,y) = 4'((y*8+x) & 0xF). Asserts: 1 trxdir arm, 32 vram writes, every emit be=0001 and mask ∈ {0x0F, 0xF0}, per-byte readback matches, boundary bytes preserved.
  • tb_gs_image_xfer_psmt8.sv (Ch117) — focused contract for gif_image_xfer_stub's PSMT8 path. Pushes 2 IMAGE qwords (32 PSMT8 pixels = 16 px/qword × 2) through the engine after a TRXDIR-shaped GIF-A+D register sequence with DPSM=PSMT8 (=0x13). PSMT8 packs 1 byte/pixel (an 8-bit CLUT index), so each qword holds 16 pixels; the engine emits one 8-bit pixel per cycle with write_be = 4'b0001, the index in the LOW byte of write_data, and write_mask = 0xFFFFFFFF; vram_stub commits mem[write_addr] <= write_data[7:0] at any byte alignment. Pattern is pixel(x,y) = 8'(y*16 + x) — 32 distinct values across the 8×4 rect so a wrong-byte-lane commit shows up unambiguously. Asserts: 1 trxdir arm, 32 vram writes (all be=0001, mask=0xFFFFFFFF), every pixel reads back at dest_base + y*64 + x, plus right-of-rect / before / after byte-zero boundary preservation. Each qword packs TWO rows of 8 pixels (lanes 0..7 = row y, lanes 8..15 = row y+1) — exercises the per-lane row-stride math at the qword boundary.
  • tb_gs_image_xfer_psmct16.sv (Ch116) — focused contract for gif_image_xfer_stub's new PSMCT16 path. Pushes 4 IMAGE qwords (32 PSMCT16 pixels = 8 px/qword × 4) through the engine after a TRXDIR-shaped GIF-A+D register sequence (BITBLTBUF/TRXPOS/TRXREG/TRXDIR). PSMCT16 packs 2 bytes/pixel, so each qword holds 8 pixels (vs 4 for PSMCT32). The engine emits one 16-bit pixel per cycle to vram_stub with write_be = 4'b0011, the pixel value in the LOW halfword of write_data, and write_mask = 0xFFFFFFFF; vram_stub commits the 2 bytes at the 2-byte-aligned destination address. Pattern is pixel(x,y) = 16'h{yyxx}{yyxx} — distinct per-pixel value so a wrong-lane commit shows up unambiguously. Asserts: 1 trxdir arm, 32 vram writes (all be=0011, mask=0xFFFFFFFF), every pixel reads back at dest_base + y*row_stride + x*2, and the bytes immediately right of the rect on each row + before the dest region + after the dest region all stay zero (proves row-stride math + no halfword leak past RRW). PSMT8 image-xfer landed in Ch117 and PSMT4 image-xfer landed in Ch118 — see those TB rows for their own per-byte / per-nibble contract coverage.
  • tb_gs_demo_psmt4_e2e_trxdir.sv (Ch110) — driver-shaped PSMT4 demo with the palette upload now arriving via a real TRXDIR/TRXPOS/TRXREG/HWREG image-transfer GIF packet sequence instead of TB-direct vram_stub writes. Closes the LAST TB-direct path in the e2e demo flow: every byte the GS sees — framebuffer pixels AND palette source — now arrives through a driver-shaped GIF stream. The DMAC delivers 36 qwords total: U1 (PACKED, NREG=4): BITBLTBUF/TRXPOS/TRXREG/TRXDIR — TRXDIR arms gif_image_xfer_stub. U2 (IMAGE, NLOOP=4): 4 qwords of 4 PSMCT32 entries each → 16 palette entries written into VRAM at DBP*256 by gif_image_xfer_stub. Then 4 SPRITE PACKED packets
    • 1 TEX0_1 PACKED packet. PASS criteria add to Ch109's: 1 EV_DMA_START / 36 EV_DMA_BEAT / 1 EV_DMA_DONE, 7 GIFtag accepts (U1 + U2 + 4×SPRITE + TEX0), 25 PACKED A+D dispatches (4 TRX-setup + 20 SPRITE + 1 TEX0), 16 image-xfer VRAM writes from gif_image_xfer_stub (DBP=4, DBW=1, DPSM=PSMCT32, DSAX=DSAY=0, RRW=16, RRH=1). The vram_stub write port is muxed at TB level: xfer_busy ? xfer_we : raster_pixel_emit (sequenced — palette upload completes before sprites raster). Ch110 also added a backpressure path on gif_packed_stub (image_data_ready input) so the upstream DMA stalls while gif_image_xfer_stub is draining the previous IMAGE qword's 4 PSMCT32 lanes; outside S_IMAGE the gate is a no-op (in_ready stays high). Privileged-block MMIO (PMODE/ DISPFB1/DISPLAY1) remains TB-direct because those are CPU MMIO writes in real hardware, not GIF traffic.
  • tb_gs_demo_psmt4_e2e_dmac.sv (Ch109) — same 4-quadrant PSMT4 demo as Ch108, but the GIFtag + PACKED A+D quadwords arrive at gif_packed_stub via the DMAC channel-2 → ee_memory_map_stubee_ram_stub path instead of being TB-driven directly. Closes the last GIF-side sideband from Ch108: the demo is now reachable the way real EE/IOP code reaches it. The TB pre-stages the same 26 qwords (4 SPRITE packets × 6 qwords + 1 TEX0_1 packet × 2 qwords) into RAM at PAYLOAD_MADR, then writes DMAC channel-2 MADR/QWC/CHCR; a single NORMAL transfer with QWC=26 streams them into the GIF. PASS criteria add to Ch108's: 1 EV_DMA_START / 26 EV_DMA_BEAT / 1 EV_DMA_DONE (DMA event taxonomy locked), with the same downstream chain — 5 GIFtag accepts, 21 A+D dispatches in the expected reg-num sequence, 32 PSMT4 emits, 1 loader_busy rise, identical 16×8 captured frame. Privileged- block MMIO and palette pre-stage stay TB-direct (NOT GIF-side); TRXDIR/HWREG image-transfer for palette upload is a separate future chapter.
  • tb_gs_demo_psmt4_e2e_packed.sv (Ch108) — same 4-quadrant PSMT4 demo as Ch107 but routed through the GIFtag / PACKED A+D front-end (gif_packed_stub with REAL_AD_REG_MAP=1). Closes the last bit of GS-side sideband from Ch107: instead of TB-driving gs_stub.gif_reg_* directly, the TB pushes raw 128-bit GIFtag + PACKED A+D quadwords into gif_packed_stub. in_* exactly the way the real GIF would receive them from PATH3. Each SPRITE is a packet of 1 GIFtag (NLOOP=1, NREG=5, PACKED, REGS=0xEEEEE — 5×A+D in the low 5 nibble slots) + 5 PACKED A+D qwords (PRIM, FRAME_1=PSMT4, RGBAQ, XYZ2, XYZ2); TEX0_1 load is its own 1-tag/1-A+D packet. Total: 5 GIFtag accepts (4 SPRITEs + 1 TEX0_1) and 4×5 + 1×1 = 21 PACKED A+D register-write dispatches into gs_stub.gif_reg_*. 32 PSMT4 raster emits arrive (Ch106 RMW), loader fires exactly once on TEX0_1, and the captured 16×8 frame matches the same expected CLUT-decoded RGB as Ch107 — i.e. real-format GIF packets reach the GS register file with the same cadence the TB previously synthesised by hand. Privileged-block MMIO (PMODE/DISPFB1/DISPLAY1) and the palette pre-stage in VRAM remain TB-direct because they are NOT GIF-side; the palette upload via real-PS2 TRXDIR/TRXPOS/TRXREG/HWREG image-transfer packets is a separate future chapter, as is the DMAC channel-2 burst that would normally deliver the GIFtag qwords (this TB drives gif_packed_stub.in_* directly to keep the demo narrow and deterministic; the full DMAC→RAM→GIF round trip is what the integration-tier tb_ee_core_gif_* family covers).
  • tb_gs_psmt4_round_trip.sv (Ch104) — full driver-shaped PSMT4 + CLD=4 + CSA round trip. Wires gs_stub + vram_stub + clut_stub + clut_loader_stub + gs_pcrtc_stub end-to-end with pcrtc.clut_csa = gs_stub.tex0_1_csa_q (the Ch98 sideband-free pattern). Phase 1: stages a 4×4 PSMT4 sprite in VRAM, plus a 16-entry pattern_a palette in VRAM at CBP_A*256. Drives TEX0_1 with CBP=4, CPSM=PSMCT32, CSM=CSM2, CSA=0, CLD=4; the loader writes pattern_a into clut_stub[0..15] and pcrtc.clut_csa is 0, so PSMT4 scanout reads pattern_a per nibble. Phase 2: stages a different pattern_b palette at CBP_B*256 and drives TEX0_1 with CBP=8, CSA=4, CLD=4; the loader writes pattern_b into clut_stub[64..79] (the CSA=4 window) and pcrtc.clut_csa flips to 4, so the same VRAM sprite — same DISPFB1 / DISPLAY1 / PMODE — now reads pattern_b. Proves loader policy + clut_stub contents + PCRTC lookup are wired consistently.

Scope (current, after Ch165):

  • PSMCT32 (DISPFB1.PSM=0), PSMCT16 (PSM=2), PSMT8 (PSM=0x13), and PSMT4 (PSM=0x14) honored at BOTH the read and write sides (Ch94 + Ch95 + Ch96 + Ch97 + Ch103 + Ch105 + Ch106). PSMCT24/PSMCT16S/PSMZ32/etc. force scanout off and are not contract-tested at the raster channel. The write side (gs_stub.raster_pixel_emit) emits the four supported PSMs via raster_pixel_be_q (per-byte gate) and raster_pixel_mask_q (per-bit merge mask, Ch106): PSMCT32 = be 0xF / mask 0xFFFFFFFF, PSMCT16 = be 0x3 / mask 0xFFFFFFFF, PSMT8 = be 0x1 / mask 0xFFFFFFFF, PSMT4 = be 0x1 / mask 0x0F or 0xF0. The mask path is no-op for byte-or-larger PSMs (mem[i] = data[i] when mask_i = 0xFF) and only meaningful for PSMT4 sub-byte writes. PSMT8 / PSMT4 scanout surfaces the index/nibble as grayscale by default; with clut_enable=1 (Ch97/Ch103) and a programmed clut_stub, the index/nibble looks up real RGB. CLUT contents come either from a TB-direct write OR (Ch99..Ch102) from a VRAM→CLUT load triggered by a TEX0_1 GIF write with CSM == 1 (CSM2 linear), CPSM ∈ {PSMCT32, PSMCT16}, and a CLD value passing the policy: CLD=0 never; CLD=1 always (full 256-entry load); CLD=2 if CBP changed since last load (full); CLD=3 if CBP/CPSM/CSA any-changed (full); CLD=4 always but only the 16-entry CSA window at indices CSA*16 + i (Ch102 — preserves the other 240 entries); CLD ∈ {5..7} silently no-op (reserved). clut_loader_stub walks the entries via vram_stub's second read port; PSMCT16 entries are unpacked with the same 5→8 bit-replicate the scanout side uses (Ch94). CSM1 swizzle and CPSM ∉ {PSMCT32, PSMCT16} remain deferred.
  • Single CRTC, single DISPFB. Real PS2 has two interlace- capable CRTCs (DISPFB1, DISPFB2). One context is enough for TBs to verify the round trip; PMODE.EN2 + DISPFB2 + DISPLAY2 is deferred.
  • Read-side addressing. Linear by default (legacy formula vram_read_addr = FBP*2048 + (effective_y*FBW*64 + effective_x) << bpp_shift). Four OPTIONAL per-PSM swizzle paths gated by parameters on gs_pcrtc_stub: PSMCT32_SWIZZLE=1 (Ch120) routes PSMCT32 reads through gs_swizzle_psmct32_stub; PSMCT16_SWIZZLE=1 (Ch126) routes PSMCT16 reads through gs_swizzle_psmct16_stub; PSMT8_SWIZZLE=1 (Ch132) routes PSMT8 reads through gs_swizzle_psmt8_stub (Ch131) — FBW must be even because PSMT8 pages are 128 px wide and the swizzle internally divides FBW by 2; PSMT4_SWIZZLE=1 (Ch138) routes PSMT4 reads through gs_swizzle_psmt4_stub (Ch137); FBW must be even (same as PSMT8). The four parameters are independent — enabling one doesn't affect the others. PSMT4's swizzle module also outputs a nibble_hi selector that PCRTC uses in place of pixel_index[0] to pick which nibble of the byte at the swizzled address holds this pixel (PSMT4 packs 2 pixels per byte and the canonical PCSX2 column table reorders nibbles within a block, so the linear formula's pixel_index[0] selector is no longer correct under the swizzled layout). All four swizzle parameter defaults are 0 so all existing PCRTC- using TBs see the legacy linear behavior unchanged. The PSMT4 image-xfer (Ch139) and raster (Ch140) write-side wiring is now live as well, completing the four-PSM × three- path swizzle integration. Both driver-shape e2e demos for PSMT4 are also live: raster-driven (Ch141) and TRXDIR-driven (Ch142). All four common GS PSMs now have BOTH driver-shape e2e demos (CT32 Ch123+Ch124, CT16 Ch129+Ch130, T8 Ch135+ Ch136, T4 Ch141+Ch142) — closing the four-PSM × three-path × dual-driver-shape e2e foundation.
  • Parallel to platform_video_stub, not a replacement. We did not extend platform_video_stub (which would have rippled through 6 existing TBs). Pcrtc is the alternative video source for TBs that want VRAM-backed scanout. The legacy flood-fill module stays as-is.

End-to-end demo manifest (Ch143)

Eight driver-shaped end-to-end byte-accurate demos cover the four common GS PSMs across both driver shapes (raster-driven PACKED-SPRITE payload + TRXDIR-driven IMAGE payload). Each demo runs the same EE-bootlet → DMAC → GIF → GS → vram → swizzled- PCRTC chain with all three same-PSM swizzle gates parameter-set to 1; the listed write-side path is load-bearing and the other write-side path is asserted dormant in the demo flow.

All eight demos emit a 16×8 framebuffer (128 pixels). The raster column shows (emits, xfer_writes); the TRXDIR column shows (xfer_writes, emits) — in both cases the load-bearing path fires 128 times and the dormant path is asserted 0.

PSM Raster-driven e2e TRXDIR-driven e2e
PSMCT32 Ch123 — tb_gs_demo_psmct32_swizzle_e2e (128, 0) Ch124 — tb_gs_demo_psmct32_swizzle_trxdir_e2e (128, 0)
PSMCT16 Ch129 — tb_gs_demo_psmct16_swizzle_e2e (128, 0) Ch130 — tb_gs_demo_psmct16_swizzle_trxdir_e2e (128, 0)
PSMT8 Ch135 — tb_gs_demo_psmt8_swizzle_e2e (128, 0) Ch136 — tb_gs_demo_psmt8_swizzle_trxdir_e2e (128, 0)
PSMT4 Ch141 — tb_gs_demo_psmt4_swizzle_e2e (128, 0) Ch142 — tb_gs_demo_psmt4_swizzle_trxdir_e2e (128, 0)

For each row both demos use the same per-quadrant pixel pattern (so the verify side is shared across the row), the same DBW- even constraint where applicable (PSMT8 / PSMT4: 128-px-wide pages → DBW=2 minimum even), and verification through the freed-up vram_stub 2nd read port. Ch141 + Ch142 together close the four-PSM × three-path × dual-driver-shape e2e foundation — the foundation Ch143 manifests and seals.

Hardware-demo candidates:

  • PSMCT32 swizzled raster e2e (Ch123) — simplest direct- color path: 4 SPRITE PACKED packets, RGBAQ.{R,G,B,A} mapped 1:1 to scanout RGB, no CLUT, no nibble RMW. The natural first hardware demo because every byte from EE-bootlet through the swizzled 16×8 framebuffer to PCRTC RGB is visible without any indirection. Build target: make tb_gs_demo_psmct32_swizzle_e2e.
  • PSMT4 swizzled TRXDIR e2e (Ch142) — strongest indexed/ CLUT-like stress path: U1 PACKED A+D TRX setup + U2 IMAGE NLOOP=4 with 32 PSMT4 nibbles per qword, image-xfer engine decoding the canonical PCSX2 columnTable4 (which reorders nibbles within a block — the linear pixel_index[0] rule is wrong under swizzle), and per-pixel nibble RMW on vram_stub via write_be=4'b0001 + write_mask ∈ {0x0F, 0xF0} keyed by the swizzle's nibble_hi. Exercises the full sub-byte pipeline + the canonical-source-locked column table. Build target: make tb_gs_demo_psmt4_swizzle_trxdir_e2e.

First hardware-targeted top wrapper (Ch146)

Ch146 turns the Ch144 readiness audit + Ch145 BRAM-shrink groundwork into a real top-level SystemVerilog module: rtl/top/top_psmct32_raster_demo.sv. This is the module a board-level synthesis project would target first. Board-level concerns (HDMI/VGA PHY, pin constraints, .mem bake tooling, clock-domain crossings) are deliberately deferred — Ch146 proves the design can be expressed as a single hardware- shape module.

Top ports:

  • clk / rst_n / core_go — clock, active-low synchronous reset, start pulse (a board reset-release sequencer can tie core_go high after rst_n deasserts).
  • r/g/b/hsync/vsync/de — 8-bit RGB scanout from PCRTC.
  • core_halt / dma_done_seen / frame_seen — debug/status bundle suitable for LEDs or a board-level state observer.

Top parameters: H_ACTIVE (default 16), V_ACTIVE (default 8), BIOS_SIZE_BYTES, RAM_SIZE_BYTES, VRAM_BYTES, USEG_SHADOW_WORDS_PARAM (default 1024 = 4 KiB per Ch145).

Image fixtures are passed via macros (iverilog-12 string- parameter forwarding limitation): TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE and TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE. The fixtures are baked by sim/data/top_psmct32_raster_demo/bake.py which writes:

  • bios.mem — 18-word EE bootlet (one 32-bit hex word per line)
  • payload.mem — 40 qwords for ee_ram_stub (16 zero qwords + 24 GIF qwords carrying 4 SPRITE PACKED packets)

The bake script is a deterministic Python rewrite of the procedural ee_prog_word() + preload_qword() loops in the Ch123 TB. Same bit-exact values, just baked into static repo artifacts so a hardware top can $readmemh them.

Focused TB: sim/tb/top/tb_top_psmct32_raster_demo.sv. Drives the top with the static fixtures, captures one full PCRTC frame after the EE halts and DMAC completes, and asserts the per-quadrant RGB matches the Ch123 frame exactly. Counts: raster_emits=128, errors=0, core_halt=1, dma_done_seen=1, frame_seen=1.

Bug-fix iteration: the first bake had Y in XYZ2 placed at bits[43:32] instead of bits[31:20] — a Python translation error of the SystemVerilog {32'd0, y_int, 4'd0, x_int, 4'd0} concatenation. Symptom: per-sprite emit count was 8 instead of 32 (each sprite drew one row), and VRAM held the per-sprite R component scattered across 32 consecutive 4-byte cells. Caught by adding a per-emit observer that printed (addr, data, be, mask, color_q) for the first 10 emits. Fix: y << 20 instead of y << 32 in bake.py. PASS after the fix.

What's still NOT in this chapter (deferred to Ch147+):

  • Real .mem bake tooling integration (currently the bake.py is run manually before sim; a Makefile target or CI step that invokes it would belong in Ch147).
  • Board-specific top: pin constraints, target FPGA family, PHY shim (HDMI/DVI/VGA), reset-release sequencer.
  • A multi-PSM top (the Ch142 PSMT4 TRXDIR variant would be a natural second wrapper once the build flow is proven).

Fixture bake flow (Ch147)

Ch147 makes the Ch146 .mem bake first-class so the static fixtures can't drift from bake.py. Three new Makefile targets:

Target Purpose
top_psmct32_raster_demo_mem Re-runs bake.py; produces bios.mem + payload.mem atomically.
top_psmct32_raster_demo_mem_check Verifies fixture sizes (bios.mem = 1024 lines, payload.mem = 256).
tb_top_psmct32_raster_demo (existing) Now declares top_psmct32_raster_demo_mem as a prerequisite.

The bake target uses Make's grouped-target syntax (&:) so a single bake.py run produces both files atomically — they can never be out-of-step.

The size-check target counts payload lines (skipping blanks + // ... comment-only lines) and asserts the exact expected counts. A non-matching count exits with status 1, surfacing a fixture/script drift as a hard build failure.

Deleting the fixtures and running the TB triggers the bake automatically:

$ make tb_top_psmct32_raster_demo
=== bake top_psmct32_raster_demo .mem fixtures ===
python3 .../bake.py
[bake] wrote bios.mem (1024 words, 18 active) and payload.mem (256 qwords, 40 active)
=== build tb_top_psmct32_raster_demo ===
...
[tb_top_psmct32_raster_demo] PASS

Synthesis-facing macros

When pointing a synthesis tool at rtl/top/top_psmct32_raster_demo.sv, two preprocessor defines must be set so bios_rom_stub and ee_ram_stub find their $readmemh images. These are macros (NOT module parameters) per the iverilog-12 string-parameter forwarding workaround documented in the Ch146 wrapper banner; they map cleanly to FPGA-tool defines.

Macro Value
TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE Absolute (or tool-relative) path to bios.mem
TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE Absolute (or tool-relative) path to payload.mem

Both default to "" so the wrapper still elaborates without fixtures (synthetic NOP-sled in bios_rom_stub + zero-init ee_ram_stub, which produces no DMAC payload but a stable PCRTC frame with r=g=b=0).

Vivado (preprocessor verilog_define on the synthesis + implementation filesets — these are macros, not module generics):

set_property verilog_define { \
    TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE="$path/bios.mem" \
    TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE="$path/payload.mem" \
} [get_filesets sources_1]

Repeat for the implementation fileset if it diverges from sources_1.

Quartus (project-level macro defines):

set_global_assignment -name VERILOG_MACRO \
    "TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE=\"$path/bios.mem\""
set_global_assignment -name VERILOG_MACRO \
    "TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE=\"$path/payload.mem\""

Iverilog (sim): the Ch147 Makefile passes them via -D flags in the tb_top_psmct32_raster_demo build rule — -DTOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE='"$(SIM_DIR)/data/... /bios.mem"' — and the top_psmct32_raster_demo_mem prerequisite ensures the .mem files exist before the TB elaborates.

DE25-Nano synthesis scaffold (Ch148)

Ch148 makes the Ch146 hardware top synthesis-addressable on DE25-Nano without committing to a video PHY shim or final pin constraints (those land in Ch149+).

File / target Purpose
synth/de25_nano/top_psmct32_raster_demo/files.f RTL filelist — Ch123 dep tree only (~14 entries).
synth/de25_nano/top_psmct32_raster_demo/README.md Top module + macros + fixtures + DE25-Nano clock/reset/video assumptions.
make top_psmct32_raster_demo_synth_check Validates files.f paths + fixture presence.

The synth-check target depends on top_psmct32_raster_demo_mem_check, so a single command verifies fixture sizes AND that every file referenced by the synth filelist exists. It exits non-zero on any miss — surfacing both fixture drift (Ch147 size guard) and filelist drift as hard build failures.

.qsf (Quartus pin assignments) is not committed in Ch148. The README documents the board assumptions (clock domain, reset polarity, core_go strategy, video-out path candidates, LED status mapping) so the next chapter can author it without inventing context. The point of Ch148 is that a Quartus project import (or Vivado / verilator --lint-only) finds every file the design needs, with the macros documented end-to-end.

DE25-Nano board wrapper (Ch149)

Ch149 turns the Ch146 board-agnostic top into a real board top without yet committing to pin assignments or a video PHY. New:

Artifact Purpose
rtl/top/de25_nano_psmct32_raster_demo_top.sv Board wrapper — DE25-Nano signal names + reset sequencer + LED status.
sim/tb/top/tb_de25_nano_psmct32_raster_demo_top.sv Smoke TB exercising clock/reset/core_go/LED/video pins.

Top ports (matching the Terasic Golden_top.v conventions from the DE25-Nano resource CD): CLOCK0_50 / CLOCK1_50 / CLOCK2_50, KEY[1:0] (active-LOW), SW[3:0], LED[7:0] (active-LOW), and raw VIDEO_R/G/B/HSYNC/VSYNC/DE outputs that a future PHY shim will consume.

Reset bridge:

  1. ninit_done sourced from Terasic's reset_release IP under \ifdef USE_TERASIC_RESET_RELEASE_IP` (default-off; sim uses an inline 16-cycle stub matching the IP's shape).
  2. KEY[0] + ninit_done feed an async-assert/sync-deassert 2-stage shift register on CLOCK2_50. Mirrors the retroDE_nes pattern at retroDE_nes.sv:170-177.

core_go sequencer: 16-cycle delay after core_rst_n deasserts, then a one-cycle core_go pulse. Matches the "recommended hardware path" documented in the Ch148 README and the level-sensitive go_i semantics at ee_core_stub.sv:812-813.

LED status: the Ch146 wrapper's three sticky status outputs drive LED[2:0] (active-LOW): LED[0] = ~core_halt, LED[1] = ~dma_done_seen, LED[2] = ~frame_seen. LED[7:3] tied HIGH (OFF).

Smoke TB counts: core_go_pulses=1, all three status LEDs eventually latch (the actual fall-edge order is frame_seen first, then core_halt, then dma_done_seenframe_seen is a "PCRTC alive" indicator that fires on the first empty frame after reset, well before the bootlet runs), and VIDEO_DE rises inside the active region. Standalone PASS.

.qsf (pin assignments), PLL, and video PHY shim remain deferred (Ch150+). Ch149 makes the design board-shaped, not yet board-pinned.

Quartus scaffold for DE25-Nano (Ch150)

Ch150 commits the first real Quartus artifacts for the Ch149 board wrapper — a minimal .qsf + .sdc pair, deliberately PHY-light:

File Purpose
synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf Device + family + pin assignments + IO standards + .mem macros + file list.
synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc CLOCK2_50 50 MHz clock + reset-sync false-path + IO false-paths.
make top_psmct32_raster_demo_quartus_scaffold_check Validates both files exist + top entity + pins + clock period.

Device (sourced from retroDE_splash.qsf): Agilex 5 A5EB013BB23BE4SCS, package VPBGA. Top entity: de25_nano_psmct32_raster_demo_top (the Ch149 board wrapper — NOT the inner Ch146 module). Pin assignments match the DE25-Nano board pinout used by retroDE_splash and retroDE_nes: CLOCK2_50PIN_BF23, KEY[0]PIN_C8, LED[2:0]PIN_DN22 / PIN_DJ32 / PIN_DF35. CLOCK0/1_50, KEY[1], SW[3:0], and LED[7:3] are also assigned (their canonical pins) so Quartus doesn't flag them as unconstrained inputs/ outputs even though the Ch149 wrapper ties them off.

SDC (sourced from retroDE_splash.sdc): a single 50 MHz create_clock on CLOCK2_50, the standard reset-sync first-stage false-path (set_false_path -to [get_registers -nowarn {*rst_sync[0]}]), and IO false paths for KEY[*], SW[*], LED[*] plus the as-yet-unpinned VIDEO_* outputs (replaced by real set_output_delay constraints when the PHY shim lands).

.mem macros baked into the QSF (project-relative paths): TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE = sim/data/top_psmct32_raster_demo/bios.mem and the matching payload macro. Run make -C sim top_psmct32_raster_demo_mem before launching Quartus.

USE_TERASIC_RESET_RELEASE_IP is not defined in this QSF — keeping the wrapper self-contained for the first project import. To wire in Terasic's reset_release IP, define the macro and add the IP file from DE25_Nano_ResourceCD/Demonstration/FPGA/Board_Info_RTL/reset_release/.

Deferred to Ch151+: video PHY pins + shim (HDMI ADV7513 + I²C config FSM, VGA DAC, or PMOD), PLL .ip config, LPDDR4 / SDRAM / HPS / CAM / UART / GPIO assignments. Ch150 makes the project Quartus-importable, not yet Quartus-buildable for video output.

PLL + lock-gated reset (Ch151)

Ch151 adds the most conservative hardware bring-up step before touching the video PHY: a board-clock PLL on the path between CLOCK2_50 and the design clock, with the reset bridge gated on PLL lock so the design can only leave reset once the PLL is stable.

Artifact Purpose
rtl/top/de25_nano_pll_stub.sv Sim stub matching the Quartus IOPLL pll module signature.
rtl/top/de25_nano_psmct32_raster_demo_top.sv (Ch151) Reworked with PLL instantiation + lock-gated reset bridge + design_clk distribution to the Ch146 wrapper and core_go sequencer.
tb_de25_nano_psmct32_raster_demo_top (Ch151 update) Adds rising-edge timestamps for pll_locked / core_rst_n / core_go and asserts the contract pll_locked < core_rst_n < core_go.

PLL signature (matches retroDE_nes/ip/pll/pll_bb.v and retroDE_splash/ip/sys_pll/sys_pll_bb.v):

module pll (
    input  wire  refclk,
    input  wire  rst,
    output wire  outclk_0,
    output wire  locked
);

Sim stub behavior: outclk_0 = refclk (pass-through, no multiplication — sim doesn't need a different frequency, and a pass-through still exercises the lock-gated reset bridge). locked rises after 32 cycles with rst low; held LOW while rst is HIGH.

Reset gating: the board top's rst_sync register async-asserts on (ninit_done | ~pll_locked) — both FPGA init AND PLL lock must complete before reset can deassert.

Synth swap: define USE_PLL_IP and add a Quartus IOPLL .qip to the project; the board wrapper's \ifdef USE_PLL_IP` swaps the stub for the real IP. The QSF documents the swap mechanism but ships with the IP commented out, keeping the scaffold self-contained until the PLL chapter (Ch152+) commits a frequency choice + IP file.

TB contract (smoke output): t_pll/rstn/go=(950000,990000, 1330000) ns — PLL locks at 950 ns, reset deasserts 40 ns later (the 2-stage sync register prop), core_go fires 340 ns later (the GO_DELAY=16 wait). Order assertions catch any future regression of the gating.

Deferred to Ch152+: real PLL output frequency tuning (the stub passes refclk through; a real build sets outclk_0 to whatever the video PHY chapter needs), committing the actual IOPLL .ip file under synth/de25_nano/.../ip/, the video PHY shim itself.

First Quartus compile + baseline report (Ch152)

Ch152 is the chapter where the toolchain is finally asked the honest question: "does this DE25-Nano board top synthesize, fit, and pass static timing analysis?"

Driver: synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh runs quartus_syn → quartus_fit → quartus_sta against the Ch150 QSF + Ch151 PLL stub. quartus_asm (bitstream gen) is deliberately skipped — Ch152 is a compile-and-report smoke, not a deploy path. USE_PLL_IP is left undefined so the Ch151 self-contained PLL stub stays under test (per Codex framing).

Make targets:

Target Action
make quartus_compile Full syn + fit + sta flow.
make quartus_compile_clean Wipe outputs first, then full flow.
make quartus_syn_only Synthesis only (~14 min smoke).
make quartus_compile_report Run parse_reports.py on the latest output.

Ch152 RTL fixes that landed before synthesis would even elaborate:

Issue Fix
QSF line-continuation (\) parse error in set_global_assignment -name VERILOG_MACRO Collapsed to single-line lines.
vram_stub.mem 8192-iter init loop exceeded Quartus's 5000-iter synthesizable-loop limit (Error 13356) Wrapped initial block in // synthesis translate_off / _on pragmas. Real Altera/Intel BRAM is power-on-zero so the procedural loop is sim-only.
gs_pcrtc_stub / gif_image_xfer_stub / gs_stub unconditionally instantiate all four swizzle math primitives even when their gate is 0 Added gs_swizzle_psmct16/8/4_stub.sv to the synth filelist + QSF (iverilog trimmed silently; Quartus errors).
gs_stub.interp_byte (Ch86 Gouraud TRI math) 64-bit signed divide hits Quartus Pro's lpm_divide LPM_WIDTHN ≤64 limit (Error 272006) Wrapped divide in // synthesis translate_off; default fallback returns 0. The Ch123 SPRITE-only demo doesn't exercise Gouraud TRIs, so this is dead code in the build. A future Gouraud-TRI hardware demo would need a divider redesign sized for Agilex 5.
QSF SDC_FILE referenced via repo-root-relative path failed when the build script ran Quartus from a per-build work dir (Warning 16124) Changed to basename-only — works from either the repo root or the work dir (the script symlinks the SDC alongside the QSF).

First successful synthesis: 0 errors, 3 warnings, 14:08 elapsed. 160 RAM segments + 26 DSP elements inferred.

Fitter result — design too large for the part (the chapter's honest answer):

Total dedicated logic registers : 121,176
Total pins                      :      17 / 351      ( 5 %)
Total block memory bits         :  65,536 / 7,331,840 (<1 %)
Total RAM Blocks                :       6 / 358      ( 2 %)
Total DSP Blocks                :      20 / 188      (11 %)
Logic utilization (ALMs needed) : 155,104 / 46,800   (331 %)

The design needs 155,104 ALMs vs the part's 46,800 — 3.31× oversized. Error (170011): Design contains 260,263 blocks of type combinational node. However, the device contains only 93,600 blocks.

Why so big (the precise picture, to be drilled into by Ch153+):

The synthesis log reports Info (22567): extracting RAM for all four memory identifiers — ee_ram_stub.mem, bios_rom_stub.mem, ee_memory_map_stub.useg_shadow_mem, and vram_stub.mem — so Quartus did recognize each as a memory structure at syn time. But the fit report shows only 65,536 bits / 6 RAM Blocks committed (roughly enough for BIOS 4 KB + EE-RAM 4 KB). Something between syn and fit caused the larger arrays — most likely vram_stub.mem (8 KB) and possibly useg_shadow_mem (4 KB after Ch145's 1024-word shrink) — to either (a) be replicated into combinational mux/decoder logic because of their access-port shape, or (b) lose their RAM attribute during fitter optimization and fall back to flip-flop implementation. The 121,176 dedicated registers + the 260,263 combinational nodes are consistent with at least u_vram getting massively unrolled.

Ch153's job is to isolate which array(s) and which port shape(s) prevent compact block-RAM implementation. The likely candidates: vram_stub's dual read ports + per-byte write_be lane (Ch95's per-byte gate may not be RAM-block- friendly on Agilex 5), and the EE memory map's wide arbitration into the useg-shadow port. None of this is fixed in Ch152 — surfacing the gap precisely is the chapter's deliverable.

Other notable findings (full list in output_files/build_logs/):

  • Critical Warning 20759: "Use the Reset Release IP in Agilex 5 designs to ensure a successful configuration." This is the Ch151 \ifdef USE_TERASIC_RESET_RELEASE_IP` opt-in; enabling it (and committing the IP file) is a Ch153+ task.
  • 6× Warning 16749: identifiers used before declaration in dmac_reg_stub, gif_packed_stub, gs_stub, gif_image_xfer_stub. Style/lint warnings, no functional impact; clean-up candidate for a future polish chapter.
  • STA never ran because fit failed.

What Ch152 leaves for Ch153+:

  • Resource reduction. Most likely candidates: BRAM-infer vram_stub.mem and useg_shadow_mem cleanly (Quartus attribute hints / restructure read ports), or shrink the EE core's MIPS decode (table-driven vs LUT-driven), or move to a larger Agilex 5 part if available.
  • Enabling USE_TERASIC_RESET_RELEASE_IP and committing the Terasic reset_release IP file.
  • The PHY shim chapter (VIDEO_* virtualized → real HDMI ADV7513 / VGA / PMOD pins).
  • Cleaning up the 6× forward-reference style warnings.

Memory-shape forensics (Ch153)

Ch153 is a memory-forensics chapter (NOT a rewrite chapter): two isolated tiny Quartus projects under synth/de25_nano/experiments/ target the same Agilex 5 part as the Ch150 board top so resource deltas are apples-to-apples. The goal is to identify which feature(s) of vram_stub's shape prevent compact block-RAM implementation and drive the Ch152 size deficit.

Experiment Memory shape
exp_a_bram_friendly 2048 × 32-bit, single port, sync read + sync write with byte-WE. Intel-friendly BRAM template.
exp_b_vram_shape 8192 × 8-bit, dual COMBINATIONAL read, byte-WE + per-bit mask RMW. Exact vram_stub shape.

The result is decisive:

Metric exp_a (BRAM-friendly) exp_b (vram_stub-shape)
Fitter status Successful Failed
Logic utilization (ALMs) 46 / 46,800 (< 1 %) (fit failed — placement reports 257,986 combinational nodes vs 93,600 device max)
Total dedicated logic registers 0 65,536
Total RAM Blocks 4 / 358 (1 %) 0 / 358 (0 %)
Total block memory bits 65,536 (8 KB) 0

Interpretation:

  • The Intel-friendly shape maps the same 8 KB to 4 RAM Blocks with zero combinational logic and zero registers beyond the read-output flop.
  • The vram_stub shape maps the same 8 KB to zero RAM Blocks, 65,536 dedicated registers (one flip-flop per byte), and 257,986 combinational nodes (the 4-byte concatenation multiplexers for the dual combinational reads + the per-bit mask RMW gates).
  • The 257,986 combinational-node figure for a single 8 KB memory almost exactly matches the 260,263 combinational-node figure Ch152 reported for the entire top-wrapper design — empirical confirmation that u_vram alone accounts for essentially all of the Ch152 size deficit.

Which feature is the dominant cost (the four candidates the shape diff isolates):

The exp_a vs exp_b diff folds four feature changes together (byte-addressable storage, combinational reads, dual reads, per-bit mask RMW). To pin down which feature(s) dominate, a future chapter could insert intermediate experiments — but the exp_a result already gives the upper bound on what BRAM-native inference can buy: ~4 RAM blocks + ~50 ALMs for 8 KB. Anything that gets vram_stub close to that bar wins back the entire Ch152 fit headroom.

The most likely individual culprit is the per-bit mask RMW: Agilex 5's M20K BRAM has byte-WE primitives but does NOT have per-bit RMW. Quartus has to materialize the (mem & ~mask) | (data & mask) arithmetic outside the BRAM, which forces the storage out of BRAM and into per-bit flip-flops. Combinational reads are the second most likely (BRAMs are synchronous-read-only on Agilex 5; Quartus has to either insert a register on the read path or materialize the storage as discrete flip-flops to feed the comb output).

Make targets:

Target Action
make quartus_experiments Compile every synth/.../experiments/exp_* project.
make quartus_experiments_clean Wipe outputs first, then compile.
make quartus_experiments_report Side-by-side resource summary (no recompile).

What Ch153 leaves for Ch154+:

  • Refactor vram_stub into a BRAM-friendly shape: replace combinational reads with sync (registered output) reads, replace per-bit mask RMW with byte-WE-only writes (move the per-pixel sub-byte merging logic into the writer module — most likely gs_stub.raster_pixel_emit for the PSMT4 nibble case), and switch to 32-bit word-addressable storage with byte-WE for the unaligned-byte case.
  • Audit useg_shadow_mem next — it had Info (22567): extracting RAM at synthesis but didn't survive to fit. Likely culprits there: the Ch64 / Ch65 / Ch70 mirror-write features that turn the simple useg-shadow into a multi-port write structure.

BRAM-friendly vram sibling (Ch154)

Ch154 adds a hardware-friendly sibling of vram_stubrtl/gif_gs/vram_bram_stub.sv — that maps cleanly onto Agilex 5 M20K block-RAM. Per Codex's framing, the chapter's blast radius stays narrow: add the sibling + prove it works + measure the BRAM-inference win. The actual swap of the board top to use the new module + the writer-side PSMT4 nibble-RMW rework lands in Ch155+.

vram_bram_stub shape vs vram_stub:

Feature vram_stub (legacy / sim reference) vram_bram_stub (Ch154, hw-friendly)
Storage 8192 × 8-bit byte-addressable 2048 × 32-bit word-addressable
Reads Combinational; arbitrary alignment Synchronous (1-cycle); word-aligned only
Read ports 2 (combinational) 2 (sync, true dual-port M20K)
Write granularity byte-WE + per-bit write_mask RMW byte-WE only
Per-bit mask RMW (Ch106) yes — supports PSMT4 nibble splice NO — caller must splice on writer side

New equivalence TB: tb_vram_bram_stub_equivalence. Drives both DUTs in lockstep with byte-WE-only writes (write_mask = 0xFFFFFFFF on the legacy module so the per-bit RMW path is a no-op), aligns sample times across the new module's 1-cycle sync-read latency, and asserts data equivalence across:

  • 32-bit word writes (be=4'b1111)
  • per-byte-lane writes (be=4'b0001 / 0010 / 0100 / 1000)
  • per-byte non-wrapping admission near MAX_BASE
  • dual-port read agreement

PASS standalone + in the full sim regression.

Quartus experiment exp_c_vram_bram_stub (synth/.../experiments/exp_c_vram_bram_stub/) proves the new module infers BRAM cleanly. Side-by-side with the Ch153 baselines, all on the same Agilex 5 part:

Experiment Fit ALMs Registers RAM Blocks Block memory bits
exp_a_bram_friendly Success 46 / 46,800 0 4 / 358 65,536
exp_b_vram_shape Failed (261,578 comb nodes vs 93,600 device max) 65,536 0 / 358 0
exp_c_vram_bram_stub Success 190 / 46,800 2 8 / 358 131,072

Interpretation:

  • exp_c lands close to exp_a's ideal (190 vs 46 ALMs; 8 vs 4 RAM Blocks). The slight overhead vs exp_a is the dual read port (M20K replicates storage to serve two independent read addresses simultaneously, hence 2× block memory bits) plus the per-byte non-wrapping admission gate Ch95 inherited from vram_stub.
  • exp_c consumes 3.4× fewer dedicated registers than exp_a would have if read_data was reset (2 vs the 32 a reset would require) — the canonical Quartus inference template demands no reset on the BRAM data register.
  • vs exp_b's 65,536 registers + 261,578 combinational nodes, swapping vram_stubvram_bram_stub recovers essentially all of the Ch152 ALM headroom on the vram side. Useg-shadow is the next forensic target (likely similar shape).

Inference template gotcha (caught + fixed in this chapter): the first cut of vram_bram_stub had a reset on read_data inside the always_ff block AND an in-bounds gate guarding the mem read. Quartus rejected BRAM inference with Info (276007): RAM logic ... uninferred due to asynchronous read logic. Fix: simplified the read path to the canonical template (always_ff @(posedge clk) read_data <= mem[idx];) and moved bounds + alignment checks to a parallel read_valid pipeline. Then Implemented 64 RAM segments instead of 0.

Ch155+ surface — writer-side normalization for ALL sub-32-bit PSMs, not just PSMT4: vram_bram_stub's contract is stricter than vram_stub's — write_addr MUST be word-aligned (write_addr[1:0] == 2'b00), and the byte lane(s) being written are selected via write_be with the payload pre-shifted into the right byte lane(s) of write_data[31:0]. Today's writer- side RTL emits at sub-word boundaries:

  • PSMCT16 raster + image-xfer write at halfword addresses (write_addr[1] == 1 for the high halfword) with be=4'b0011 or 4'b1100 and the 16-bit payload in write_data[15:0].
  • PSMT8 raster + image-xfer write at byte addresses (any write_addr[1:0]) with be=4'b0001 and the 8-bit payload in write_data[7:0].
  • PSMT4 raster + image-xfer write at byte addresses with be=4'b0001 + per-bit write_mask 0x0F or 0xF0 to splice one nibble.
  • PSMCT32 raster + image-xfer write at word addresses with be=4'b1111 + the full 32-bit payload — the ONLY PSM that natively matches vram_bram_stub's contract today.

If we swap the board top to vram_bram_stub without writer-side normalization, CT16/T8/T4 writes silently drop because write_addr[1:0] != 0 fails admission. So Ch155 must rework each writer to:

  1. Mask write_addr down to its word base (write_addr & ~32'd3).
  2. Shift the payload from its native byte lane into the appropriate byte lane(s) of a 32-bit write_data based on the original write_addr[1:0].
  3. Generate write_be with bits set only for the byte lanes the original sub-word address actually targets.
  4. For PSMT4 specifically: replace the per-bit write_mask nibble splice with a writer-side read-modify-write — read the existing byte first, splice the new nibble in, then issue a normal byte-WE write. Adds ~1 cycle of latency per nibble-write but that's well within the 16×8 demo budget.

The rework lands inside gs_stub.raster_pixel_emit (Ch95/Ch105/ Ch106 wrote the legacy paths) and gif_image_xfer_stub's per- PSM dispatch. A focused TB that drives sub-word writes through the normalizer and asserts the resulting vram_bram_stub words match the legacy vram_stub byte-/halfword-/nibble-level state would be the cleanest proof.

Other Ch155+ work:

  • Update scanout / debug TBs that sample VRAM via vram_stub's combinational reads to handle the 1-cycle sync-read latency (or keep them on vram_stub if they're sim-only).
  • Swap the Ch146 board top to instantiate vram_bram_stub AFTER the writer-side normalization lands. Rerun the full Quartus compile and expect a dramatic ALM/register reduction.
  • Audit useg_shadow_mem next — Ch64/Ch65/Ch70 mirror-write features may make it multi-port-write-shaped.

VRAM write normalizer + first BRAM integration (Ch155)

Ch155 lands the writer-side normalization layer that bridges the contract gap between the legacy vram_stub (byte-addressed sub-word writes + per-bit RMW) and the new vram_bram_stub (word-aligned + byte-WE only). Per Codex's framing the chapter keeps blast radius narrow: build the normalizer + verify it standalone for all 4 PSMs + prove the easiest case (PSMCT32) end-to-end through the new VRAM. RTL plumbing into gs_stub.raster_pixel_emit and gif_image_xfer_stub lands in Ch156+.

Artifact Purpose
rtl/gif_gs/vram_normalize_pkg.sv Pure-comb normalize_write function — natural byte address + PSM + payload + (T4-only) old_byte → word-aligned write_addr + shifted write_data + write_be.
tb_vram_normalize_write Focused unit TB — 17 cases across CT32 / CT16 / T8 / T4 lanes + misuse detection.
rtl/top/top_psmct32_raster_demo_bram.sv Sibling of the Ch146 wrapper with vram_bram_stub swapped in.
tb_top_psmct32_raster_demo_bram Integration TB — drives Ch146 fixtures + verifies VRAM contents at PSMCT32 swizzled addresses via hierarchical probe.

Function contract (vram_normalize_pkg::normalize_write):

PSM byte_addr alignment payload bits used output write_be shape extras
PSMCT32 word (addr[1:0]==0) payload[31:0] (full ABGR) 4'b1111 misuse → drop (be=0000)
PSMCT16 halfword (addr[0]==0) payload[15:0] (RGB5A1) 4'b0011 (low) / 4'b1100 (high), keyed on addr[1] misuse → drop
PSMT8 byte (any) payload[7:0] (index byte) one of 4'b0001 / 0010 / 0100 / 1000, keyed on addr[1:0]
PSMT4 byte (any) payload[3:0] (nibble) one of 4'b0001 / 0010 / 0100 / 1000, keyed on addr[1:0] needs old_byte + nibble_hi; output is the spliced full byte at the addressed lane
any other 4'b0000

PSMT4 splice math (the only PSM whose output depends on prior memory state): given nibble_hi=0, the function returns new_byte = {old_byte[7:4], payload[3:0]} — preserves the upper nibble, replaces the lower. With nibble_hi=1, new_byte = {payload[3:0], old_byte[3:0]}. The CALLER is responsible for sourcing old_byte via a 1-cycle read of mem[byte_addr] upstream of the write; the function itself is purely combinational. The Ch156+ RTL plumbing chapter is where that read pipeline lives inside gs_stub.raster_pixel_emit and gif_image_xfer_stub.

top_psmct32_raster_demo_bram integration result: the new sibling wrapper substitutes vram_bram_stub for vram_stub, drops write_mask wiring (CT32's mask=0xFFFFFFFF makes the per-bit RMW path a no-op so dropping it is functionally equivalent), and accepts the 1-cycle sync-read latency on PCRTC's vram_read_data path (so PCRTC scanout is 1-pixel shifted; the integration TB skips frame capture and verifies VRAM content via direct hierarchical probe). All 128 pixel words at canonical PSMCT32 swizzled addresses match expected ABGR. Standalone PASS.

Ch155 critical audit check: vram_normalize_write's function-level misuse handling pins the contract — passing an unaligned byte_addr for CT32 OR CT16 returns write_be=4'b0000, which vram_bram_stub then drops cleanly. Combined with Codex's stance that "no sub-32-bit writer is allowed to hand an unaligned address directly to vram_bram_stub", the Ch156+ plumbing chapter has a hard contract to verify against.

Ch156+ surface:

  • Insert a 1-cycle byte-read pipeline upstream of the PSMT4 raster emit + image-xfer paths inside gs_stub and gif_image_xfer_stub. The read returns old_byte for normalize_write's splice input.
  • Apply normalize_write to all four PSM emit lanes inside both writers.
  • Add focused TBs for PSMCT16 / PSMT8 / PSMT4 paths analogous to tb_top_psmct32_raster_demo_bram — each verifies the swizzled VRAM contents under the new normalizer + bram_stub.
  • Add a 1-cycle address-stage register inside gs_pcrtc_stub so scanout consumers see a clean combinational-look read (addrdata with the BRAM's internal sync stage hidden).
  • Once all four lanes pass, swap the Ch146 board top to use vram_bram_stub directly (or retire vram_stub outright).
  • Audit useg_shadow_mem next — the Ch64/Ch65/Ch70 mirror- write features may make it multi-port-write-shaped, which is its own forensic exercise.

Writer-side normalize plumbing — CT16 + T8 (Ch156)

Ch156 plumbs the Ch155 vram_normalize_pkg::normalize_write function into the BRAM-friendly path so PSMCT16 and PSMT8 raster emits land at the right vram_bram_stub byte lane. The chapter intentionally keeps blast radius narrow — the function is wired in at the wrapper site between the unmodified writer engines (gs_stub.raster_pixel_emit) and vram_bram_stub, so the legacy byte-addressable contract on gs_stub's raster emit ports stays exactly as Ch128/Ch134 / etc. defined them. PSMT4 still requires the read-modify-write pipeline and is deferred to Ch157+.

File / target Role
rtl/top/top_psmct32_raster_demo_bram.sv Wrapper updated: raster_pixel_psm_q exposed; bitbltbuf_q[61:56] provides the PSM during xfer; the muxed (byte_addr, psm, payload) triple is run through vram_normalize_pkg::normalize_write and the result feeds vram_bram_stub. CT32 path remains a passthrough; CT16/T8 paths now write to the right lane.
tb_gs_raster_bram_psmct16 Focused CT16 integration TB — 16×4 SPRITE at FBP=0/FBW=1, halfword 0x6155. Drives gs_stub#(PSMCT16_SWIZZLE=1) directly; verifies all 64 swizzled halfwords land in u_vram.mem[byte_addr >> 2] at the addr[1]-keyed lane; pins the linear-stride separator at byte 0x80 = zero.
tb_gs_raster_bram_psmt8 Focused PSMT8 integration TB — 16×8 SPRITE at FBP=0/FBW=2, byte index 0xA5. Drives gs_stub#(PSMT8_SWIZZLE=1) directly; verifies all 128 swizzled bytes land in u_vram.mem[byte_addr >> 2] at the addr[1:0]-keyed lane.

Why wrapper-site, not in-engine: keeping gs_stub and gif_image_xfer_stub byte-addressable preserves the contract that every Ch128 / Ch134 / Ch140 swizzle TB (and the legacy vram_stub) was written against. Ch156's only structural change is that a top wrapper which targets vram_bram_stub also runs normalize_write between the writer and VRAM. A future chapter can promote the normalizer into the writer engines once we've decided to retire vram_stub; until then the function lives where it can be removed without changing the writers.

PSMT4 deferral — explicit hard-gate (Ch156 audit Medium #1 fix; superseded by Ch157): when Ch156 closed, the wrapper masked write_en off when the active PSM was PSMT4 (vram_psmt4_block = (vram_psm_pre == PSM_PSMT4), vram_we_mux = vram_we_pre && !vram_psmt4_block). Without that gate, normalize_write's PSMT4 branch returned a real one-byte write spliced against old_byte=0, silently corrupting VRAM on any T4 raster emit. The Ch156 focused TB tb_gs_raster_bram_psmt4_gate drove a 16×4 PSMT4 SPRITE through the wrapper-shape gate and asserted (1) raster_pixel_emit pulses fired, (2) every pulse hit the gate (blocked == emit), (3) VRAM stayed at sentinel 0xDEADBEEF — zero corruption. Ch157 retires both the gate and that TB: the wrapper now runs a real RMW pipeline (see "PSMT4 RMW pipeline" section below) and supplies a live old_byte so the splice produces correct bytes. The retired TB's coverage is replaced by tb_gs_raster_bram_psmt4, which drives the same kind of PSMT4 SPRITE but verifies correct nibble splices instead of absence of writes.

Adversarial coverage on the CT16 / PSMT8 TBs (Ch156 audit Medium #2 fix): both TBs originally drove a single uniform payload across the whole sprite, so a buggy normalizer that wrote all four byte lanes (or duplicated payload, or stomped neighboring lanes) could still leave every checked pixel matching. The TBs now split the image into TWO half-width SPRITEs with distinct payloads:

  • tb_gs_raster_bram_psmct16 drives (0,0)..(7,3) with halfword 0x6155 (low halfword lane via PSMCT16 swizzle) and (8,0)..(15,3) with halfword 0x9F8E (high halfword lane of the same 32-bit words). Sentinel preload (0xDEADBEEF) on every VRAM word before the drive plus a linear-stride separator check at byte 0x80 (outside the swizzled set).
  • tb_gs_raster_bram_psmt8 drives (0,0)..(7,7) with byte 0xA5 (lanes {0,1}) and (8,0)..(15,7) with byte 0x5A (lanes {2,3}). Same sentinel preload.

A normalizer that swaps lanes, sets be too wide, or fails to preserve the other halfword/byte lane(s) of the shared word now surfaces as a per-pixel mismatch.

Sim regression: 141 PASS / 0 FAIL after the audit fixes (140 + the new tb_gs_raster_bram_psmt4_gate).

xfer-side coverage: gif_image_xfer_stub already feeds the wrapper's pre-normalize mux during xfer_busy. CT32 TRXDIR uploads (no Ch156 TB exists yet, but the path is wired) pass through the normalizer cleanly because xfer emits CT32 word-aligned. CT16 + T8 xfer TBs that exercise this path are a follow-on item — the wiring is already in place; only a focused TB is missing.

Sim regression: 140 PASS / 0 FAIL after Ch156 (138 + 2 new BRAM-integration TBs).

PSMT4 RMW pipeline — vram_bram_stub writes enabled (Ch157)

Ch157 closes the last writer-PSM gap that Ch156 left behind: the PSMT4 hard-gate is replaced by a wrapper-site read-modify-write pipeline that supplies a LIVE old_byte from VRAM, splices the new nibble against it, and commits a full-byte write through vram_bram_stub's byte-WE (no per-bit RMW required). The nibble splice itself uses the SAME math as vram_normalize_pkg's PSMT4 branch (new = nibble_hi ? {nib, old[3:0]} : {old[7:4], nib}) but lives inline in the wrapper, not inside a call to normalize_write — the function is pure-comb and would have required old_byte to be combinationally available, whereas vram_bram_stub's registered read port hands the byte back one cycle later. The CT32/CT16/T8 paths still call normalize_write directly (same-cycle, no read-back required). Goal Codex framed: "all writer PSMs safe before swapping the board top."

Pipeline shape (inside rtl/top/top_psmct32_raster_demo_bram.sv):

emit cycle N:        is_t4_emit=1; vram_read2_addr = byte_addr & ~3;
                     pipe_q <= (byte_addr, nibble_hi, nibble[3:0]).
posedge → cycle N+1: vram_read2_data = mem[byte_addr] (sync read);
                     splice new_byte = nibble_hi
                            ? {nibble, old[3:0]}
                            : {old[7:4], nibble};
                     drive vram_we_final=1, write_addr=byte_addr&~3,
                     write_data shifted to byte_addr[1:0] lane,
                     write_be one-hot to that lane.
posedge → cycle N+2: mem[byte_addr] commits new_byte.

old_byte is sourced from the lane-correct slice of vram_read2_data. CT32/CT16/T8 emits skip the pipe entirely and fall through vram_norm same-cycle (CT32 stays a passthrough, existing TBs unaffected).

Forwarding hazard — back-to-back same-byte writes: a PSMT4 SPRITE rasters adjacent pixels at x=2k and x=2k+1 to the SAME byte_addr (low + high nibble of a single byte). At cycle N+1 the wrapper reads mem[byte_addr] for emit-2 in the SAME posedge that emit-1's write commits. NBA semantics inside vram_bram_stub (separate always_ff blocks for the write port and the read port) make the read see the PRE-write value, so emit-2 would splice against stale data. The Ch157 pipe carries a 1-deep t4_prev_* register set (addr + new_byte from the just-completed RMW) and forwards t4_prev_new_byte_q whenever the in-flight emit's byte_addr matches the previous emit's byte_addr. The forwarding chain extends across any number of back-to-back same-byte emits — emit-N reads emit-(N-1)'s new_byte from the forward register, splices on top, and emit-(N+1) reads emit-N's new_byte from that same register.

File / target Role
rtl/top/top_psmct32_raster_demo_bram.sv Ch156 hard-gate replaced by the RMW pipe + forwarding registers; vram_read2_addr driven on T4 emit cycles; vram_we_final mux selects T4 pipe write or non-T4 same-cycle path.
tb_gs_raster_bram_psmt4 New positive-proof TB — drives a 16×4 LINEAR PSMT4 SPRITE (PSMT4_SWIZZLE=0 so adjacent x's hit the same byte) split into two halves with distinct nibbles (0xA / 0x5). 64 raster emits; verifies every byte under the sprite holds the expected pair of spliced nibbles (left half = 0xAA, right half = 0x55) plus sentinel preserved on bytes outside the sprite. PASS.
tb_gs_raster_bram_psmt4_gate Retired — the gate it asserted no longer exists.

Why LINEAR PSMT4 in the new TB: the linear address formula (y*FBW*32) + (x>>1) puts adjacent x's at the same byte, which is exactly the back-to-back same-byte forwarding hazard. The swizzled path scatters bytes via columnTable4, so it touches the forwarding logic less often. Linear coverage is strictly stronger here.

Non-T4 TB cleanup: tb_gs_raster_bram_psmct16 and tb_gs_raster_bram_psmt8 still mirror the non-T4 portion of the wrapper-site plumbing, but they no longer carry the Ch156 PSMT4 hard-gate (now removed in the wrapper). Both wire raster_pixel_emit straight to write_en and let vram_norm drive addr/data/be — focused TBs verifying their own PSM lane. Full pipe coverage lives in tb_gs_raster_bram_psmt4 and the top wrapper TB.

Sim regression: 141 PASS / 0 FAIL after Ch157 (140 + new tb_gs_raster_bram_psmt4 retired tb_gs_raster_bram_psmt4_gate).

PCRTC sync-read alignment (Ch158)

Ch158 closes the last big blocker before swapping the board top to vram_bram_stub: the PCRTC's data-decode + sync-output pipeline is now aware that vram_bram_stub's read_data is registered with 1-cycle latency, so the captured scanout no longer trails the address stage by one column.

gs_pcrtc_stub change (in rtl/gif_gs/gs_pcrtc_stub.sv): new module parameter VRAM_SYNC_READ (default 0). When set to 1, every hcnt/vcnt-derived signal that the data-decode comb consumes is run through a 1-cycle register before the consumer sees it (active_h_dec, active_v_dec, in_hsync_dec, in_vsync_dec, in_display_window_dec, scanout_enable_dec, dispfb_psm_*_dec, psm4_nibble_select_dec, end_of_frame_dec). The address-side (vram_read_addr) keeps using the current (hcnt, vcnt) so the read is issued one pixel "ahead"; the registered vram_read_data arrives one cycle later, paired with the matching delayed counter view. Outputs r/g/b/hsync/vsync/de come from the _dec signals, so the entire output stream shifts right by exactly one clock when VRAM_SYNC_READ=1. Default VRAM_SYNC_READ=0 is a pure passthrough — every existing PCRTC TB written against the legacy vram_stub (comb-read) shape is unaffected.

top_psmct32_raster_demo_bram change: instantiates gs_pcrtc_stub with .VRAM_SYNC_READ(1'b1). The wrapper banner updates to drop the Ch155 caveat about scanout being 1 column shifted — that caveat is now resolved.

tb_top_psmct32_raster_demo_bram extension: adds a Phase 2 frame-capture block that arms on the next vsync rising edge after raster drain, captures one full frame's r/g/b into cap_*[v][h] indexed by a 1-cycle-delayed copy of PCRTC's address-stage counters (since the registered de aligns with those delayed counters), and asserts each captured pixel's post-decode r/g/b matches the expected ABGR for its quadrant. Phase 1 (per-pixel VRAM probe via hierarchical mem[byte_addr >> 2]) is unchanged. PASS — 16×8 active region, all 128 pixels captured + all 128 VRAM words probe-verified, frame_seen latched.

Open Ch159+ items:

  • xfer-side T4 coverage TB — the Ch157 wrapper handles xfer-side T4 emits identically (the mux feeds vram_psm_pre from bitbltbuf_q[61:56] during xfer_busy), but no focused TB exercises that path yet.
  • Swap the Ch146 board top to instantiate vram_bram_stub and the Ch158 PCRTC-sync mode directly (or retire vram_stub outright). All four writer PSMs and PCRTC scanout are now proven correct against the BRAM-friendly contract; the remaining work is the integration commit on the board side.
  • Audit useg_shadow_mem for the same BRAM-shape forensics that Ch153 ran on vram_stub (Ch64/Ch65/Ch70 mirror writes may make it multi-port-write-shaped).

Ch158 audit Medium fix — sub-word PSM lane selection: the initial Ch158 cut shifted the data-decode pipeline by 1 cycle to align with vram_bram_stub's registered output, but it still extracted CT16 / PSMT8 / PSMT4 sub-word values from the LOW lane of vram_read_data (i.e. [15:0] halfword and [7:0] byte). That worked for vram_stub (byte-addressable; the read returns 4 bytes starting at byte_addr so the sub-word always lands at the low lane) but NOT for vram_bram_stub (word-addressable; read_data is mem[byte_addr >> 2] so the sub-word lives at lane byte_addr[1:0] of the returned word). Codex Ch158 audit called this out as a blocker for any sub-word PSM scanout through the BRAM. The fix adds:

  • vram_addr_lane_q — 1-cycle-delayed copy of vram_read_addr[1:0], paralleling the other _q decode- stage registers added in the original Ch158 cut.
  • data_lane = VRAM_SYNC_READ ? vram_addr_lane_dec : 2'd0 — forces the legacy comb-read path to keep using the low lane (preserving every existing PCRTC TB's expectation), and resolves to the correct byte_addr-keyed lane in sync mode.
  • psm16_pixel = data_lane[1] ? read_data[31:16] : read_data[15:0].
  • A vram_byte_lane mux extracting one of 4 byte lanes for PSMT8 (psm8_idx) and PSMT4 (psm4_byte_lane → nibble splice).

Two new focused integration TBs prove the fix end-to-end with adversarial pre-loads:

TB Coverage
tb_gs_scanout_bram_psmct16 4-pixel CT16 scanout reading mem[0]/mem[1] with FOUR distinct halfwords across both halfword lanes (byte_addr[1]∈{0,1}); each pixel's captured 5→8-decoded RGB matches the expected halfword. PASS
tb_gs_scanout_bram_psmt8 4-pixel PSMT8 scanout reading mem[0] with FOUR distinct byte indices, one per byte lane (byte_addr[1:0] ∈ {0,1,2,3}); each pixel's grayscale RGB matches the expected byte. PASS

Without the fix, both TBs would have failed: the CT16 TB would emit the same pair of pixels twice (low halfword of each word), and the PSMT8 TB would emit IDX_0 for all four pixels.

Sim regression: 143 PASS / 0 FAIL after Ch158 audit fixes (141 + 2 new BRAM scanout TBs).

Board-top swap to BRAM wrapper + Quartus fit recovery (Ch159)

Ch159 commits the integration step that the prior chapters were building toward: the DE25-Nano board top (rtl/top/de25_nano_psmct32_raster_demo_top.sv) now instantiates top_psmct32_raster_demo_bram instead of the legacy top_psmct32_raster_demo. External port shape is identical so this is drop-in at the board level; the BRAM-backed wrapper carries through every Ch155-Ch158 fix (writer-side normalize + PSMT4 RMW pipe + PCRTC sync-read alignment + sub-word lane select). The synth file list (synth/de25_nano/top_psmct32_raster_demo/files.f) and Quartus QSF gain vram_normalize_pkg.sv, vram_bram_stub.sv, and top_psmct32_raster_demo_bram.sv; the legacy vram_stub

  • legacy top stay on the project for back-compat with sim TBs that still target them.

Quartus fit recovery — vs Ch152 baseline: the headline of this chapter. Ch152 fit FAILED at 155k ALMs needed (331% over) because vram_stub's 8 KiB byte-addressable + per-bit-RMW storage didn't infer as M20K and landed as a 65,536-flip-flop array, dragging 121k registers and 199k synthesis ALMs along with it. Ch159 swap turns those numbers around:

Metric Ch152 (vram_stub) Ch159 (vram_bram_stub) Δ
Synthesis status Successful Successful
Synthesis ALMs estimate 199,103 / 46,800 (425% over) 22,704 / 46,800 (49%) 176,399 (88.6%)
Synthesis registers 101,457 36,008 65,449 (64.5%)
Fit status FAILED (155k / 331% over) Successful (30,364 / 65%) fits
Fit registers 121,176 39,085 82,091 (67.7%)
Fit RAM blocks 6 / 358 14 / 358 +8 (BRAM-inferred VRAM)
Fit block memory bits 65,536 196,608 +131,072 (data in M20K)
Fit DSP blocks 20 18 2
STA status DID NOT RUN (fit failed) Successful (12 warnings) STA reachable
STA setup slack worst (CLOCK2_50) n/a 6.950 ns timing miss at 50 MHz
Fmax n/a 37.11 MHz (Ch160+ tunes)

The eight new RAM blocks are the same vram_bram_stub footprint exp_c proved in Ch154 (8 RAM blocks for the dual-port

  • admission-gated 8 KiB shape; the +6 already in the Ch152 baseline came from bios_rom_stub + ee_ram_stub + useg_shadow_mem correctly inferring as BRAM there). The register drop (121k → 39k) is essentially the entire VRAM flip-flop array vanishing.

Setup-slack reality check: STA reports 6.950 ns slack at the CLOCK2_50 50 MHz constraint (Fmax = 37.11 MHz). The critical path is somewhere in the Ch123 dep tree's longer combinational chains (likely the Gouraud divider or one of the swizzle muxers). That is NOT a Ch159 regression — it's a brand-new visibility unlocked by being able to run STA at all. Ch160+ owns timing closure (PLL down-clock to ≤30 MHz, critical-path pipelining, or both).

Snapshots preserved: the Ch152 baseline reports are saved under synth/de25_nano/top_psmct32_raster_demo/baseline_ch152/ (syn / fit summaries + flow.rpt + parse_report.txt) so future chapters can diff against them without re-running the failing Ch152 baseline.

Sim regression: 143 PASS / 0 FAIL unchanged. The Ch149 board-wrapper TB exercises the same external behavior with the new core wrapper inside.

Down-clock target + first .sof bitstream (Ch160)

Ch160 closes the loop Codex framed at the end of Ch159 — "first add a down-clock PLL profile so we can get a real bitstream moving on hardware, then use the successful STA path report to decide whether to pipeline toward 50 MHz." The chapter is SDC- and build-flow-only; no RTL changes.

SDC retarget (synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc) relaxes the CLOCK2_50 period from 20.000 ns (50 MHz) to 33.333 ns (30 MHz). The DE25-Nano's CLOCK2_50 oscillator is physically still 50 MHz; the SDC tells Quartus to ASSUME a 30 MHz input so the fitter closes timing at the down-clock target. A real PLL .ip that divides 50 → 30 MHz on hardware is the Ch161+ commit (the QSF's commented-out QIP_FILE swap-point is staged for it). Until then, the .sof produced under this constraint is structurally clean for 30 MHz operation; programming it on a board where CLOCK2_50 is still wired straight through gives an effective 50 MHz chip clock that may show setup-violating behavior — Ch161 closes that gap.

build_quartus.sh adds quartus_asm (synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh) gated on a clean STA, so a .sof bitstream is now produced when the design fits and timing closes. The Make scaffold check is loosened to accept either the 50 MHz (legacy) or 33.333 ns (Ch160 down-clock) period.

Quartus result vs Ch159:

Metric Ch159 (50 MHz target) Ch160 (30 MHz target)
Synth ALMs estimate 22,704 / 46,800 (49 %) 22,704 / 46,800 (49 %)
Synth registers 36,008 36,008
Fit status Successful Successful
Fit ALMs 30,364 / 46,800 (65 %) 31,056 / 46,800 (66 %)
Fit registers 39,085 37,381
Fit RAM blocks 14 / 358 14 / 358
STA setup slack worst 6.950 ns (timing miss) +0.805 ns (closes)
Fmax (CLOCK2_50) 37.11 MHz 30.74 MHz
quartus_asm (skipped) Successful — .sof produced

The synth-side numbers are identical because no RTL changed — the differences are entirely in the fitter's placement choices under the looser timing constraint. Fmax dropped slightly (37.11 → 30.74 MHz) because Quartus optimizes harder when the target is tighter; the headline is that at the 30 MHz target the design CLOSES (positive slack on every report) and a real .sof is now generated.

Critical path (from output_files/de25_nano_psmct32_raster_demo_top.sta.rpt, worst-10 paths all in the same module hierarchy):

Field Value
Slack +0.805 ns (worst path of 10 with this slack value)
From / To `u_demo
Data Delay 32.516 ns (out of 33.333 ns period)
Critical net The EE core's auto-generated 64-bit signed divider (the Ch152-noted Gouraud TRI divider — dead code in the PSMCT32 raster demo because no RM_TRI primitive is dispatched).

Ch161+ pipelining handoff: the path Codex's framing asked us to surface is now visible. Two options:

  1. Pipeline the divider — re-implement ee_core's 64-bit division as an N-cycle multi-cycle path. Quartus's auto- generated divider is a single-cycle ripple chain; making it 2-3 stage pipelined would put Fmax comfortably above 50 MHz.
  2. Strip it from the build — gate the Gouraud TRI divider behind a STRIP_GOURAUD_TRI parameter (default off), so the PSMCT32 raster demo's hardware build instances the EE core without it. Quartus removes the entire div_0_rtl_0 block; Fmax should jump dramatically.

Option 2 is the lower-blast-radius hardware bring-up move (removes ~32 ns of dead-code combinational chain); option 1 is the long-term correct fix once the Gouraud TRI path goes load-bearing.

Snapshots: Ch159 baseline reports preserved under baseline_ch159/ (syn / fit / sta summaries + parse_report).

Sim regression: 143 PASS / 0 FAIL unchanged (no RTL changes). Scaffold check + Ch149 board TB + top BRAM TB all green under the new SDC.

Real PLL IP commit — .sof actually runs at 30 MHz (Ch161)

Ch161 retires the Ch160 hardware-honesty caveat by committing a real Quartus IOPLL .ip configured for 50 MHz refclk → 30 MHz outclk_0. The wrapper's \ifdef USE_PLL_IP(staged in Ch151) now flips to the IP-generatedpllmodule on Quartus builds; sim TBs continue to use the pass-throughde25_nano_pll_stub`.

Files committed under synth/de25_nano/top_psmct32_raster_demo/ip/:

  • pll.ip — adapted from retroDE_nes/ip/audio_pll.ip (single- output Agilex 5 IOPLL template), retargeted to 50 MHz refclk → 30 MHz outclk_0.
  • pll/pll.qip + pll/synth/pll.v + pll/pll_bb.v — Quartus IP-generated artifacts (quartus_ipgenerate de25_nano_psmct32_raster_demo_top --ip_file=ip/pll.ip --generate_ip_file --synthesis=verilog). The generated pll module exposes (refclk, rst, outclk_0, locked) — exactly the Ch151 stub's signature, so the \ifdef` swap is drop-in.

Wiring changes:

  • de25_nano_psmct32_raster_demo_top.qsf uncommented the set_global_assignment -name QIP_FILE ip/pll/pll.qip swap- point and added set_global_assignment -name VERILOG_MACRO "USE_PLL_IP=1" so Quartus instantiates the IP pll instead of the de25_nano_pll_stub.
  • de25_nano_psmct32_raster_demo_top.sdc reverted the Ch160 CLOCK2_50 period back to 20.000 ns (the physical 50 MHz oscillator). The IOPLL's auto-generated SDC inside the .qip declares the post-PLL outclk_0 clock at 30 MHz, so STA picks up two domains: u_pll|iopll_0_refclk (50 MHz, the pin) and u_pll|iopll_0_outclk0 (30 MHz, the design clock).
  • build_quartus.sh symlinks the ip/ dir alongside the existing rtl/ and sim/ symlinks so the QIP_FILE's ip/pll/pll.qip path resolves from the work dir.

Quartus result vs Ch160:

Metric Ch160 (SDC profile only) Ch161 (real PLL IP)
Fit ALMs 31,056 / 46,800 (66 %) 30,898 / 46,800 (66 %)
Fit registers 37,381 37,352
Fit PLLs 0 / 11 1 / 11 (real IOPLL)
RAM blocks 14 / 358 14 / 358
Setup slack worst (design_clk) +0.805 ns @ CLOCK2_50 **+0.565 ns @ u_pll
Fmax (design_clk) 30.74 MHz 30.74 MHz
quartus_asm Successful Successful (.sof produced)

The +1 PLL block is the real IOPLL on the chip; ALMs go down slightly because the stub's clock-distribution path no longer needs ALM glue. STA now reports BOTH clock domains: the refclk (50 MHz, +19.249 ns slack — trivially fast) and the design_clk (30 MHz post-PLL, +0.565 ns slack — comfortable margin). The .sof produced under this configuration genuinely runs at 30 MHz on the DE25-Nano: the IOPLL takes the 50 MHz CLOCK2_50 input and divides to 30 MHz inside the chip, so the entire design downstream of u_pll.outclk_0 operates at the constrained frequency. (Setup slack landed at +0.914 ns on the initial Ch161 build; the Ch161 audit's wider reset false-path nudged the fitter into a slightly different placement, dropping the worst-case setup slack to +0.565 ns. Recovery analysis on the rst_sync stages — which had been hiding a real -0.079 ns violation under the original *rst_sync[0] constraint — is now gone from the .sta.summary entirely after the false-path was widened to *rst_sync[*].)

Snapshots: Ch160 baseline (parse_report + summaries + .sof) preserved under baseline_ch160/.

Open Ch162+ items (Ch161 forward-ref, superseded by Ch162 below):

  • Pipeline or strip the EE-core 64-bit Gouraud TRI dividerclosed in Ch162 via STRIP_HW_DIVIDER (note: the actual divider is the Ch43 DIVU divider, not Gouraud TRI; the forward-ref's name was loose). The Ch162 strip retired the u_demo|u_core|div_0_rtl_0|... STA worst path entirely; see the Ch162 section below for the new critical path.
  • xfer-side T4 coverage TB (open from Ch157+).
  • useg_shadow_mem BRAM-shape forensics.
  • Video PHY shim (HDMI / VGA / PMOD) — VIDEO_* pins virtualized.

Sim regression: 143 PASS / 0 FAIL unchanged. Sim ignores the \ifdef USE_PLL_IP(no+define+USE_PLL_IP` in the iverilog Makefile) so the stub stays active under sim.

Strip the EE-core hardware divider (Ch162)

Ch162 takes the lower-blast move from the Ch161 STA handoff: add a parameter that gates the EE-core's auto-inferred 32-bit hardware divider out of synthesis on the PSMCT32 SPRITE-only hardware build, then re-measure Fmax.

RTL change (rtl/ee/ee_core_stub.sv) gains parameter bit STRIP_HW_DIVIDER = 1'b0. Two / and % sites tied to the Ch43 DIVU instruction are gated by this parameter — the writeback (lines ~932-935) and the retire- trace arg3 mirror (lines ~1005-1014). Default 0 keeps DIVU semantics intact for every existing sim TB (tb_ee_core_divu_mflo is the only consumer; it stays at the default). When the parameter is 1, the writeback becomes a no-op (HI/LO unchanged, identical to the divisor==0 case the spec calls undefined) and the retire-trace arg3 reports 0. Quartus then has nothing to infer — the div_0_rtl_0 block disappears.

Wrapper plumbing: top_psmct32_raster_demo_bram gains a matching STRIP_HW_DIVIDER parameter and forwards it to ee_core_stub. The DE25-Nano board top sets .STRIP_HW_DIVIDER(1'b1) on its u_demo instantiation (the bootlet doesn't execute DIVU, so this is behavior-neutral for the demo). Sim TBs that instantiate the BRAM wrapper directly use the default 0.

Quartus result vs Ch161 (real-PLL baseline):

Metric Ch161 (real PLL) Ch162 (real PLL + strip)
Fit ALMs 30,898 / 46,800 (66 %) 30,006 / 46,800 (64 %)
Fit registers 37,352 36,618
Fit PLLs 1 1
RAM blocks 14 14
Setup slack worst (design) +0.565 ns +3.567 ns
Fmax (design domain) 30.74 MHz 33.6 MHz (+9.4 %)
quartus_asm Successful Successful (.sof produced)

Stripping the divider freed 892 ALMs / 734 registers and yielded ~3 ns of new setup margin. Fmax climbs from 30.74 MHz to 33.6 MHz — a real jump, but not enough to clear the 50 MHz target (which would need a +67 % jump). Codex's Ch162 framing predicted this branch: "if Fmax jumps, we have a clean path to a 50 MHz demo bitstream; if not, the next real critical path will reveal itself." We landed in the second branch — Fmax jumped, but not far enough.

New critical path (the Ch163+ handoff, from output_files/de25_nano_psmct32_raster_demo_top.sta.rpt):

Field Value
Slack +3.567 ns
From `u_demo
To `u_demo
Data delay 38.443 ns of arrival vs 42.010 ns required (period 33.333 ns + clock skew + uncertainty)

The PCRTC divider comes from gs_pcrtc_stub.sv lines:

assign vram_x_unshift = {20'd0, hwin_rel} / hmag_factor;
assign vram_y_unshift = {20'd0, vwin_rel} / vmag_factor;

where hmag_factor = MAGH + 1 and vmag_factor = MAGV + 1. For the demo MAGH = MAGV = 0, so the divisor is constant 1 — but Quartus doesn't constant-propagate through this formulation and synthesizes a real 32-bit divider anyway. The parallel Ch162 fix shape would be a STRIP_PCRTC_MAG_DIV parameter (or a more general "demo doesn't use magnification" hint that bypasses the divider when both MAGH and MAGV are constant 0).

Snapshots: Ch161 baseline preserved under baseline_ch161/ (syn / fit / sta summaries + parse_report + .sof) for diff.

Open Ch163+ items:

  • Strip the PCRTC magnification divider on hardware builds (next critical path; same shape as Ch162's STRIP_HW_DIVIDER).
  • Once Fmax climbs north of 50 MHz, retune the IOPLL .ip to outclk_0 = 50 MHz, retarget the SDC, and ship a 50 MHz bitstream.
  • xfer-side T4 coverage TB (still open from Ch157+).
  • useg_shadow_mem BRAM-shape forensics.
  • Video PHY shim (HDMI / VGA / PMOD) — VIDEO_* pins virtualized.

Sim regression: 143 PASS / 0 FAIL unchanged. Default STRIP_HW_DIVIDER=0 preserves DIVU semantics for tb_ee_core_divu_mflo; the board top's STRIP_HW_DIVIDER=1 goes through tb_de25_nano_psmct32_raster_demo_top cleanly because the Ch149 board TB doesn't exercise DIVU.

Strip PCRTC magnification divider + 50 MHz close (Ch163)

Ch163 takes the next critical-path attack from the Ch162 STA report (the PCRTC magnification divider) and uses the resulting Fmax headroom to retune the PLL IP to 50 MHz output — closing the journey that started at the Ch152 fit failure with a real 50 MHz bitstream.

RTL change (rtl/gif_gs/gs_pcrtc_stub.sv) gains parameter bit STRIP_PCRTC_MAG_DIV = 1'b0. The two / operators are gated:

assign vram_x_unshift = STRIP_PCRTC_MAG_DIV
                        ? {20'd0, hwin_rel}
                        : ({20'd0, hwin_rel} / hmag_factor);
assign vram_y_unshift = STRIP_PCRTC_MAG_DIV
                        ? {20'd0, vwin_rel}
                        : ({20'd0, vwin_rel} / vmag_factor);

Default 0 keeps the live divider math for every Ch93-era magnification scanout TB (tb_gs_scanout_magh_magv etc.). When 1, the math collapses to a passthrough — equivalent to the MAGH=MAGV=0 case the demo always hits but with no inferred divider for Quartus to synthesize.

Wrapper plumbing: top_psmct32_raster_demo_bram gains a matching STRIP_PCRTC_MAG_DIV parameter that forwards to gs_pcrtc_stub. The DE25-Nano board top sets .STRIP_PCRTC_MAG_DIV(1'b1) on its u_demo instantiation.

Quartus result, two stages:

Stage A — strip @ 30 MHz target (still on the Ch161 PLL .ip):

Metric Ch162 (strip EE divider only) Ch163 (strip both, 30 MHz)
Fit ALMs 30,006 / 46,800 (64 %) 27,216 / 46,800 (58 %)
Setup slack worst +3.567 ns +21.113 ns
Fmax (design) 33.6 MHz 81.83 MHz (+143 %)

The Ch163 strip alone freed +17.5 ns of margin and 2,790 ALMs — large enough to clear 50 MHz outright. Codex's Ch162 framing predicted both branches of the if-Fmax-jumps fork; Ch163 lands in the first branch ("clean path to a 50 MHz demo bitstream").

Stage B — retune PLL .ip from 30 MHz → 50 MHz output:

The pll.ip source's gui_output_clock_frequency0 and gui_output_clock_frequency_ps0 are bumped (30.0 → 50.0 MHz; 33333.333 → 20000.0 ps). quartus_ipgenerate rebuilds the .qip / synth files in-place. No SDC change needed — CLOCK2_50 stays pinned at the physical 50 MHz period; the IOPLL's auto- generated SDC declares the new outclk_0 frequency.

Metric Ch163 strip @ 30 MHz target Ch163 strip @ 50 MHz target
Fit ALMs 27,216 / 46,800 (58 %) 27,543 / 46,800 (59 %)
RAM blocks / PLLs 14 / 1 14 / 1
Setup slack worst +21.113 ns +7.500 ns
Fmax (design) 81.83 MHz 80.0 MHz
.sof produced yes (30 MHz run on hw) yes — 50 MHz on hw

The .sof produced under Stage B genuinely runs at 50 MHz on the DE25-Nano — the IOPLL takes 50 MHz CLOCK2_50 in and emits 50 MHz outclk_0 (effectively a 1:1 relation through the real PLL hardware so the chip's clock distribution still goes through the IOPLL's clock network). All 8 timing classes positive; no recovery violations; build gate Successful.

Snapshots:

Open Ch164+ items (the project has hit the major hardware milestone Codex called out at Ch157+; Ch164+ is post-launch):

  • xfer-side T4 coverage TB (open from Ch157+).
  • useg_shadow_mem BRAM-shape forensics.
  • Video PHY shim (HDMI / VGA / PMOD) — VIDEO_* pins still virtualized; this is the next big front-end deliverable before the demo can paint a real screen.

Sim regression: 143 PASS / 0 FAIL unchanged. Default-off on STRIP_PCRTC_MAG_DIV preserves every Ch93 magnification scanout TB; the board top's STRIP_PCRTC_MAG_DIV=1 propagates cleanly through tb_de25_nano_psmct32_raster_demo_top since the demo locks MAGH=MAGV=0.

HDMI pin shim — pixels off-chip (Ch164)

Ch164 is the first video-PHY chapter — Codex's framing was "small PHY shim chapter, not a full display-stack leap. Get pixels off- chip before making them pretty." Replace the abstract VIDEO_R/G/B/HSYNC/VSYNC/DE virtual pins with real DE25-Nano HDMI transmitter signals; the ADV7513 chip itself stays asleep (its I²C wake-up FSM is the Ch165+ chapter), so the bitstream makes the FPGA pins toggle correctly but a real monitor stays dark until Ch165 lands.

Wrapper change (rtl/top/de25_nano_psmct32_raster_demo_top.sv): five new top-level outputs added — HDMI_TX_CLK (= design_clk, the 50 MHz pixel clock), HDMI_TX_D[23:0] packing {VIDEO_R, VIDEO_G, VIDEO_B} (R in MSBs, ADV7513 default 24-bit RGB), and HDMI_TX_HS / HDMI_TX_VS / HDMI_TX_DE mirroring the abstract VIDEO_* signals. The VIDEO_* ports are kept on the wrapper as VIRTUAL_PIN ON (the Ch149 board TB references them via hierarchical probe).

QSF change (synth/.../de25_nano_psmct32_raster_demo_top.qsf): HDMI pinout sourced from retroDE_nes/retroDE_nes.qsf for the same DE25-Nano (Terasic Agilex 5) board — HDMI_TX_CLK on PIN_DJ24 with 1.1-V IO standard (matches the on-board level shifter), data + sync pins on 3.3-V LVCMOS. The companion ADV7513 control pins (HDMI_I2C_SCL, HDMI_I2C_SDA, HDMI_TX_INT, HDMI_MCLK) are intentionally NOT pinned — the chip stays in standby on power-up and ignores its 24-bit RGB input until the I²C wake-up FSM lands in Ch165+.

SDC change (synth/.../de25_nano_psmct32_raster_demo_top.sdc): set_false_path -to for each HDMI output port. Proper set_output_delay constraints with respect to a generated HDMI_TX_CLK domain land alongside the Ch165+ wake-up FSM, when the ADV7513's actual setup/hold window comes out of the chip's datasheet pass.

Scaffold-check extension (sim/Makefile): top_psmct32_raster_demo_quartus_scaffold_check now also verifies HDMI_TX_CLK + HDMI_TX_D[0..23] + HS/VS/DE are pin-assigned (sentinel set; not exhaustive) — fails the gate if Quartus would auto-place them on arbitrary package pins.

Quartus result vs Ch163 (50 MHz):

Metric Ch163 (50 MHz, no HDMI pins) Ch164 (50 MHz + HDMI pins)
Fit ALMs 27,543 / 46,800 (59 %) 27,271 / 46,800 (58 %)
Fit RAM / PLL blocks 14 / 1 14 / 1
Fit pins 17 / 351 (5 %) 45 / 351 (13 %) (+28 HDMI)
Setup slack worst (design) +7.500 ns +7.536 ns
Fmax (design domain) 80.0 MHz ~80 MHz (unchanged)
quartus_asm Successful Successful (.sof produced)

The +28 pins are exactly the new HDMI shim — 24 RGB lanes, 1 clock, 3 sync (HS / VS / DE). Setup slack stays at ~+7.5 ns because the new pins are false_path'd — STA doesn't time anything against them yet. ALMs ticked down slightly as the fitter rebalanced under the wider pin map.

Snapshot: Ch163 50 MHz baseline preserved at baseline_ch163_50mhz/ (syn / fit / sta summaries + parse_report + .sof). The baseline_ch163_30mhz/ 30-MHz milestone is also preserved.

Open Ch165+ items:

  • ADV7513 I²C wake-up FSM — without this the HDMI port outputs nothing on a real monitor. Ch165 owns the chip bring-up: pin HDMI_I2C_SCL / HDMI_I2C_SDA / HDMI_TX_INT / HDMI_MCLK, drop in an I²C master that walks the canonical ADV7513 register-set (sourced from retroDE_nes's working bring-up).
  • Proper set_output_delay constraints once the ADV7513 setup/hold window is documented (replacing Ch164's false_path).
  • Make the rendered pattern bigger than Ch123's 16×8 SPRITE so there's something visible to admire on a real screen.
  • xfer-side T4 coverage TB (still open from Ch157+).
  • useg_shadow_mem BRAM-shape forensics.

Sim regression: 143 PASS / 0 FAIL unchanged — no RTL changes that touched sim semantics; the new HDMI ports are combinational mirrors of existing VIDEO_* signals, and tb_de25_nano_psmct32_raster_demo_top references VIDEO_* unchanged.

Wake the ADV7513 — first .sof that drives a real HDMI monitor (Ch165)

Ch165 turns "FPGA pins toggling" into "monitor has a fighting chance of showing the tiny frame" — Codex's framing for the chapter. The ADV7513 chip stays in standby on power-up; an I²C master needs to walk a canonical register-write sequence to configure 24-bit RGB input + sync polarity + power-up + HPD override before the chip will accept the FPGA's HDMI_TX_* data and drive the connector.

Modules ported (Terasic DE-series reference design, free use on Terasic hardware per the license that ships with the DE25-Nano System CD; copyright retained):

  • rtl/platform/I2C_Controller.v — bit-bang I²C master with 23-step transaction layout (start / slave-addr / sub-addr / data / stop, ~50 µs per byte at the derived 20 kHz I²C clock).
  • rtl/platform/I2C_HDMI_Config.v — wake-up FSM that walks a 38-entry LUT of ADV7513 register writes (slave 0x72): power-up + HPD override + audio init + AVI InfoFrame for full-range RGB 444 + dither + clock-divide + HDMI mode select. Adapted from the retroDE_splash/rtl/platform/ versions (same DE25-Nano board); LUT customizations (HPD override, AVI InfoFrame for full-range RGB) carry through.

Wrapper changes (rtl/top/de25_nano_psmct32_raster_demo_top.sv):

  • Four new top-level ports: inout HDMI_I2C_SCL, inout HDMI_I2C_SDA (open-drain I²C bus), input HDMI_TX_INT (chip's HPD / monitor-sense interrupt, active-low), and output HDMI_MCLK (audio sample-rate reference, driven by CLOCK2_50 since the demo is video-only).
  • I2C_HDMI_Config u_hdmi_i2c instantiated. Clocked on CLOCK2_50 (NOT design_clk — the wake-up runs even before the PLL locks); reset on ~ninit_done (raw async reset; the I²C bus stays held in a clean state until FPGA init completes). Output READY (= hdmi_init_done) goes high after the LUT walk; HDMI_TX_INT going low retriggers the walk so a late hot-plug after FPGA boot still wakes the chip.
  • New status LED: LED[3] = ~hdmi_init_done (active-low; lit means the chip is configured). LED[7:4] retie at HIGH.

QSF + files.f + sim Makefile: QSF gains pin assignments for the 4 new control pins (sourced from retroDE_nes: BT1 / BW2 / CF2 / CF1) plus IO standards (3.3-V LVCMOS for everything). The two new platform Verilog sources are added to the QSF source list, the synth files.f, and the sim Makefile's RTL_SRCS. The scaffold-check extends to verify all 4 control pins are pin-assigned + IO standard'd, alongside the Ch164 24-pin HDMI data set.

SDC change (de25_nano_psmct32_raster_demo_top.sdc): set_false_path -to / -from on the new control pins. The I²C bus runs at ~20 kHz (50 µs per SCL period) and is inherently async to the design clock; HDMI_MCLK is driven by CLOCK2_50 and sampled by the chip's audio PLL — both well below any constraint on the fabric.

Quartus result vs Ch164:

Metric Ch164 (HDMI data only) Ch165 (HDMI data + wake-up)
Fit ALMs 27,271 / 46,800 (58 %) 27,374 / 46,800 (58 %)
Fit RAM / PLL blocks 14 / 1 14 / 1 (unchanged)
Fit pins 45 / 351 49 / 351 (+4 control)
Setup slack worst +7.536 ns +7.198 ns
quartus_asm Successful Successful (.sof produced)

The +103 ALMs are the I²C controller's bit-bang state machine and the 38-entry LUT walker. STA stays positive on every class — the wake-up FSM lives entirely on the I²C-clock domain (slow), and Recovery analysis on iRST_N async-deassert is cleanly +17.621 ns of slack.

TB notetb_de25_nano_psmct32_raster_demo_top (the Ch149 board smoke) wires up the new HDMI_TX_INT input (tied high = no interrupt) and leaves the I²C SCL/SDA lines floating; the wake-up FSM walks the LUT but full completion takes ~125 ms simulated at the production divider (controller-clock period ~100 µs × 33 phases/byte × 38 bytes), far longer than the existing 5 ms TB runtime. The board TB doesn't observe hdmi_init_done directly — it pre-dates the wake-up FSM and only smoke-tests the wrapper. The Ch165 audit landed tb_hdmi_i2c_wake_smoke (sim/tb/top/), which overrides CLK_Freq / I2C_Freq to collapse the divider so the walk runs in microseconds and asserts the LUT walk + READY rise + HDMI_TX_INT retrigger + open-drain SDA + the Ch166 sticky NACK watchdog. Ch167 added a bus-level byte-sequence lock: the TB switched its SDA model from pulldown to pullup + a phase-aware slave-ACK driver (drives strong-LOW exactly when u_dut.u0.phase is PH_ACK0/1/2, releases otherwise so the master's data bits are visible on the wire). A decoder samples SDA on each SCL rising edge between START and STOP, assembles the three bytes per transaction into a 24-bit {dev_addr, reg, data} tuple, and compares against u_dut.mI2C_DATA[23:0] snapshotted on mI2C_GO rising edges. Asserts: 38 captured == 38 intent, every byte matches, every dev_addr is 8'h72. The Phase 3 open-drain check also flipped semantics from "SDA never strong-HIGH" to "SDA never 'x" (the right violation test for the pullup bus).

Snapshots: Ch164 baseline preserved at baseline_ch164/; Ch165 baseline at baseline_ch165/.

Open Ch168+ items:

  • Proper set_output_delay constraints on HDMI_TX_* once the ADV7513 setup/hold window is locked from the bring-up datasheet pass (replaces the Ch164 set_false_path -to).
  • Make the rendered pattern bigger than Ch123's 16×8 SPRITE so there's something visible to admire on a real screen.
  • xfer-side T4 coverage TB (still open from Ch157+).
  • useg_shadow_mem BRAM-shape forensics.

Sim regression: 144 PASS / 0 FAIL. tb_de25_nano_psmct32_raster_demo_top PASSES with the new HDMI control ports wired up (HDMI_TX_INT held high in the TB; LED=0b11111000 shows the existing 3 status LEDs lit — LED[3] stays unlit because the LUT walk doesn't complete in 5 ms of sim). tb_hdmi_i2c_wake_smoke PASSES the accelerated bring-up + Ch166 NACK-watchdog assertions.

Hardware-readiness pass for the Ch123 PSMCT32 raster demo (Ch144)

Ch144 is a synthesis/FPGA-readiness audit around the first hardware-demo candidate (Ch123 PSMCT32 raster e2e, marked above). No RTL changes — Ch144 documents what a top-level FPGA wrapper needs to know before attempting a first build.

RTL dependency tree (Ch123-only) — what the demo actually instantiates. The full RTL_SRCS list compiled by sim contains ~40 modules; Ch123 only reaches these 11, plus the swizzle math primitive that the three swizzle-aware modules each instantiate internally:

Module Role in Ch123
bios_rom_stub EE bootlet at 0xBFC0_0000 (~18 instructions)
ee_ram_stub DMAC-side GIF payload (~24 qwords)
ee_memory_map_stub EE-CPU + DMAC + bios + map's GS-priv decode
ee_core_stub MIPS R5900 core running the bootlet
ee_gs_priv_bridge_stub EE 32-bit MMIO → 64-bit GS-priv reg writes
dmac_reg_stub DMAC ch2 NORMAL transfer
gif_packed_stub GIFtag + PACKED A+D parser
gs_stub GS register file + raster (PSMCT32_SWIZZLE=1)
gif_image_xfer_stub TRXDIR/IMAGE engine (PSMCT32_SWIZZLE=1, dormant in Ch123)
vram_stub 8 KiB VRAM (one PSMCT32 page)
gs_pcrtc_stub PCRTC scanout (PSMCT32_SWIZZLE=1)
gs_swizzle_psmct32_stub Pure-comb math, instantiated x3 inside the gates above

Sim-only constructs audit (full sweep of the 12 modules above):

  • bios_rom_stub.sv and ee_ram_stub.sv$display / $readmemh inside initial begin. Both are synth-safe: Xilinx Vivado and Intel Quartus support $readmemh for BRAM initialization, and $display is silently ignored by all major synthesizers.
  • vram_stub.sv L114-117 — single $error parameter validator inside initial begin. Synth ignores it; the BYTES parameter must be set to a sane value at instantiation regardless.
  • ee_gs_priv_bridge_stub.sv L118 — runtime $error on unsupported byte enables, inside always_ff. Synth ignores the $error; the surrounding logic still synthesizes correctly.
  • No $finish / $dumpfile / $random / force / release / real-typed signals / hierarchical refs in any module of the Ch123 dep tree. (TBs use hierarchical refs into bios_rom_stub to preload the bootlet — that's a TB- only concern; on hardware the bootlet image is the BRAM init. Out-of-tree note: boot_install_agent_stub.sv (SIF subsystem, not in the Ch123 dep tree) contains a $fatal runtime validator, but it is never compiled into the Ch123 hardware build.)

Memory sizing:

Memory Default Ch123 sim setting Ch123 hw recommendation FPGA fit
bios_rom_stub 4 MiB 4 KiB 4 KiB ≤1 BRAM tile
ee_ram_stub 16 KiB 4 KiB 4 KiB ≤1 BRAM tile
vram_stub 64 KiB 8 KiB 8 KiB ≤2 BRAM tiles (one PSMCT32 page)
ee_memory_map_stub.useg_shadow_mem (Ch145) 4 MiB 4 MiB 4 KiB (override USEG_SHADOW_WORDS_PARAM=1024) ≤1 BRAM tile when overridden

The 16×8 framebuffer needs only 16×8×4 = 512 bytes; 8 KiB gives the full first PSMCT32 page (FBP=0). For a more ambitious hardware demo (multi-page framebuffers, textures), grow vram_stub.BYTES toward 1 MiB / 4 MiB. Real PS2 has 4 MiB of VRAM; a first hardware build can stay at 8 KiB.

Ch145 — useg_shadow_mem parameterization: pre-Ch145, the ee_memory_map_stub's useg-shadow backing was a fixed 1M-word / 4 MiB array. That was correct for the BIOS-smoke chapters that need full first-4-MiB-of-useg coverage, but it's wasted area for the Ch123 hardware demo (which never touches useg — the bootlet runs from BIOS at 0xBFC0_0000 and the GIF payload from RAM at phys 0x100). Ch145 promotes USEG_SHADOW_WORDS from a hardcoded localparam to the USEG_SHADOW_WORDS_PARAM module parameter (default 1M words = 4 MiB → existing TBs unchanged). For the Ch123 hardware demo, the top-level wrapper instantiates ee_memory_map_stub with .USEG_SHADOW_WORDS_PARAM(1024) to shrink the inferred BRAM footprint by ~1024×; correctness is unaffected because no useg access ever happens in the Ch123 data plane.

Clock / reset assumptions:

  • Single clock domain (clk) — all 12 modules share one input.
  • Active-low synchronous reset input (rst_n) — also a single shared input. No reset gating, no per-module variants. The reset is sampled inside always_ff @(posedge clk) via the if (!rst_n) pattern (NOT posedge clk or negedge rst_n) — i.e., it is NOT an async reset despite being active-low. On FPGA this should be brought up via the device's reset bridge so the deasserting edge is synchronous to clk.
  • No clock gating, no derived clocks. The PCRTC's hsync/vsync/de are regular clock-domain outputs, not separate clocks.

Swizzle gate parameter defaults:

  • All four swizzle parameters (PSMCT32_SWIZZLE, PSMCT16_SWIZZLE, PSMT8_SWIZZLE, PSMT4_SWIZZLE) default to 1'b0 on gs_stub, gs_pcrtc_stub, and gif_image_xfer_stub. For the Ch123 hardware demo, instantiate these three modules with PSMCT32_SWIZZLE(1'b1) and the other three left at 1'b0. The swizzle-math primitives (gs_swizzle_psmct32_stub etc.) are pure-comb and trim cleanly when their gate is off.

Top-level harness expectations (for a future top_psmct32_raster_demo.sv):

  • Inputs: clk, rst_n, plus board-level video-out connections (HDMI / DVI / VGA — driven by r/g/b/hsync/vsync/de from gs_pcrtc_stub).
  • The EE bootlet image must be preloaded into bios_rom_stub via either IMAGE_FILE (→ $readmemh) or a bake-step that writes a .mem next to the synthesis project. The bootlet is 18 MIPS instructions (currently authored procedurally in the Ch123 TB body via ee_prog_word()); for hardware this needs to become a static .mem checked into the repo.
  • The GIF payload must be preloaded into ee_ram_stub via the same mechanism — 24 qwords starting at PAYLOAD_MADR=0x100. Current TB authors them procedurally with preload_qword(); hardware needs a static .mem.
  • The core_go signal must be tied high (or pulsed by a board reset-release sequencer) so the EE starts fetching from 0xBFC0_0000.

Known sim-only constructs that should NOT block first build:

  • $display lines in BIOS/RAM init (synth ignores).
  • $readmemh (synth tools handle it for BRAM init).
  • $error parameter validators (synth ignores).

Known sim-only constructs that WOULD block first build:

  • None found in the Ch123 dep tree.

Open questions for the hardware-build session (deliberately not answered here — they need a board-level decision):

  • Target FPGA family + clock frequency (PCRTC was designed around 13.5 MHz pixel clock for the 16×8 active area; first build can run at any clock since the TB doesn't model real CRTC timing).
  • Video-out PHY (HDMI core, VGA DAC, on-board HDMI transmitter chip).
  • BIOS / payload bake step (Vivado update_compile_order + .mem files vs. a SystemVerilog localparam array preload).
  • Whether to keep ee_core_stub's STRICT_UNSUPPORTED gate active on hardware (catches unknown opcodes by halt+latch — useful for debugging, but a hard failure on any unintended fetch).

The Ch90 white-box TB tb_gs_scanout_basic.sv exercises the full round trip: instantiates gs_stub + vram_stub + gs_pcrtc_stub, drives a 4×4 sprite through the GIF reg port, waits for raster to fully drain, then enables scanout and captures one full frame's (hcnt, vcnt) → (r, g, b) trace. Asserts: every pixel inside the sprite reads as the emitted color, every pixel outside reads as 0, and at least one EV_MODE frame trace fires.