# GIF/GS Contract Status: `Draft` ## Purpose Define the graphics ingress and rendering/display boundary. ## Owns - GIF path intake and arbitration, - GIF tag interpretation, - GS register decode, - GS VRAM-visible operations, - framebuffer/zbuffer/texture-visible state handling, - PCRTC/display output generation or a planned approximation layer. ## Inputs - DMAC channel 2 traffic, - VIF/VU-generated graphics traffic, - privileged GS register writes, - reset and display configuration controls. ## Outputs - VRAM updates, - display timing and pixel output, - status/interrupt signals, - packet and register trace events. ## Questions to lock - What is the first output milestone: - GS privileged register acceptance only - static background color - minimal primitive draw - `gsKit`-style demo target - Is Phase 1 display based on a faithful GS/PCRTC path or a temporary adapter? - What VRAM organization assumptions must stay stable from the beginning? ## Allowed early stubs - privileged-register-only GS stub, - BGCOLOR/test-pattern display path, - packet logger with no rendering. ## Required debug visibility - GIF tags, - PATH source and arbitration result, - GS register writes, - VRAM write summaries, - display mode transitions. ## First meaningful milestone - a known packet stream or direct privileged-register sequence produces a stable, visible, repeatable output and matching trace. ## GS write-port contract (Ch75) The GS model has **two architecturally distinct write ports** because real PS2 hardware exposes two unrelated register namespaces. Conflating them was a Ch74 mistake; Ch75 split them. ### `reg_wr_*` — privileged GS/MMIO writes - Source: CPU MMIO writes to the `0x12000000` privileged-register block, e.g. via `platform_video_stub` or a direct test-harness path. - Address: `reg_wr_addr[15:0]` is the offset *inside* the privileged block. - Examples: `BGCOLOR` at offset `0x00E0`, `PMODE` at `0x0000`, `SMODE2` at `0x0020`, etc. - Currently latched: `BGCOLOR` only. Other offsets emit `EV_MODE`. ### `gif_reg_*` — GIF A+D register-number writes - Source: `gif_packed_stub` consuming a PACKED A+D entry when run with `REAL_AD_REG_MAP=1` (the new default-on path for real PS2 packets; parameter still defaults to `0` for back-compat with project-local Ch72/Ch73 PACKED-A+D layout). - Address: `gif_reg_num[7:0]` is the **GIF A+D register number** straight out of the PACKED entry's `in_data[71:64]`. Source-of-truth is PCSX2 `pcsx2/GS/GSRegs.h`. - Currently decoded: `PRIM=0x00`, `RGBAQ=0x01`, `XYZF2=0x04`, `XYZ2=0x05`, `FRAME_1=0x4C`, `ZBUF_1=0x4E` (**not `0x4F` — that is `ZBUF_2`**). Each has a dedicated 64-bit latch output. Other reg numbers emit `EV_MODE`. ### Event taxonomy The two write paths emit different events. Read this carefully — `arg2` semantics differ across emitters. - `EV_BGCOLOR` — emitted **only** by `gs_stub` on the privileged port when `reg_wr_addr == 0x00E0`. Carries the unpacked R/G/B in `arg0`/`arg1`/`arg2`. The privileged port has no per-register "selector" beyond this dedicated event; everything else on that port goes to `EV_MODE` with `arg0=offset`, `arg1=data`. - `EV_WRITE` — emitted in two places with different `arg2` semantics: - **`gif_packed_stub`** on a PACKED A+D accept (REGS nibble = `0xE`). Carries the raw PACKED address bits in `arg2` (`{48'd0, in_data[79:64]}`). Under `REAL_AD_REG_MAP=1` the low 8 bits are the real GIF reg# (`in_data[71:64]`); under `REAL_AD_REG_MAP=0` the low 16 bits are the project-local privileged-style offset. **Not a stable selector — it is the address half of the wire.** - **`gs_stub`** on the `gif_reg_*` port for a tracked GIF reg (PRIM/RGBAQ/XYZF2/XYZ2/FRAME_1/ZBUF_1). Carries a **stable per-register selector** in `arg2`: `1=PRIM, 2=RGBAQ, 3=XYZF2, 4=XYZ2, 5=FRAME_1, 6=ZBUF_1, 7=TEX0_1` (Ch98). `arg0=reg#`, `arg1=data`. Use this selector for trace-side filtering; it does not depend on `REAL_AD_REG_MAP`. - **Ch76 caveat**: a tracked vertex commit (XYZ2 or XYZF2) on the `gif_reg_*` port that *closes* a primitive does NOT emit EV_WRITE that cycle — `EV_PRIM_DRAW` preempts it (see below). The xyz2_q / xyzf2_q latch still updates. Trace consumers counting "vertices seen" must sum `EV_WRITE`(selector=3 or 4) + `EV_PRIM_DRAW` to get the true total. - `EV_PRIM_DRAW` — Ch76 / Ch77. Fired by `gs_stub` once per primitive completion: when an XYZ2 or XYZF2 vertex commit on the `gif_reg_*` port closes a primitive under the current `PRIM[2:0]`. Preempts the EV_WRITE that the closing vertex would otherwise have emitted. Args: `arg0=PRIM[2:0]` (prim type), `arg1=primary threshold`, `arg2=cumulative `prim_complete_count` post-increment`, `arg3=closing vertex data` (the same 64 bits that latched into xyz2_q / xyzf2_q on this cycle). - **Discrete primitives** (POINT=1, LINE=2, TRIANGLE=3, SPRITE=2): one draw per N vertices; the vertex counter resets to 0 after each draw. - **Strip / fan primitives** (LINE_STRIP=2, TRI_STRIP=3, TRI_FAN=3): Ch77. Anchor on the first N vertices, then fire one draw per additional vertex commit. The vertex counter saturates at the primary threshold so every subsequent vertex closes another primitive. Ch78 adds **vertex-identity tracking** distinguishing TRI_STRIP rolling triangles `{v_n-2, v_n-1, v_n}` from TRI_FAN pivot triangles `{v_pivot, v_n-1, v_n}` — see the next section. - **Reserved** (PRIM=7): no draw, vertex commits do not increment the counter, latches still update. - A PRIM write always resets the vertex counter so a fresh primitive type starts cleanly. ### Per-primitive vertex snapshot (Ch78) Alongside `EV_PRIM_DRAW`, `gs_stub` exposes three 64-bit outputs — `prim_v0_q`, `prim_v1_q`, `prim_v2_q` — that hold the *vertex tuple* of the most recently closed primitive. Snapshot is registered on the same clock edge as the `ev_valid` pulse and held until the next `prim_complete`, so a TB can sample it at the same time it sees `gs_ev_event == EV_PRIM_DRAW`. The number of valid slots is implicit in `PRIM[2:0]`: | `PRIM` | type | valid slots | semantics | |---|---|---|---| | 0 | POINT | `v0` | the single vertex | | 1 | LINE | `v0`, `v1` | endpoints | | 2 | LINE_STRIP | `v0`, `v1` | each segment uses `{v_n-1, v_n}` | | 3 | TRIANGLE | `v0`, `v1`, `v2` | the three vertices | | 4 | TRI_STRIP | `v0`, `v1`, `v2` | rolling: `{v_n-2, v_n-1, v_n}` | | 5 | TRI_FAN | `v0`, `v1`, `v2` | pivot+rolling: `{v_pivot, v_n-1, v_n}` | | 6 | SPRITE | `v0`, `v1` | top-left + bottom-right | | 7 | reserved | — | observer never closes | The TRI_STRIP-vs-TRI_FAN distinction lives entirely in the saturated-extension path: a TRI_STRIP advances `v0` each draw with the rolling window; a TRI_FAN pins `v0` to `v_pivot` (the first vertex committed since the most recent PRIM write). On the *anchor* draw, `v_pivot` and the rolling `v_prev` happen to coincide, so TRI_STRIP and TRI_FAN report the same tuple for their first triangle. A PRIM write clears the rolling window (`v_curr` / `v_prev` / `v_prev_prev` / `v_pivot` / `pivot_seen`) so a fresh primitive context starts with no residual vertex bleed. Slots not used by the current primitive type read `0`. The snapshot tracks identity, not geometry — the values written are the raw 64-bit `gif_reg_data` payloads of XYZ2 / XYZF2 commits, with no decoding into screen-space coordinates. Rasterization is still out of scope. ### Per-primitive color snapshot (Ch79 / Ch80) `prim_color_q[63:0]` is registered on the same edge as `prim_v0_q` / `prim_v1_q` / `prim_v2_q` and carries the value of `rgbaq_q` at the moment the primitive closed. RGBAQ writes are separate A+D entries from XYZ2 / XYZF2 commits (gif_packed_stub serializes A+D to one accept per cycle), so `rgbaq_q` is always settled to its draw-time value when `prim_complete_now` fires. `prim_color_q` reads `0` if no RGBAQ has been written since reset; `rgbaq_q` itself is **not** cleared on a PRIM write — color carries forward across PRIM context switches, matching real GS behavior — but it does reset to `0` on `rst_n`. #### Per-vertex Gouraud color (Ch80) For real game streams that interleave RGBAQ writes with vertex commits to drive Gouraud shading, `gs_stub` exposes three additional outputs: | Output | Slot semantics | |---|---| | `prim_color_v0_q[63:0]` | color of vertex 0 | | `prim_color_v1_q[63:0]` | color of vertex 1 | | `prim_color_v2_q[63:0]` | color of vertex 2 | A parallel rolling color window (`c_curr_q` / `c_prev_q` / `c_prev_prev_q` / `c_pivot_q`, internal) samples `rgbaq_q` on every vertex commit, mirroring the Ch78 vertex-identity window. The snapshot layout matches the vertex layout exactly: | `PRIM` | type | `_v0_q` color of | `_v1_q` color of | `_v2_q` color of | |---|---|---|---|---| | 0 | POINT | the single vertex | 0 | 0 | | 1 | LINE | first endpoint | closing | 0 | | 2 | LINE_STRIP | previous vertex | closing | 0 | | 3 | TRIANGLE | `v_n-2` | `v_n-1` | closing | | 4 | TRI_STRIP | `v_n-2` (rolls) | `v_n-1` | closing | | 5 | TRI_FAN, anchor | `v1` (≡ pivot) | `v2` | `v3` | | 5 | TRI_FAN, saturated | `v_pivot` (PINNED) | `v_n-1` | closing | | 6 | SPRITE | first endpoint | closing | 0 | `prim_color_q` is exactly the closing-vertex color (≡ `prim_color_v_close`), kept as a convenience alias for consumers that don't care about Gouraud. For **flat-shaded** primitives (RGBAQ written once before the strip), all per-vertex color slots used by the primitive equal each other and equal `prim_color_q`. For **Gouraud-shaded** primitives (RGBAQ rewritten between vertex commits), the slots may differ — capturing the per-vertex color identity needed to distinguish a strip's rolling colors from a fan's pivot color. The color window is **cleared on PRIM write** (unlike `rgbaq_q` itself, which carries forward). This means per-vertex color identity stays tied to the current primitive context — a stream that switches PRIM types mid-context starts color tracking fresh for the new context. Slots not used by the current primitive type read `0`. Like the vertex snapshot, this captures identity, not interpolated geometry — the stored values are the raw 64-bit RGBAQ payloads (packing R, G, B, A, and the texture-coord divisor Q together); GS-style Gouraud interpolation across the primitive interior remains out of scope. ### Structured-field decode (Ch81) `gs_stub` exposes pre-decoded snapshot outputs alongside the raw 64-bit slots so a downstream rasterizer or pixel-emit path doesn't have to re-derive bit fields: | Output | Type | Carries | |---|---|---| | `prim_v0_decoded_q` / `_v1_` / `_v2_` | `trace_pkg::vertex_t` | `x` / `y` / `z` / `fog` / `is_xyzf2` per slot | | `prim_v0_color_decoded_q` / `_v1_` / `_v2_` | `trace_pkg::color_t` | `r` / `g` / `b` / `a` / `q` per slot | The decoded outputs latch on the same edge as the raw snapshots, so a TB samples both atomically with `EV_PRIM_DRAW`. #### vertex_t and the XYZ2 / XYZF2 distinction ```sv typedef struct packed { logic is_xyzf2; // 1 = XYZF2 source, 0 = XYZ2 logic [7:0] fog; // valid iff is_xyzf2; else 0 logic [31:0] z; // 32-bit (XYZ2) or zero-extended 24-bit (XYZF2) logic [15:0] y; // 12.4 fixed-point screen Y logic [15:0] x; // 12.4 fixed-point screen X } vertex_t; ``` XYZ2 packs full 32-bit Z in `data[63:32]`. XYZF2 packs 24-bit Z in `data[55:32]` and an 8-bit fog byte in `data[63:56]`. The `is_xyzf2` flag is registered in a parallel rolling format-flag window (`xyzf2_curr_q` / `xyzf2_prev_q` / `xyzf2_prev_prev_q` / `xyzf2_pivot_q`) that tracks the source format of each vertex through the rolling window — so when an XYZF2 vertex rolls into the `v_prev` slot of a TRI_STRIP saturated extension, its `is_xyzf2` flag rolls with it. Cleared on `rst_n` and on PRIM write, same as the vertex/color windows. #### color_t ```sv typedef struct packed { logic [31:0] q; // texture-coord divisor (IEEE float) logic [7:0] a; logic [7:0] b; logic [7:0] g; logic [7:0] r; } color_t; ``` Direct bit-slice of the RGBAQ payload — no interpretation. Q is carried verbatim as a 32-bit IEEE float (the GS uses it for texture coordinate division during rasterization, which remains out of scope). #### Decode helper functions `trace_pkg` exposes `decode_vertex(data, is_xyzf2)` and `decode_color(data)` so downstream code can re-decode raw 64-bit values consistently with the `gs_stub` snapshot. The decoded outputs are an additive contract — the raw `prim_v*_q` and `prim_color_v*_q` outputs continue to work for consumers that don't care about per-channel decoding. ### Minimal pixel emit (Ch82) `gs_stub` exposes a per-primitive *pixel emit* — the smallest possible output that ties the recognition layer to a framebuffer destination. One pixel per closed primitive (the closing vertex, in screen-space integer coords), addressed by the latched `frame_1_q` register. No interpolation, no coverage, no rasterization — this is the contact point for a future raster chapter, not a substitute for one. | Output | Width | Carries | |---|---|---| | `pixel_emit` | 1 | 1-cycle strobe; pulses on the same edge as `prim_complete` | | `pixel_emit_count` | 32 | Running tally of emits since reset | | `pixel_x_q` / `pixel_y_q` | 12 | Closing vertex integer screen coords (top 12 bits of 12.4 fixed-point) | | `pixel_color_q` | 64 | RGBAQ at the emit moment (= `prim_color_q`) | | `pixel_fbp_q` | 9 | `FRAME_1[8:0]` (framebuffer base / 2048) | | `pixel_fbw_q` | 6 | `FRAME_1[21:16]` (framebuffer width / 64 in pixels) | | `pixel_psm_q` | 6 | `FRAME_1[29:24]` (pixel storage format) | | `pixel_fb_addr_q` | 32 | Computed VRAM byte offset (see below) | #### Address arithmetic ``` fb_addr = FBP * 2048 + (Y * FBW * 64 + X) * bytes_per_pixel ``` Ch83 added PSM-aware `bytes_per_pixel` derived from the latched `FRAME_1[29:24]` (PSM field): | PSM (hex) | Format | bytes/pixel | Notes | |---|---|---|---| | 00, 01 | PSMCT32 / PSMCT24 | 4 | host-word | | 02, 0A | PSMCT16 / PSMCT16S | 2 | | | 13 | PSMT8 | 1 | indexed | | 14 | PSMT4 | 4 here (host-word) | **legacy `pixel_emit` channel only** — see note below | | 1B, 24, 2C | PSMT8H / PSMT4HL / PSMT4HH | 4 | host-word (high/low nibble of 32-bit slot) | | 30, 31 | PSMZ32 / PSMZ24 | 4 | depth | | 32, 3A | PSMZ16 / PSMZ16S | 2 | depth | | other | — | 4 (host-word fallback) | unrecognized PSM | This table describes the **legacy `pixel_emit` channel** (the single-pixel-per-primitive debug strobe from Ch82/Ch83). That channel does not commit to `vram_stub`; it only emits a trace event. Its PSMT4 entry stays at host-word fallback — the recognition layer never tracked sub-byte position there. The **raster channel (`raster_pixel_emit`)** does NOT use this table. It owns its own PSM-aware emit packing in S2 with full PSMT4 support after Ch106: - Byte address = `pixel_index >> 1` (overrides the `pixel_index << ras_bpp_shift` form). - The 4-bit index from R[3:0] is placed in the targeted nibble (low/high keyed by `pixel_index[0]`) of `write_data[7:0]`. - `raster_pixel_be_q = 4'b0001`, `raster_pixel_mask_q = 0x0F` or `0xF0` so `vram_stub`'s per-bit merge updates only that nibble. PSMT8H / PSMT4HL / PSMT4HH still address the host 32-bit slot, not the high/low byte/nibble within it; the extracted sub-byte is rasterizer/blit-specific and out of scope here. `pixel_psm_q` is still exposed verbatim so consumers can apply their own sub-slot offset arithmetic if needed. #### Carry-forward semantics `frame_1_q` is part of the standard GIF-context register file and carries forward across PRIM writes (matching real GS). A stream that sets `FRAME_1` once and then emits multiple primitives correctly addresses all of them. A stream that never writes `FRAME_1` lands every pixel at `fb_addr=0` — observable but not useful, behaves cleanly under reset. `rgbaq_q` likewise carries forward, so `pixel_color_q` reflects the most recent RGBAQ write at emit time. If a Gouraud-style stream rewrites RGBAQ between vertices, `pixel_color_q` captures the closing-vertex color — same semantic as Ch79's `prim_color_q`. #### Strobe channel, not trace event `pixel_emit` is a dedicated 1-cycle strobe alongside the snapshot outputs, not a multiplexed event on the main `ev_valid` trace stream. This avoids contention with `EV_PRIM_DRAW` on the close cycle. A consumer that wants both can sample on `pixel_emit` posedge and read the snapshots atomically. ### Minimal interior rasterizer (Ch84) `gs_stub` adds a *separate* per-interior-pixel emit channel alongside the per-primitive `pixel_emit` of Ch82. The Ch82 strobe is unchanged (still pulses once per closed primitive); the new channel pulses once per pixel that the rasterizer determines is inside the closed primitive's interior. | Output | Width | Carries | |---|---|---| | `raster_pixel_emit` | 1 | 1-cycle strobe per emitted interior pixel | | `raster_pixel_emit_count` | 32 | Cumulative interior pixels emitted since reset | | `raster_pixel_x_q` / `_y_q` | 12 | Integer screen coords of the emitted pixel | | `raster_pixel_color_q` | 64 | Per-pixel color: Gouraud-interpolated R/G/B/A for TRI/TRI_STRIP/TRI_FAN (Ch86), flat (= `prim_color_q`) for SPRITE. Q passes through from the closing vertex. | | `raster_pixel_fb_addr_q` | 32 | Computed VRAM byte offset (PSM-aware, same math as Ch82/Ch83) | | `raster_active` | 1 | High while the FSM is scanning a primitive | | `raster_overflow` | 1 | Latches if a new primitive closes while the 2-entry raster FIFO is full and no concurrent pop frees a slot (Ch87 + audit-medium fix). See "Raster command queue (Ch87)" below for the back-to-back-close budget. | | `raster_degenerate` | 1 | Latches if a TRI/STRIP/FAN closes with zero signed area (3 colinear vertices). SCAN is skipped; SPRITE never sets this. | #### Per-primitive coverage | `PRIM` | Raster behavior | |---|---| | 0 POINT | No raster emit — Ch82 closing-pixel only | | 1 LINE | No raster emit — Ch82 closing-pixel only | | 2 LINE_STRIP | No raster emit — Ch82 closing-pixel only | | 3 TRIANGLE | Bounding-box scan with edge-function half-plane test | | 4 TRI_STRIP | Same engine as TRIANGLE, fires per closed strip triangle | | 5 TRI_FAN | Same engine as TRIANGLE, fires per closed fan triangle | | 6 SPRITE | Bounding-box rectangle fill (every pixel inside) | | 7 reserved | No raster emit | #### Triangle edge-function math For each candidate pixel `p` and each edge `(vA, vB)` of the triangle: ``` e(p) = (p.x - vA.x) * (vB.y - vA.y) - (p.y - vA.y) * (vB.x - vA.x) ``` 32-bit signed math is used to avoid overflow at typical coord ranges. ##### Top-left fill rule (Ch85) Adjacent triangles that share an edge would double-paint pixels on that edge under a naïve same-sign test. Ch85 applies the standard D3D-style top-left fill rule so each shared-edge pixel is owned by exactly one of the two triangles. At the IDLE→SCAN transition the FSM: 1. Computes `signed_area = (v1-v0) × (v2-v0)`. 2. If `signed_area == 0` → degenerate (3 colinear vertices); `raster_degenerate` latches and SCAN is skipped (no raster pixels emit). The Ch82 `pixel_emit` and `prim_complete` pulses still fire — only the interior raster is suppressed. 3. If `signed_area < 0` → CW winding; the FSM swaps `v1` and `v2` so the rule applies uniformly to a CCW-ordered triangle. 4. For each edge of the post-swap CCW triangle, classifies it as *top-or-left* (inclusive) or *right/bottom* (exclusive): - **Top edge**: horizontal going right (`dy == 0 && dx > 0`). - **Left edge**: going down in Y-down screen (`dy > 0`). - Anything else is a right or bottom edge. The inside test in SCAN becomes: ``` inside = (e[i] + bias[i] <= 0) for all i in {0, 1, 2} ``` where `bias[i] = 0` if edge `i` is top-or-left and `bias[i] = 1` otherwise. The `+1` bias converts the strict `< 0` test for right/bottom edges into a non-strict `<= 0` test on the biased value, keeping the math integer and uniform. Result: for any two adjacent triangles sharing an edge, the edge's pixels are inclusive in exactly one triangle's bias configuration and exclusive in the other's — no double-paint. Some shared-corner pixels may end up unpainted by either triangle. That's the standard top-left rule trade-off: non-overlap takes priority over coverage of every boundary pixel. ##### Per-pixel Gouraud color (Ch86) Triangle interior pixels now use **per-pixel Gouraud color interpolation** instead of flat shading. The three per-vertex colors (the same Ch80 `prim_color_v0_q` / `prim_color_v1_q` / `prim_color_v2_q` slot mapping) are latched at SCAN init with the same `v1↔v2` swap mirror as the vertex coords, so the post-swap CCW vertex order matches the latched color order. For each interior pixel `p`, barycentric weights are derived directly from the unbiased edge functions: ``` L0(p) = -e1(p) // weight for v0 = signed area of (p, v1, v2) L1(p) = -e2(p) // weight for v1 L2(p) = -e0(p) // weight for v2 — L0 + L1 + L2 == sa for all p inside the triangle ``` For each color channel `ch` ∈ {R, G, B, A}: ``` ch_out(p) = (L0(p)*c0.ch + L1(p)*c1.ch + L2(p)*c2.ch) / sa ``` Q (the texture-coord IEEE float in c2's upper 32 bits) is **not** interpolated — it passes through from the closing vertex's RGBAQ unchanged. For a flat-shaded primitive (RGBAQ written once before all three vertices, all three vertex colors equal), `λ0+λ1+λ2 = 1` and the formula collapses to `c0` exactly with no rounding error — existing flat-shaded raster TBs (raster_basic, raster_topleft) continue to pass. The R/G/B/A division uses **integer truncation toward zero**. Real PS2 GS uses fixed-point with specific rounding rules; the recognition-layer stub is intentionally simpler. SPRITE keeps flat shading (only 2 vertices, no barycentric weights defined). #### Sprite rectangle fill A SPRITE has two vertices forming opposite corners. The bounding box is computed via `min`/`max` of each axis; every pixel inside the box is emitted in row-major order. #### FSM and scan timing The FSM is `IDLE` → `SCAN`. On `prim_complete_now` for an eligible primitive, the FSM latches the vertex tuple, color, FRAME_1 fields, and bounding box, then walks the box one pixel per cycle. For each pixel: combinational inside-test → if inside, pulse `raster_pixel_emit` and update the snapshot. Returns to `IDLE` when `(ras_cur_x, ras_cur_y) == (x_max, y_max)`. Color is **Gouraud-interpolated per pixel** for triangles (Ch86) and **flat** for sprites — see the dedicated subsections below for the fill-rule and Gouraud math. The closing-primitive flat color (`prim_color_q`) is still used as the SPRITE fill color and as a backward-compat reference for flat-shaded TRIs (when all three vertex colors are equal, the Gouraud formula reduces to that flat value with no rounding error). Coordinates are **integer** — the 4-bit sub-pixel of 12.4 fixed-point is discarded. Sub-pixel edge adjustment is not modeled (top-left fill rule IS modeled — see Ch85 subsection above). #### Raster command queue (Ch87) and `raster_overflow` `gs_stub` has a **2-entry FIFO** in front of the SCAN FSM. Every primitive close that targets the rasterizer (`RM_TRI` / `RM_SPRITE`) snapshots its full per-prim context (vertices, bias, signed area, per-vertex colors, FRAME_1 fields, bounding box) into the queue at the close cycle. The FSM dequeues the oldest entry whenever it's idle or finishing a scan. Effective concurrency is **1 in-flight + 2 queued = up to 3 back-to-back closes** absorbed without drop. `raster_overflow` now latches when a 4th close arrives while the FIFO is **full** (1 in-flight, both FIFO slots occupied). The 4th primitive is dropped. Earlier chapters' bound of "1 close mid-scan = overflow" is replaced by Ch87's "3 closes back-to-back = OK; 4th = overflow." Degenerate triangles are **filtered at enqueue**: they set `raster_degenerate` and are not pushed into the queue. SPRITE never sets `raster_degenerate`. POINT/LINE/LINE_STRIP don't raster (RM_NONE) — they don't enqueue at all and the queue ignores them. Pop happens at `IDLE`→`SCAN` AND at drain-done (Ch88; see below) when the queue has more work, so back-to-back scans run contiguously without an `IDLE` bubble. `raster_active` stays high across the boundary. Real PS2 game streams emit thousands of primitives back-to-back; 3-deep concurrency is enough for most TRI_STRIP / TRI_FAN patterns with small bounding boxes. Larger sprites or larger triangles increase scan length and reduce headroom — a future chapter can grow the FIFO depth. #### Pixel pipeline (Ch88) The SCAN body is **3 stages, throughput 1 candidate pixel per cycle**: | Stage | Source | Work | |-------|--------|------| | **S0** | `ras_cur_x` / `ras_cur_y` (bbox walker) | Generate the next candidate coord; advance the bbox walker; on bbox corner, fire `ras_at_end_of_s0` and transition R_SCAN→R_DRAIN. | | **S1** | `s1_x_q` / `s1_y_q` (registered) | Combinational edge functions on `(s1_x, s1_y)` against the three triangle edges (or trivial-true for SPRITE), top-left bias, inside test → `s1_pixel_inside`. Latched into `s2_inside_q`. | | **S2** | `s2_x_q` / `s2_y_q` / `s2_L0..L2_q` / `s2_inside_q` | Compute Gouraud `interp_byte(λ_i, c_i)` × 4 channels and `s2_fb_addr` from PSM/FBP/FBW. If `s2_valid_q && s2_inside_q`, drive `raster_pixel_emit` with the resolved fb_addr / x / y / color. | `raster_state` is now a 3-state FSM: - **R_IDLE** — no work; `pop_ok` fires on a non-empty FIFO. - **R_SCAN** — S0 produces one valid coord per cycle; S1/S2 latches propagate. On bbox corner, transitions to R_DRAIN. - **R_DRAIN** — S0 stops producing valids (`s1_valid_q <= 0`); S1 and S2 finish their in-flight pixels. When both pipeline valids are low (`drain_done`), the FSM either pops the next primitive (back-to-back contiguous SCAN) or returns to R_IDLE. `pop_ok = !fifo_empty && (R_IDLE || drain_done)` — the end-of-scan pop is now drain-done, three cycles after S0 produces the corner. This guarantees the pipeline-tail pixels of the previous primitive are not overwritten by the next primitive's pop, while still keeping `raster_active` high across the seam. Latency from `pop_ok` to first registered `raster_pixel_emit` is **3 stages of pipeline + 1 cycle of FIFO turnaround + 1 cycle of registered emit output = 5 posedges from the close cycle of the closing vertex** (see `sim/tb/gif_gs/tb_gs_raster_pipeline.sv` for the cycle-exact contract). - `EV_MODE` — fired for any accept that did not resolve to a tracked register: REGLIST entries, IMAGE/DISABLE payload qwords, NOP-nibble PACKED slots, unknown privileged offsets, unknown GIF reg numbers. Reserved for "we know we saw something, we are intentionally not modeling it yet." - `EV_GIFTAG` — one per accepted GIFtag; carries `flg`/`nreg`/`nloop`/`eop` for stream-level checking. When trace event semantics change, audit this section and the per-stub trace-schema header comments together. #### VRAM persistence (Ch89) `vram_stub` (`rtl/gif_gs/vram_stub.sv`) is the **first persistence layer** the rasterizer has had. Every `raster_pixel_emit` pulse writes 4 bytes of pixel data at `raster_pixel_fb_addr_q` into `vram_stub`'s linear byte array. A combinational debug read port exposes `read_data` byte-addressably so testbenches can verify storage. Wiring: | vram_stub port | gs_stub source | |---|---| | `write_en` | `raster_pixel_emit` | | `write_addr` | `raster_pixel_fb_addr_q` | | `write_data` | `raster_pixel_color_q[31:0]` (the lower 32 bits — Q in the upper 32 is not framebuffer data) | | `write_be` | `raster_pixel_be_q` (Ch95) — per-byte write enable: byte i (the byte at `write_addr + i`) is committed only when `write_en && write_be[i]`. Lets the same 32-bit write port serve PSMs of any byte width. | | `write_mask` | `raster_pixel_mask_q` (Ch106) — per-bit merge mask: for each enabled byte, `mem[i] <= (mem[i] & ~mask_i) | (data_i & mask_i)`. Tied to `0xFFFFFFFF` for PSMs ≥ 1 byte/pixel (no behavior change). PSMT4 drives `0x0000_000F` or `0x0000_00F0` to preserve the un-targeted nibble in the same byte. | Scope (current write-side support, after Ch105): - **PSMCT32 + PSMCT16 + PSMT8** at the raster write port. The PSM width is selected by `gs_stub`'s `bpp_shift` mux off `FRAME_1.PSM` and surfaced as `raster_pixel_psm_q`; `gs_stub`'s S2 packs the pixel into the right byte lane and drives `raster_pixel_be_q` so `vram_stub` commits exactly the right bytes: - PSMCT32 (PSM=0x00) → 4 bytes/pixel, `be = 4'b1111`, ABGR in `write_data[31:0]`. - PSMCT16 (PSM=0x02) → 2 bytes/pixel, `be = 4'b0011`, RGB5A1 packed in `write_data[15:0]` (Ch95). `write_addr` is the halfword byte address — per-byte `be` makes unaligned halfword writes safe. - PSMT8 (PSM=0x13) → 1 byte/pixel, `be = 4'b0001`, the natural ABGR's R channel goes into `write_data[7:0]` as the PSMT8 index (Ch105). `write_addr` is the exact byte address; `vram_stub` commits `mem[write_addr] ← write_data[7:0]` at any byte alignment without needing data-lane shifting. - PSMT4 (PSM=0x14) → 0.5 bytes/pixel (2 pixels per byte), `be = 4'b0001`, `write_mask = 0x0000_000F` (low nibble) or `0x0000_00F0` (high nibble) per `pixel_index[0]`. The 4-bit index (low nibble of natural ABGR's R) is placed in the targeted nibble position in `write_data[7:0]`. vram_stub merges only the masked bits — the OTHER nibble of the same byte is preserved (Ch106). Back-to-back same-byte emits (e.g. PSMT4 pixels x=0 and x=1, both landing in byte 0) chain through NBA semantics: the second NBA samples mem[addr] AFTER the prior commit, so both nibbles end up in the byte without a bypass-forwarding net. - PSMCT24 / PSMCT16S / PSMZ32 / PSMZ24 / PSMZ16 / PSMZ16S / PSMT8H / PSMT4HL / PSMT4HH — `bpp_shift` falls through to a host-word default (4 bytes); raster emit through these PSMs is not contract-tested. - **Write-side addressing**. Real PS2 VRAM is 4 MiB organized into pages × blocks × columns per PSM. By DEFAULT, both `gs_stub` raster emit and `gif_image_xfer_stub` TRXDIR uploads produce the linear-framebuffer layout PCSX2 calls "linear PSM". Optional per-PSM swizzle paths gated by parameters on each module: * **PSMCT32**: `PSMCT32_SWIZZLE` parameter on `gs_pcrtc_stub` (Ch120 read-side), `gif_image_xfer_stub` (Ch121 image-xfer write-side), and `gs_stub` (Ch122 raster write-side). * **PSMCT16**: `PSMCT16_SWIZZLE` parameter on `gs_pcrtc_stub` (Ch126 read-side), `gif_image_xfer_stub` (Ch127 image-xfer write-side), and `gs_stub` (Ch128 raster write-side). All three integration points live, mirroring the PSMCT32 trio. When on, byte addresses route through the per-PSM swizzle module (`gs_swizzle_psmct32_stub` / `gs_swizzle_psmct16_stub`); image-xfer adds `dest_base_q = DBP*256` on top of the swizzle output so any DBP works, while raster emit feeds the active `ras_fbp` directly so the swizzle output is already the absolute address. Per-PSM parameters are independent — enabling one doesn't affect the other PSM. **PSMT8** has its full three-path swizzle integration as of Ch134, mirroring the PSMCT32/PSMCT16 trios: standalone math primitive `gs_swizzle_psmt8_stub` (Ch131) wired into `gs_pcrtc_stub` (Ch132 read-side, `PSMT8_SWIZZLE`), `gif_image_xfer_stub` (Ch133 write-side), and `gs_stub` (Ch134 raster emit) — same parameter name on all three modules. **PSMT4** has its full three-path swizzle integration as of Ch140, mirroring the PSMCT32/PSMCT16/PSMT8 trios: standalone math primitive `gs_swizzle_psmt4_stub` (Ch137) wired into `gs_pcrtc_stub` (Ch138 read-side, `PSMT4_SWIZZLE`), `gif_image_xfer_stub` (Ch139 write-side), and `gs_stub` (Ch140 raster emit) — same parameter name on all three modules. The PSMT4 paths additionally thread the swizzle module's `nibble_hi` output through the existing Ch106 (raster) / Ch118 (image-xfer) nibble RMW machinery (replacing `s2_pixel_index[0]` / `x_eff[0]` as the high/low nibble selector when the gate is on). All parameter defaults are 0, so existing TBs see the legacy linear behavior. **All four common GS PSMs (CT32 + CT16 + T8 + T4) now have COMPLETE three-path swizzle integration foundation.** - **Stub-sized**. Default `BYTES = 65536`. Real VRAM is 4 MiB; for TB purposes a small linear region is enough to verify that emitted pixels actually land at the addresses gs_stub computes. - **Scanout path** is provided by `gs_pcrtc_stub` (Ch90 — see below). The legacy `platform_video_stub` flood-fills BGCOLOR and is unaware of VRAM; TBs that want to verify the round trip use `gs_pcrtc_stub` instead. The Ch89 white-box TB `tb_gs_vram_writeback.sv` exercises the contract end-to-end: drive a 4×4 SPRITE through gs_stub, capture the (fb_addr, color) of each `raster_pixel_emit` pulse, then read each fb_addr back from `vram_stub` and assert byte-exact match. #### PCRTC scanout (Ch90) `gs_pcrtc_stub` (`rtl/gif_gs/gs_pcrtc_stub.sv`) is the **scanout side** of the GS pipeline — its dual is `gs_stub` (the write side). It models a minimal PCRTC (Programmable CRT Controller): runs its own raster timing, generates a VRAM read address from the current `(hcnt, vcnt)` using the same fb_addr math as gs_stub, reads the byte returned by `vram_stub`'s combinational debug port, and drives `r`/`g`/`b` for the active area. Together with Ch88's pipeline + Ch89's VRAM, this closes the loop: ``` raster_pixel_emit → vram_stub.write → vram_stub.read → pcrtc.r/g/b ``` Configuration (Ch91 — privileged-block CPU MMIO): `gs_pcrtc_stub` consumes two real PS2 GS privileged display register latches directly from `gs_stub`: | pcrtc input | gs_stub source | Layout | |---|---|---| | `pmode_q[63:0]` | privileged write at offset 0x0000 | bit 0 = EN1 (display 1 enable) | | `dispfb1_q[63:0]` | privileged write at offset 0x0070 | FBP[8:0], FBW[14:9], PSM[19:15], DBX[42:32] (Ch91-audit), DBY[53:43] (Ch91-audit) | | `display1_q[63:0]` (Ch92, Ch93) | privileged write at offset 0x0080 | DX[11:0], DY[22:12], MAGH[26:23] (Ch93 — H scale = MAGH+1), MAGV[28:27] (Ch93 — V scale = MAGV+1), DW[43:32] (width-1), DH[54:44] (height-1) | The Ch90 sideband ports (`scanout_enable` / `dispfb_fbp` / `dispfb_fbw`) are **gone**. TBs program scanout the way a real PS2 driver would: write DISPFB1, then DISPLAY1, then PMODE.EN1=1 (Ch92). Out of reset, all three registers are 0, so EN1 is low and pcrtc outputs 0. `scanout_enable` inside pcrtc is derived combinationally from the latches: `scanout_enable = pmode_q[0] & (PSM ∈ {0, 2, 0x13, 0x14})`. PSMCT32 (=0), PSMCT16 (=2), PSMT8 (=0x13), and PSMT4 (=0x14) are honored at this scope; any other PSM forces scanout off rather than mis-decoding the byte layout. DISPLAY1 (Ch92, Ch93) supplies the **display window** — the sub-rect inside the active area where pcrtc actually pulls pixels from VRAM — and the **per-axis magnification**: each VRAM column is shown for (MAGH+1) consecutive VCK pulses, each VRAM line for (MAGV+1) raster lines. Outside the window pcrtc drives r/g/b = 0 even with EN1=1. Pcrtc's H_TOTAL/V_TOTAL still come from module parameters at instantiation; only the active-area sub-rect gated by DX/DY/DW/DH is register-driven. Dual-display (PMODE.EN2 + DISPFB2 + DISPLAY2) is deferred. Address math + display-window gating + magnification: ``` hmag_factor = MAGH + 1 // 1..16 vmag_factor = MAGV + 1 // 1..4 hwin_rel = hcnt - DX // pixel offset inside the window vwin_rel = vcnt - DY in_window = (hcnt >= DX) && (hwin_rel <= DW) && (vcnt >= DY) && (vwin_rel <= DH) fbp_bytes = dispfb_fbp << 11 // FBP × 2048 pixels_per_row = dispfb_fbw << 6 // FBW × 64 vram_x_unshift = hwin_rel / hmag_factor // 4 displayed pixels = 1 VRAM column at MAGH=3 vram_y_unshift = vwin_rel / vmag_factor effective_x = vram_x_unshift + DBX effective_y = vram_y_unshift + DBY pixel_index = effective_y × pixels_per_row + effective_x bpp_shift = (PSM == PSMCT32) ? 2 : (PSM == PSMCT16) ? 1 : (PSM == PSMT8) ? 0 : 2 fb_addr = fbp_bytes + (pixel_index << bpp_shift) r/g/b drive = (de && scanout_enable && in_window) ? decode(VRAM, PSM) : 0 ``` Per-PSM color decode at `vram_read_data`: - **PSMCT32**: `r = data[7:0]`, `g = data[15:8]`, `b = data[23:16]`. Alpha at `[31:24]` is dropped (no DAC channel). - **PSMCT16** (Ch94): RGB5A1 packed into the lower 16 bits as `{A[15], B[14:10], G[9:5], R[4:0]}`. 5→8 expansion uses bit-replicate `r8 = {r5, r5[4:2]}` (so 5'h1F → 8'hFF, 5'h00 → 8'h00). Alpha bit dropped at the DAC. - **PSMT8** (Ch96/Ch97): index in `data[7:0]`. With `clut_enable=1` (Ch97), pcrtc presents `clut_read_idx = idx + (CSA << 4)` to the external `clut_stub` and decodes the returned PSMCT32 entry as `r = clut_data[7:0]`, `g = clut_data[15:8]`, `b = clut_data[23:16]`. With `clut_enable=0` (Ch96 fallback), pcrtc surfaces the index as grayscale so the 8-bit storage lane is visually verifiable without programming a CLUT. - **PSMT4** (Ch103): 2 pixels per byte. `byte_offset = pixel_index >> 1` (overrides the standard `pixel_index << bpp_shift` math). `nibble = pixel_index[0] ? data[7:4] : data[3:0]` picks the 4-bit pixel; the zero-extended 8-bit value `{4'd0, nibble}` plus `(CSA << 4)` is presented on `clut_read_idx`. With `clut_enable=1`, pcrtc decodes the returned PSMCT32 entry the same way as PSMT8. With `clut_enable=0`, the fallback replicates the nibble across the 8-bit DAC value (`r = g = b = {nibble, nibble}`) so 4'hF → 0xFF and 4'h5 → 0x55. CSA is the natural per-palette-window selector for PSMT4 — multiple 16-entry palettes can share the 256-entry staging area, indexed by CSA. **Ch95 — gs_stub raster channel emits PSMCT16**. The S2 stage of the pipeline now packs ABGR → RGB5A1 (`r5=R[7:3]`, `g5=G[7:3]`, `b5=B[7:3]`, `a1=A[7]`) when `ras_bpp_shift==1` (PSMCT16 / PSMCT16S / PSMZ16 / PSMZ16S — any 16-bit PSM). The packed 16-bit pixel goes in the LOW halfword of `raster_pixel_color_q[31:0]`, and a new `raster_pixel_be_q[3:0]` selects which bytes vram_stub commits: `4'b0011` for PSMCT16, `4'b1111` for PSMCT32. vram_stub gates each byte write on `write_be[i]`, so back-to-back PSMCT16 emits write 2 bytes each without halfword stomping. New `raster_pixel_psm_q[5:0]` exposes the current PSM for trace. The Ch95 TB `tb_gs_raster_psmct16.sv` exercises the round trip: gs_stub renders a 4×4 SPRITE with FRAME_1.PSM=PSMCT16, then VRAM read-back verifies each pixel landed at the right halfword AND that the halfword right after the sprite stays zero (no leak). Ch105 extends the raster channel to PSMT8 (FRAME_1.PSM=0x13). When `ras_bpp_shift==0`, S2 takes the natural ABGR's R channel (low 8 bits) as the PSMT8 index — the same lane real PS2 hardware writes when the destination FB is PSMT8 — places it in the LOW byte of the emit lane, and sets `raster_pixel_be_q = 4'b0001` so vram_stub commits exactly the 1 byte at fb_addr. The 1-byte commit works at any byte alignment because vram_stub gates each byte lane independently. The Ch105 TB `tb_gs_raster_psmt8.sv` renders a 5×3 SPRITE (chosen so the row spans byte lanes 1, 2, 3, 0, 1 — exercising every lane alignment) at FRAME_1.PSM=PSMT8 with RGBAQ R=0x55, G=0xAA, B=0xBB, A=0xCC; asserts each sprite byte reads back as 0x55, the bytes immediately left and right of the sprite stay 0x00 (so 32-bit-aligned overwrite would be visible), and a full-VRAM sweep finds NO byte equal to 0xAA / 0xBB / 0xCC (channel-isolation: only R reaches VRAM at PSMT8). Ch106 closes the indexed-write gap with PSMT4 (FRAME_1.PSM=0x14) as a per-bit RMW into `vram_stub`. Three changes form the mechanism: 1. `vram_stub` gains a new `write_mask[31:0]` input (Ch106). The commit is now `mem[i] <= (mem[i] & ~mask_i) | (data_i & mask_i)` for each enabled byte. PSMCT32/16/PSMT8 tie mask=`0xFFFF_FFFF` (no behavior change — full byte writes). 2. `gs_stub`'s S2 PSM-aware emit packing gets a PSMT4 branch: the byte address is `pixel_index >> 1` (overrides the `pixel_index << ras_bpp_shift` form), the index is the low 4 bits of the natural ABGR's R channel, and the emit places that 4-bit value in either the low (`{4'd0, idx}`) or high (`{idx, 4'd0}`) nibble of `write_data[7:0]` based on `pixel_index[0]`. `s2_emit_be = 4'b0001`, `s2_emit_mask = pixel_index[0] ? 0x0000_00F0 : 0x0000_000F`. 3. New `raster_pixel_mask_q[31:0]` output on `gs_stub` carries the mask through to `vram_stub.write_mask`. The Ch106 TB `tb_gs_raster_psmt4.sv` is intentionally adversarial about preservation. VRAM is preloaded with `0xA5` (high=A, low=5) at every byte the sprites will touch. Three phases: - **Phase A**: 4×2 SPRITE at (0,0)..(3,1), R=0x05 → idx=5. Both nibbles of each enclosing byte are written (8 emits across 4 bytes); each byte ends at `0x55` and the four neighbouring preloaded bytes (2..3, 34..35) remain `0xA5`. This proves the back-to-back same-byte case (NBA chaining) and the neighbour- byte preservation in one go. - **Phase B**: single-pixel SPRITE at (5, 2). x=5 odd → high nibble; pixel_index = 133, byte_addr = 66; idx=7. Preload `mem[66] = 0xA5`. Expected after raster: `mem[66] = 0x75` — high nibble updated from A to 7, low nibble stays 5. Proves isolated high-nibble RMW preserves the low nibble. - **Phase C**: single-pixel SPRITE at (4, 3). x=4 even → low nibble; pixel_index = 196, byte_addr = 98; idx=9. Preload `mem[98] = 0xA5`. Expected after: `mem[98] = 0xA9` — low nibble updated from 5 to 9, high nibble stays A. Proves isolated low-nibble RMW preserves the high nibble. Continuous observer asserts `psm_q == 6'h14`, `be_q == 4'b0001`, and `mask_q ∈ {0x0F, 0xF0}` on every emit. Final aggregate checks: 10 emits total, full-VRAM sweep finds NO byte equal to 0xAA / 0xBB / 0xCC (only R reaches the framebuffer at PSMT4). DBX / DBY shift the VRAM origin: the pixel that appears at displayed (DX, DY) corresponds to VRAM (DBX, DBY). Real PS2 drivers use this for double-buffered framebuffers (alternate frames at different DBX/DBY) and offset display windows. Five TBs lock these contracts: - `tb_gs_scanout_basic.sv` — DBX=DBY=0, DISPLAY1 covers full active area, MAGH=MAGV=0 (1×): classic sprite-at-origin scanout. - `tb_gs_scanout_dbx_dby.sv` — sprite at VRAM (4,2)..(7,5), DISPFB1.DBX=4/DBY=2, DISPLAY1 full active area, MAGH=MAGV=0: sprite shows at displayed (0..3, 0..3). - `tb_gs_scanout_display_window.sv` — sprite at VRAM (0..3, 0..3), DBX=DBY=0, DISPLAY1 with DX=2/DY=1/DW=3/DH=3, MAGH=MAGV=0: sprite shows at displayed (2..5, 1..4); pixels outside the window are black even though pcrtc's raster passes through them. - `tb_gs_scanout_magh_magv.sv` (Ch93) — sprite at VRAM (0..3, 0..3), DBX=DBY=0, DISPLAY1 with DX=4/DY=2/DW=7/DH=7, MAGH=1/MAGV=1 (2×/2×): 4×4 VRAM sprite stretches to fill the 8×8 displayed window pixel-perfect; pixels outside the window are black. - `tb_gs_scanout_psm16.sv` (Ch94) — 4×4 RGB5A1 sprite written directly to vram_stub at PSMCT16 byte stride, DISPFB1.PSM=0x02: 5→8 bit-replicate decode produces the right (R8, G8, B8) at scanout. (No gs_stub instantiated; this TB exercises the PSM decode path in isolation.) - `tb_gs_scanout_psmt8.sv` (Ch96) — 4×4 PSMT8 sprite of indices 0x10..0x1F written directly to vram_stub at 1 byte/pixel stride. DISPFB1.PSM=0x13, DISPLAY1 with DX=4/DY=2/DW=7/DH=3 AND MAGH=1 (2× horizontal). Asserts each scan-out displayed pixel reads back as grayscale R=G=B=expected index, proving byte stride + display window + horizontal magnification all work at 1 byte/pixel. - `tb_gs_scanout_psmt8_clut.sv` (Ch97) — same 4×4 PSMT8 sprite, plus a programmed CLUT where `CLUT[i] = ABGR(0xFF, i+0x80, i+0x40, i)`. With `clut_enable=1` and `clut_csa=0`, asserts each scan-out pixel reads back as the CLUT entry for its index — PSMT8 storage + CLUT lookup compose correctly into real RGB. Three phases: full-frame CSA=0, single-pixel CSA=1 (idx 0x00 → CLUT[0x10]), and CSA=1 wrap (idx 0xF8 → CLUT[0x08]). - `tb_gs_tex0_clut.sv` (Ch98) — drives gs_stub's GIF reg# 0x06 (TEX0_1) and asserts the latch + sub-field decoders match the encoded payload (CBP/CPSM/CSM/CSA/CLD bit ranges). Phase 2 wires `pcrtc.clut_csa` from `gs_stub.tex0_1_csa_q` (instead of TB-side sideband) and verifies the CSA value flows from a GIF register write into the CLUT lookup math at scan-out. - `tb_gs_clut_load.sv` (Ch99) — full TEX0.CLD-driven VRAM→CLUT load round trip. Stages 256 PSMCT32 entries in VRAM at `CBP*256` (using the new `vram_stub` second read port), drives TEX0_1 with `CBP=4, CPSM=PSMCT32, CSM=CSM2, CLD=1`, waits for `clut_loader_stub.load_busy` to fall, then runs PSMT8 scanout and asserts each in-sprite pixel reads back as the CLUT entry the loader copied — no TB-direct CLUT writes needed. Also carries a Ch99-audit negative phase: a TEX0 write with CSM=0 (CSM1 swizzle, deferred) silently no-ops instead of laying down wrong linear bytes. - `tb_gs_clut_load_ct16.sv` (Ch100) — CPSM=PSMCT16 variant of the Ch99 load TB. Stages 256 RGB5A1 entries (2 bytes each) in VRAM at `CBP*256`, drives TEX0_1 with `CPSM=2`. The loader now walks at 2-byte stride, unpacks RGB5A1 → PSMCT32 ABGR via 5→8 bit-replicate, and writes to clut_stub. PSMT8 scanout produces the expanded RGB. Ch100-audit alpha coverage: per-entry `a1 = idx[0]` varies the alpha bit so both `{8{0}} = 0x00` and `{8{1}} = 0xFF` are exercised; a TB-side `clut_we` snoop captures every loader write so alpha can be asserted directly without going through the RGB-only scanout path. - `tb_gs_clut_load_cld_modes.sv` (Ch101 + Ch102) — conditional CLD-mode policy. Phases walk through CLD ∈ {0, 1, 2, 3, 4, 5, 6, 7} with varying CBP/CPSM/CSA, counting `loader_busy` rising edges to prove: CLD=0 never loads; CLD=1 always (full); CLD=2 loads only when CBP changed; CLD=3 loads when CBP/CPSM/CSA any-changed (CBP, CSA, and CPSM arms each isolated); CLD=4 always loads but only the 16-entry CSA window (Ch102 — write range correctness is locked by `tb_gs_clut_load_csa_window`); CLD ∈ {5, 6, 7} reserved no-ops. - `tb_gs_clut_load_csa_window.sv` (Ch102) — CLD=4 write-range correctness. Phase 1 stages 256 distinct PSMCT32 entries in VRAM and runs CLD=1 to fill all 256 CLUT slots with pattern_a. Phase 2 stages 16 different entries at a new CBP, drives CLD=4 with CSA=2 (window = idx 32..47), and asserts via a `clut_we` snoop that exactly 16 writes occurred AND the captured array contains: pattern_a(i) at i ∈ [0, 32) ∪ [48, 256), pattern_b(i-32) at i ∈ [32, 48). Proves 240 entries are preserved across the partial load. Audit-low extensions: Phase 3 covers the high-CSA wrap (CSA=16 → window-base wraps mod-256 to 0); Phase 4 covers CT16 partial (CPSM=PSMCT16, 2-byte stride, RGB5A1 unpack at the loader, window at idx 160..175). - `tb_gs_scanout_psmt4_clut.sv` (Ch103) — PSMT4 scanout. Stages a 4×4 PSMT4 sprite (2 pixels/byte) and 16 CLUT entries. Phase 1 (`clut_enable=1`): asserts each pixel reads `CLUT[zero-ext(nibble) + CSA*16]`. Phase 2 (`clut_enable=0`): asserts the grayscale fallback replicates the 4-bit nibble across the 8-bit DAC value. Both phases verify byte-stride half-extraction (low/high nibble pick) at every active pixel. Audit-low Phase 3 locks PSMT4 + nonzero CSA (CSA=1, window 16..31) end-to-end: TB-direct CLUT writes plant a 0xDEAD_BEEF sentinel at entries 0..15 and a per-index formula at 16..31, scanout asserts each pixel reads the formula and never the sentinel. - `tb_gs_demo_psmt4_e2e.sv` (Ch107) — first end-to-end demo for the GS/PCRTC stack. **Scope is GS-side only**: the post-GIF register stream (per-reg A+D writes via `gs_stub.gif_reg_*`) plus privileged-block MMIO drive the pipeline; `gif_packed_stub` / GIFtag-PACKED is BYPASSED — feeding the same demo through the GIF front-end is a future chapter. Step 1 stages 16 PSMCT32 palette entries in VRAM at `CBP*256` (modelled as a TB-direct write — DMA→GS image transfer is a future chapter, but the framebuffer itself is NOT TB-direct). Step 2 drives per-reg writes (PRIM/FRAME_1/RGBAQ/XYZ2) for four SPRITEs paying out a 4-quadrant 8×4 image (TL idx 0x5, TR idx 0x7, BL idx 0xA, BR idx 0xC) at FRAME_1.PSM=PSMT4 — all 32 framebuffer pixels arrive via the Ch106 raster channel. Step 3 drives TEX0_1 with `CBP=palette, CPSM=PSMCT32, CSM=CSM2, CSA=0, CLD=4`; loader writes clut_stub[0..15]. Step 4 brings up scanout via privileged-block writes to DISPFB1 (PSM=PSMT4) + DISPLAY1 + PMODE.EN1. Step 5 captures one full frame and asserts each pixel reads back as `CLUT[quadrant_idx]` (or `CLUT[0]` outside the 8×4 image since vram_stub zero-init means nibble=0). Aggregate asserts: 32 PSMT4 emits, mask ∈ {0x0F, 0xF0} on every emit (channel-isolation locked architecturally — only R[3:0] ever reaches VRAM at PSMT4), loader fires exactly once, no raster_overflow / raster_degenerate. This TB is the first stack-wide proof that the GS-side post-GIF sequence — per-reg writes → indexed framebuffer → TEX0+CLD palette upload → PMODE/DISPFB/DISPLAY scanout — produces a coherent RGB frame end to end without TB sideband for the framebuffer pixels. Routing the same primitives through GIFtag/PACKED A+D via `gif_packed_stub` closes the last sideband and is the natural Ch108 anchor. - `tb_gs_demo_psmt4_e2e_ee_full_bootlet.sv` (Ch114) — extends Ch113's EE-driven control plane to ALSO drive the DMAC channel-2 setup from the same MIPS instruction stream. The EE program now writes the 4 GS-priv registers + the 3 DMAC ch2 registers (MADR / QWC / CHCR.start) via real `sw` instructions, then SYSCALLs to halt. Total: 7 EE-CPU MMIO writes (4 GS-priv + 3 DMAC) producing the same 16×8 captured frame. **Architectural note**: the program lives in `bios_rom_stub` at 0xBFC0_0000 / phys 0x1FC0_0000, NOT in RAM. A RAM-resident program would have its instruction fetches contend with the DMAC's RAM reads through `ee_ram_stub`'s single read port (the map's CPU>DMAC arbitration silently corrupts DMAC data). Putting the program in BIOS decouples the two paths so EE and DMAC run truly in parallel. This also matches real PS2: the EE boots out of BIOS ROM. PASS criteria add to Ch113's: **3 EE-driven DMAC writes** seen at the map's DMAC-ch2 decode; the existing `dma=(1,36,1)` event taxonomy still holds (those events are triggered by the EE's CHCR write, not a TB-direct write). The remaining TB-direct surfaces in the demo are now narrowly the GIF payload pre-stage in RAM (a real EE driver would itself stage this) and bios_rom_stub's program preload (which is the EE bootlet itself — not a runtime TB sideband). - `tb_gs_demo_psmt4_e2e_ee_program.sv` (Ch113) — same demo as Ch112 but the 4 control-plane MMIO writes (PMODE / DISPFB1 / DISPLAY1 lo / DISPLAY1 hi) are no longer issued by the TB directly. Instead a 10-instruction MIPS program preloaded into ee_ram_stub at phys 0x800 (kseg0 0x80000800) is fetched and executed by `ee_core_stub` (parameterized with `PC_RESET=0x80000800`). The program is `LUI/ORI/SW × 4` plus a SYSCALL terminator; the SW instructions target `0x12000000+` and flow through `ee_memory_map_stub`'s GS-priv decode → `ee_gs_priv_bridge_stub` → `gs_stub.reg_wr_*`. Closes the very last TB-direct surface in the demo flow: every byte AND every register bit AND every control-plane decision now arrives from a real-shape source. PASS criteria add to Ch112's: `core_halt_o == 1` (asserts exactly once on the SYSCALL halt), `core_trap == 0`, EE program halts at `EE_PROG_VA + 36 = 0x80000824` (the SYSCALL slot). The TB still pre-stages the GIF payload and triggers the DMAC channel-2 transfer via TB-direct CHCR/MADR/QWC writes — a wider EE program that also drives DMAC bring-up is a separate future chapter. - `tb_gs_demo_psmt4_e2e_eemap.sv` (Ch112) — same demo as Ch111 but the bridge is no longer driven by the TB directly. Instead the TB drives `ee_memory_map_stub.ee_wr_*` with full 32-bit physical addresses targeting the new GS-privileged-MMIO window at 0x1200_0000-0x1200_FFFF (64 KiB; phys[28:16] == 13'h1200). The map decodes the window, peels the 16-bit offset, and hands the 32-bit half-write to `ee_gs_priv_bridge_stub`, which then fires gs_stub.reg_wr_* with the running 64-bit shadow value. Closes the last control-plane routing gap before a real EE instruction stream can drive the demo's bring-up: PMODE / DISPFB1 / DISPLAY1 are now reachable from `sw 0x1200_0080(...)`- shaped writes rather than from a TB-shaped EE-MMIO port. PASS criteria identical to Ch111: 4 EE-MMIO writes / 4 bridge fires, same 16×8 captured frame. **Architectural note**: this chapter ALSO adds 4 new output ports to `ee_memory_map_stub` (`ee_gs_priv_wr_en/addr/data/be`). Existing 56 ee_memory_map_ stub-using TBs leave those outputs unconnected (named-port instantiation tolerates omitted outputs); only the new Ch112 TB wires them through to the bridge. - `tb_gs_demo_psmt4_e2e_eemmio.sv` (Ch111) — same demo as Ch110 but the privileged-block control writes (PMODE / DISPFB1 / DISPLAY1) now arrive through `ee_gs_priv_bridge_stub` (a new RTL module) driven by EE-shaped 32-bit MMIO writes from the TB, instead of TB-direct gs_stub.reg_wr_* pulses. The bridge accumulates 32-bit half-writes per 8-byte slot and fires a 64-bit gs_stub.reg_wr_* pulse on each EE half-write — single-half writes work for PMODE.EN1 and DISPFB1 (interesting bits in the low 32), and a pair of writes (lo+hi to consecutive 4-byte addresses) handles DISPLAY1 whose DW/DH live in the high 32. **Bridge contract**: full-word writes only — `ee_wr_be` must be `4'b1111`; sub-word (per-byte) merging into the 64-bit shadow is intentionally out of scope and a `$error` fires on any narrower be (control-plane GS registers are always written as full 32-bit `sw` halves of an `sd`). **Scope precision**: this chapter closes the TB-direct `gs_stub.reg_wr_*` surface — i.e., the privileged-MMIO sink at the GS itself. The bridge is instantiated by the TB directly; it is NOT yet wired into `ee_memory_map_stub`, so the full EE-CPU / memory-map MMIO path (a real EE instruction stream reaching 0x12000000+ via `sw`) is a separate future chapter. PASS criteria add to Ch110's: **4 EE-MMIO writes** (1 PMODE + 1 DISPFB1 + 2 DISPLAY1) and **4 bridge fires** producing the same 16×8 captured frame as Ch110. - `tb_gs_demo_psmct32_swizzle_trxdir_e2e.sv` (Ch124) — companion to Ch123: same EE-bootlet → DMAC → GIF data plane and same all- three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of by raster. The Ch121 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every byte the GS produces comes out of the image-xfer engine at canonical PSMCT32 swizzled addresses, and the raster path is dormant. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT32} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=32: 32 IMAGE qwords carrying the 128 PSMCT32 pixels of the same four-quadrant pattern Ch123 used). DMAC QWC = 38. Verification mirrors Ch123: (1) full 16×8 scanout frame capture; (2) per-pixel byte readback at the canonical swizzled address via vram_stub's 2nd read port; (3) strict linear-vs- swizzled separator at byte 1024 stays 0. Aggregate counts: `dma=(1,38,1) ee_dmac_wr=3 giftags=2 ad_writes=4 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8`. Ch123 + Ch124 together exercise BOTH PSMCT32 write-side paths (raster Ch122 + image-xfer Ch121) end-to-end through the same driver-shaped flow with the same swizzled-scanout (Ch120) read side. - `tb_gs_demo_psmct32_swizzle_e2e.sv` (Ch123) — full driver-shaped end-to-end demo with ALL THREE PSMCT32 swizzle gates flipped on simultaneously: `gs_stub#(PSMCT32_SWIZZLE=1)` (Ch122 raster), `gif_image_xfer_stub#(PSMCT32_SWIZZLE=1)` (Ch121 — instantiated but unused in this demo), `gs_pcrtc_stub#(PSMCT32_SWIZZLE=1)` (Ch120 read). The data plane is the same DMAC + GIF + EE-bootlet shape Ch107..Ch114 demos use: a BIOS-resident EE program (PC_RESET=0xBFC0_0000) configures GS-priv (DISPFB1, DISPLAY1 lo/hi, PMODE.EN1) via `sw` instructions through `ee_memory_map_stub` → `ee_gs_priv_bridge_stub` → `gs_stub.reg_wr_*`, then kicks DMAC ch2 (MADR / QWC / CHCR) via `sw` to the DMAC reg window, then `SYSCALL` halts. DMAC delivers a 24-qword payload from `ee_ram_stub` to `gif_packed_stub`, which dispatches 4 SPRITE PACKED packets (1 GIFtag + 5 A+D each — PRIM, FRAME_1=PSMCT32, RGBAQ, XYZ2, XYZ2). The 4 sprites tile the 16×8 active area into 4 quadrants with unique RGB triples. With the raster gate on, all 128 per-pixel store addresses go through `gs_swizzle_psmct32_stub`; with the pcrtc gate on, scanout reads from those same swizzled addresses. **Two-phase verification**: (1) **scanout** — every (x, y) in 16×8 captures its sprite's RGB; (2) **byte readback via vram_stub's 2nd read port** — for every (x, y), the 32-bit word at `ref_addr_psmct32(0, 1, x, y)` equals the sprite's `{A=0xFF, B, G, R}` PSMCT32 word. Strict linear-vs-swizzled separator at byte 1024 (where the linear formula's y=4 row would land at stride=256) stays 0 — the swizzled write set for the 16×8 image stays in blocks (0,0) and (1,0) of page 0 (bytes 0..511), so a fall-through to linear would have placed sprite-2's color at byte 1024. Aggregate counts: `dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8`. This is the FIRST end-to-end demo where every PSMCT32 byte the GS produces lives at the canonical PCSX2 swizzled address AND the scanout reads from it — byte-accurate to real PS2 VRAM layout, end-to-end through the driver-shaped flow. - `tb_gs_raster_swizzle_psmct32.sv` (Ch122) — focused contract for the new `PSMCT32_SWIZZLE` parameter on `gs_stub`. When the parameter is set to 1 AND the active raster PSM is PSMCT32 (`ras_psm == 6'h00`), the per-pixel raster emit address is routed through the Ch119 `gs_swizzle_psmct32_stub` (FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) and its output is the absolute byte address (FBP*2048 already baked in). At Ch122 only, PSMCT16/PSMT8/PSMT4 raster emits always took the linear path. Ch128 later closed the PSMCT16 raster gate and Ch134 closed the PSMT8 raster gate (each with its own per-PSM parameter on this same `gs_stub`); PSMT4 raster still takes the linear path. Default 0 keeps every existing PSMCT32 raster TB unchanged. **Three-phase verification**: (1) **origin SPRITE** — drive a single 16×4 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT32} with RGBAQ R=0x55/G=0xAA/B=0xCC/A=0x77, expect 64 emits, per-pixel byte readback via vram_stub's 2nd read port at swizzled addresses confirms each pixel lands where the swizzle says. Strict linear-vs-swizzled separators at bytes 512 and 768 (the linear formula's y=2 / y=3 row starts) stay 0 — proves the gate is live. (2) **scanout agreement** — enable the Ch120 swizzled- pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each visible pixel reads back the SPRITE's RGB. Both gs_stub (Ch122 raster) and gs_pcrtc_stub (Ch120 scanout) instantiate the same swizzle module; a successful capture proves the two integrations agree at byte level — what raster wrote at swizzled addresses comes out on r/g/b at the same (x, y). (3) **non-origin SPRITE** — re-arm the raster with FRAME_1{FBP=4, FBW=2, PSMCT32} and an 8×2 SPRITE at (60, 4)..(67, 5) crossing the page-x boundary at x=64 (so page_index actually changes mid-row). Pins three contracts the origin transfer can't distinguish from a buggy implementation: (a) `ras_fbp` reaches the swizzle's `fbp` input (FBP=0 in Phase 1 would have masked a tied-zero regression), (b) `ras_fbw` reaches the swizzle's `fbw` input (FBW=1 would have masked a tied-one regression), (c) the swizzle gets the FULL absolute pixel coords (s2_x_q, s2_y_q) rather than bbox-local coords (Phase 1's sprite started at (0,0) so absolute and local were equal there). Strict linear-vs- swizzled separator at byte 10480 (where the linear formula would land Phase-3's first pixel) stays 0. Total emit count after all phases: 64 + 16 = 80. With Ch120 (read), Ch121 (TRXDIR upload), and Ch122 (raster emit) all live, the three major PSMCT32 paths are byte-consistent end-to-end. - `tb_gs_image_xfer_swizzle_psmct32.sv` (Ch121) — focused contract for the new `PSMCT32_SWIZZLE` parameter on `gif_image_xfer_stub`. When the parameter is set to 1 AND the upload's PSM is PSMCT32, per-pixel VRAM byte addresses are routed through the Ch119 `gs_swizzle_psmct32_stub` (FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+ cur_y) and `dest_base_q (= DBP*256)` is added back to anchor at the upload's DBP base. PSMCT16/PSMT8/PSMT4 always take the linear path. Default 0 keeps every existing image-xfer TB unchanged. **Three-phase verification**: (1) **origin transfer** — TRXDIR upload of a 16×4 PSMCT32 image at DBP=DSAX=DSAY=0, DBW=1, RRW=16, RRH=4 → 64 pixels, 16 IMAGE qwords. After the upload completes, the TB reads VRAM via vram_stub's 2nd read port at the SWIZZLED byte address (TB-side `ref_addr()` mirrors the swizzle module) and asserts each pixel landed where the swizzle says. Strict linear-vs-swizzled separator: bytes 512 and 768 (where linear y=2 and y=3 rows would land) stay 0 under swizzled, since the 16×4 image only fills blocks (0,0) and (1,0) which together cover bytes [0..127] ∪ [256..383]. (2) **scanout agreement** — enable the Ch120 swizzled-pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each scanned-out pixel matches its uploaded color. Both upload and scanout instantiate the same `gs_swizzle_psmct32_stub`, so a successful capture proves the two integrations agree at byte level — what was written by TRXDIR comes out on r/g/b at the same (x, y). (3) **non-origin transfer** — re-arm with NONZERO DBP, DSAX, and DSAY (DBP=8, DSAX=4, DSAY=2, RRW=8, RRH=4) and verify each uploaded pixel lands at `DBP*256 + swizzle(0, DBW, DSAX+x_local, DSAY+y_local)`. Phase 3 pins TWO contracts the origin transfer can't distinguish from a buggy implementation: (a) `dest_base_q (= DBP*256)` is correctly ADDED ON TOP of the swizzle output (with DBP=0 a missing-add regression would still pass), and (b) the swizzle is fed the FULL effective coordinates (with DSAX=DSAY=0 a "feeds only cur_x/cur_y" regression would still pass). Strict linear-vs-swizzled separator at byte 3088 (where the linear formula's y=2 row of the P3 image would land) stays 0 under swizzled. NOTE: gs_stub raster writes still use linear addressing — that wiring is a follow-on chapter. - `tb_gs_scanout_swizzle_psmct32.sv` (Ch120) — focused contract for the new `PSMCT32_SWIZZLE` parameter on `gs_pcrtc_stub`. When the parameter is set to 1 AND the active PSM is PSMCT32, PCRTC reads VRAM at swizzled addresses (via the Ch119 swizzle module instantiated inside pcrtc) instead of the legacy linear formula. Other PSMs (CT16/T8/T4) and `PSMCT32_SWIZZLE=0` keep the original linear path unchanged. Topology: TB drives `vram_stub.write_*` directly with each pixel's color preloaded at the swizzled byte address (TB-side `ref_addr()` mirrors the DUT swizzle math), then pcrtc with `PSMCT32_SWIZZLE=1` scans out the frame and the TB asserts each captured pixel matches the preloaded color. Image is 16×4 PSMCT32 (covers blocks (0,0) AND (1,0) horizontally) at FBP=0/FBW=1; pcrtc active area is 8×4 (block (0,0) entirely), but the swizzle vs. linear distinction shows up at any y>0 (linear y=1 → byte 64; swizzled byte 32) so even the in-window region is a strict separator. Per-pixel color is unique (`{A=0xFF, B=y<<4, G=x<<4, R=0x10|(y*8+x)}`) so any wrong- address commit surfaces immediately. NOTE: at Ch120 ONLY, gs_stub raster writes and gif_image_xfer_stub uploads still used linear addressing — Ch120 was read-side only. Ch121 (image-xfer) and Ch122 (raster) closed the write-side gates, and Ch123 demonstrates all three running together end-to-end. - `tb_gs_demo_psmt8_swizzle_trxdir_e2e.sv` (Ch136) — companion to Ch135: same EE-bootlet → DMAC → GIF data plane and same all- three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of by raster. The Ch133 PSMT8 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every byte the GS produces comes out of the image-xfer engine at canonical PSMT8 swizzled addresses, and the raster path is dormant. Mirrors Ch124 PSMCT32 + Ch130 PSMCT16 TRXDIR demos for the third PSM. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2, DPSM=PSMT8} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=8: 8 IMAGE qwords each carrying 16 PSMT8 bytes for the 16×8 image, row-major). DBW=2 is the minimum even DBW for PSMT8. DMAC QWC=14. Per-quadrant byte indices Q0=0xA0/Q1=0x40/Q2=0xC0/Q3=0x60 reused from Ch135 so the verify side is unchanged. New `trxdir_arms_seen` counter asserts =1 (single TRX setup) + xfer-side per-emit observer asserts every xfer_we pulse fires with be=4'b0001, mask= 0xFFFFFFFF (PSMT8 single-byte commit shape). Verification mirrors Ch135: (1) full 16×8 scanout frame capture; (2) per- pixel BYTE readback at the canonical swizzled byte address (with `addr[1:0]` selecting the right byte from the 32-bit word) via vram_stub's 2nd port; (3) strict separators at bytes 128 and 256 stay 0. Aggregate counts: `dma=(1,14,1) ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8`. **First-attempt PASS** errors=0. Ch135 + Ch136 together close the PSMT8 byte-accuracy milestone end- to-end through the full driver-shaped flow — same Ch123+Ch124 (PSMCT32) and Ch129+Ch130 (PSMCT16) shape. - `tb_gs_demo_psmt4_swizzle_trxdir_e2e.sv` (Ch142) — companion to Ch141 (raster-driven PSMT4 e2e): same EE-bootlet → DMAC → GIF data plane and same all-three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of by raster. The Ch139 PSMT4 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every nibble the GS produces comes out of the image-xfer engine at canonical PSMT4 swizzled (addr, nibble_hi) slots, and the raster path is dormant. Mirrors Ch124's PSMCT32 TRXDIR demo, Ch130's PSMCT16 TRXDIR demo, and Ch136's PSMT8 TRXDIR demo for the fourth (and last) common GS PSM. Cloned from Ch136 and surgically retargeted to PSMT4. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2, DPSM=PSMT4} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=4 EOP=1: 4 IMAGE qwords carrying 32 PSMT4 nibbles each — at RRW=16 each qword holds 2 rows: lanes 0..15 = row 2*qi, lanes 16..31 = row 2*qi+1, matching Ch139's focused-TB packing). Total QWC = 10 (5+5). EE-bootlet DISPFB1 immediate identical to Ch141 (LUI 0x000A; ORI 0x0400 → PSM=PSMT4). Per-quadrant nibbles match Ch141 verbatim (Q0=0xA → 0xAA, Q1=0x4 → 0x44, Q2=0xC → 0xCC, Q3=0x6 → 0x66) so the verify side reuses Ch141's pattern unchanged. Verification mirrors Ch141: (1) full 16×8 scanout frame capture via Ch138 swizzled-pcrtc; (2) per-pixel NIBBLE readback at the canonical swizzled (addr, nibble_hi) slot via vram_stub's 2nd port (addr[1:0]-keyed byte selection then nibble_hi-keyed nibble selection); (3) strict linear- vs-swizzled separator at byte 128 stays 0 (per-byte check, not full word: a neighbor byte may legitimately be touched); (4) per-emit observer asserts every image-xfer write is `be=4'b0001` / `mask ∈ {0x0F, 0xF0}` (PSMT4 nibble RMW shape) and the `trxdir_wr_q` arming pulse fires exactly once. Aggregate counts: `dma=(1,10,1) ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8`. Ch141 + Ch142 together exercise BOTH PSMT4 write-side paths (raster Ch140 + image-xfer Ch139) end-to-end through the same driver-shaped flow with the same swizzled-scanout (Ch138) read side — bringing PSMT4 to full parity with the PSMCT32, PSMCT16, and PSMT8 e2e coverage from Ch123+Ch124, Ch129+Ch130, and Ch135+Ch136. **Architectural milestone**: this is the first state of the project where ALL FOUR common GS PSMs (CT32 + CT16 + T8 + T4) have BOTH a raster- driven AND a TRXDIR-driven driver-shaped end-to-end byte- accuracy demo — closing the **four-PSM × three-path × dual- driver-shape e2e foundation** (8 demos total). The bug-fix iteration: TB-side `ref_col_idx4` was first written with a 7-bit case key `{yb[2:0], xb[3:0]}` covering yb=0..7 in indices 0..127, but the values for yb=4..7 were miscopied from Ch139's yb=12..15 row (Ch139 only exercises yb=0..3 and yb=12..15). Phase 2 readback failed for all 64 pixels in y=4..7 with `got=0 expected=0xC/0x6` — the engine wrote the right nibbles to the right addresses (scanout passed), but the TB's ref looked at the wrong slot. Fix: switched to Ch141's 9-bit case key `{yb[3:0], xb[4:0]}` and used Ch141's verified yb=0..7 values verbatim. **First-attempt PASS** after the table fix. - `tb_gs_demo_psmt4_swizzle_e2e.sv` (Ch141) — first driver-shaped end-to-end PSMT4 demo with all three PSMT4 swizzle gates (Ch138 read-side pcrtc, Ch139 image-xfer write-side, Ch140 raster write-side) parameter-set to 1 simultaneously, but with the demo flow exercising only the raster (Ch140) + scanout (Ch138) paths as load-bearing. The Ch139 image-xfer gate is smoke-only here (parameter is set but `xfer_writes_seen == 0` is asserted, since no TRXDIR/IMAGE packet is delivered in the raster-driven payload); the Ch139 load-bearing variant is the Ch142 TRXDIR-driven PSMT4 e2e (mirrors Ch124/Ch130/Ch136). PSMT4 counterpart of Ch123's PSMCT32 / Ch129's PSMCT16 / Ch135's PSMT8 e2e demos. Same EE-bootlet → DMAC → GIF data plane: BIOS-resident EE program configures GS-priv (DISPFB1 PSMT4 with FBW=2, DISPLAY1, PMODE) via `sw` instructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload (4 SPRITE PACKED packets) through `gif_packed_stub` into `gs_stub` raster. The 4 sprites tile the 16×8 active area into 4 quadrants with per-quadrant unique RGBAQ.R[3:0] nibbles (Q0=0xA → 0xAA, Q1=0x4 → 0x44, Q2=0xC → 0xCC, Q3=0x6 → 0x66). PSMT4 raster (Ch106) takes RGBAQ.R[3:0] as the nibble that hits VRAM via the existing Ch106 nibble RMW machinery (write_be=4'b0001 + write_mask 0x0F or 0xF0); Ch140 keys the high/low nibble selector off the swizzle's `nibble_hi` output instead of `s2_pixel_index[0]`. PCRTC's Ch103 PSMT4 grayscale fallback (clut_enable=0) surfaces the nibble as r=g=b={n, n} at scanout, so each captured pixel IS the nibble we wrote (no CLUT setup needed for this demo; a CLUT-driven Ch141 variant is a future chapter). With the raster gate on, all 128 per-pixel nibble stores go through `gs_swizzle_psmt4_stub`; with the pcrtc gate on, scanout reads from those same swizzled (addr, nibble_hi) slots. **Two-phase verification**: (1) full-frame scanout asserts each (x, y) reads back its quadrant's nibble as PSMT4 grayscale r=g=b={n, n}; (2) per-pixel NIBBLE readback at the canonical swizzled address (with `addr[1:0]` selecting the right byte from the 32-bit word, then `nibble_hi` selecting which nibble of that byte) via vram_stub's 2nd port — the 16×8 PSMT4 image lives entirely in the upper-left of block (0,0) of page 0 (PSMT4 block = 32×16 px) and the within-block columnTable4 yb=0..7 / xb=0..15 exercises nibble_idx values [0..127]. Strict linear-vs-swizzled separator at byte 128 (linear y=2 row start at PSMT4 stride=64 with FBW=2) stays 0 — outside block (0,0)'s touched range. Per-emit observer locks PSM=0x14, be=4'b0001, mask ∈ {0x0F, 0xF0}. Aggregate counts: `dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8`. **First-attempt PASS** errors=0. Together with Ch123 (PSMCT32 e2e), Ch129 (PSMCT16 e2e), and Ch135 (PSMT8 e2e), this is the first state of the project where the full driver-shaped flow has end-to-end byte-accuracy demos for ALL FOUR common GS PSMs (CT32 + CT16 + T8 + T4) under software-shaped raster traffic. The TRXDIR-driven PSMT4 companion landed at Ch142 (mirror of Ch124/Ch130/Ch136 making Ch139 load-bearing), so Ch141 + Ch142 together close the PSMT4 byte-accuracy milestone end-to-end through both driver shapes — bringing PSMT4 to full parity with CT32/CT16/T8. - `tb_gs_demo_psmt8_swizzle_e2e.sv` (Ch135) — first driver-shaped end-to-end PSMT8 demo with all three PSMT8 swizzle gates (Ch132 read-side pcrtc, Ch133 image-xfer write-side, Ch134 raster write-side) parameter-set to 1 simultaneously, but with the demo flow exercising only the raster (Ch134) + scanout (Ch132) paths as load-bearing. The Ch133 image-xfer gate is smoke-only here (parameter is set but `xfer_writes_seen == 0` is asserted, since no TRXDIR/IMAGE packet is delivered in the raster-driven payload); the Ch133 load-bearing variant is the Ch136 TRXDIR-driven PSMT8 e2e (mirror of Ch124/Ch130). PSMT8 counterpart of Ch123's PSMCT32 / Ch129's PSMCT16 e2e demos. Same EE-bootlet → DMAC → GIF data plane: BIOS-resident EE program configures GS-priv (DISPFB1 PSMT8 with FBW=2, DISPLAY1, PMODE) via `sw` instructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload (4 SPRITE PACKED packets) through `gif_packed_stub` into `gs_stub` raster. The 4 sprites tile the 16×8 active area into 4 quadrants with per-quadrant unique RGBAQ.R values (Q0=0xA0, Q1=0x40, Q2=0xC0, Q3=0x60). PSMT8 raster (Ch105) takes the natural ABGR's R channel as the byte index that hits VRAM; PCRTC's Ch96 grayscale fallback (clut_enable=0) surfaces the byte as R=G=B at scanout, so each captured pixel IS the byte we wrote (no CLUT setup needed for this demo; a CLUT-driven Ch135 variant is a future chapter). With the raster gate on, all 128 per-pixel byte stores go through `gs_swizzle_psmt8_stub`; with the pcrtc gate on, scanout reads from those same swizzled addresses. **Two-phase verification**: (1) full-frame scanout asserts each (x, y) reads back its quadrant's byte as PSMT8 grayscale R=G=B; (2) per-pixel BYTE readback at the canonical swizzled address (with `addr[1:0]` selecting the right byte from the 32-bit word) via vram_stub's 2nd port — the 16×8 PSMT8 image lives entirely in the upper half of block (0,0) of page 0 (PSMT8 block = 16×16 px) and the within-block columnTable8 yb=0..7 exercises byte values [0..127]. Strict linear-vs-swizzled separators at bytes 128 (linear y=1 row start at PSMT8 stride=128 with FBW=2) and 256 (linear y=2) stay 0 — both outside block (0,0)'s touched range. Aggregate counts: `dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8`. Together with Ch123 (PSMCT32 e2e) and Ch129 (PSMCT16 e2e), this was the first state of the project where the full driver-shaped flow had end-to-end byte-accuracy demos for the CT32/CT16/T8 trio under software-shaped traffic. PSMT4 was the natural follow-on and landed at Ch141 (raster- driven, mirror of this demo) + Ch142 (TRXDIR-driven, mirror of Ch136), closing the four-PSM × dual-driver-shape e2e matrix. - `tb_gs_demo_psmct16_swizzle_trxdir_e2e.sv` (Ch130) — companion to Ch129: same EE-bootlet → DMAC → GIF data plane and same all- three-gates-on instantiation, but the framebuffer is filled by a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of by raster. The Ch127 image-xfer write-side swizzle gate becomes LOAD-BEARING inside the demo flow — every byte the GS produces comes out of the image-xfer engine at canonical PSMCT16 swizzled addresses, and the raster path is dormant. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT16} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=16: 16 IMAGE qwords carrying the 128 PSMCT16 halfwords of the same four-quadrant pattern Ch129 used). DMAC QWC = 22. Verification mirrors Ch129: (1) full 16×8 scanout frame capture; (2) per-pixel halfword readback at the canonical swizzled byte address (with `addr[1]` selecting the right 16-bit slot) via vram_stub's 2nd read port; (3) strict linear-vs- swizzled separators at bytes 256 and 384 stay 0; (4) per-emit observer asserts every image-xfer write is `be=4'b0011` / `mask=0xFFFF_FFFF` (low halfword) and the `trxdir_wr_q` arming pulse fires exactly once. Aggregate counts: `dma=(1,22,1) ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8`. Ch129 + Ch130 together exercise BOTH PSMCT16 write-side paths (raster Ch128 + image-xfer Ch127) end-to-end through the same driver-shaped flow with the same swizzled-scanout (Ch126) read side — bringing PSMCT16 to full parity with the PSMCT32 e2e coverage from Ch123 + Ch124. - `tb_gs_demo_psmct16_swizzle_e2e.sv` (Ch129) — full driver-shaped end-to-end demo with all three PSMCT16 swizzle gates (Ch126 read-side pcrtc, Ch127 image-xfer write-side, Ch128 raster write-side) parameter-set to 1 simultaneously, but with the demo flow exercising only the raster (Ch128) + scanout (Ch126) paths as load-bearing. The Ch127 image-xfer gate is smoke-only here (parameter is set but `xfer_writes_seen == 0` is asserted, since no TRXDIR/IMAGE packet is delivered in the raster-driven payload); Ch130 (TRXDIR-driven PSMCT16 e2e) is the load-bearing image-xfer-side counterpart. PSMCT16 counterpart of Ch123's PSMCT32 e2e demo. Same EE-bootlet → DMAC → GIF data plane: BIOS-resident EE program configures GS-priv (DISPFB1 PSMCT16, DISPLAY1, PMODE) via `sw` instructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload (4 SPRITE PACKED packets) through `gif_packed_stub` into `gs_stub` raster. The 4 sprites tile the 16×8 active area into 4 quadrants with per-quadrant unique RGB5A1 colors picked so the 5→8 bit-replicate at PCRTC output produces unique 8-bit RGB triples. With the raster gate on, all 128 per-pixel halfword stores go through `gs_swizzle_psmct16_stub`; with the pcrtc gate on, scanout reads from those same swizzled addresses. **Two-phase verification**: (1) full-frame scanout asserts each (x, y) reads back its quadrant's 5→8-expanded RGB; (2) per-pixel halfword readback via vram_stub's 2nd port at swizzled addresses (with `addr[1]` selecting the right 16-bit slot) confirms each sprite halfword landed where the swizzle says — the 16×8 PSMCT16 image lives entirely in block (0,0) of page 0 (PSMCT16 block = 16×8 px), so the readback exercises ALL 16 xb × 8 yb entries of `columnTable16`. Strict linear-vs-swizzled separators at bytes 256 (linear y=2 row start at PSMCT16 stride=128) and 384 (linear y=3) stay 0 — both outside block (0,0)'s 256-byte range. Aggregate counts: `dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8`. Together with Ch123 (PSMCT32 e2e), this is the first state of the project where the full driver-shaped flow has end-to-end byte-accuracy demos for BOTH direct-color PS2 PSMs. - `tb_gs_raster_swizzle_psmct16.sv` (Ch128) — focused contract for the new `PSMCT16_SWIZZLE` parameter on `gs_stub` (the raster emit surface). Mirrors Ch122's wiring shape but for PSMCT16: when the parameter is 1 AND the active raster PSM is PSMCT16 (`ras_psm == 6'h02`), the per-pixel raster emit address is routed through the Ch125 `gs_swizzle_psmct16_stub` (FBP=ras_fbp, FBW= ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — its output is the absolute byte address. PSMCT32 is gated by its own `PSMCT32_SWIZZLE` parameter (Ch122). At Ch128 only, PSMT8/PSMT4 raster emits stayed linear; Ch134 later closed the PSMT8 raster gate via `PSMT8_SWIZZLE` on this same `gs_stub`. PSMT4 raster still takes the linear path. Default 0 keeps every existing PSMCT16 raster TB (Ch95 etc.) unchanged. **Three-phase verification**: (1) **origin SPRITE** — drive a 16×4 PSMCT16 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT16} with RGBAQ {R=0xAA, G=0x50, B=0xC0, A=0x00} → halfword 0x6155 (R5=0x15, G5=0x0A, B5=0x18, A1=0). Per-pixel halfword readback via vram_stub's 2nd port (with `addr[1]` selecting the right 16-bit slot) confirms each lands at the swizzled byte. The 16×4 image lives in block (0,0) of page (0,0), so within-block columnTable16 rows 0..3 are exercised. **Strict separators**: bytes 128 (linear y=1 row start at PSMCT16 stride=128) and 256 (linear y=2) stay 0 — proves the gate is live, since a fall- through to the legacy linear path would put the SPRITE halfword there. (2) **scanout agreement** — enable the Ch126 swizzled-pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each visible pixel reads back the expected RGB after PCRTC's 5→8 bit-replicate (RGB = {0xAD, 0x52, 0xC6}). Both gs_stub (Ch128 raster) and gs_pcrtc_stub (Ch126 scanout) instantiate the same swizzle module. (3) **non-origin SPRITE** — re-arm with FRAME_1{FBP=4, FBW=2, PSMCT16} and an 8×4 SPRITE at (60, 4)..(67, 7) with distinct color (halfword 0x9F8E). Crosses the PAGE-x boundary at x=64 (page (0,0) for x∈[60..63] — block (0,3) by swizzle table — vs page (1,0) for x∈[64..67] — block (0,0)) so page_index changes mid-row. Within-block column-table coords (xb=12..3, yb=4..7) cover columnTable16 rows 4..7 — a different region than Phase 1's yb=0..3. Pins three contracts Phase 1 can't: (a) `ras_fbp` reaches the swizzle's `fbp` input (FBP=0 in P1 would mask a tied-zero); (b) `ras_fbw` reaches `fbw` (FBW=1 in P1 would mask a tied-one); (c) the swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q rather than bbox-local (P1's sprite started at (0,0), so absolute and local were equal). Strict P3 separator at byte 9336 (linear formula's effective (60, 4) byte) stays 0 — outside the P3 swizzled write set, which lives in block (0,3) of page (0,0) (10914..11006) and block (0,0) of page (1,0) (16512..16604). Total emit count after all phases: 64 + 32 = 96. With Ch126 (read), Ch127 (TRXDIR upload), and Ch128 (raster emit) all live, the three major PSMCT16 paths are byte-consistent end-to-end — completes the byte-accuracy milestone for the second PSM, mirroring the Ch120/Ch121/Ch122 PSMCT32 closure. - `tb_gs_image_xfer_swizzle_psmct16.sv` (Ch127) — focused contract for the new `PSMCT16_SWIZZLE` parameter on `gif_image_xfer_stub`. Mirrors Ch121's wiring shape but for PSMCT16: when the parameter is 1 AND the upload's PSM is PSMCT16, per-pixel byte addresses route through the Ch125 `gs_swizzle_psmct16_stub` (FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y) and `dest_base_q (= DBP*256)` is added back to anchor at the upload's DBP base. PSMCT32 is gated by its own PSMCT32_SWIZZLE parameter (Ch121); PSMT8/T4 always linear. Default 0 keeps every existing PSMCT16 image-xfer TB unchanged. **Three-phase verification**: (1) **origin transfer** — TRXDIR upload of a 16×4 PSMCT16 image at DBP=DSAX=DSAY=0, DBW=1, RRW=16, RRH=4 → 64 pixels, 8 IMAGE qwords (8 px/qword for PSMCT16). After upload, the TB reads vram_stub's 2nd port at the SWIZZLED byte address (TB-side `ref_addr16/ref_block_idx16/ref_col_idx16` carry the verbatim PCSX2 tables locked at Ch125) and asserts each halfword landed where the swizzle says (selecting the right 16-bit slot inside the 32-bit word via `addr[1]`). Strict linear-vs-swizzled separators at bytes 128 (linear y=1) and 256 (linear y=2) stay 0 — swizzled writes for the 16×4 image fill only block (0,0) bytes [0..126]. (2) **scanout agreement** — enable the Ch126 swizzled-pcrtc path on the same VRAM contents, capture the full 16×4 frame, assert each scanned pixel matches the uploaded RGB5A1 → RGB888 5→8 bit-replicate. Both upload and scanout instantiate the same `gs_swizzle_psmct16_stub`. (3) **non-origin transfer** — re-arm with DBP=8, DSAX=12, DSAY=6, RRW=8, RRH=4. Effective coords (12..19, 6..9) cross block_x=0→1 at effective_x=16 AND block_y=0→1 at effective_y=8, exercising both block-table dimensions inside a single non-origin upload. Pins three contracts the origin transfer can't distinguish from a buggy implementation: (a) `dest_base_q (= DBP*256)` is added on top of the swizzle output (DBP=0 in P1 would mask a missing-add); (b) the swizzle is fed the FULL effective coords (DSAX=DSAY=0 in P1 would mask a "feeds only cur_x/cur_y" regression); (c) BOTH block_x and block_y propagate through `blockTable16[by][bx]` (block_x=0 throughout P1 would mask a tied-block_x regression). Strict P3 separator at byte 3096 (linear formula's effective (12, 8) byte) stays 0 — outside the P3 swizzled write set [2048..3071]. NOTE (now historical): PSMCT16 raster swizzle was deferred when Ch127 landed; it shipped at Ch128 (mirrors Ch122 for PSMCT32) so the PSMCT16 raster path is now byte-consistent with the image-xfer path documented here. - `tb_gs_raster_swizzle_psmt4.sv` (Ch140) — focused contract for the new `PSMT4_SWIZZLE` parameter on `gs_stub` (the raster emit surface). Mirrors Ch122/Ch128/Ch134 wiring shape but for the fourth (and last) PSM, and threads the Ch137 swizzle module's `nibble_hi` output into the existing Ch106 PSMT4 raster nibble RMW data lane (replacing `s2_pixel_index[0]` as the high/low nibble selector when the gate is on). When the parameter is 1 AND the active raster PSM is PSMT4 (`ras_psm == 6'h14`), the per-pixel raster emit address is routed through the Ch137 `gs_swizzle_psmt4_stub` (FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — its `addr` output is the absolute byte address, AND its `nibble_hi` output keys `s2_emit_color64`'s nibble placement and `s2_emit_mask`'s high/low gating (write_be stays 4'b0001 for both paths). PSMCT32/PSMCT16/PSMT8 are gated by their own parameters; default 0 keeps every existing PSMT4 raster TB (Ch106 raster_psmt4, Ch107 PSMT4-e2e, Ch103 PSMT4+CLUT, Ch104 round- trip, etc.) on the original linear addressing. No new ports. Default-off smoke verification: ran Ch106 + Ch107 + Ch103 + Ch104 PSMT4 TBs before writing the new TB; all PASSed unchanged. **Three-phase verification** (mirrors Ch134 PSMT8 raster shape, with PSMT4 nibble adaptations + CLUT-disabled grayscale at scanout): (1) **origin SPRITE** at FBP=0/FBW=2 (FBW must be even per PCSX2 GSLocalMemory.h:560 — same as PSMT8). Drive a 16×4 PSMT4 SPRITE with RGBAQ.R=0xAA (PSMT4 raster channel takes R[3:0] as the nibble per Ch106 → nibble = 0xA). Per-pixel nibble readback via vram_stub's 2nd port (with `addr[1:0]`-keyed byte selection then `nibble_hi`-keyed nibble selection inside the byte) confirms each pixel landed at the correct (byte, nibble) slot. The image lives in the upper-left of block (0,0) of page (0,0); within-block columnTable4 entries for yb=0..3, xb=0..15 cover nibble_idx values [0..127] → byte_in_block ∈ [0..63]. Strict separator: byte 64 (linear y=1 row start at PSMT4 FBW=2 stride 64) stays 0. (2) **scanout agreement** — enable Ch138 swizzled-pcrtc on the same VRAM, capture full 16×4 frame, assert each pixel reads back as PSMT4 grayscale R=G=B={0xA, 0xA} = 0xAA. Both gs_stub and gs_pcrtc_stub instantiate the same `gs_swizzle_psmt4_stub` AND thread its `nibble_hi` output through their respective nibble selectors — agreement at this layer means both integrations land at the same byte+nibble positions for PSMT4. (3) **non-origin SPRITE** at FBP=4/FBW=4 (bw_pg=2) drawing 8×4 SPRITE at (124, 4)..(131, 7) with R=0x55 (nibble = 0x5). Crosses PSMT4 PAGE-x at x=128 (page (0,0) for x∈[124..127], page (1,0) for x∈[128..131]). 2 blocks visited: blockTable4[0][3]=10 → page (0,0) block_base 10752; blockTable4[0][0]=0 → page (1,0) block_base 16384. Pins three contracts the origin transfer can't: ras_fbp reaches the swizzle's fbp input; ras_fbw reaches fbw; the swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q. Strict P3 separator at byte 8766 (linear (124, 4) at FBP=4/FBW=4) stays 0 — outside the P3 swizzled write set [10752..11007] + [16384..16639]. Total emit count: 64 + 32 = 96. **First- attempt PASS** errors=0. With Ch138 (read-side), Ch139 (TRXDIR upload), and Ch140 (raster emit) all live, the three major PSMT4 paths can be byte-consistent under the canonical swizzle when their gates are flipped on — completing the **four-PSM × three-path byte-accuracy foundation** (CT32 Ch120/Ch121/Ch122 + CT16 Ch126/Ch127/Ch128 + T8 Ch132/Ch133/Ch134 + T4 Ch138/Ch139/ Ch140). End-to-end PSMT4 swizzled demos (mirroring Ch123/ Ch124, Ch129/Ch130, Ch135/Ch136) are now possible. - `tb_gs_raster_swizzle_psmt8.sv` (Ch134) — focused contract for the new `PSMT8_SWIZZLE` parameter on `gs_stub` (the raster emit surface). Mirrors Ch122's PSMCT32 + Ch128's PSMCT16 wiring shape but for the third PSM: when the parameter is 1 AND the active raster PSM is PSMT8 (`ras_psm == 6'h13`), the per-pixel raster emit address is routed through the Ch131 `gs_swizzle_psmt8_stub` (FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — its output is the absolute byte address. PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays linear. Default 0 keeps every existing PSMT8 raster TB (Ch105 raster_psmt8, Ch107 PSMT4-via-CT16-CLUT palette path, etc.) on the original linear addressing. No new ports — parameter-only API change. Default- off smoke verification: ran Ch105 `tb_gs_raster_psmt8` before writing the new TB; PASSed unchanged. **Three-phase verification** (mirrors Ch128 PSMCT16 raster shape): (1) **origin SPRITE** at FBP=0/FBW=2 (DBW must be even — PCSX2 asserts `(bw & 1) == 0` for PSMT8). Drive a 16×8 PSMT8 SPRITE with RGBAQ.R=0xA5 (PSMT8 raster channel uses R as the byte index per Ch105). Per-pixel byte readback via vram_stub's 2nd port confirms each lands at the swizzled byte. The 16×8 image lives in the upper half of block (0,0) of page (0,0); the within-block columnTable8 distributes the 128 bytes across yb rows 0..7 — byte values 0..127 within the block. **Strict separators**: bytes 128 (linear y=1 row start at PSMT8 stride=128) and 256 (linear y=2) stay 0 — proves the gate is live, since a fall-through to the legacy linear path would put the SPRITE byte there. (2) **scanout agreement** — enable the Ch132 swizzled-pcrtc path on the same VRAM, capture the full 16×8 frame, assert each pixel's PCRTC PSMT8 grayscale R=G=B matches `idx=0xA5`. Both gs_stub and gs_pcrtc_stub instantiate the same `gs_swizzle_psmt8_stub`, so success proves byte-level agreement. (3) **non-origin SPRITE** at FBP=4/FBW=4 (bw_pg=2) drawing 8×4 SPRITE at (124, 4)..(131, 7) with RGBAQ.R=0x5A. Crosses PSMT8 PAGE-x at x=128 (x∈[124..127] is in page (0,0) block (0,7) by swizzle table; x∈[128..131] is in page (1,0) block (0,0)) so page_index changes mid-row. Pins three contracts the origin transfer can't: `ras_fbp` reaches the swizzle's fbp input (FBP=0 in P1 would mask a tied-zero); `ras_fbw` reaches fbw (FBW=2 would mask a tied-two); the swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q rather than bbox-local (P1 sprite started at (0,0) so absolute=local). PSMT8's page-x boundary at x=128 is different from CT32/CT16's x=64, so this exercises the PSMT8-specific x[7] wiring of the swizzle. Strict P3 separator at byte 9340 (linear (124, 4) at FBP=4/FBW=4) stays 0 — outside the P3 swizzled write set (page (0,0) block (0,7) at base 13568, page (1,0) block (0,0) at base 16384). Total emit count: 128 + 32 = 160. **First-attempt PASS** errors=0. With Ch132 (read-side), Ch133 (TRXDIR upload), and Ch134 (raster emit) all live, the three major PSMT8 paths can be byte-consistent under the canonical swizzle when their gates are flipped on — completing the third-PSM byte-accuracy milestone for ALL three integration points (mirrors the Ch120/Ch121/Ch122 PSMCT32 trio + the Ch126/Ch127/Ch128 PSMCT16 trio). - `tb_gs_image_xfer_swizzle_psmt4.sv` (Ch139) — focused contract for the new `PSMT4_SWIZZLE` parameter on `gif_image_xfer_stub`. Mirrors Ch121/Ch127/Ch133 wiring shape but for the fourth (and last) PSM, and threads the Ch137 swizzle module's `nibble_hi` output into the existing Ch118 nibble RMW data lane (replacing `x_eff[0]` as the high/low nibble selector when the gate is on). When the parameter is 1 AND the active DPSM is PSMT4, the per-pixel byte address is `dest_base_q (= DBP*256) + swizzle_psmt4(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y).addr`, AND `cur_mask_c` is `0x0000_00F0` when `swizzle4_nibble_hi=1` (high nibble) or `0x0000_000F` when 0 (low nibble) — the per-bit write_mask machinery (vram_stub merges only the targeted nibble) layers on top of the swizzled address. PSMCT32 /PSMCT16/PSMT8 are gated by their own parameters. Default 0 keeps the legacy linear path for every existing PSMT4 image- xfer TB (Ch118 etc.). No new ports — parameter-only API change. Default-off smoke verification: ran Ch118 `tb_gs_image_xfer_psmt4` before writing the new TB; PASSed unchanged. **Three-phase verification** (mirrors Ch127/Ch133 audit-closed shape): (1) **origin write-side lock** at DBP=0/ DBW=2/DSAX=DSAY=0 (DBW must be even per PCSX2 GSLocalMemory.h: 560 — same FBW-evenness as PSMT8). 16×4 PSMT4 image upload via 2 IMAGE qwords (32 px/qword for PSMT4 = 4 rows × 16-px row at RRW=16). After upload, per-pixel nibble readback at the swizzled `(addr, nibble_hi)` slot asserts each nibble landed where the swizzle says. Strict separator: PSMT4 row stride at DBW=2 = DBW*32 = 64 bytes, so linear y=1 starts at byte 64. Swizzled write set lives in [0..63] within block (0,0). Byte 64 stays 0 (verified via per-byte check, not full-word — the `check_byte_zero` task initially had a full-word bug that misreported neighbor-byte writes; fixed to check only the targeted byte via `addr[1:0]`-keyed case statement). (2) **end-to-end agreement**: enable Ch138 PSMT4 swizzled scanout on the same VRAM (PSMT4_SWIZZLE=1 on pcrtc, CLUT disabled), capture the 16×4 frame, verify each pixel's grayscale R=G=B={nibble, nibble} matches `nibble_at(xx, yy)`. Both modules instantiate the same `gs_swizzle_psmt4_stub` so success proves byte+nibble-level agreement under TRXDIR-style emit + scanout-style read. (3) **non-origin transfer** at DBP=8/DBW=2/DSAX=28/DSAY=12/ RRW=8/RRH=8. Effective coords (28..35, 12..19) cross block_x= 0→1 at effective_x=32 AND block_y=0→1 at effective_y=16 (PSMT4 block geometry: 32×16 px). All 4 corner blocks of page (0,0) at DBP=8 visited: blockTable4[0][0]=0, [0][1]=2, [1][0]=1, [1][1]=3 (block bases 2048/2560/2304/2816). Pins three contracts the origin transfer can't: dest_base_q ADDED ON TOP of the swizzle output (DBP=0 in P1 would mask a missing-add regression — fixed during bring-up after the TB initially passed P3_DBP directly to ref_pos_psmt4 instead of using fbp_v=0 + adding DBP*256); FULL effective coords; BOTH block_x and block_y propagate through `blockTable4[by][bx]`. Phase 3 strict separator: linear formula puts effective coord (28, 12) at byte 2830 — under linear, the neighboring pixel (29, 12) writes high nibble = 1 to that byte. Under swizzled, no Phase-3 pixel hits byte 2830 (cross-checked: col_idx_psmt4 for the 4-block × 16-pixel coord set never produces nibble_idx 28 or 29). Byte 2830 stays 0 → fall-through to linear would have stomped it with 0x10. **PASS** errors=0 after two bug-fix iterations: (a) ref_pos_psmt4(P3_DBP, ...) was wrong — engine feeds FBP=0 to the swizzle and adds DBP*256 separately, so TB must do the same; (b) check_byte_zero tested the full word instead of the targeted byte, producing false failures when a neighbor byte in the same word was independently touched. Counts: arms=2, writes=128 (P1 64 + P3 64). With Ch138 (read- side scanout) + Ch139 (image-xfer write-side) + Ch140 (raster write-side) all live, the Ch137 PSMT4 primitive now has all 3 integration points wired, and Ch141 closes the e2e demo. - `tb_gs_image_xfer_swizzle_psmt8.sv` (Ch133) — focused contract for the new `PSMT8_SWIZZLE` parameter on `gif_image_xfer_stub`. Mirrors Ch121's PSMCT32 + Ch127's PSMCT16 wiring shape but for the third PSM: when the parameter is 1 AND the active DPSM is PSMT8, the per-pixel byte address is `dest_base_q (= DBP*256) + swizzle_psmt8(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y)`. PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays linear (its swizzle math is future). Default 0 keeps the legacy linear path for every existing PSMT8 image-xfer TB (Ch117 etc.). No new ports — parameter-only API change. Default-off smoke verification: ran Ch117 `tb_gs_image_xfer_psmt8` before writing the new TB; PASSed unchanged. **Three-phase verification** (mirrors Ch127 audit-closed shape): (1) **origin write-side lock** at DBP=0/DBW=2 (DBW must be even per PCSX2 GSLocalMemory.h:553 — PSMT8 pages are 128 px wide vs FBW's 64-px units, so 2 FBW units per page → bw_pg=1 here). 16×8 PSMT8 image upload via 8 IMAGE qwords (16 px/qword). Per- pixel index `idx_at(x, y) = (y[2:0] << 4) | x[3:0]` ∈ [0x00..0x7F]. After upload, byte-readback at the swizzled address asserts each byte landed where the swizzle says. Strict separators: linear y=1 (byte 128) and y=2 (byte 256) row starts stay 0 — swizzled write set lives entirely in [0..127]. (2) **end-to-end agreement**: enable Ch132 swizzled scanout on the same VRAM, capture the frame, verify each visible pixel's PCRTC PSMT8 grayscale R=G=B matches `idx_at(x, y)`. Both modules instantiate the same `gs_swizzle_psmt8_stub` so success proves byte-level agreement under TRXDIR-style emit + scanout-style read. (3) **non-origin transfer** at DBP=8/DBW=2/DSAX=12/DSAY=10/ RRW=8/RRH=8. Effective coords (12..19, 10..17) cross block_x=0→1 at effective_x=16 AND block_y=0→1 at effective_y=16, so all 4 corner blocks of page (0,0) at DBP=8 (blockTable8[0][0]=0, [0][1]=1, [1][0]=2, [1][1]=3 → block bases 2048/2304/2560/2816) are visited. Pins three contracts the origin transfer can't: `dest_base_q = DBP*256` ADDED ON TOP; the swizzle is fed FULL effective coords (DSAX/DSAY non-zero); BOTH block_x and block_y propagate through `blockTable8[by][bx]`. Phase 3 distinct-pixel pattern uses `p3_idx = 0x80 | idx` ∈ [0x80..0xFF] (disjoint from Phase 1's [0x00..0x7F]) so a P3 pixel landing at a P1 byte (or vice versa) surfaces as wrong RGB. Phase 3 strict separator: linear formula puts effective coord (12, 10) at byte `2048 + 10*128 + 12 = 3340` (outside swizzled set [2048..3071]); byte 3340 stays 0 — proves a fall-through to linear would have stomped that byte. **First-attempt PASS**: arms=2, writes=192 (=128+64), errors=0. NOTE: at Ch133 only, PSMT8 raster-side emits via `gs_stub` still used linear addressing — Ch133 was image-xfer write-side only. Ch134 later closed the raster-side gate via `PSMT8_SWIZZLE` on `gs_stub` (mirrors Ch122 for PSMCT32 and Ch128 for PSMCT16) — see Ch134 row above. - `tb_gs_scanout_swizzle_psmt4.sv` (Ch138) — focused contract for the new `PSMT4_SWIZZLE` parameter on `gs_pcrtc_stub`. Mirrors Ch120/Ch126/Ch132's read-side-first wiring shape but adds the PSMT4-specific twist: the swizzle module outputs both an absolute byte address AND a `nibble_hi` selector (PSMT4 = 4 bits/pixel = 2 pixels per byte, and the canonical PCSX2 column table reorders nibbles within a block, so `pixel_index[0]` is no longer the right selector under the swizzled layout). When the parameter is 1 AND the active PSM is PSMT4, scanout reads go through the Ch137 `gs_swizzle_psmt4_stub` and the PSMT4 nibble extractor uses `swizzle4_nibble_hi` instead of `pixel_index[0]`. PSMCT32/PSMCT16/PSMT8 are gated by their own parameters; default 0 keeps every existing PSMT4 scanout TB (Ch103 PSMT4+CLUT, Ch104 PSMT4 round-trip, Ch107 PSMT4 e2e, etc.) on the legacy linear path. No new ports — parameter- only API change. Default-off smoke verification: ran Ch103 `tb_gs_scanout_psmt4_clut` + Ch104 `tb_gs_psmt4_round_trip` before writing the new TB; both PASSed unchanged. **Two-phase verification** (mirrors Ch132 closure shape; CLUT disabled so PCRTC's PSMT4 grayscale fallback gives `r=g=b={nibble, nibble}` at scanout): (1) **origin** at FBP=0/FBW=2/DBX=DBY=0 (FBW must be even per PCSX2 GSLocalMemory.h:560 because PSMT4 pages are 128 px wide, same as PSMT8). 16×4 region preloaded at swizzled bytes via a TB-side `byte_shadow` accumulator that lays each pixel's nibble at its `(addr, nibble_hi)` slot; bytes are then flushed to vram_stub via per-byte BE writes. Per-pixel nibble pattern `nibble_at(x, y) = ((y << 1) ^ x) & 4'h7` ∈ [0..7] gives unique gray values across the 16×4 frame. The image lives entirely in block (0,0) of page (0,0) and exercises within-block columnTable4 entries for yb=0..3, xb=0..15. Strict separator: byte 64 (linear y=1 row start at FBW=2 stride) pre-colored with sentinel 0xCC (gray=0xCC, unproducible by Phase 1's [0..7]-nibble pattern) — fall-through to linear would surface as RGB(0xCC, 0xCC, 0xCC). (2) **non-origin** at FBP=4/FBW=4 (bw_pg=2), DBX=120, DBY=126. Effective coords range x∈[120..135], y∈[126..129]. page_x crosses 0→1 at effective_x=128, page_y crosses 0→1 at effective_y=128 (PSMT4's 128-tall page boundary — different from PSMT8's 64-tall). All 4 corner pages of FBP=4/FBW=4 visited, each with a distinct blockTable4 lookup (blockTable4[7][3]=31 → page (0,0) block_base 16128; blockTable4[7][0]=21 → page (1,0) block_base 21760; blockTable4[0][3]=10 → page (0,1) block_base 27136; blockTable4[0][0]=0 → page (1,1) block_base 32768). A regression that tied any of {dispfb_fbp, dbx, dby, FBW, block_x, block_y, page_index, bw_pg=FBW/2, swizzle nibble_hi} to zero would NOT survive Phase 2. Strict P2 separator: byte 24380 (linear formula's place for (120, 126); outside all 4 swizzled chunks) pre-colored with sentinel 0xDD → fall-through to linear would surface as RGB(0xDD, 0xDD, 0xDD), unproducible by the Phase-2 pattern. **PASS** errors=0 after one bug-fix iteration: Phase 2's flush-loop initially hardcoded the wrong byte ranges due to a `blockTable4[7][3]` lookup mistake (the value is 31, not 15) — replaced with a shadow-array sweep [256..65535] that flushes any non-zero byte, eliminating the hardcode/lookup mismatch class entirely. NOTE (now historical): Ch138 was read-side only when introduced; the PSMT4 write-side is now live as well — Ch139 (image-xfer) + Ch140 (raster) + Ch141 (raster-driven e2e demo). With Ch138, **all four common GS PSMs now have read- side byte-accuracy under their swizzle gates** (CT32 Ch120 + CT16 Ch126 + T8 Ch132 + T4 Ch138). - `tb_gs_scanout_swizzle_psmt8.sv` (Ch132) — focused contract for the new `PSMT8_SWIZZLE` parameter on `gs_pcrtc_stub`. Mirrors Ch120/Ch126's wiring shape but for PSMT8: when the parameter is 1 AND the active PSM is PSMT8, scanout reads go through the Ch131 `gs_swizzle_psmt8_stub` (real PS2 GS page/block/column layout — 128×64 pixel pages, 4×8 block grid, 16×16 within-block bytes, `bw_pg = FBW>>1`) instead of the legacy linear `FBW*64*y + x` formula. PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays linear (its swizzle math is future). Default PSMT8_SWIZZLE=0 keeps every existing PSMT8 scanout TB (Ch96 storage-only, Ch97 PSMT8+CLUT, Ch103 PSMT4-via-CT16-CLUT, Ch107 PSMT4-e2e palette path) on the original linear addressing. No new ports — parameter-only API change. Default-off smoke verification: ran Ch96 `tb_gs_scanout_psmt8` before writing the new TB; PASSed unchanged, confirming the new instance + 4-way mux extension don't disturb the linear path. **Two-phase verification** (mirrors Ch126 PSMCT16 closure shape): (1) **origin** (FBP=0, FBW=2, DBX=DBY=0; FBW must be even — PCSX2 asserts `(bw & 1) == 0` for PSMT8 because pages are 128 px wide vs FBW's 64-px units, so 2 FBW units per page → bw_pg=1 here). 16×8 region preloaded at swizzled bytes; per-pixel index `idx = (y[2:0] << 4) | x[3:0]` ∈ [0x00..0x7F] surfaces as grayscale R=G=B=idx via PCRTC's PSMT8 fallback (Ch96). x∈[0..15] is entirely block_x_in_page=0, so the within-block test exercises ALL 16 xb positions of `columnTable8` across yb rows 0..7. Strict separators: linear y=1 starts at byte 128 (FBW=2 stride) but swizzled lands at byte 8 (`columnTable8[1][0]=8`, no `*2` scale since PSMT8 is 1 byte/pixel); linear x=8,y=0 is byte 8 but swizzled is byte 2. (2) **non-origin** (FBP=4, FBW=4 → bw_pg=2, DBX=120, DBY=60). Effective coords range x∈[120..135], y∈[60..67] — page_x crosses 0→1 at effective_x=128 (proves x[7] reaches the page-x lane of the PSMT8 swizzle — different boundary from CT16/CT32's x[6]); page_y crosses 0→1 at effective_y=64; block_x and block_y both flip; ALL 4 pages (0,0)/(1,0)/(0,1)/(1,1) are visited, each with a distinct blockTable8 lookup ([3][7]=31, [3][0]=10, [0][7]=21, [0][0]=0). A regression that tied any of {dispfb_fbp, dbx, dby, FBW, block_x, block_y, page_index, bw_pg=FBW/2} to zero would NOT survive Phase 2. **Sentinel separator**: byte 24500 (inside linear range 23672..25479 for the Phase-2 effective region, outside ALL 4 swizzled write-set blocks) pre-colored with 0xFF → fall-through to linear would surface as RGB(0xFF, 0xFF, 0xFF), which is unproducible by the Phase-2 unique pattern (idx ∈ [0x00..0x7F]). **First-attempt PASS** errors=0 — no audit iteration needed because Phase 2's coord choices were designed up front to make all 7 chain-layer wires load-bearing AND the page-x crossing boundary is at PSMT8's specific x=128 (not the 64-px boundary the direct-color PSMs use). NOTE (now historical): Ch132 was read-side only when introduced; Ch133 then Ch134 closed the image-xfer + raster write sides for PSMT8, so all three PSMT8 swizzle integration points are now live (mirrors Ch120/Ch121/Ch122 for PSMCT32 and Ch126/Ch127/Ch128 for PSMCT16). - `tb_gs_scanout_swizzle_psmct16.sv` (Ch126) — focused contract for the new `PSMCT16_SWIZZLE` parameter on `gs_pcrtc_stub`. Mirrors Ch120's wiring shape but for PSMCT16: when the parameter is 1 AND the active PSM is PSMCT16, scanout reads go through the Ch125 `gs_swizzle_psmct16_stub` (real PS2 GS page/block/column layout) instead of the legacy linear `FBW*64*y + x*2` formula. PSMCT32 is gated by its own `PSMCT32_SWIZZLE` parameter (Ch120); PSMT8/PSMT4 stay linear. Default 0 keeps every existing PSMCT16 scanout TB (Ch94/Ch95/Ch103/etc.) on the original linear addressing. Topology: TB drives `vram_stub.write_*` directly with each pixel's RGB5A1 halfword preloaded at the swizzled byte address (TB-side `ref_addr16()` mirrors the swizzle math + the Ch125 source-table-locked tables); pcrtc with `PSMCT16_SWIZZLE=1` scans out the 16×8 frame and the TB asserts each captured pixel matches the preloaded color after 5→8 bit-replicate. Per-pixel pattern is unique per (x, y): R5=`(x^y)&0xF`, G5=`x&0xF`, B5=`y&0xF`, expanded to 8 bits via PCRTC's bit-replicate. The PSMCT16 swizzle vs. linear distinction shows up at any y>0 (linear y=1 → byte 128 with FBW=1, but swizzled within block (0,0) yb=1 → columnTable16[1][0]=4 → byte 8) and at x=8, y=0 (linear byte 16 vs swizzled byte 2) so even within the first row + first block, the gate is a strict separator. NOTE (now historical): Ch126 was read-side only when introduced; Ch127 (image-xfer) then Ch128 (raster) closed the PSMCT16 write sides, mirroring Ch121/Ch122 for PSMCT32. - `tb_gs_swizzle_psmt4.sv` (Ch137) — focused contract for the new `gs_swizzle_psmt4_stub` math primitive: a pure-comb module mapping `(FBP, FBW, x, y)` to a VRAM **byte address + nibble_hi selector** using the real PS2 GS PSMT4 layout (8 KiB pages organized as 128×128 PSMT4 pixels — 4× as many pixels per page as PSMT8 since each PSMT4 pixel is a NIBBLE; 32 blocks/page in an 8-rows × 4-cols grid (same orientation as PSMCT16's blockTable16); each block 32×16 pixels = 512 nibbles = 256 bytes; **512-entry within-block column table** — 2× the entries of PSMT8's 256-entry table due to the doubled block area, indexed [yb][xb] with yb=0..15 + xb=0..31 → nibble 0..511). PSMT4 is the most complex of the four common GS PSMs because each pixel is HALF a byte, so the swizzle outputs both a byte address and a `nibble_hi` selector (=0 for low nibble of the byte at `addr`, =1 for high). PSMT4 reuses PSMT8's page-stride convention (`bw_pg = FBW >> 1`; PCSX2 asserts FBW must be even at GSLocalMemory.h:560 because PSMT4 pages are 128 px wide). Source-table provenance pinned: `_blockTable4` taken verbatim from pcsx2/GS/GSTables.cpp lines 61–69; `columnTable4` from same file lines 147–213. Master HEAD commit `3000e113e2b3a76357c08dfa80d3c747f40e2706`; file blob SHA `3581209b8217378f473f9de22a9dbc8c45ca49b6` (same blob Ch131 pinned). Cross-checked against GSLocalMemory.h:558 `BlockNumber4` + the `pxOffset` template at GSTables.cpp:247–258 (blockSize=512 in NIBBLE units, pageSize=16384 nibble units = 8192 bytes, pageWidth=128). The existing per-bit write_mask 0x0F/0xF0 nibble RMW from Ch106/Ch118 will still apply on top of the swizzled byte address — the swizzle module doesn't touch the nibble merge logic; it just produces (addr, nibble_hi). **Five-phase verification** (mirrors Ch125/Ch131 shape, scaled up): (1) **spot-checks** at 15 hand-computed corners (origin, intra-block xb=1/8/16/yb=1/yb=2-with-hi-nibble, last nibble of block (0,0), first/second/third/fourth horizontal blocks, second-row-of-blocks origin, page-x at x=128 + page-y at y=128, FBP=4 origin, page0-last-pixel (127,127) → addr 8191 hi=1). (2a) **INDEPENDENT column-table source lock** — 32 hard-coded `check_nibble()` calls for yb=0 (literal-by-literal verbatim from PCSX2 columnTable4 row 0) PLUS a programmatic walk for yb=1..15 against the in-TB ref function (480 more checks); Phase 2a's literal yb=0 row + Phase 5's bijectivity sweep + Phase 3's literal block-table lock together pin the table. (3) **INDEPENDENT block-table source lock** — 32 hard-coded checks (one per block in page 0) with expected block index taken VERBATIM from PCSX2 blockTable4. (4) block-swizzle walk via in-TB ref_block_idx4. (5) **bijectivity sweep over the 128×128 page** — 16384 NIBBLE slots (vs PSMT8's 8192 byte slots), every pixel must hit a unique (byte_addr, nibble_hi) pair and agree with both the in-TB ref byte address AND ref nibble_hi. Plus multi-page sanity at FBW=4/bw_pg=2 (page-x crossing at x=192 → byte 10496 with blockTable4[1][2]=9, and page-y crossing at y=128 → byte 16384) and non-page-aligned FBP coverage at FBP ∈ {1,2,3}, including FBP=3+FBW=4+page-(1,1) intra-block at (129, 129) → byte 30732 (= 6144 + 3*8192 + 0*256 + ref_col_idx4(1,1)/2 = 30720 + 12). **First-attempt PASS** errors=0. NOTE: This module is NOT YET wired into `gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` — those still use linear PSMT4 addressing as of Ch137. The math is locked here so follow-on chapters can wire `PSMT4_SWIZZLE` parameter gates into the existing address paths without disturbing the legacy linear-PSMT4 TBs (Ch103 / Ch106 / Ch107 / Ch118). With Ch119 PSMCT32 + Ch125 PSMCT16 + Ch131 PSMT8 + Ch137 PSMT4, **all four common GS PSMs now have byte-accurate- to-real-PS2 swizzle math available as standalone primitives** — the four-PSM swizzle math foundation is complete. Future chapters can wire PSMT4 into pcrtc/image-xfer/raster behind a PSMT4_SWIZZLE parameter (mirroring Ch120→Ch124 / Ch126→Ch130 / Ch132→Ch136), with the existing nibble RMW machinery layered on top. - `tb_gs_swizzle_psmt8.sv` (Ch131) — focused contract for the new `gs_swizzle_psmt8_stub` math primitive: a pure-comb module mapping `(FBP, FBW, x, y)` to a VRAM byte address using the real PS2 GS PSMT8 layout (8 KiB pages organized as 128×64 PSMT8 pixels — 2× wider than CT16's 64×64 page; 32 blocks/page in a 4-rows × 8-cols grid; each block 16×16 pixels = 256 bytes; **256-entry within- block column table** — 2× the entries of CT16's 128-entry table due to the doubled block area, indexed [yb][xb] with yb=0..15 + xb=0..15 → byte 0..255). PSMT8 also introduces a new page-stride constant `bw_pg = FBW >> 1` (PCSX2 asserts `(bw & 1) == 0` at GSLocalMemory.h:553 because PSMT8 pages are 128 px wide vs FBW's 64-px units → 2 FBW units per PSMT8 page, so FBW must be even). Source-table provenance pinned: `blockTable8` taken verbatim from pcsx2/GS/GSTables.cpp lines 53–59; `columnTable8` from same file lines 111–145. Master HEAD commit `3000e113e2b3a76357c08dfa80d3c747f40e2706`; file blob SHA `3581209b8217378f473f9de22a9dbc8c45ca49b6`. Cross-checked against GSLocalMemory.h:551 `BlockNumber8` + the `pxOffset` template at GSTables.cpp:247–258 (blockSize=256, pageSize=8192, pageWidth=128). PCSX2's `bp` is in 256-byte block-pointer units; in our FBP=2048-byte units, `bp = FBP * 8` so `bp*256 = FBP*2048`. **Five-phase verification** (mirrors Ch125 PSMCT16 shape): (1) **spot-checks** at 15 hand-computed corners (origin, intra- block xb=1/4/8/yb=1, last byte of block (0,0), first/second block origins, second row of blocks, third+fourth blocks, page-x at x=128 and page-y at y=64, FBP=4 origin); (2a) **INDEPENDENT column-table source lock** — 256 hard-coded `check()` calls (one per (yb, xb) inside block (0,0)) where the expected byte index is taken VERBATIM from PCSX2 columnTable8 with `` arithmetic, NOT derived from the in-TB ref function. Catches any case where DUT and ref share the same miscopy (the same trap Ch125 added Phase 2a for with PSMCT16's column table); (2b) within-block 16×16 walk via the in-TB ref_col_idx8 (self-check); (3) **INDEPENDENT block-table source lock** — 32 hard-coded checks (one per block in page 0) with the expected block index taken VERBATIM from PCSX2 blockTable8, NOT derived from the in-TB ref; (4) block-swizzle walk via in-TB ref_block_idx8; (5) **bijectivity sweep over the 128×64 page** — 8192 byte slots (vs CT16's 4096 halfword slots), every pixel must hit a unique byte address in `[0, 8192)` and agree with the in-TB reference. Plus multi-page sanity at FBW=4/bw_pg=2 (page-x crossing at x=192 and page-y crossing at y=64) and non-page-aligned FBP coverage at FBP ∈ {1, 2, 3}, including FBP=3+FBW=4+page-(1,1) intra-block crossing at (129, 65). **First-attempt PASS** errors=0. NOTE: This module is NOT YET wired into `gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` — those still use linear PSMT8 addressing as of Ch131. The math is locked here so follow-on chapters can wire `PSMT8_SWIZZLE` parameter gates into the existing address paths without disturbing the legacy linear-PSMT8 TBs (Ch96 / Ch97 / Ch103 / Ch105 / Ch107 / Ch117). With Ch119 PSMCT32 + Ch125 PSMCT16 + Ch131 PSMT8, three of the four common GS PSMs now have byte- accurate-to-real-PS2 swizzle math available as standalone primitives; PSMT4 (with its 32×16 nibble intra-block layout) is the natural Ch132 candidate. - `tb_gs_swizzle_psmct16.sv` (Ch125) — focused contract for the new `gs_swizzle_psmct16_stub` math primitive: a pure-comb module mapping `(FBP, FBW, x, y)` to a VRAM byte address using the real PS2 GS PSMCT16 layout (8 KiB pages organized as 64×64 PSMCT16 pixels; 32 blocks/page in a 4×8 grid; each block 16×8 pixels = 256 bytes; **non-trivial within-block column table** — unlike PSMCT32 where within-block IS row-major halfwords by accident, PSMCT16 has 4 internal 16×2-pixel sub-columns with a 128-entry permutation). Source-table provenance pinned: `blockTable16` taken verbatim from pcsx2/GS/GSTables.cpp lines 29–39 (master HEAD commit 3d71e310; file-touch commit d983b2b0, 2026-01-12); `columnTable16` from same file lines 91–109. Cross-check against the older Debian-packaged GSdx `PixelAddressOrg16(x, y, bp, bw) = (BlockNumber16(...) << 7) + columnTable16[y & 7][x & 15]` confirms the address chain (`<< 7` lifts to halfword units, multiply by 2 for bytes; in our FBP=2048-byte units, bp = FBP * 8 so bp*256 = FBP*2048). **Five-phase verification**: (1) spot-checks at 13 well-defined corners (origin, intra-block, first/second block, second row of blocks, page-x and page-y boundaries, FBP=4 origin); (2) within-block 16×8 walk asserting `byte = 2 * columnTable16[yb][xb]` — locks the column table; a row-major-halfwords regression would fail; (3) **source-table lock** — 32 hard-coded address checks (one per block in page 0) with the expected block index taken VERBATIM from PCSX2 blockTable16, NOT derived from the in-TB reference function; (4) block-swizzle walk cross-checking the in-TB ref function against the DUT (the bijectivity sweep relies on it being correct); (5) **bijectivity sweep over the 64×64 page** — 4096 halfword slots, every pixel must hit a unique halfword address in `[0, 8192)` and agree with the in-TB reference. Plus multi-page sanity at FBW=2 and non-page-aligned FBP coverage at FBP ∈ {1, 2, 3} (real PS2 supports any 2048-byte-aligned FBP — same broadening Ch119 adopted post- audit). NOTE: This module is NOT YET wired into `gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` — those still use linear PSMCT16 addressing as of Ch125. The math is locked here so follow-on chapters can wire `PSMCT16_SWIZZLE` parameter gates into the existing address paths without disturbing the legacy linear-PSMCT16 TBs (Ch94 / Ch95 / Ch103 / Ch116). - `tb_gs_swizzle_psmct32.sv` (Ch119) — focused contract for the new `gs_swizzle_psmct32_stub` math primitive: a pure-combinational module mapping `(FBP, FBW, x, y)` to a VRAM byte address using the real PS2 GS PSMCT32 page/block swizzle layout (8 KiB pages, 4×8 grid of 8×8-pixel blocks per page, blocks ordered per the canonical PCSX2 PSMCT32 swizzle table, row-major within a block). Verification has five phases: (1) spot-checks on the well-defined corners — origin, intra-block walks, first/second block, second row of blocks, page-x and page-y boundaries, second page on x, and FBP=4 origin; (2) within-block 8×8 walk asserting `byte_in_block = yb*32 + xb*4`; (3) **source-table lock** — 32 hard-coded address checks (one per block in page 0) where the expected block index is taken VERBATIM from PCSX2's PSMCT32 block table, NOT derived from the in-TB reference function. This proves the DUT's `swizzle_psmct32()` table matches the canonical source; a copied-wrong table that happened to still be a valid permutation of 0..31 would fail this phase, while the bijectivity sweep below would pass it; (4) block-swizzle walk (redundant with phase 3, cross-checks ref_block_idx against the DUT — the bijectivity sweep relies on ref_block_idx being correct); (5) bijectivity sweep over the full 64×32 PSMCT32 page — every word slot in `[0, 8192)` reached exactly once (catches any swap/typo in the swizzle table). Plus a multi-page sanity check at FBW=2 (pixel (96, 16) → block (4,2) of page 1 → addr 14336) and a **non-page- aligned FBP** phase that drives FBP=1, 2, 3 (mid-page in the 8 KiB sense — real PS2 supports any 2048-byte-aligned FBP; our address formula is bit-correct for non-page-aligned FBP) plus FBP=3 with FBW=2 + intra-block crossing as a stress case. NOTE (now historical): at Ch119 this module was standalone math only; Ch120 (PCRTC read), Ch121 (image-xfer write), and Ch122 (raster write) wired it into the three integration points — the same shape that Ch125–Ch128 (PSMCT16), Ch131–Ch134 (PSMT8), and Ch137–Ch140 (PSMT4) followed for the other three PSMs. - `tb_gs_image_xfer_psmt4.sv` (Ch118) — focused contract for `gif_image_xfer_stub`'s PSMT4 path (the fourth and final supported PSM). PSMT4 packs 0.5 bytes/pixel (4-bit nibble per pixel = 2 px/byte), so each 128-bit IMAGE qword carries 32 pixels in 16 bytes. Each emit is a SUB-BYTE write: `write_be = 4'b0001` with a per-emit nibble mask (`write_mask = 0x0000_000F` for the LOW nibble, `0x0000_00F0` for the HIGH nibble), keyed by `(DSAX+x)[0]`; vram_stub's per-bit merge commits exactly the targeted nibble, preserving the OTHER nibble of the byte. Back-to-back emits to the same byte (e.g. x=0 + x=1 of the same row) chain through NBA semantics without bypass logic — the same trick the raster channel uses since Ch106. The TB is INTENTIONALLY adversarial: VRAM is preloaded with `0xA5` across every byte the engine will write (plus boundary bytes), then a single IMAGE qword (32 PSMT4 pixels) covers the entire 8×4 rect. Every byte ends as `{nibble_high_pixel, nibble_low_pixel}` (no trace of 0xA5); bytes immediately right of the rect on each row stay 0xA5 (proves no nibble leak past RRW); bytes before / after the destination region also stay 0xA5. Pattern `pixel(x,y) = 4'((y*8+x) & 0xF)`. Asserts: 1 trxdir arm, 32 vram writes, every emit `be=0001` and `mask ∈ {0x0F, 0xF0}`, per-byte readback matches, boundary bytes preserved. - `tb_gs_image_xfer_psmt8.sv` (Ch117) — focused contract for `gif_image_xfer_stub`'s PSMT8 path. Pushes 2 IMAGE qwords (32 PSMT8 pixels = 16 px/qword × 2) through the engine after a TRXDIR-shaped GIF-A+D register sequence with DPSM=PSMT8 (=0x13). PSMT8 packs 1 byte/pixel (an 8-bit CLUT index), so each qword holds 16 pixels; the engine emits one 8-bit pixel per cycle with `write_be = 4'b0001`, the index in the LOW byte of `write_data`, and `write_mask = 0xFFFFFFFF`; vram_stub commits `mem[write_addr] <= write_data[7:0]` at any byte alignment. Pattern is `pixel(x,y) = 8'(y*16 + x)` — 32 distinct values across the 8×4 rect so a wrong-byte-lane commit shows up unambiguously. Asserts: 1 trxdir arm, 32 vram writes (all `be=0001`, `mask=0xFFFFFFFF`), every pixel reads back at `dest_base + y*64 + x`, plus right-of-rect / before / after byte-zero boundary preservation. Each qword packs TWO rows of 8 pixels (lanes 0..7 = row y, lanes 8..15 = row y+1) — exercises the per-lane row-stride math at the qword boundary. - `tb_gs_image_xfer_psmct16.sv` (Ch116) — focused contract for `gif_image_xfer_stub`'s new PSMCT16 path. Pushes 4 IMAGE qwords (32 PSMCT16 pixels = 8 px/qword × 4) through the engine after a TRXDIR-shaped GIF-A+D register sequence (BITBLTBUF/TRXPOS/TRXREG/TRXDIR). PSMCT16 packs 2 bytes/pixel, so each qword holds 8 pixels (vs 4 for PSMCT32). The engine emits one 16-bit pixel per cycle to vram_stub with `write_be = 4'b0011`, the pixel value in the LOW halfword of `write_data`, and `write_mask = 0xFFFFFFFF`; vram_stub commits the 2 bytes at the 2-byte-aligned destination address. Pattern is `pixel(x,y) = 16'h{yyxx}{yyxx}` — distinct per-pixel value so a wrong-lane commit shows up unambiguously. Asserts: 1 trxdir arm, 32 vram writes (all `be=0011`, `mask=0xFFFFFFFF`), every pixel reads back at `dest_base + y*row_stride + x*2`, and the bytes immediately right of the rect on each row + before the dest region + after the dest region all stay zero (proves row-stride math + no halfword leak past RRW). PSMT8 image-xfer landed in Ch117 and PSMT4 image-xfer landed in Ch118 — see those TB rows for their own per-byte / per-nibble contract coverage. - `tb_gs_demo_psmt4_e2e_trxdir.sv` (Ch110) — driver-shaped PSMT4 demo with the palette upload now arriving via a real TRXDIR/TRXPOS/TRXREG/HWREG image-transfer GIF packet sequence instead of TB-direct vram_stub writes. Closes the LAST TB-direct path in the e2e demo flow: every byte the GS sees — framebuffer pixels AND palette source — now arrives through a driver-shaped GIF stream. The DMAC delivers 36 qwords total: U1 (PACKED, NREG=4): BITBLTBUF/TRXPOS/TRXREG/TRXDIR — TRXDIR arms `gif_image_xfer_stub`. U2 (IMAGE, NLOOP=4): 4 qwords of 4 PSMCT32 entries each → 16 palette entries written into VRAM at DBP*256 by `gif_image_xfer_stub`. Then 4 SPRITE PACKED packets + 1 TEX0_1 PACKED packet. PASS criteria add to Ch109's: **1 EV_DMA_START / 36 EV_DMA_BEAT / 1 EV_DMA_DONE**, **7 GIFtag accepts** (U1 + U2 + 4×SPRITE + TEX0), **25 PACKED A+D dispatches** (4 TRX-setup + 20 SPRITE + 1 TEX0), **16 image-xfer VRAM writes** from `gif_image_xfer_stub` (DBP=4, DBW=1, DPSM=PSMCT32, DSAX=DSAY=0, RRW=16, RRH=1). The vram_stub write port is muxed at TB level: `xfer_busy ? xfer_we : raster_pixel_emit` (sequenced — palette upload completes before sprites raster). Ch110 also added a backpressure path on `gif_packed_stub` (`image_data_ready` input) so the upstream DMA stalls while `gif_image_xfer_stub` is draining the previous IMAGE qword's 4 PSMCT32 lanes; outside S_IMAGE the gate is a no-op (in_ready stays high). Privileged-block MMIO (PMODE/ DISPFB1/DISPLAY1) remains TB-direct because those are CPU MMIO writes in real hardware, not GIF traffic. - `tb_gs_demo_psmt4_e2e_dmac.sv` (Ch109) — same 4-quadrant PSMT4 demo as Ch108, but the GIFtag + PACKED A+D quadwords arrive at `gif_packed_stub` via the DMAC channel-2 → `ee_memory_map_stub` → `ee_ram_stub` path instead of being TB-driven directly. Closes the last GIF-side sideband from Ch108: the demo is now reachable the way real EE/IOP code reaches it. The TB pre-stages the same 26 qwords (4 SPRITE packets × 6 qwords + 1 TEX0_1 packet × 2 qwords) into RAM at PAYLOAD_MADR, then writes DMAC channel-2 MADR/QWC/CHCR; a single NORMAL transfer with QWC=26 streams them into the GIF. PASS criteria add to Ch108's: **1 EV_DMA_START / 26 EV_DMA_BEAT / 1 EV_DMA_DONE** (DMA event taxonomy locked), with the same downstream chain — 5 GIFtag accepts, 21 A+D dispatches in the expected reg-num sequence, 32 PSMT4 emits, 1 loader_busy rise, identical 16×8 captured frame. Privileged- block MMIO and palette pre-stage stay TB-direct (NOT GIF-side); TRXDIR/HWREG image-transfer for palette upload is a separate future chapter. - `tb_gs_demo_psmt4_e2e_packed.sv` (Ch108) — same 4-quadrant PSMT4 demo as Ch107 but routed through the GIFtag / PACKED A+D front-end (`gif_packed_stub` with REAL_AD_REG_MAP=1). Closes the last bit of GS-side sideband from Ch107: instead of TB-driving `gs_stub.gif_reg_*` directly, the TB pushes raw 128-bit GIFtag + PACKED A+D quadwords into `gif_packed_stub. in_*` exactly the way the real GIF would receive them from PATH3. Each SPRITE is a packet of 1 GIFtag (NLOOP=1, NREG=5, PACKED, REGS=0xEEEEE — 5×A+D in the low 5 nibble slots) + 5 PACKED A+D qwords (PRIM, FRAME_1=PSMT4, RGBAQ, XYZ2, XYZ2); TEX0_1 load is its own 1-tag/1-A+D packet. Total: 5 GIFtag accepts (4 SPRITEs + 1 TEX0_1) and 4×5 + 1×1 = 21 PACKED A+D register-write dispatches into gs_stub.gif_reg_*. 32 PSMT4 raster emits arrive (Ch106 RMW), loader fires exactly once on TEX0_1, and the captured 16×8 frame matches the same expected CLUT-decoded RGB as Ch107 — i.e. real-format GIF packets reach the GS register file with the same cadence the TB previously synthesised by hand. Privileged-block MMIO (PMODE/DISPFB1/DISPLAY1) and the palette pre-stage in VRAM remain TB-direct because they are NOT GIF-side; the palette upload via real-PS2 TRXDIR/TRXPOS/TRXREG/HWREG image-transfer packets is a separate future chapter, as is the DMAC channel-2 burst that would normally deliver the GIFtag qwords (this TB drives `gif_packed_stub.in_*` directly to keep the demo narrow and deterministic; the full DMAC→RAM→GIF round trip is what the integration-tier `tb_ee_core_gif_*` family covers). - `tb_gs_psmt4_round_trip.sv` (Ch104) — full driver-shaped PSMT4 + CLD=4 + CSA round trip. Wires `gs_stub` + `vram_stub` + `clut_stub` + `clut_loader_stub` + `gs_pcrtc_stub` end-to-end with `pcrtc.clut_csa = gs_stub.tex0_1_csa_q` (the Ch98 sideband-free pattern). Phase 1: stages a 4×4 PSMT4 sprite in VRAM, plus a 16-entry pattern_a palette in VRAM at `CBP_A*256`. Drives TEX0_1 with `CBP=4, CPSM=PSMCT32, CSM=CSM2, CSA=0, CLD=4`; the loader writes pattern_a into `clut_stub[0..15]` and `pcrtc.clut_csa` is 0, so PSMT4 scanout reads pattern_a per nibble. Phase 2: stages a different pattern_b palette at `CBP_B*256` and drives TEX0_1 with `CBP=8, CSA=4, CLD=4`; the loader writes pattern_b into `clut_stub[64..79]` (the CSA=4 window) and `pcrtc.clut_csa` flips to 4, so the same VRAM sprite — same DISPFB1 / DISPLAY1 / PMODE — now reads pattern_b. Proves loader policy + clut_stub contents + PCRTC lookup are wired consistently. Scope (current, after Ch165): - **PSMCT32 (DISPFB1.PSM=0), PSMCT16 (PSM=2), PSMT8 (PSM=0x13), and PSMT4 (PSM=0x14) honored at BOTH the read and write sides** (Ch94 + Ch95 + Ch96 + Ch97 + Ch103 + Ch105 + Ch106). PSMCT24/PSMCT16S/PSMZ32/etc. force scanout off and are not contract-tested at the raster channel. The write side (gs_stub.raster_pixel_emit) emits the four supported PSMs via `raster_pixel_be_q` (per-byte gate) and `raster_pixel_mask_q` (per-bit merge mask, Ch106): PSMCT32 = be `0xF` / mask `0xFFFFFFFF`, PSMCT16 = be `0x3` / mask `0xFFFFFFFF`, PSMT8 = be `0x1` / mask `0xFFFFFFFF`, PSMT4 = be `0x1` / mask `0x0F` or `0xF0`. The mask path is no-op for byte-or-larger PSMs (mem[i] = data[i] when mask_i = 0xFF) and only meaningful for PSMT4 sub-byte writes. PSMT8 / PSMT4 scanout surfaces the index/nibble as grayscale by default; with `clut_enable=1` (Ch97/Ch103) and a programmed `clut_stub`, the index/nibble looks up real RGB. CLUT contents come either from a TB-direct write OR (Ch99..Ch102) from a VRAM→CLUT load triggered by a TEX0_1 GIF write with `CSM == 1` (CSM2 linear), `CPSM` ∈ {PSMCT32, PSMCT16}, and a CLD value passing the policy: CLD=0 never; CLD=1 always (full 256-entry load); CLD=2 if CBP changed since last load (full); CLD=3 if CBP/CPSM/CSA any-changed (full); CLD=4 always but only the 16-entry CSA window at indices `CSA*16 + i` (Ch102 — preserves the other 240 entries); CLD ∈ {5..7} silently no-op (reserved). `clut_loader_stub` walks the entries via `vram_stub`'s second read port; PSMCT16 entries are unpacked with the same 5→8 bit-replicate the scanout side uses (Ch94). CSM1 swizzle and CPSM ∉ {PSMCT32, PSMCT16} remain deferred. - **Single CRTC, single DISPFB**. Real PS2 has two interlace- capable CRTCs (DISPFB1, DISPFB2). One context is enough for TBs to verify the round trip; PMODE.EN2 + DISPFB2 + DISPLAY2 is deferred. - **Read-side addressing**. Linear by default (legacy formula `vram_read_addr = FBP*2048 + (effective_y*FBW*64 + effective_x) << bpp_shift`). Four OPTIONAL per-PSM swizzle paths gated by parameters on `gs_pcrtc_stub`: `PSMCT32_SWIZZLE=1` (Ch120) routes PSMCT32 reads through `gs_swizzle_psmct32_stub`; `PSMCT16_SWIZZLE=1` (Ch126) routes PSMCT16 reads through `gs_swizzle_psmct16_stub`; `PSMT8_SWIZZLE=1` (Ch132) routes PSMT8 reads through `gs_swizzle_psmt8_stub` (Ch131) — FBW must be even because PSMT8 pages are 128 px wide and the swizzle internally divides FBW by 2; `PSMT4_SWIZZLE=1` (Ch138) routes PSMT4 reads through `gs_swizzle_psmt4_stub` (Ch137); FBW must be even (same as PSMT8). The four parameters are independent — enabling one doesn't affect the others. PSMT4's swizzle module also outputs a `nibble_hi` selector that PCRTC uses in place of `pixel_index[0]` to pick which nibble of the byte at the swizzled address holds this pixel (PSMT4 packs 2 pixels per byte and the canonical PCSX2 column table reorders nibbles within a block, so the linear formula's `pixel_index[0]` selector is no longer correct under the swizzled layout). All four swizzle parameter defaults are 0 so all existing PCRTC- using TBs see the legacy linear behavior unchanged. The PSMT4 image-xfer (Ch139) and raster (Ch140) write-side wiring is now live as well, completing the four-PSM × three- path swizzle integration. Both driver-shape e2e demos for PSMT4 are also live: raster-driven (Ch141) and TRXDIR-driven (Ch142). All four common GS PSMs now have BOTH driver-shape e2e demos (CT32 Ch123+Ch124, CT16 Ch129+Ch130, T8 Ch135+ Ch136, T4 Ch141+Ch142) — closing the four-PSM × three-path × dual-driver-shape e2e foundation. - **Parallel to `platform_video_stub`, not a replacement**. We did not extend `platform_video_stub` (which would have rippled through 6 existing TBs). Pcrtc is the alternative video source for TBs that want VRAM-backed scanout. The legacy flood-fill module stays as-is. ### End-to-end demo manifest (Ch143) Eight driver-shaped end-to-end byte-accurate demos cover the four common GS PSMs across both driver shapes (raster-driven PACKED-SPRITE payload + TRXDIR-driven IMAGE payload). Each demo runs the same EE-bootlet → DMAC → GIF → GS → vram → swizzled- PCRTC chain with all three same-PSM swizzle gates parameter-set to 1; the listed write-side path is load-bearing and the other write-side path is asserted dormant in the demo flow. All eight demos emit a 16×8 framebuffer (128 pixels). The raster column shows `(emits, xfer_writes)`; the TRXDIR column shows `(xfer_writes, emits)` — in both cases the load-bearing path fires 128 times and the dormant path is asserted 0. | PSM | Raster-driven e2e | TRXDIR-driven e2e | |---------|---------------------------------|------------------------------------| | PSMCT32 | Ch123 — `tb_gs_demo_psmct32_swizzle_e2e` (128, 0) | Ch124 — `tb_gs_demo_psmct32_swizzle_trxdir_e2e` (128, 0) | | PSMCT16 | Ch129 — `tb_gs_demo_psmct16_swizzle_e2e` (128, 0) | Ch130 — `tb_gs_demo_psmct16_swizzle_trxdir_e2e` (128, 0) | | PSMT8 | Ch135 — `tb_gs_demo_psmt8_swizzle_e2e` (128, 0) | Ch136 — `tb_gs_demo_psmt8_swizzle_trxdir_e2e` (128, 0) | | PSMT4 | Ch141 — `tb_gs_demo_psmt4_swizzle_e2e` (128, 0) | Ch142 — `tb_gs_demo_psmt4_swizzle_trxdir_e2e` (128, 0) | For each row both demos use the same per-quadrant pixel pattern (so the verify side is shared across the row), the same DBW- even constraint where applicable (PSMT8 / PSMT4: 128-px-wide pages → DBW=2 minimum even), and verification through the freed-up `vram_stub` 2nd read port. Ch141 + Ch142 together close the four-PSM × three-path × dual-driver-shape e2e foundation — the foundation Ch143 manifests and seals. **Hardware-demo candidates**: - **PSMCT32 swizzled raster e2e (Ch123)** — simplest direct- color path: 4 SPRITE PACKED packets, RGBAQ.{R,G,B,A} mapped 1:1 to scanout RGB, no CLUT, no nibble RMW. The natural first hardware demo because every byte from EE-bootlet through the swizzled 16×8 framebuffer to PCRTC RGB is visible without any indirection. Build target: `make tb_gs_demo_psmct32_swizzle_e2e`. - **PSMT4 swizzled TRXDIR e2e (Ch142)** — strongest indexed/ CLUT-like stress path: U1 PACKED A+D TRX setup + U2 IMAGE NLOOP=4 with 32 PSMT4 nibbles per qword, image-xfer engine decoding the canonical PCSX2 columnTable4 (which reorders nibbles within a block — the linear `pixel_index[0]` rule is wrong under swizzle), and per-pixel nibble RMW on vram_stub via `write_be=4'b0001 + write_mask ∈ {0x0F, 0xF0}` keyed by the swizzle's `nibble_hi`. Exercises the full sub-byte pipeline + the canonical-source-locked column table. Build target: `make tb_gs_demo_psmt4_swizzle_trxdir_e2e`. ### First hardware-targeted top wrapper (Ch146) Ch146 turns the Ch144 readiness audit + Ch145 BRAM-shrink groundwork into a real top-level SystemVerilog module: [`rtl/top/top_psmct32_raster_demo.sv`](../../rtl/top/top_psmct32_raster_demo.sv). This is the module a board-level synthesis project would target first. Board-level concerns (HDMI/VGA PHY, pin constraints, .mem bake tooling, clock-domain crossings) are deliberately deferred — Ch146 proves the design can be expressed as a single hardware- shape module. **Top ports**: - `clk` / `rst_n` / `core_go` — clock, active-low synchronous reset, start pulse (a board reset-release sequencer can tie `core_go` high after `rst_n` deasserts). - `r/g/b/hsync/vsync/de` — 8-bit RGB scanout from PCRTC. - `core_halt` / `dma_done_seen` / `frame_seen` — debug/status bundle suitable for LEDs or a board-level state observer. **Top parameters**: `H_ACTIVE` (default 16), `V_ACTIVE` (default 8), `BIOS_SIZE_BYTES`, `RAM_SIZE_BYTES`, `VRAM_BYTES`, `USEG_SHADOW_WORDS_PARAM` (default 1024 = 4 KiB per Ch145). **Image fixtures** are passed via macros (iverilog-12 string- parameter forwarding limitation): `TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE` and `TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE`. The fixtures are baked by [`sim/data/top_psmct32_raster_demo/bake.py`](../../sim/data/top_psmct32_raster_demo/bake.py) which writes: - `bios.mem` — 18-word EE bootlet (one 32-bit hex word per line) - `payload.mem` — 40 qwords for ee_ram_stub (16 zero qwords + 24 GIF qwords carrying 4 SPRITE PACKED packets) The bake script is a deterministic Python rewrite of the procedural `ee_prog_word()` + `preload_qword()` loops in the Ch123 TB. Same bit-exact values, just baked into static repo artifacts so a hardware top can `$readmemh` them. **Focused TB**: [`sim/tb/top/tb_top_psmct32_raster_demo.sv`](../../sim/tb/top/tb_top_psmct32_raster_demo.sv). Drives the top with the static fixtures, captures one full PCRTC frame after the EE halts and DMAC completes, and asserts the per-quadrant RGB matches the Ch123 frame exactly. Counts: `raster_emits=128, errors=0, core_halt=1, dma_done_seen=1, frame_seen=1`. **Bug-fix iteration**: the first bake had Y in XYZ2 placed at bits[43:32] instead of bits[31:20] — a Python translation error of the SystemVerilog `{32'd0, y_int, 4'd0, x_int, 4'd0}` concatenation. Symptom: per-sprite emit count was 8 instead of 32 (each sprite drew one row), and VRAM held the per-sprite R component scattered across 32 consecutive 4-byte cells. Caught by adding a per-emit observer that printed `(addr, data, be, mask, color_q)` for the first 10 emits. Fix: `y << 20` instead of `y << 32` in `bake.py`. **PASS after the fix.** **What's still NOT in this chapter** (deferred to Ch147+): - Real `.mem` bake tooling integration (currently the `bake.py` is run manually before sim; a Makefile target or CI step that invokes it would belong in Ch147). - Board-specific top: pin constraints, target FPGA family, PHY shim (HDMI/DVI/VGA), reset-release sequencer. - A multi-PSM top (the Ch142 PSMT4 TRXDIR variant would be a natural second wrapper once the build flow is proven). ### Fixture bake flow (Ch147) Ch147 makes the Ch146 `.mem` bake first-class so the static fixtures can't drift from `bake.py`. Three new Makefile targets: | Target | Purpose | |-----------------------------------------|-----------------------------------------------------------------------| | `top_psmct32_raster_demo_mem` | Re-runs `bake.py`; produces `bios.mem` + `payload.mem` atomically. | | `top_psmct32_raster_demo_mem_check` | Verifies fixture sizes (bios.mem = 1024 lines, payload.mem = 256). | | `tb_top_psmct32_raster_demo` (existing) | Now declares `top_psmct32_raster_demo_mem` as a prerequisite. | The bake target uses Make's grouped-target syntax (`&:`) so a single `bake.py` run produces both files atomically — they can never be out-of-step. The size-check target counts payload lines (skipping blanks + `// ...` comment-only lines) and asserts the exact expected counts. A non-matching count exits with status 1, surfacing a fixture/script drift as a hard build failure. Deleting the fixtures and running the TB triggers the bake automatically: ``` $ make tb_top_psmct32_raster_demo === bake top_psmct32_raster_demo .mem fixtures === python3 .../bake.py [bake] wrote bios.mem (1024 words, 18 active) and payload.mem (256 qwords, 40 active) === build tb_top_psmct32_raster_demo === ... [tb_top_psmct32_raster_demo] PASS ``` #### Synthesis-facing macros When pointing a synthesis tool at `rtl/top/top_psmct32_raster_demo.sv`, two preprocessor defines must be set so `bios_rom_stub` and `ee_ram_stub` find their `$readmemh` images. These are macros (NOT module parameters) per the iverilog-12 string-parameter forwarding workaround documented in the Ch146 wrapper banner; they map cleanly to FPGA-tool defines. | Macro | Value | |----------------------------------------------------|----------------------------------------------------------------| | `TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE` | Absolute (or tool-relative) path to `bios.mem` | | `TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE` | Absolute (or tool-relative) path to `payload.mem` | Both default to `""` so the wrapper still elaborates without fixtures (synthetic NOP-sled in `bios_rom_stub` + zero-init `ee_ram_stub`, which produces no DMAC payload but a stable PCRTC frame with `r=g=b=0`). **Vivado** (preprocessor `verilog_define` on the synthesis + implementation filesets — these are macros, not module generics): ``` set_property verilog_define { \ TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE="$path/bios.mem" \ TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE="$path/payload.mem" \ } [get_filesets sources_1] ``` Repeat for the implementation fileset if it diverges from `sources_1`. **Quartus** (project-level macro defines): ``` set_global_assignment -name VERILOG_MACRO \ "TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE=\"$path/bios.mem\"" set_global_assignment -name VERILOG_MACRO \ "TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE=\"$path/payload.mem\"" ``` **Iverilog (sim)**: the Ch147 Makefile passes them via `-D` flags in the `tb_top_psmct32_raster_demo` build rule — `-DTOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE='"$(SIM_DIR)/data/... /bios.mem"'` — and the `top_psmct32_raster_demo_mem` prerequisite ensures the .mem files exist before the TB elaborates. ### DE25-Nano synthesis scaffold (Ch148) Ch148 makes the Ch146 hardware top synthesis-addressable on DE25-Nano without committing to a video PHY shim or final pin constraints (those land in Ch149+). | File / target | Purpose | |------------------------------------------------------------------|------------------------------------------------------------| | `synth/de25_nano/top_psmct32_raster_demo/files.f` | RTL filelist — Ch123 dep tree only (~14 entries). | | `synth/de25_nano/top_psmct32_raster_demo/README.md` | Top module + macros + fixtures + DE25-Nano clock/reset/video assumptions. | | `make top_psmct32_raster_demo_synth_check` | Validates files.f paths + fixture presence. | The synth-check target depends on `top_psmct32_raster_demo_mem_check`, so a single command verifies fixture sizes AND that every file referenced by the synth filelist exists. It exits non-zero on any miss — surfacing both fixture drift (Ch147 size guard) and filelist drift as hard build failures. `.qsf` (Quartus pin assignments) is **not** committed in Ch148. The README documents the board assumptions (clock domain, reset polarity, `core_go` strategy, video-out path candidates, LED status mapping) so the next chapter can author it without inventing context. The point of Ch148 is that a Quartus project import (or Vivado / `verilator --lint-only`) finds every file the design needs, with the macros documented end-to-end. ### DE25-Nano board wrapper (Ch149) Ch149 turns the Ch146 board-agnostic top into a real board top without yet committing to pin assignments or a video PHY. New: | Artifact | Purpose | |-----------------------------------------------------------|------------------------------------------------------------------------| | `rtl/top/de25_nano_psmct32_raster_demo_top.sv` | Board wrapper — DE25-Nano signal names + reset sequencer + LED status. | | `sim/tb/top/tb_de25_nano_psmct32_raster_demo_top.sv` | Smoke TB exercising clock/reset/core_go/LED/video pins. | **Top ports** (matching the Terasic Golden_top.v conventions from the DE25-Nano resource CD): `CLOCK0_50` / `CLOCK1_50` / `CLOCK2_50`, `KEY[1:0]` (active-LOW), `SW[3:0]`, `LED[7:0]` (active-LOW), and raw `VIDEO_R/G/B/HSYNC/VSYNC/DE` outputs that a future PHY shim will consume. **Reset bridge**: 1. `ninit_done` sourced from Terasic's `reset_release` IP under `\`ifdef USE_TERASIC_RESET_RELEASE_IP` (default-off; sim uses an inline 16-cycle stub matching the IP's shape). 2. `KEY[0]` + `ninit_done` feed an async-assert/sync-deassert 2-stage shift register on CLOCK2_50. Mirrors the retroDE_nes pattern at `retroDE_nes.sv:170-177`. **`core_go` sequencer**: 16-cycle delay after `core_rst_n` deasserts, then a one-cycle `core_go` pulse. Matches the "recommended hardware path" documented in the Ch148 README and the level-sensitive `go_i` semantics at `ee_core_stub.sv:812-813`. **LED status**: the Ch146 wrapper's three sticky status outputs drive `LED[2:0]` (active-LOW): `LED[0] = ~core_halt`, `LED[1] = ~dma_done_seen`, `LED[2] = ~frame_seen`. `LED[7:3]` tied HIGH (OFF). **Smoke TB counts**: `core_go_pulses=1`, all three status LEDs eventually latch (the actual fall-edge order is `frame_seen` first, then `core_halt`, then `dma_done_seen` — `frame_seen` is a "PCRTC alive" indicator that fires on the first empty frame after reset, well before the bootlet runs), and `VIDEO_DE` rises inside the active region. Standalone PASS. `.qsf` (pin assignments), PLL, and video PHY shim remain deferred (Ch150+). Ch149 makes the design board-shaped, not yet board-pinned. ### Quartus scaffold for DE25-Nano (Ch150) Ch150 commits the first real Quartus artifacts for the Ch149 board wrapper — a minimal `.qsf` + `.sdc` pair, deliberately PHY-light: | File | Purpose | |-----------------------------------------------------------------|-------------------------------------------------------------------| | `synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf` | Device + family + pin assignments + IO standards + .mem macros + file list. | | `synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc` | CLOCK2_50 50 MHz clock + reset-sync false-path + IO false-paths. | | `make top_psmct32_raster_demo_quartus_scaffold_check` | Validates both files exist + top entity + pins + clock period. | **Device** (sourced from `retroDE_splash.qsf`): Agilex 5 `A5EB013BB23BE4SCS`, package `VPBGA`. **Top entity**: `de25_nano_psmct32_raster_demo_top` (the Ch149 board wrapper — NOT the inner Ch146 module). **Pin assignments** match the DE25-Nano board pinout used by `retroDE_splash` and `retroDE_nes`: `CLOCK2_50` → `PIN_BF23`, `KEY[0]` → `PIN_C8`, `LED[2:0]` → `PIN_DN22 / PIN_DJ32 / PIN_DF35`. CLOCK0/1_50, KEY[1], SW[3:0], and LED[7:3] are also assigned (their canonical pins) so Quartus doesn't flag them as unconstrained inputs/ outputs even though the Ch149 wrapper ties them off. **SDC** (sourced from `retroDE_splash.sdc`): a single 50 MHz `create_clock` on CLOCK2_50, the standard reset-sync first-stage false-path (`set_false_path -to [get_registers -nowarn {*rst_sync[0]}]`), and IO false paths for `KEY[*]`, `SW[*]`, `LED[*]` plus the as-yet-unpinned `VIDEO_*` outputs (replaced by real `set_output_delay` constraints when the PHY shim lands). **`.mem` macros** baked into the QSF (project-relative paths): `TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE = sim/data/top_psmct32_raster_demo/bios.mem` and the matching payload macro. Run `make -C sim top_psmct32_raster_demo_mem` before launching Quartus. **`USE_TERASIC_RESET_RELEASE_IP`** is **not** defined in this QSF — keeping the wrapper self-contained for the first project import. To wire in Terasic's `reset_release` IP, define the macro and add the IP file from `DE25_Nano_ResourceCD/Demonstration/FPGA/Board_Info_RTL/reset_release/`. **Deferred to Ch151+**: video PHY pins + shim (HDMI ADV7513 + I²C config FSM, VGA DAC, or PMOD), PLL `.ip` config, LPDDR4 / SDRAM / HPS / CAM / UART / GPIO assignments. Ch150 makes the project Quartus-importable, not yet Quartus-buildable for video output. ### PLL + lock-gated reset (Ch151) Ch151 adds the most conservative hardware bring-up step before touching the video PHY: a board-clock PLL on the path between `CLOCK2_50` and the design clock, with the reset bridge gated on PLL lock so the design can only leave reset once the PLL is stable. | Artifact | Purpose | |-------------------------------------------------------|----------------------------------------------------------------------| | `rtl/top/de25_nano_pll_stub.sv` | Sim stub matching the Quartus IOPLL `pll` module signature. | | `rtl/top/de25_nano_psmct32_raster_demo_top.sv` (Ch151) | Reworked with PLL instantiation + lock-gated reset bridge + `design_clk` distribution to the Ch146 wrapper and `core_go` sequencer. | | `tb_de25_nano_psmct32_raster_demo_top` (Ch151 update) | Adds rising-edge timestamps for `pll_locked` / `core_rst_n` / `core_go` and asserts the contract `pll_locked < core_rst_n < core_go`. | **PLL signature** (matches `retroDE_nes/ip/pll/pll_bb.v` and `retroDE_splash/ip/sys_pll/sys_pll_bb.v`): ``` module pll ( input wire refclk, input wire rst, output wire outclk_0, output wire locked ); ``` **Sim stub behavior**: `outclk_0 = refclk` (pass-through, no multiplication — sim doesn't need a different frequency, and a pass-through still exercises the lock-gated reset bridge). `locked` rises after 32 cycles with `rst` low; held LOW while `rst` is HIGH. **Reset gating**: the board top's `rst_sync` register async-asserts on `(ninit_done | ~pll_locked)` — both FPGA init AND PLL lock must complete before reset can deassert. **Synth swap**: define `USE_PLL_IP` and add a Quartus IOPLL `.qip` to the project; the board wrapper's `\`ifdef USE_PLL_IP` swaps the stub for the real IP. The QSF documents the swap mechanism but ships with the IP commented out, keeping the scaffold self-contained until the PLL chapter (Ch152+) commits a frequency choice + IP file. **TB contract** (smoke output): `t_pll/rstn/go=(950000,990000, 1330000)` ns — PLL locks at 950 ns, reset deasserts 40 ns later (the 2-stage sync register prop), `core_go` fires 340 ns later (the GO_DELAY=16 wait). Order assertions catch any future regression of the gating. **Deferred to Ch152+**: real PLL output frequency tuning (the stub passes refclk through; a real build sets `outclk_0` to whatever the video PHY chapter needs), committing the actual IOPLL `.ip` file under `synth/de25_nano/.../ip/`, the video PHY shim itself. ### First Quartus compile + baseline report (Ch152) Ch152 is the chapter where the toolchain is finally asked the honest question: "does this DE25-Nano board top synthesize, fit, and pass static timing analysis?" **Driver**: [`synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh`](../../synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh) runs `quartus_syn → quartus_fit → quartus_sta` against the Ch150 QSF + Ch151 PLL stub. `quartus_asm` (bitstream gen) is deliberately skipped — Ch152 is a compile-and-report smoke, not a deploy path. `USE_PLL_IP` is left undefined so the Ch151 self-contained PLL stub stays under test (per Codex framing). **Make targets**: | Target | Action | |---------------------------------|-------------------------------------------------------------| | `make quartus_compile` | Full syn + fit + sta flow. | | `make quartus_compile_clean` | Wipe outputs first, then full flow. | | `make quartus_syn_only` | Synthesis only (~14 min smoke). | | `make quartus_compile_report` | Run [`parse_reports.py`](../../synth/de25_nano/top_psmct32_raster_demo/parse_reports.py) on the latest output. | **Ch152 RTL fixes that landed before synthesis would even elaborate**: | Issue | Fix | |------------------------------------------------------------------------------------|------------------------------------------------------------------------------| | QSF line-continuation (`\`) parse error in `set_global_assignment -name VERILOG_MACRO` | Collapsed to single-line lines. | | `vram_stub.mem` 8192-iter init loop exceeded Quartus's 5000-iter synthesizable-loop limit (Error 13356) | Wrapped initial block in `// synthesis translate_off` / `_on` pragmas. Real Altera/Intel BRAM is power-on-zero so the procedural loop is sim-only. | | `gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` unconditionally instantiate all four swizzle math primitives even when their gate is 0 | Added `gs_swizzle_psmct16/8/4_stub.sv` to the synth filelist + QSF (iverilog trimmed silently; Quartus errors). | | `gs_stub.interp_byte` (Ch86 Gouraud TRI math) 64-bit signed divide hits Quartus Pro's lpm_divide LPM_WIDTHN ≤64 limit (Error 272006) | Wrapped divide in `// synthesis translate_off`; default fallback returns 0. The Ch123 SPRITE-only demo doesn't exercise Gouraud TRIs, so this is dead code in the build. A future Gouraud-TRI hardware demo would need a divider redesign sized for Agilex 5. | | QSF `SDC_FILE` referenced via repo-root-relative path failed when the build script ran Quartus from a per-build work dir (Warning 16124) | Changed to basename-only — works from either the repo root or the work dir (the script symlinks the SDC alongside the QSF). | **First successful synthesis**: 0 errors, 3 warnings, 14:08 elapsed. 160 RAM segments + 26 DSP elements inferred. **Fitter result — design too large for the part (the chapter's honest answer)**: ``` Total dedicated logic registers : 121,176 Total pins : 17 / 351 ( 5 %) Total block memory bits : 65,536 / 7,331,840 (<1 %) Total RAM Blocks : 6 / 358 ( 2 %) Total DSP Blocks : 20 / 188 (11 %) Logic utilization (ALMs needed) : 155,104 / 46,800 (331 %) ``` The design needs **155,104 ALMs vs the part's 46,800 — 3.31× oversized**. `Error (170011): Design contains 260,263 blocks of type combinational node. However, the device contains only 93,600 blocks.` **Why so big** (the precise picture, to be drilled into by Ch153+): The synthesis log reports `Info (22567): extracting RAM` for **all four** memory identifiers — `ee_ram_stub.mem`, `bios_rom_stub.mem`, `ee_memory_map_stub.useg_shadow_mem`, and `vram_stub.mem` — so Quartus *did* recognize each as a memory structure at syn time. But the fit report shows only **65,536 bits / 6 RAM Blocks** committed (roughly enough for BIOS 4 KB + EE-RAM 4 KB). Something between syn and fit caused the larger arrays — most likely `vram_stub.mem` (8 KB) and possibly `useg_shadow_mem` (4 KB after Ch145's 1024-word shrink) — to either (a) be replicated into combinational mux/decoder logic because of their access-port shape, or (b) lose their RAM attribute during fitter optimization and fall back to flip-flop implementation. The 121,176 dedicated registers + the 260,263 combinational nodes are consistent with at least `u_vram` getting massively unrolled. Ch153's job is to isolate **which array(s)** and **which port shape(s)** prevent compact block-RAM implementation. The likely candidates: `vram_stub`'s dual read ports + per-byte write_be lane (Ch95's per-byte gate may not be RAM-block- friendly on Agilex 5), and the EE memory map's wide arbitration into the useg-shadow port. None of this is fixed in Ch152 — surfacing the gap precisely is the chapter's deliverable. **Other notable findings** (full list in [`output_files/build_logs/`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/build_logs/)): - **Critical Warning 20759**: "Use the Reset Release IP in Agilex 5 designs to ensure a successful configuration." This is the Ch151 `\`ifdef USE_TERASIC_RESET_RELEASE_IP` opt-in; enabling it (and committing the IP file) is a Ch153+ task. - **6× Warning 16749**: identifiers used before declaration in `dmac_reg_stub`, `gif_packed_stub`, `gs_stub`, `gif_image_xfer_stub`. Style/lint warnings, no functional impact; clean-up candidate for a future polish chapter. - **STA never ran** because fit failed. **What Ch152 leaves for Ch153+**: - Resource reduction. Most likely candidates: BRAM-infer `vram_stub.mem` and `useg_shadow_mem` cleanly (Quartus attribute hints / restructure read ports), or shrink the EE core's MIPS decode (table-driven vs LUT-driven), or move to a larger Agilex 5 part if available. - Enabling `USE_TERASIC_RESET_RELEASE_IP` and committing the Terasic `reset_release` IP file. - The PHY shim chapter (`VIDEO_*` virtualized → real HDMI ADV7513 / VGA / PMOD pins). - Cleaning up the 6× forward-reference style warnings. ### Memory-shape forensics (Ch153) Ch153 is a memory-forensics chapter (NOT a rewrite chapter): two isolated tiny Quartus projects under [`synth/de25_nano/experiments/`](../../synth/de25_nano/experiments/) target the same Agilex 5 part as the Ch150 board top so resource deltas are apples-to-apples. The goal is to identify which feature(s) of `vram_stub`'s shape prevent compact block-RAM implementation and drive the Ch152 size deficit. | Experiment | Memory shape | |-----------------------|-----------------------------------------------------------------------------------------------| | `exp_a_bram_friendly` | 2048 × 32-bit, single port, sync read + sync write with byte-WE. Intel-friendly BRAM template. | | `exp_b_vram_shape` | 8192 × 8-bit, dual COMBINATIONAL read, byte-WE + per-bit mask RMW. Exact `vram_stub` shape. | **The result is decisive**: | Metric | exp_a (BRAM-friendly) | exp_b (vram_stub-shape) | |---------------------------------|-----------------------|-------------------------| | Fitter status | ✅ **Successful** | ❌ **Failed** | | Logic utilization (ALMs) | **46** / 46,800 (< 1 %) | (fit failed — placement reports 257,986 combinational nodes vs 93,600 device max) | | Total dedicated logic registers | **0** | **65,536** | | Total RAM Blocks | **4** / 358 (1 %) | **0** / 358 (0 %) | | Total block memory bits | **65,536** (8 KB) | **0** | **Interpretation**: - The Intel-friendly shape maps the same 8 KB to **4 RAM Blocks** with **zero combinational logic and zero registers** beyond the read-output flop. - The `vram_stub` shape maps the same 8 KB to **zero RAM Blocks**, **65,536 dedicated registers** (one flip-flop per byte), and **257,986 combinational nodes** (the 4-byte concatenation multiplexers for the dual combinational reads + the per-bit mask RMW gates). - The 257,986 combinational-node figure for a single 8 KB memory almost exactly matches the 260,263 combinational-node figure Ch152 reported for the **entire top-wrapper design** — empirical confirmation that `u_vram` alone accounts for essentially all of the Ch152 size deficit. **Which feature is the dominant cost** (the four candidates the shape diff isolates): The exp_a vs exp_b diff folds four feature changes together (byte-addressable storage, combinational reads, dual reads, per-bit mask RMW). To pin down which feature(s) dominate, a future chapter could insert intermediate experiments — but the exp_a result already gives the upper bound on what BRAM-native inference can buy: ~4 RAM blocks + ~50 ALMs for 8 KB. Anything that gets `vram_stub` close to that bar wins back the entire Ch152 fit headroom. The most likely individual culprit is the **per-bit mask RMW**: Agilex 5's M20K BRAM has byte-WE primitives but does NOT have per-bit RMW. Quartus has to materialize the `(mem & ~mask) | (data & mask)` arithmetic outside the BRAM, which forces the storage out of BRAM and into per-bit flip-flops. Combinational reads are the second most likely (BRAMs are synchronous-read-only on Agilex 5; Quartus has to either insert a register on the read path or materialize the storage as discrete flip-flops to feed the comb output). **Make targets**: | Target | Action | |---------------------------------------|--------------------------------------------------------------| | `make quartus_experiments` | Compile every `synth/.../experiments/exp_*` project. | | `make quartus_experiments_clean` | Wipe outputs first, then compile. | | `make quartus_experiments_report` | Side-by-side resource summary (no recompile). | **What Ch153 leaves for Ch154+**: - Refactor `vram_stub` into a BRAM-friendly shape: replace combinational reads with sync (registered output) reads, replace per-bit mask RMW with byte-WE-only writes (move the per-pixel sub-byte merging logic into the writer module — most likely `gs_stub.raster_pixel_emit` for the PSMT4 nibble case), and switch to 32-bit word-addressable storage with byte-WE for the unaligned-byte case. - Audit `useg_shadow_mem` next — it had `Info (22567): extracting RAM` at synthesis but didn't survive to fit. Likely culprits there: the `Ch64` / `Ch65` / `Ch70` mirror-write features that turn the simple useg-shadow into a multi-port write structure. ### BRAM-friendly vram sibling (Ch154) Ch154 adds a hardware-friendly sibling of `vram_stub` — [`rtl/gif_gs/vram_bram_stub.sv`](../../rtl/gif_gs/vram_bram_stub.sv) — that maps cleanly onto Agilex 5 M20K block-RAM. Per Codex's framing, the chapter's blast radius stays narrow: **add the sibling + prove it works + measure the BRAM-inference win**. The actual swap of the board top to use the new module + the writer-side PSMT4 nibble-RMW rework lands in Ch155+. **`vram_bram_stub` shape vs `vram_stub`**: | Feature | `vram_stub` (legacy / sim reference) | `vram_bram_stub` (Ch154, hw-friendly) | |----------------------------|-------------------------------------|----------------------------------------| | Storage | 8192 × 8-bit byte-addressable | 2048 × 32-bit word-addressable | | Reads | Combinational; arbitrary alignment | Synchronous (1-cycle); word-aligned only | | Read ports | 2 (combinational) | 2 (sync, true dual-port M20K) | | Write granularity | byte-WE + per-bit `write_mask` RMW | byte-WE only | | Per-bit mask RMW (Ch106) | yes — supports PSMT4 nibble splice | NO — caller must splice on writer side | **New equivalence TB**: [`tb_vram_bram_stub_equivalence`](../../sim/tb/gif_gs/tb_vram_bram_stub_equivalence.sv). Drives both DUTs in lockstep with byte-WE-only writes (`write_mask = 0xFFFFFFFF` on the legacy module so the per-bit RMW path is a no-op), aligns sample times across the new module's 1-cycle sync-read latency, and asserts data equivalence across: - 32-bit word writes (`be=4'b1111`) - per-byte-lane writes (`be=4'b0001 / 0010 / 0100 / 1000`) - per-byte non-wrapping admission near MAX_BASE - dual-port read agreement PASS standalone + in the full sim regression. **Quartus experiment `exp_c_vram_bram_stub`** ([synth/.../experiments/exp_c_vram_bram_stub/](../../synth/de25_nano/experiments/exp_c_vram_bram_stub/)) proves the new module infers BRAM cleanly. Side-by-side with the Ch153 baselines, all on the same Agilex 5 part: | Experiment | Fit | ALMs | Registers | RAM Blocks | Block memory bits | |------------------------|-----------|------|-----------|------------|-------------------| | `exp_a_bram_friendly` | ✅ Success | **46** / 46,800 | **0** | **4** / 358 | 65,536 | | `exp_b_vram_shape` | ❌ Failed | (261,578 comb nodes vs 93,600 device max) | **65,536** | **0** / 358 | 0 | | `exp_c_vram_bram_stub` | ✅ Success | **190** / 46,800 | **2** | **8** / 358 | 131,072 | **Interpretation**: - `exp_c` lands close to `exp_a`'s ideal (190 vs 46 ALMs; 8 vs 4 RAM Blocks). The slight overhead vs `exp_a` is the dual read port (M20K replicates storage to serve two independent read addresses simultaneously, hence 2× block memory bits) plus the per-byte non-wrapping admission gate Ch95 inherited from `vram_stub`. - `exp_c` consumes **3.4× fewer** dedicated registers than `exp_a` would have if `read_data` was reset (2 vs the 32 a reset would require) — the canonical Quartus inference template demands no reset on the BRAM data register. - vs `exp_b`'s **65,536 registers + 261,578 combinational nodes**, swapping `vram_stub` → `vram_bram_stub` recovers essentially all of the Ch152 ALM headroom on the vram side. Useg-shadow is the next forensic target (likely similar shape). **Inference template gotcha** (caught + fixed in this chapter): the first cut of `vram_bram_stub` had a reset on `read_data` inside the always_ff block AND an in-bounds gate guarding the `mem` read. Quartus rejected BRAM inference with `Info (276007): RAM logic ... uninferred due to asynchronous read logic`. Fix: simplified the read path to the canonical template (`always_ff @(posedge clk) read_data <= mem[idx];`) and moved bounds + alignment checks to a parallel `read_valid` pipeline. Then `Implemented 64 RAM segments` instead of 0. **Ch155+ surface — writer-side normalization for ALL sub-32-bit PSMs, not just PSMT4**: `vram_bram_stub`'s contract is stricter than `vram_stub`'s — `write_addr` MUST be word-aligned (`write_addr[1:0] == 2'b00`), and the byte lane(s) being written are selected via `write_be` with the payload pre-shifted into the right byte lane(s) of `write_data[31:0]`. Today's writer- side RTL emits at sub-word boundaries: - **PSMCT16** raster + image-xfer write at halfword addresses (`write_addr[1] == 1` for the high halfword) with `be=4'b0011` or `4'b1100` and the 16-bit payload in `write_data[15:0]`. - **PSMT8** raster + image-xfer write at byte addresses (any `write_addr[1:0]`) with `be=4'b0001` and the 8-bit payload in `write_data[7:0]`. - **PSMT4** raster + image-xfer write at byte addresses with `be=4'b0001` + per-bit `write_mask` 0x0F or 0xF0 to splice one nibble. - **PSMCT32** raster + image-xfer write at word addresses with `be=4'b1111` + the full 32-bit payload — the ONLY PSM that natively matches `vram_bram_stub`'s contract today. If we swap the board top to `vram_bram_stub` without writer-side normalization, **CT16/T8/T4 writes silently drop** because `write_addr[1:0] != 0` fails admission. So Ch155 must rework each writer to: 1. Mask `write_addr` down to its word base (`write_addr & ~32'd3`). 2. Shift the payload from its native byte lane into the appropriate byte lane(s) of a 32-bit `write_data` based on the original `write_addr[1:0]`. 3. Generate `write_be` with bits set only for the byte lanes the original sub-word address actually targets. 4. **For PSMT4 specifically**: replace the per-bit `write_mask` nibble splice with a writer-side read-modify-write — read the existing byte first, splice the new nibble in, then issue a normal byte-WE write. Adds ~1 cycle of latency per nibble-write but that's well within the 16×8 demo budget. The rework lands inside `gs_stub.raster_pixel_emit` (Ch95/Ch105/ Ch106 wrote the legacy paths) and `gif_image_xfer_stub`'s per- PSM dispatch. A focused TB that drives sub-word writes through the normalizer and asserts the resulting `vram_bram_stub` words match the legacy `vram_stub` byte-/halfword-/nibble-level state would be the cleanest proof. **Other Ch155+ work**: - Update scanout / debug TBs that sample VRAM via vram_stub's combinational reads to handle the 1-cycle sync-read latency (or keep them on `vram_stub` if they're sim-only). - Swap the Ch146 board top to instantiate `vram_bram_stub` AFTER the writer-side normalization lands. Rerun the full Quartus compile and expect a dramatic ALM/register reduction. - Audit `useg_shadow_mem` next — Ch64/Ch65/Ch70 mirror-write features may make it multi-port-write-shaped. ### VRAM write normalizer + first BRAM integration (Ch155) Ch155 lands the writer-side normalization layer that bridges the contract gap between the legacy `vram_stub` (byte-addressed sub-word writes + per-bit RMW) and the new `vram_bram_stub` (word-aligned + byte-WE only). Per Codex's framing the chapter keeps blast radius narrow: build the normalizer + verify it standalone for all 4 PSMs + prove the easiest case (PSMCT32) end-to-end through the new VRAM. RTL plumbing into `gs_stub.raster_pixel_emit` and `gif_image_xfer_stub` lands in Ch156+. | Artifact | Purpose | |------------------------------------------------------------|------------------------------------------------------------------| | `rtl/gif_gs/vram_normalize_pkg.sv` | Pure-comb `normalize_write` function — natural byte address + PSM + payload + (T4-only) old_byte → word-aligned write_addr + shifted write_data + write_be. | | `tb_vram_normalize_write` | Focused unit TB — 17 cases across CT32 / CT16 / T8 / T4 lanes + misuse detection. | | `rtl/top/top_psmct32_raster_demo_bram.sv` | Sibling of the Ch146 wrapper with `vram_bram_stub` swapped in. | | `tb_top_psmct32_raster_demo_bram` | Integration TB — drives Ch146 fixtures + verifies VRAM contents at PSMCT32 swizzled addresses via hierarchical probe. | **Function contract** (`vram_normalize_pkg::normalize_write`): | PSM | byte_addr alignment | payload bits used | output `write_be` shape | extras | |-----------|---------------------|-------------------|-------------------------|--------| | PSMCT32 | word (`addr[1:0]==0`) | `payload[31:0]` (full ABGR) | `4'b1111` | misuse → drop (`be=0000`) | | PSMCT16 | halfword (`addr[0]==0`) | `payload[15:0]` (RGB5A1) | `4'b0011` (low) / `4'b1100` (high), keyed on `addr[1]` | misuse → drop | | PSMT8 | byte (any) | `payload[7:0]` (index byte) | one of `4'b0001 / 0010 / 0100 / 1000`, keyed on `addr[1:0]` | — | | PSMT4 | byte (any) | `payload[3:0]` (nibble) | one of `4'b0001 / 0010 / 0100 / 1000`, keyed on `addr[1:0]` | needs `old_byte` + `nibble_hi`; output is the spliced full byte at the addressed lane | | any other | — | — | `4'b0000` | — | **PSMT4 splice math** (the only PSM whose output depends on prior memory state): given `nibble_hi=0`, the function returns `new_byte = {old_byte[7:4], payload[3:0]}` — preserves the upper nibble, replaces the lower. With `nibble_hi=1`, `new_byte = {payload[3:0], old_byte[3:0]}`. The CALLER is responsible for sourcing `old_byte` via a 1-cycle read of `mem[byte_addr]` upstream of the write; the function itself is purely combinational. The Ch156+ RTL plumbing chapter is where that read pipeline lives inside `gs_stub.raster_pixel_emit` and `gif_image_xfer_stub`. **`top_psmct32_raster_demo_bram` integration result**: the new sibling wrapper substitutes `vram_bram_stub` for `vram_stub`, drops `write_mask` wiring (CT32's `mask=0xFFFFFFFF` makes the per-bit RMW path a no-op so dropping it is functionally equivalent), and accepts the 1-cycle sync-read latency on PCRTC's `vram_read_data` path (so PCRTC scanout is 1-pixel shifted; the integration TB skips frame capture and verifies VRAM content via direct hierarchical probe). All 128 pixel words at canonical PSMCT32 swizzled addresses match expected ABGR. Standalone PASS. **Ch155 critical audit check**: `vram_normalize_write`'s function-level misuse handling pins the contract — passing an unaligned `byte_addr` for CT32 OR CT16 returns `write_be=4'b0000`, which `vram_bram_stub` then drops cleanly. Combined with Codex's stance that "no sub-32-bit writer is allowed to hand an unaligned address directly to vram_bram_stub", the Ch156+ plumbing chapter has a hard contract to verify against. **Ch156+ surface**: - Insert a 1-cycle byte-read pipeline upstream of the PSMT4 raster emit + image-xfer paths inside `gs_stub` and `gif_image_xfer_stub`. The read returns `old_byte` for `normalize_write`'s splice input. - Apply `normalize_write` to all four PSM emit lanes inside both writers. - Add focused TBs for PSMCT16 / PSMT8 / PSMT4 paths analogous to `tb_top_psmct32_raster_demo_bram` — each verifies the swizzled VRAM contents under the new normalizer + bram_stub. - Add a 1-cycle address-stage register inside `gs_pcrtc_stub` so scanout consumers see a clean combinational-look read (`addr` → `data` with the BRAM's internal sync stage hidden). - Once all four lanes pass, swap the Ch146 board top to use `vram_bram_stub` directly (or retire `vram_stub` outright). - Audit `useg_shadow_mem` next — the Ch64/Ch65/Ch70 mirror- write features may make it multi-port-write-shaped, which is its own forensic exercise. ### Writer-side normalize plumbing — CT16 + T8 (Ch156) Ch156 plumbs the Ch155 `vram_normalize_pkg::normalize_write` function into the BRAM-friendly path so PSMCT16 and PSMT8 raster emits land at the right `vram_bram_stub` byte lane. The chapter intentionally keeps blast radius narrow — the function is wired in at the **wrapper site** between the unmodified writer engines (`gs_stub.raster_pixel_emit`) and `vram_bram_stub`, so the legacy byte-addressable contract on gs_stub's raster emit ports stays exactly as Ch128/Ch134 / etc. defined them. PSMT4 still requires the read-modify-write pipeline and is deferred to Ch157+. | File / target | Role | | ---------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | | `rtl/top/top_psmct32_raster_demo_bram.sv` | Wrapper updated: `raster_pixel_psm_q` exposed; `bitbltbuf_q[61:56]` provides the PSM during xfer; the muxed `(byte_addr, psm, payload)` triple is run through `vram_normalize_pkg::normalize_write` and the result feeds `vram_bram_stub`. CT32 path remains a passthrough; CT16/T8 paths now write to the right lane. | | `tb_gs_raster_bram_psmct16` | Focused CT16 integration TB — 16×4 SPRITE at FBP=0/FBW=1, halfword 0x6155. Drives gs_stub#(PSMCT16_SWIZZLE=1) directly; verifies all 64 swizzled halfwords land in `u_vram.mem[byte_addr >> 2]` at the addr[1]-keyed lane; pins the linear-stride separator at byte 0x80 = zero. | | `tb_gs_raster_bram_psmt8` | Focused PSMT8 integration TB — 16×8 SPRITE at FBP=0/FBW=2, byte index 0xA5. Drives gs_stub#(PSMT8_SWIZZLE=1) directly; verifies all 128 swizzled bytes land in `u_vram.mem[byte_addr >> 2]` at the addr[1:0]-keyed lane. | **Why wrapper-site, not in-engine**: keeping `gs_stub` and `gif_image_xfer_stub` byte-addressable preserves the contract that every Ch128 / Ch134 / Ch140 swizzle TB (and the legacy `vram_stub`) was written against. Ch156's only structural change is that a top wrapper which targets `vram_bram_stub` also runs `normalize_write` between the writer and VRAM. A future chapter can promote the normalizer into the writer engines once we've decided to retire `vram_stub`; until then the function lives where it can be removed without changing the writers. **PSMT4 deferral — explicit hard-gate** (Ch156 audit Medium #1 fix; **superseded by Ch157**): when Ch156 closed, the wrapper masked `write_en` off when the active PSM was PSMT4 (`vram_psmt4_block = (vram_psm_pre == PSM_PSMT4)`, `vram_we_mux = vram_we_pre && !vram_psmt4_block`). Without that gate, `normalize_write`'s PSMT4 branch returned a real one-byte write spliced against `old_byte=0`, silently corrupting VRAM on any T4 raster emit. The Ch156 focused TB `tb_gs_raster_bram_psmt4_gate` drove a 16×4 PSMT4 SPRITE through the wrapper-shape gate and asserted (1) raster_pixel_emit pulses fired, (2) every pulse hit the gate (`blocked == emit`), (3) VRAM stayed at sentinel 0xDEADBEEF — zero corruption. **Ch157 retires both the gate and that TB**: the wrapper now runs a real RMW pipeline (see "PSMT4 RMW pipeline" section below) and supplies a live `old_byte` so the splice produces correct bytes. The retired TB's coverage is replaced by `tb_gs_raster_bram_psmt4`, which drives the same kind of PSMT4 SPRITE but verifies *correct* nibble splices instead of *absence* of writes. **Adversarial coverage on the CT16 / PSMT8 TBs** (Ch156 audit Medium #2 fix): both TBs originally drove a single uniform payload across the whole sprite, so a buggy normalizer that wrote all four byte lanes (or duplicated payload, or stomped neighboring lanes) could still leave every checked pixel matching. The TBs now split the image into TWO half-width SPRITEs with **distinct** payloads: - `tb_gs_raster_bram_psmct16` drives `(0,0)..(7,3)` with halfword 0x6155 (low halfword lane via PSMCT16 swizzle) and `(8,0)..(15,3)` with halfword 0x9F8E (high halfword lane of the same 32-bit words). Sentinel preload (0xDEADBEEF) on every VRAM word before the drive plus a linear-stride separator check at byte 0x80 (outside the swizzled set). - `tb_gs_raster_bram_psmt8` drives `(0,0)..(7,7)` with byte 0xA5 (lanes {0,1}) and `(8,0)..(15,7)` with byte 0x5A (lanes {2,3}). Same sentinel preload. A normalizer that swaps lanes, sets be too wide, or fails to preserve the other halfword/byte lane(s) of the shared word now surfaces as a per-pixel mismatch. **Sim regression**: 141 PASS / 0 FAIL after the audit fixes (140 + the new `tb_gs_raster_bram_psmt4_gate`). **xfer-side coverage**: `gif_image_xfer_stub` already feeds the wrapper's pre-normalize mux during `xfer_busy`. CT32 TRXDIR uploads (no Ch156 TB exists yet, but the path is wired) pass through the normalizer cleanly because xfer emits CT32 word-aligned. CT16 + T8 xfer TBs that exercise this path are a follow-on item — the wiring is already in place; only a focused TB is missing. **Sim regression**: 140 PASS / 0 FAIL after Ch156 (138 + 2 new BRAM-integration TBs). ### PSMT4 RMW pipeline — `vram_bram_stub` writes enabled (Ch157) Ch157 closes the last writer-PSM gap that Ch156 left behind: the PSMT4 hard-gate is replaced by a wrapper-site read-modify-write pipeline that supplies a LIVE `old_byte` from VRAM, splices the new nibble against it, and commits a full-byte write through `vram_bram_stub`'s byte-WE (no per-bit RMW required). The nibble splice itself uses the SAME math as `vram_normalize_pkg`'s PSMT4 branch (`new = nibble_hi ? {nib, old[3:0]} : {old[7:4], nib}`) but lives **inline in the wrapper**, not inside a call to `normalize_write` — the function is pure-comb and would have required `old_byte` to be combinationally available, whereas `vram_bram_stub`'s registered read port hands the byte back one cycle later. The CT32/CT16/T8 paths still call `normalize_write` directly (same-cycle, no read-back required). Goal Codex framed: "all writer PSMs safe before swapping the board top." **Pipeline shape** (inside [`rtl/top/top_psmct32_raster_demo_bram.sv`](../../rtl/top/top_psmct32_raster_demo_bram.sv)): ``` emit cycle N: is_t4_emit=1; vram_read2_addr = byte_addr & ~3; pipe_q <= (byte_addr, nibble_hi, nibble[3:0]). posedge → cycle N+1: vram_read2_data = mem[byte_addr] (sync read); splice new_byte = nibble_hi ? {nibble, old[3:0]} : {old[7:4], nibble}; drive vram_we_final=1, write_addr=byte_addr&~3, write_data shifted to byte_addr[1:0] lane, write_be one-hot to that lane. posedge → cycle N+2: mem[byte_addr] commits new_byte. ``` `old_byte` is sourced from the lane-correct slice of `vram_read2_data`. CT32/CT16/T8 emits skip the pipe entirely and fall through `vram_norm` same-cycle (CT32 stays a passthrough, existing TBs unaffected). **Forwarding hazard — back-to-back same-byte writes**: a PSMT4 SPRITE rasters adjacent pixels at `x=2k` and `x=2k+1` to the SAME `byte_addr` (low + high nibble of a single byte). At cycle N+1 the wrapper reads `mem[byte_addr]` for emit-2 in the SAME posedge that emit-1's write commits. NBA semantics inside `vram_bram_stub` (separate `always_ff` blocks for the write port and the read port) make the read see the PRE-write value, so emit-2 would splice against stale data. The Ch157 pipe carries a 1-deep `t4_prev_*` register set (addr + new_byte from the just-completed RMW) and forwards `t4_prev_new_byte_q` whenever the in-flight emit's `byte_addr` matches the previous emit's `byte_addr`. The forwarding chain extends across any number of back-to-back same-byte emits — emit-N reads emit-(N-1)'s `new_byte` from the forward register, splices on top, and emit-(N+1) reads emit-N's new_byte from that same register. | File / target | Role | | ---------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | | `rtl/top/top_psmct32_raster_demo_bram.sv` | Ch156 hard-gate replaced by the RMW pipe + forwarding registers; `vram_read2_addr` driven on T4 emit cycles; `vram_we_final` mux selects T4 pipe write or non-T4 same-cycle path. | | `tb_gs_raster_bram_psmt4` | New positive-proof TB — drives a 16×4 LINEAR PSMT4 SPRITE (PSMT4_SWIZZLE=0 so adjacent x's hit the same byte) split into two halves with distinct nibbles (0xA / 0x5). 64 raster emits; verifies every byte under the sprite holds the expected pair of spliced nibbles (left half = 0xAA, right half = 0x55) plus sentinel preserved on bytes outside the sprite. **PASS**. | | `tb_gs_raster_bram_psmt4_gate` | Retired — the gate it asserted no longer exists. | **Why LINEAR PSMT4 in the new TB**: the linear address formula `(y*FBW*32) + (x>>1)` puts adjacent x's at the same byte, which is exactly the back-to-back same-byte forwarding hazard. The swizzled path scatters bytes via `columnTable4`, so it touches the forwarding logic less often. Linear coverage is strictly stronger here. **Non-T4 TB cleanup**: `tb_gs_raster_bram_psmct16` and `tb_gs_raster_bram_psmt8` still mirror the *non-T4* portion of the wrapper-site plumbing, but they no longer carry the Ch156 PSMT4 hard-gate (now removed in the wrapper). Both wire `raster_pixel_emit` straight to `write_en` and let `vram_norm` drive addr/data/be — focused TBs verifying their own PSM lane. Full pipe coverage lives in `tb_gs_raster_bram_psmt4` and the top wrapper TB. **Sim regression**: 141 PASS / 0 FAIL after Ch157 (140 + new `tb_gs_raster_bram_psmt4` − retired `tb_gs_raster_bram_psmt4_gate`). ### PCRTC sync-read alignment (Ch158) Ch158 closes the last big blocker before swapping the board top to `vram_bram_stub`: the PCRTC's data-decode + sync-output pipeline is now aware that `vram_bram_stub`'s `read_data` is registered with 1-cycle latency, so the captured scanout no longer trails the address stage by one column. **`gs_pcrtc_stub` change** (in [`rtl/gif_gs/gs_pcrtc_stub.sv`](../../rtl/gif_gs/gs_pcrtc_stub.sv)): new module parameter `VRAM_SYNC_READ` (default 0). When set to 1, every hcnt/vcnt-derived signal that the data-decode comb consumes is run through a 1-cycle register before the consumer sees it (`active_h_dec`, `active_v_dec`, `in_hsync_dec`, `in_vsync_dec`, `in_display_window_dec`, `scanout_enable_dec`, `dispfb_psm_*_dec`, `psm4_nibble_select_dec`, `end_of_frame_dec`). The address-side (`vram_read_addr`) keeps using the current `(hcnt, vcnt)` so the read is issued one pixel "ahead"; the registered `vram_read_data` arrives one cycle later, paired with the matching delayed counter view. Outputs `r/g/b/hsync/vsync/de` come from the `_dec` signals, so the entire output stream shifts right by exactly one clock when `VRAM_SYNC_READ=1`. Default `VRAM_SYNC_READ=0` is a pure passthrough — every existing PCRTC TB written against the legacy `vram_stub` (comb-read) shape is unaffected. **`top_psmct32_raster_demo_bram` change**: instantiates `gs_pcrtc_stub` with `.VRAM_SYNC_READ(1'b1)`. The wrapper banner updates to drop the Ch155 caveat about scanout being 1 column shifted — that caveat is now resolved. **`tb_top_psmct32_raster_demo_bram` extension**: adds a Phase 2 frame-capture block that arms on the next vsync rising edge after raster drain, captures one full frame's r/g/b into `cap_*[v][h]` indexed by a 1-cycle-delayed copy of PCRTC's address-stage counters (since the registered `de` aligns with those delayed counters), and asserts each captured pixel's post-decode r/g/b matches the expected ABGR for its quadrant. Phase 1 (per-pixel VRAM probe via hierarchical `mem[byte_addr >> 2]`) is unchanged. **PASS** — 16×8 active region, all 128 pixels captured + all 128 VRAM words probe-verified, `frame_seen` latched. **Open Ch159+ items**: - xfer-side T4 coverage TB — the Ch157 wrapper handles xfer-side T4 emits identically (the mux feeds `vram_psm_pre` from `bitbltbuf_q[61:56]` during `xfer_busy`), but no focused TB exercises that path yet. - Swap the Ch146 board top to instantiate `vram_bram_stub` and the Ch158 PCRTC-sync mode directly (or retire `vram_stub` outright). All four writer PSMs and PCRTC scanout are now proven correct against the BRAM-friendly contract; the remaining work is the integration commit on the board side. - Audit `useg_shadow_mem` for the same BRAM-shape forensics that Ch153 ran on `vram_stub` (Ch64/Ch65/Ch70 mirror writes may make it multi-port-write-shaped). **Ch158 audit Medium fix — sub-word PSM lane selection**: the initial Ch158 cut shifted the data-decode pipeline by 1 cycle to align with `vram_bram_stub`'s registered output, but it still extracted CT16 / PSMT8 / PSMT4 sub-word values from the LOW lane of `vram_read_data` (i.e. `[15:0]` halfword and `[7:0]` byte). That worked for `vram_stub` (byte-addressable; the read returns 4 bytes starting at `byte_addr` so the sub-word always lands at the low lane) but NOT for `vram_bram_stub` (word-addressable; `read_data` is `mem[byte_addr >> 2]` so the sub-word lives at lane `byte_addr[1:0]` of the returned word). Codex Ch158 audit called this out as a blocker for any sub-word PSM scanout through the BRAM. The fix adds: - `vram_addr_lane_q` — 1-cycle-delayed copy of `vram_read_addr[1:0]`, paralleling the other `_q` decode- stage registers added in the original Ch158 cut. - `data_lane = VRAM_SYNC_READ ? vram_addr_lane_dec : 2'd0` — forces the legacy comb-read path to keep using the low lane (preserving every existing PCRTC TB's expectation), and resolves to the correct byte_addr-keyed lane in sync mode. - `psm16_pixel = data_lane[1] ? read_data[31:16] : read_data[15:0]`. - A `vram_byte_lane` mux extracting one of 4 byte lanes for PSMT8 (`psm8_idx`) and PSMT4 (`psm4_byte_lane` → nibble splice). Two new focused integration TBs prove the fix end-to-end with adversarial pre-loads: | TB | Coverage | | --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- | | [`tb_gs_scanout_bram_psmct16`](../../sim/tb/gif_gs/tb_gs_scanout_bram_psmct16.sv) | 4-pixel CT16 scanout reading mem[0]/mem[1] with FOUR distinct halfwords across both halfword lanes (`byte_addr[1]∈{0,1}`); each pixel's captured 5→8-decoded RGB matches the expected halfword. **PASS** | | [`tb_gs_scanout_bram_psmt8`](../../sim/tb/gif_gs/tb_gs_scanout_bram_psmt8.sv) | 4-pixel PSMT8 scanout reading mem[0] with FOUR distinct byte indices, one per byte lane (`byte_addr[1:0] ∈ {0,1,2,3}`); each pixel's grayscale RGB matches the expected byte. **PASS** | Without the fix, both TBs would have failed: the CT16 TB would emit the same pair of pixels twice (low halfword of each word), and the PSMT8 TB would emit `IDX_0` for all four pixels. **Sim regression**: 143 PASS / 0 FAIL after Ch158 audit fixes (141 + 2 new BRAM scanout TBs). ### Board-top swap to BRAM wrapper + Quartus fit recovery (Ch159) Ch159 commits the integration step that the prior chapters were building toward: the DE25-Nano board top ([`rtl/top/de25_nano_psmct32_raster_demo_top.sv`](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)) now instantiates [`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv) instead of the legacy [`top_psmct32_raster_demo`](../../rtl/top/top_psmct32_raster_demo.sv). External port shape is identical so this is drop-in at the board level; the BRAM-backed wrapper carries through every Ch155-Ch158 fix (writer-side normalize + PSMT4 RMW pipe + PCRTC sync-read alignment + sub-word lane select). The synth file list ([`synth/de25_nano/top_psmct32_raster_demo/files.f`](../../synth/de25_nano/top_psmct32_raster_demo/files.f)) and Quartus QSF gain `vram_normalize_pkg.sv`, `vram_bram_stub.sv`, and `top_psmct32_raster_demo_bram.sv`; the legacy `vram_stub` + legacy top stay on the project for back-compat with sim TBs that still target them. **Quartus fit recovery — vs Ch152 baseline**: the headline of this chapter. Ch152 fit FAILED at 155k ALMs needed (331% over) because `vram_stub`'s 8 KiB byte-addressable + per-bit-RMW storage didn't infer as M20K and landed as a 65,536-flip-flop array, dragging 121k registers and 199k synthesis ALMs along with it. Ch159 swap turns those numbers around: | Metric | Ch152 (vram_stub) | Ch159 (vram_bram_stub) | Δ | | ---------------------------------- | ---------------------------- | ----------------------------- | ----------------------- | | Synthesis status | Successful | Successful | — | | Synthesis ALMs estimate | 199,103 / 46,800 (425% over) | **22,704 / 46,800 (49%)** | −176,399 (**−88.6%**) | | Synthesis registers | 101,457 | **36,008** | −65,449 (**−64.5%**) | | **Fit status** | **FAILED** (155k / 331% over) | **Successful** (30,364 / 65%) | ✅ **fits** | | Fit registers | 121,176 | **39,085** | −82,091 (**−67.7%**) | | Fit RAM blocks | 6 / 358 | **14 / 358** | +8 (BRAM-inferred VRAM) | | Fit block memory bits | 65,536 | **196,608** | +131,072 (data in M20K) | | Fit DSP blocks | 20 | 18 | −2 | | **STA status** | **DID NOT RUN** (fit failed) | **Successful** (12 warnings) | ✅ STA reachable | | STA setup slack worst (CLOCK2_50) | n/a | −6.950 ns | timing miss at 50 MHz | | Fmax | n/a | 37.11 MHz | (Ch160+ tunes) | The eight new RAM blocks are the same `vram_bram_stub` footprint exp_c proved in Ch154 (8 RAM blocks for the dual-port + admission-gated 8 KiB shape; the +6 already in the Ch152 baseline came from `bios_rom_stub` + `ee_ram_stub` + `useg_shadow_mem` correctly inferring as BRAM there). The register drop (121k → 39k) is essentially the entire VRAM flip-flop array vanishing. **Setup-slack reality check**: STA reports −6.950 ns slack at the CLOCK2_50 50 MHz constraint (Fmax = 37.11 MHz). The critical path is somewhere in the Ch123 dep tree's longer combinational chains (likely the Gouraud divider or one of the swizzle muxers). That is **NOT** a Ch159 regression — it's a brand-new visibility unlocked by being able to run STA at all. Ch160+ owns timing closure (PLL down-clock to ≤30 MHz, critical-path pipelining, or both). **Snapshots preserved**: the Ch152 baseline reports are saved under [`synth/de25_nano/top_psmct32_raster_demo/baseline_ch152/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch152/) (syn / fit summaries + flow.rpt + parse_report.txt) so future chapters can diff against them without re-running the failing Ch152 baseline. **Sim regression**: 143 PASS / 0 FAIL unchanged. The Ch149 board-wrapper TB exercises the same external behavior with the new core wrapper inside. ### Down-clock target + first .sof bitstream (Ch160) Ch160 closes the loop Codex framed at the end of Ch159 — "first add a down-clock PLL profile so we can get a real bitstream moving on hardware, then use the successful STA path report to decide whether to pipeline toward 50 MHz." The chapter is SDC- and build-flow-only; no RTL changes. **SDC retarget** ([`synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc`](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc)) relaxes the CLOCK2_50 period from 20.000 ns (50 MHz) to 33.333 ns (30 MHz). The DE25-Nano's CLOCK2_50 oscillator is physically still 50 MHz; the SDC tells Quartus to ASSUME a 30 MHz input so the fitter closes timing at the down-clock target. A real PLL `.ip` that divides 50 → 30 MHz on hardware is the Ch161+ commit (the QSF's commented-out `QIP_FILE` swap-point is staged for it). Until then, the .sof produced under this constraint is structurally clean for 30 MHz operation; programming it on a board where CLOCK2_50 is still wired straight through gives an effective 50 MHz chip clock that may show setup-violating behavior — Ch161 closes that gap. **`build_quartus.sh` adds `quartus_asm`** ([`synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh`](../../synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh)) gated on a clean STA, so a `.sof` bitstream is now produced when the design fits and timing closes. The Make scaffold check is loosened to accept either the 50 MHz (legacy) or 33.333 ns (Ch160 down-clock) period. **Quartus result vs Ch159**: | Metric | Ch159 (50 MHz target) | Ch160 (30 MHz target) | |-------------------------------|-------------------------------|-------------------------------| | Synth ALMs estimate | 22,704 / 46,800 (49 %) | 22,704 / 46,800 (49 %) | | Synth registers | 36,008 | 36,008 | | Fit status | Successful | Successful | | Fit ALMs | 30,364 / 46,800 (65 %) | 31,056 / 46,800 (66 %) | | Fit registers | 39,085 | 37,381 | | Fit RAM blocks | 14 / 358 | 14 / 358 | | **STA setup slack worst** | **−6.950 ns** (timing miss) | **+0.805 ns** (closes) | | **Fmax (CLOCK2_50)** | 37.11 MHz | 30.74 MHz | | **`quartus_asm`** | (skipped) | **Successful — `.sof` produced** | The synth-side numbers are identical because no RTL changed — the differences are entirely in the fitter's placement choices under the looser timing constraint. Fmax dropped slightly (37.11 → 30.74 MHz) because Quartus optimizes harder when the target is tighter; the headline is that **at the 30 MHz target the design CLOSES** (positive slack on every report) and a real `.sof` is now generated. **Critical path** (from [`output_files/de25_nano_psmct32_raster_demo_top.sta.rpt`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/de25_nano_psmct32_raster_demo_top.sta.rpt), worst-10 paths all in the same module hierarchy): | Field | Value | |--------------|------------------------------------------------------------------------------------------| | Slack | +0.805 ns (worst path of 10 with this slack value) | | From / To | `u_demo|u_core|div_0_rtl_0|auto_generated|divider|divider|...` (intra-divider register-to-register) | | Data Delay | 32.516 ns (out of 33.333 ns period) | | Critical net | The EE core's auto-generated 64-bit signed divider (the Ch152-noted Gouraud TRI divider — dead code in the PSMCT32 raster demo because no `RM_TRI` primitive is dispatched). | **Ch161+ pipelining handoff**: the path Codex's framing asked us to surface is now visible. Two options: 1. **Pipeline the divider** — re-implement `ee_core`'s 64-bit division as an N-cycle multi-cycle path. Quartus's auto- generated divider is a single-cycle ripple chain; making it 2-3 stage pipelined would put Fmax comfortably above 50 MHz. 2. **Strip it from the build** — gate the Gouraud TRI divider behind a `STRIP_GOURAUD_TRI` parameter (default off), so the PSMCT32 raster demo's hardware build instances the EE core without it. Quartus removes the entire `div_0_rtl_0` block; Fmax should jump dramatically. Option 2 is the lower-blast-radius hardware bring-up move (removes ~32 ns of dead-code combinational chain); option 1 is the long-term correct fix once the Gouraud TRI path goes load-bearing. **Snapshots**: Ch159 baseline reports preserved under [`baseline_ch159/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch159/) (syn / fit / sta summaries + parse_report). **Sim regression**: 143 PASS / 0 FAIL unchanged (no RTL changes). Scaffold check + Ch149 board TB + top BRAM TB all green under the new SDC. ### Real PLL IP commit — `.sof` actually runs at 30 MHz (Ch161) Ch161 retires the Ch160 hardware-honesty caveat by committing a real Quartus IOPLL `.ip` configured for 50 MHz refclk → 30 MHz outclk_0. The wrapper's `\`ifdef USE_PLL_IP` (staged in Ch151) now flips to the IP-generated `pll` module on Quartus builds; sim TBs continue to use the pass-through `de25_nano_pll_stub`. **Files committed under [`synth/de25_nano/top_psmct32_raster_demo/ip/`](../../synth/de25_nano/top_psmct32_raster_demo/ip/)**: - `pll.ip` — adapted from `retroDE_nes/ip/audio_pll.ip` (single- output Agilex 5 IOPLL template), retargeted to 50 MHz refclk → 30 MHz outclk_0. - `pll/pll.qip` + `pll/synth/pll.v` + `pll/pll_bb.v` — Quartus IP-generated artifacts (`quartus_ipgenerate de25_nano_psmct32_raster_demo_top --ip_file=ip/pll.ip --generate_ip_file --synthesis=verilog`). The generated `pll` module exposes `(refclk, rst, outclk_0, locked)` — exactly the Ch151 stub's signature, so the `\`ifdef` swap is drop-in. **Wiring changes**: - `de25_nano_psmct32_raster_demo_top.qsf` uncommented the `set_global_assignment -name QIP_FILE ip/pll/pll.qip` swap- point and added `set_global_assignment -name VERILOG_MACRO "USE_PLL_IP=1"` so Quartus instantiates the IP `pll` instead of the `de25_nano_pll_stub`. - `de25_nano_psmct32_raster_demo_top.sdc` reverted the Ch160 CLOCK2_50 period back to 20.000 ns (the physical 50 MHz oscillator). The IOPLL's auto-generated SDC inside the .qip declares the post-PLL `outclk_0` clock at 30 MHz, so STA picks up two domains: `u_pll|iopll_0_refclk` (50 MHz, the pin) and `u_pll|iopll_0_outclk0` (30 MHz, the design clock). - `build_quartus.sh` symlinks the `ip/` dir alongside the existing `rtl/` and `sim/` symlinks so the QIP_FILE's `ip/pll/pll.qip` path resolves from the work dir. **Quartus result vs Ch160**: | Metric | Ch160 (SDC profile only) | Ch161 (real PLL IP) | |-----------------------------------|------------------------------|------------------------------| | Fit ALMs | 31,056 / 46,800 (66 %) | 30,898 / 46,800 (66 %) | | Fit registers | 37,381 | 37,352 | | **Fit PLLs** | **0 / 11** | **1 / 11** (real IOPLL) | | RAM blocks | 14 / 358 | 14 / 358 | | Setup slack worst (design_clk) | +0.805 ns @ CLOCK2_50 | **+0.565 ns @ u_pll|iopll_0_outclk0** | | Fmax (design_clk) | 30.74 MHz | **30.74 MHz** | | `quartus_asm` | Successful | Successful (`.sof` produced) | The `+1` PLL block is the real IOPLL on the chip; ALMs go down slightly because the stub's clock-distribution path no longer needs ALM glue. STA now reports BOTH clock domains: the refclk (50 MHz, +19.249 ns slack — trivially fast) and the design_clk (30 MHz post-PLL, +0.565 ns slack — comfortable margin). The `.sof` produced under this configuration **genuinely runs at 30 MHz on the DE25-Nano**: the IOPLL takes the 50 MHz CLOCK2_50 input and divides to 30 MHz inside the chip, so the entire design downstream of `u_pll.outclk_0` operates at the constrained frequency. (Setup slack landed at +0.914 ns on the initial Ch161 build; the Ch161 audit's wider reset false-path nudged the fitter into a slightly different placement, dropping the worst-case setup slack to +0.565 ns. Recovery analysis on the rst_sync stages — which had been hiding a real -0.079 ns violation under the original `*rst_sync[0]` constraint — is now gone from the .sta.summary entirely after the false-path was widened to `*rst_sync[*]`.) **Snapshots**: Ch160 baseline (parse_report + summaries + `.sof`) preserved under [`baseline_ch160/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch160/). **Open Ch162+ items** (Ch161 forward-ref, **superseded by Ch162 below**): - ~~Pipeline or strip the EE-core 64-bit Gouraud TRI divider~~ — **closed in Ch162** via `STRIP_HW_DIVIDER` (note: the actual divider is the Ch43 DIVU divider, not Gouraud TRI; the forward-ref's name was loose). The Ch162 strip retired the `u_demo|u_core|div_0_rtl_0|...` STA worst path entirely; see the Ch162 section below for the new critical path. - xfer-side T4 coverage TB (open from Ch157+). - `useg_shadow_mem` BRAM-shape forensics. - Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins virtualized. **Sim regression**: 143 PASS / 0 FAIL unchanged. Sim ignores the `\`ifdef USE_PLL_IP` (no `+define+USE_PLL_IP` in the iverilog Makefile) so the stub stays active under sim. ### Strip the EE-core hardware divider (Ch162) Ch162 takes the lower-blast move from the Ch161 STA handoff: add a parameter that gates the EE-core's auto-inferred 32-bit hardware divider out of synthesis on the PSMCT32 SPRITE-only hardware build, then re-measure Fmax. **RTL change** ([rtl/ee/ee_core_stub.sv](../../rtl/ee/ee_core_stub.sv)) gains `parameter bit STRIP_HW_DIVIDER = 1'b0`. Two `/` and `%` sites tied to the Ch43 DIVU instruction are gated by this parameter — the writeback (lines ~932-935) and the retire- trace `arg3` mirror (lines ~1005-1014). Default `0` keeps DIVU semantics intact for every existing sim TB (`tb_ee_core_divu_mflo` is the only consumer; it stays at the default). When the parameter is `1`, the writeback becomes a no-op (HI/LO unchanged, identical to the divisor==0 case the spec calls undefined) and the retire-trace `arg3` reports 0. Quartus then has nothing to infer — the `div_0_rtl_0` block disappears. **Wrapper plumbing**: [`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv) gains a matching `STRIP_HW_DIVIDER` parameter and forwards it to `ee_core_stub`. The [DE25-Nano board top](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv) sets `.STRIP_HW_DIVIDER(1'b1)` on its `u_demo` instantiation (the bootlet doesn't execute DIVU, so this is behavior-neutral for the demo). Sim TBs that instantiate the BRAM wrapper directly use the default 0. **Quartus result vs Ch161 (real-PLL baseline)**: | Metric | Ch161 (real PLL) | Ch162 (real PLL + strip) | |-----------------------------------|-------------------------------|-------------------------------| | Fit ALMs | 30,898 / 46,800 (66 %) | 30,006 / 46,800 (64 %) | | Fit registers | 37,352 | 36,618 | | Fit PLLs | 1 | 1 | | RAM blocks | 14 | 14 | | **Setup slack worst (design)** | +0.565 ns | **+3.567 ns** | | **Fmax (design domain)** | 30.74 MHz | **33.6 MHz** (+9.4 %) | | `quartus_asm` | Successful | Successful (`.sof` produced) | Stripping the divider freed 892 ALMs / 734 registers and yielded ~3 ns of new setup margin. **Fmax climbs from 30.74 MHz to 33.6 MHz** — a real jump, but **not enough to clear the 50 MHz target** (which would need a +67 % jump). Codex's Ch162 framing predicted this branch: "if Fmax jumps, we have a clean path to a 50 MHz demo bitstream; if not, the next real critical path will reveal itself." We landed in the second branch — Fmax jumped, but not far enough. **New critical path** (the Ch163+ handoff, from [`output_files/de25_nano_psmct32_raster_demo_top.sta.rpt`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/de25_nano_psmct32_raster_demo_top.sta.rpt)): | Field | Value | |------------|-------------------------------------------------------------------------------------------------------------------| | Slack | +3.567 ns | | From | `u_demo|u_pcrtc|div_1_rtl_0|auto_generated|divider|divider|...` (PCRTC magnification divider) | | To | `u_demo|u_vram|mem_rtl_0|auto_generated|altera_syncram_impl1|ram_block2a15~reg0` (VRAM port input) | | Data delay | 38.443 ns of arrival vs 42.010 ns required (period 33.333 ns + clock skew + uncertainty) | The PCRTC divider comes from [`gs_pcrtc_stub.sv`](../../rtl/gif_gs/gs_pcrtc_stub.sv) lines: ``` assign vram_x_unshift = {20'd0, hwin_rel} / hmag_factor; assign vram_y_unshift = {20'd0, vwin_rel} / vmag_factor; ``` where `hmag_factor = MAGH + 1` and `vmag_factor = MAGV + 1`. For the demo `MAGH = MAGV = 0`, so the divisor is constant 1 — but Quartus doesn't constant-propagate through this formulation and synthesizes a real 32-bit divider anyway. The parallel Ch162 fix shape would be a `STRIP_PCRTC_MAG_DIV` parameter (or a more general "demo doesn't use magnification" hint that bypasses the divider when both MAGH and MAGV are constant 0). **Snapshots**: Ch161 baseline preserved under [`baseline_ch161/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch161/) (syn / fit / sta summaries + parse_report + .sof) for diff. **Open Ch163+ items**: - Strip the PCRTC magnification divider on hardware builds (next critical path; same shape as Ch162's `STRIP_HW_DIVIDER`). - Once Fmax climbs north of 50 MHz, retune the IOPLL `.ip` to outclk_0 = 50 MHz, retarget the SDC, and ship a 50 MHz bitstream. - xfer-side T4 coverage TB (still open from Ch157+). - `useg_shadow_mem` BRAM-shape forensics. - Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins virtualized. **Sim regression**: 143 PASS / 0 FAIL unchanged. Default `STRIP_HW_DIVIDER=0` preserves DIVU semantics for `tb_ee_core_divu_mflo`; the board top's `STRIP_HW_DIVIDER=1` goes through `tb_de25_nano_psmct32_raster_demo_top` cleanly because the Ch149 board TB doesn't exercise DIVU. ### Strip PCRTC magnification divider + 50 MHz close (Ch163) Ch163 takes the next critical-path attack from the Ch162 STA report (the PCRTC magnification divider) and uses the resulting Fmax headroom to retune the PLL IP to 50 MHz output — closing the journey that started at the Ch152 fit failure with a real 50 MHz bitstream. **RTL change** ([rtl/gif_gs/gs_pcrtc_stub.sv](../../rtl/gif_gs/gs_pcrtc_stub.sv)) gains `parameter bit STRIP_PCRTC_MAG_DIV = 1'b0`. The two `/` operators are gated: ``` assign vram_x_unshift = STRIP_PCRTC_MAG_DIV ? {20'd0, hwin_rel} : ({20'd0, hwin_rel} / hmag_factor); assign vram_y_unshift = STRIP_PCRTC_MAG_DIV ? {20'd0, vwin_rel} : ({20'd0, vwin_rel} / vmag_factor); ``` Default `0` keeps the live divider math for every Ch93-era magnification scanout TB (`tb_gs_scanout_magh_magv` etc.). When `1`, the math collapses to a passthrough — equivalent to the MAGH=MAGV=0 case the demo always hits but with no inferred divider for Quartus to synthesize. **Wrapper plumbing**: [`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv) gains a matching `STRIP_PCRTC_MAG_DIV` parameter that forwards to `gs_pcrtc_stub`. The [DE25-Nano board top](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv) sets `.STRIP_PCRTC_MAG_DIV(1'b1)` on its `u_demo` instantiation. **Quartus result, two stages**: *Stage A — strip @ 30 MHz target (still on the Ch161 PLL .ip)*: | Metric | Ch162 (strip EE divider only) | Ch163 (strip both, 30 MHz) | |-----------------------|-------------------------------|----------------------------| | Fit ALMs | 30,006 / 46,800 (64 %) | 27,216 / 46,800 (58 %) | | Setup slack worst | +3.567 ns | +21.113 ns | | **Fmax (design)** | 33.6 MHz | **81.83 MHz** (+143 %) | The Ch163 strip alone freed +17.5 ns of margin and 2,790 ALMs — large enough to clear 50 MHz outright. Codex's Ch162 framing predicted both branches of the if-Fmax-jumps fork; Ch163 lands in the **first** branch ("clean path to a 50 MHz demo bitstream"). *Stage B — retune PLL .ip from 30 MHz → 50 MHz output*: The `pll.ip` source's `gui_output_clock_frequency0` and `gui_output_clock_frequency_ps0` are bumped (30.0 → 50.0 MHz; 33333.333 → 20000.0 ps). `quartus_ipgenerate` rebuilds the .qip / synth files in-place. No SDC change needed — CLOCK2_50 stays pinned at the physical 50 MHz period; the IOPLL's auto- generated SDC declares the new outclk_0 frequency. | Metric | Ch163 strip @ 30 MHz target | Ch163 strip @ 50 MHz target | |-----------------------|-----------------------------|------------------------------| | Fit ALMs | 27,216 / 46,800 (58 %) | 27,543 / 46,800 (59 %) | | RAM blocks / PLLs | 14 / 1 | 14 / 1 | | **Setup slack worst** | +21.113 ns | **+7.500 ns** | | **Fmax (design)** | 81.83 MHz | **80.0 MHz** | | `.sof` produced | yes (30 MHz run on hw) | **yes — 50 MHz on hw** | **The .sof produced under Stage B genuinely runs at 50 MHz on the DE25-Nano** — the IOPLL takes 50 MHz CLOCK2_50 in and emits 50 MHz outclk_0 (effectively a 1:1 relation through the real PLL hardware so the chip's clock distribution still goes through the IOPLL's clock network). All 8 timing classes positive; no recovery violations; build gate Successful. **Snapshots**: - [`baseline_ch162/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch162/) — Ch162 30 MHz state with EE divider stripped only. - [`baseline_ch163_30mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_30mhz/) — Ch163 strip-both at 30 MHz target (Stage A milestone). **Open Ch164+ items** (the project has hit the major hardware milestone Codex called out at Ch157+; Ch164+ is post-launch): - xfer-side T4 coverage TB (open from Ch157+). - `useg_shadow_mem` BRAM-shape forensics. - Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins still virtualized; this is the next big front-end deliverable before the demo can paint a real screen. **Sim regression**: 143 PASS / 0 FAIL unchanged. Default-off on `STRIP_PCRTC_MAG_DIV` preserves every Ch93 magnification scanout TB; the board top's `STRIP_PCRTC_MAG_DIV=1` propagates cleanly through `tb_de25_nano_psmct32_raster_demo_top` since the demo locks MAGH=MAGV=0. ### HDMI pin shim — pixels off-chip (Ch164) Ch164 is the first video-PHY chapter — Codex's framing was "small PHY shim chapter, not a full display-stack leap. Get pixels off- chip before making them pretty." Replace the abstract `VIDEO_R/G/B/HSYNC/VSYNC/DE` virtual pins with real DE25-Nano HDMI transmitter signals; the ADV7513 chip itself stays asleep (its I²C wake-up FSM is the Ch165+ chapter), so the bitstream makes the FPGA pins toggle correctly but a real monitor stays dark until Ch165 lands. **Wrapper change** ([rtl/top/de25_nano_psmct32_raster_demo_top.sv](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)): five new top-level outputs added — `HDMI_TX_CLK` (= `design_clk`, the 50 MHz pixel clock), `HDMI_TX_D[23:0]` packing `{VIDEO_R, VIDEO_G, VIDEO_B}` (R in MSBs, ADV7513 default 24-bit RGB), and `HDMI_TX_HS / HDMI_TX_VS / HDMI_TX_DE` mirroring the abstract VIDEO_* signals. The VIDEO_* ports are kept on the wrapper as `VIRTUAL_PIN ON` (the Ch149 board TB references them via hierarchical probe). **QSF change** ([synth/.../de25_nano_psmct32_raster_demo_top.qsf](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf)): HDMI pinout sourced from [`retroDE_nes/retroDE_nes.qsf`](../../../retroDE_nes/retroDE_nes.qsf) for the same DE25-Nano (Terasic Agilex 5) board — `HDMI_TX_CLK` on `PIN_DJ24` with 1.1-V IO standard (matches the on-board level shifter), data + sync pins on 3.3-V LVCMOS. The companion ADV7513 control pins (`HDMI_I2C_SCL`, `HDMI_I2C_SDA`, `HDMI_TX_INT`, `HDMI_MCLK`) are intentionally NOT pinned — the chip stays in standby on power-up and ignores its 24-bit RGB input until the I²C wake-up FSM lands in Ch165+. **SDC change** ([synth/.../de25_nano_psmct32_raster_demo_top.sdc](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc)): `set_false_path -to` for each HDMI output port. Proper `set_output_delay` constraints with respect to a generated `HDMI_TX_CLK` domain land alongside the Ch165+ wake-up FSM, when the ADV7513's actual setup/hold window comes out of the chip's datasheet pass. **Scaffold-check extension** ([sim/Makefile](../../sim/Makefile)): `top_psmct32_raster_demo_quartus_scaffold_check` now also verifies `HDMI_TX_CLK + HDMI_TX_D[0..23] + HS/VS/DE` are pin-assigned (sentinel set; not exhaustive) — fails the gate if Quartus would auto-place them on arbitrary package pins. **Quartus result vs Ch163 (50 MHz)**: | Metric | Ch163 (50 MHz, no HDMI pins) | Ch164 (50 MHz + HDMI pins) | |-----------------------------|-------------------------------|-------------------------------| | Fit ALMs | 27,543 / 46,800 (59 %) | 27,271 / 46,800 (58 %) | | Fit RAM / PLL blocks | 14 / 1 | 14 / 1 | | **Fit pins** | **17 / 351 (5 %)** | **45 / 351 (13 %)** (+28 HDMI) | | Setup slack worst (design) | +7.500 ns | +7.536 ns | | Fmax (design domain) | 80.0 MHz | ~80 MHz (unchanged) | | `quartus_asm` | Successful | Successful (`.sof` produced) | The +28 pins are exactly the new HDMI shim — 24 RGB lanes, 1 clock, 3 sync (HS / VS / DE). Setup slack stays at ~+7.5 ns because the new pins are `false_path`'d — STA doesn't time anything against them yet. ALMs ticked down slightly as the fitter rebalanced under the wider pin map. **Snapshot**: Ch163 50 MHz baseline preserved at [`baseline_ch163_50mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_50mhz/) (syn / fit / sta summaries + parse_report + .sof). The [`baseline_ch163_30mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_30mhz/) 30-MHz milestone is also preserved. **Open Ch165+ items**: - **ADV7513 I²C wake-up FSM** — without this the HDMI port outputs nothing on a real monitor. Ch165 owns the chip bring-up: pin `HDMI_I2C_SCL` / `HDMI_I2C_SDA` / `HDMI_TX_INT` / `HDMI_MCLK`, drop in an I²C master that walks the canonical ADV7513 register-set (sourced from `retroDE_nes`'s working bring-up). - Proper `set_output_delay` constraints once the ADV7513 setup/hold window is documented (replacing Ch164's `false_path`). - Make the rendered pattern bigger than Ch123's 16×8 SPRITE so there's something visible to admire on a real screen. - xfer-side T4 coverage TB (still open from Ch157+). - `useg_shadow_mem` BRAM-shape forensics. **Sim regression**: 143 PASS / 0 FAIL unchanged — no RTL changes that touched sim semantics; the new HDMI ports are combinational mirrors of existing VIDEO_* signals, and `tb_de25_nano_psmct32_raster_demo_top` references VIDEO_* unchanged. ### Wake the ADV7513 — first .sof that drives a real HDMI monitor (Ch165) Ch165 turns "FPGA pins toggling" into "monitor has a fighting chance of showing the tiny frame" — Codex's framing for the chapter. The ADV7513 chip stays in standby on power-up; an I²C master needs to walk a canonical register-write sequence to configure 24-bit RGB input + sync polarity + power-up + HPD override before the chip will accept the FPGA's HDMI_TX_* data and drive the connector. **Modules ported** (Terasic DE-series reference design, free use on Terasic hardware per the license that ships with the DE25-Nano System CD; copyright retained): - [`rtl/platform/I2C_Controller.v`](../../rtl/platform/I2C_Controller.v) — bit-bang I²C master with 23-step transaction layout (start / slave-addr / sub-addr / data / stop, ~50 µs per byte at the derived 20 kHz I²C clock). - [`rtl/platform/I2C_HDMI_Config.v`](../../rtl/platform/I2C_HDMI_Config.v) — wake-up FSM that walks a 38-entry LUT of ADV7513 register writes (slave 0x72): power-up + HPD override + audio init + AVI InfoFrame for full-range RGB 444 + dither + clock-divide + HDMI mode select. Adapted from the `retroDE_splash/rtl/platform/` versions (same DE25-Nano board); LUT customizations (HPD override, AVI InfoFrame for full-range RGB) carry through. **Wrapper changes** ([rtl/top/de25_nano_psmct32_raster_demo_top.sv](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)): - Four new top-level ports: `inout HDMI_I2C_SCL`, `inout HDMI_I2C_SDA` (open-drain I²C bus), `input HDMI_TX_INT` (chip's HPD / monitor-sense interrupt, active-low), and `output HDMI_MCLK` (audio sample-rate reference, driven by CLOCK2_50 since the demo is video-only). - `I2C_HDMI_Config u_hdmi_i2c` instantiated. Clocked on `CLOCK2_50` (NOT `design_clk` — the wake-up runs even before the PLL locks); reset on `~ninit_done` (raw async reset; the I²C bus stays held in a clean state until FPGA init completes). Output `READY` (= `hdmi_init_done`) goes high after the LUT walk; `HDMI_TX_INT` going low retriggers the walk so a late hot-plug after FPGA boot still wakes the chip. - New status LED: `LED[3] = ~hdmi_init_done` (active-low; lit means the chip is configured). `LED[7:4]` retie at HIGH. **QSF + files.f + sim Makefile**: [QSF](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf) gains pin assignments for the 4 new control pins (sourced from `retroDE_nes`: `BT1` / `BW2` / `CF2` / `CF1`) plus IO standards (3.3-V LVCMOS for everything). The two new platform Verilog sources are added to the QSF source list, the synth [files.f](../../synth/de25_nano/top_psmct32_raster_demo/files.f), and the sim Makefile's `RTL_SRCS`. The [scaffold-check](../../sim/Makefile) extends to verify all 4 control pins are pin-assigned + IO standard'd, alongside the Ch164 24-pin HDMI data set. **SDC change** ([de25_nano_psmct32_raster_demo_top.sdc](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc)): `set_false_path -to / -from` on the new control pins. The I²C bus runs at ~20 kHz (50 µs per SCL period) and is inherently async to the design clock; HDMI_MCLK is driven by CLOCK2_50 and sampled by the chip's audio PLL — both well below any constraint on the fabric. **Quartus result vs Ch164**: | Metric | Ch164 (HDMI data only) | Ch165 (HDMI data + wake-up) | |-------------------------|-------------------------------|-------------------------------| | Fit ALMs | 27,271 / 46,800 (58 %) | 27,374 / 46,800 (58 %) | | Fit RAM / PLL blocks | 14 / 1 | 14 / 1 (unchanged) | | **Fit pins** | **45 / 351** | **49 / 351** (+4 control) | | Setup slack worst | +7.536 ns | +7.198 ns | | `quartus_asm` | Successful | Successful (`.sof` produced) | The +103 ALMs are the I²C controller's bit-bang state machine and the 38-entry LUT walker. STA stays positive on every class — the wake-up FSM lives entirely on the I²C-clock domain (slow), and Recovery analysis on `iRST_N` async-deassert is cleanly +17.621 ns of slack. **TB note** — `tb_de25_nano_psmct32_raster_demo_top` (the Ch149 board smoke) wires up the new HDMI_TX_INT input (tied high = no interrupt) and leaves the I²C SCL/SDA lines floating; the wake-up FSM walks the LUT but full completion takes ~125 ms simulated at the production divider (controller-clock period ~100 µs × 33 phases/byte × 38 bytes), far longer than the existing 5 ms TB runtime. The board TB doesn't observe `hdmi_init_done` directly — it pre-dates the wake-up FSM and only smoke-tests the wrapper. The Ch165 audit landed `tb_hdmi_i2c_wake_smoke` (`sim/tb/top/`), which overrides `CLK_Freq / I2C_Freq` to collapse the divider so the walk runs in microseconds and asserts the LUT walk + READY rise + HDMI_TX_INT retrigger + open-drain SDA + the Ch166 sticky NACK watchdog. Ch167 added a bus-level byte-sequence lock: the TB switched its SDA model from pulldown to pullup + a phase-aware slave-ACK driver (drives strong-LOW exactly when `u_dut.u0.phase` is `PH_ACK0/1/2`, releases otherwise so the master's data bits are visible on the wire). A decoder samples SDA on each SCL rising edge between START and STOP, assembles the three bytes per transaction into a 24-bit `{dev_addr, reg, data}` tuple, and compares against `u_dut.mI2C_DATA[23:0]` snapshotted on `mI2C_GO` rising edges. Asserts: 38 captured == 38 intent, every byte matches, every dev_addr is `8'h72`. The Phase 3 open-drain check also flipped semantics from "SDA never strong-HIGH" to "SDA never `'x`" (the right violation test for the pullup bus). **Snapshots**: Ch164 baseline preserved at [`baseline_ch164/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch164/); Ch165 baseline at [`baseline_ch165/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch165/). **Open Ch168+ items**: - Proper `set_output_delay` constraints on HDMI_TX_* once the ADV7513 setup/hold window is locked from the bring-up datasheet pass (replaces the Ch164 `set_false_path -to`). - Make the rendered pattern bigger than Ch123's 16×8 SPRITE so there's something visible to admire on a real screen. - xfer-side T4 coverage TB (still open from Ch157+). - `useg_shadow_mem` BRAM-shape forensics. **Sim regression**: 144 PASS / 0 FAIL. `tb_de25_nano_psmct32_raster_demo_top` PASSES with the new HDMI control ports wired up (HDMI_TX_INT held high in the TB; LED=`0b11111000` shows the existing 3 status LEDs lit — LED[3] stays unlit because the LUT walk doesn't complete in 5 ms of sim). `tb_hdmi_i2c_wake_smoke` PASSES the accelerated bring-up + Ch166 NACK-watchdog assertions. ### Hardware-readiness pass for the Ch123 PSMCT32 raster demo (Ch144) Ch144 is a synthesis/FPGA-readiness audit around the first hardware-demo candidate (Ch123 PSMCT32 raster e2e, marked above). No RTL changes — Ch144 documents what a top-level FPGA wrapper needs to know before attempting a first build. **RTL dependency tree (Ch123-only)** — what the demo *actually* instantiates. The full `RTL_SRCS` list compiled by sim contains ~40 modules; Ch123 only reaches these 11, plus the swizzle math primitive that the three swizzle-aware modules each instantiate internally: | Module | Role in Ch123 | |------------------------------|-------------------------------------------------------------| | `bios_rom_stub` | EE bootlet at 0xBFC0_0000 (~18 instructions) | | `ee_ram_stub` | DMAC-side GIF payload (~24 qwords) | | `ee_memory_map_stub` | EE-CPU + DMAC + bios + map's GS-priv decode | | `ee_core_stub` | MIPS R5900 core running the bootlet | | `ee_gs_priv_bridge_stub` | EE 32-bit MMIO → 64-bit GS-priv reg writes | | `dmac_reg_stub` | DMAC ch2 NORMAL transfer | | `gif_packed_stub` | GIFtag + PACKED A+D parser | | `gs_stub` | GS register file + raster (`PSMCT32_SWIZZLE=1`) | | `gif_image_xfer_stub` | TRXDIR/IMAGE engine (`PSMCT32_SWIZZLE=1`, dormant in Ch123) | | `vram_stub` | 8 KiB VRAM (one PSMCT32 page) | | `gs_pcrtc_stub` | PCRTC scanout (`PSMCT32_SWIZZLE=1`) | | `gs_swizzle_psmct32_stub` | Pure-comb math, instantiated x3 inside the gates above | **Sim-only constructs audit** (full sweep of the 12 modules above): - `bios_rom_stub.sv` and `ee_ram_stub.sv` — `$display` / `$readmemh` inside `initial begin`. Both are synth-safe: Xilinx Vivado and Intel Quartus support `$readmemh` for BRAM initialization, and `$display` is silently ignored by all major synthesizers. - `vram_stub.sv` L114-117 — single `$error` parameter validator inside `initial begin`. Synth ignores it; the BYTES parameter must be set to a sane value at instantiation regardless. - `ee_gs_priv_bridge_stub.sv` L118 — runtime `$error` on unsupported byte enables, inside `always_ff`. Synth ignores the `$error`; the surrounding logic still synthesizes correctly. - **No** `$finish` / `$dumpfile` / `$random` / `force` / `release` / `real`-typed signals / hierarchical refs in any module of the **Ch123 dep tree**. (TBs use hierarchical refs into `bios_rom_stub` to preload the bootlet — that's a TB- only concern; on hardware the bootlet image is the BRAM init. Out-of-tree note: `boot_install_agent_stub.sv` (SIF subsystem, not in the Ch123 dep tree) contains a `$fatal` runtime validator, but it is never compiled into the Ch123 hardware build.) **Memory sizing**: | Memory | Default | Ch123 sim setting | Ch123 hw recommendation | FPGA fit | |--------------------|---------------|-------------------|-------------------------|----------------------------------| | `bios_rom_stub` | 4 MiB | 4 KiB | 4 KiB | ≤1 BRAM tile | | `ee_ram_stub` | 16 KiB | 4 KiB | 4 KiB | ≤1 BRAM tile | | `vram_stub` | 64 KiB | 8 KiB | 8 KiB | ≤2 BRAM tiles (one PSMCT32 page) | | `ee_memory_map_stub.useg_shadow_mem` (Ch145) | 4 MiB | 4 MiB | **4 KiB** (override `USEG_SHADOW_WORDS_PARAM=1024`) | ≤1 BRAM tile when overridden | The 16×8 framebuffer needs only 16×8×4 = 512 bytes; 8 KiB gives the full first PSMCT32 page (FBP=0). For a more ambitious hardware demo (multi-page framebuffers, textures), grow `vram_stub.BYTES` toward 1 MiB / 4 MiB. Real PS2 has 4 MiB of VRAM; a first hardware build can stay at 8 KiB. **Ch145 — `useg_shadow_mem` parameterization**: pre-Ch145, the ee_memory_map_stub's useg-shadow backing was a fixed 1M-word / 4 MiB array. That was correct for the BIOS-smoke chapters that need full first-4-MiB-of-useg coverage, but it's wasted area for the Ch123 hardware demo (which never touches useg — the bootlet runs from BIOS at 0xBFC0_0000 and the GIF payload from RAM at phys 0x100). Ch145 promotes `USEG_SHADOW_WORDS` from a hardcoded `localparam` to the `USEG_SHADOW_WORDS_PARAM` module parameter (default 1M words = 4 MiB → existing TBs unchanged). For the Ch123 hardware demo, the top-level wrapper instantiates `ee_memory_map_stub` with `.USEG_SHADOW_WORDS_PARAM(1024)` to shrink the inferred BRAM footprint by ~1024×; correctness is unaffected because no useg access ever happens in the Ch123 data plane. **Clock / reset assumptions**: - Single clock domain (`clk`) — all 12 modules share one input. - Active-low synchronous reset input (`rst_n`) — also a single shared input. No reset gating, no per-module variants. The reset is sampled inside `always_ff @(posedge clk)` via the `if (!rst_n)` pattern (NOT `posedge clk or negedge rst_n`) — i.e., it is NOT an async reset despite being active-low. On FPGA this should be brought up via the device's reset bridge so the deasserting edge is synchronous to `clk`. - No clock gating, no derived clocks. The PCRTC's hsync/vsync/de are regular clock-domain outputs, not separate clocks. **Swizzle gate parameter defaults**: - All four swizzle parameters (`PSMCT32_SWIZZLE`, `PSMCT16_SWIZZLE`, `PSMT8_SWIZZLE`, `PSMT4_SWIZZLE`) default to `1'b0` on `gs_stub`, `gs_pcrtc_stub`, and `gif_image_xfer_stub`. For the Ch123 hardware demo, instantiate these three modules with **`PSMCT32_SWIZZLE(1'b1)`** and the other three left at `1'b0`. The swizzle-math primitives (`gs_swizzle_psmct32_stub` etc.) are pure-comb and trim cleanly when their gate is off. **Top-level harness expectations** (for a future `top_psmct32_raster_demo.sv`): - Inputs: `clk`, `rst_n`, plus board-level video-out connections (HDMI / DVI / VGA — driven by `r/g/b/hsync/vsync/de` from `gs_pcrtc_stub`). - The EE bootlet image must be preloaded into `bios_rom_stub` via either `IMAGE_FILE` (→ `$readmemh`) or a bake-step that writes a `.mem` next to the synthesis project. The bootlet is 18 MIPS instructions (currently authored procedurally in the Ch123 TB body via `ee_prog_word()`); for hardware this needs to become a static `.mem` checked into the repo. - The GIF payload must be preloaded into `ee_ram_stub` via the same mechanism — 24 qwords starting at `PAYLOAD_MADR=0x100`. Current TB authors them procedurally with `preload_qword()`; hardware needs a static `.mem`. - The `core_go` signal must be tied high (or pulsed by a board reset-release sequencer) so the EE starts fetching from `0xBFC0_0000`. **Known sim-only constructs that should NOT block first build**: - `$display` lines in BIOS/RAM init (synth ignores). - `$readmemh` (synth tools handle it for BRAM init). - `$error` parameter validators (synth ignores). **Known sim-only constructs that WOULD block first build**: - None found in the Ch123 dep tree. **Open questions for the hardware-build session** (deliberately not answered here — they need a board-level decision): - Target FPGA family + clock frequency (PCRTC was designed around 13.5 MHz pixel clock for the 16×8 active area; first build can run at any clock since the TB doesn't model real CRTC timing). - Video-out PHY (HDMI core, VGA DAC, on-board HDMI transmitter chip). - BIOS / payload bake step (Vivado `update_compile_order` + `.mem` files vs. a SystemVerilog `localparam` array preload). - Whether to keep `ee_core_stub`'s `STRICT_UNSUPPORTED` gate active on hardware (catches unknown opcodes by halt+latch — useful for debugging, but a hard failure on any unintended fetch). The Ch90 white-box TB `tb_gs_scanout_basic.sv` exercises the full round trip: instantiates `gs_stub` + `vram_stub` + `gs_pcrtc_stub`, drives a 4×4 sprite through the GIF reg port, waits for raster to fully drain, then enables scanout and captures one full frame's `(hcnt, vcnt) → (r, g, b)` trace. Asserts: every pixel inside the sprite reads as the emitted color, every pixel outside reads as 0, and at least one EV_MODE frame trace fires.