retroDE_ps2/docs/contracts/gif_gs.md

# GIF/GS Contract

Status: `Draft`

## Purpose

Define the graphics ingress and rendering/display boundary.

## Owns

- GIF path intake and arbitration,
- GIF tag interpretation,
- GS register decode,
- GS VRAM-visible operations,
- framebuffer/zbuffer/texture-visible state handling,
- PCRTC/display output generation or a planned approximation layer.

## Inputs

- DMAC channel 2 traffic,
- VIF/VU-generated graphics traffic,
- privileged GS register writes,
- reset and display configuration controls.

## Outputs

- VRAM updates,
- display timing and pixel output,
- status/interrupt signals,
- packet and register trace events.

## Questions to lock

- What is the first output milestone:
  - GS privileged register acceptance only
  - static background color
  - minimal primitive draw
  - `gsKit`-style demo target
- Is Phase 1 display based on a faithful GS/PCRTC path or a temporary adapter?
- What VRAM organization assumptions must stay stable from the beginning?

## Allowed early stubs

- privileged-register-only GS stub,
- BGCOLOR/test-pattern display path,
- packet logger with no rendering.

## Required debug visibility

- GIF tags,
- PATH source and arbitration result,
- GS register writes,
- VRAM write summaries,
- display mode transitions.

## First meaningful milestone

- a known packet stream or direct privileged-register sequence produces a stable,
  visible, repeatable output and matching trace.

## GS write-port contract (Ch75)

The GS model has **two architecturally distinct write ports** because real PS2
hardware exposes two unrelated register namespaces. Conflating them was a Ch74
mistake; Ch75 split them.

### `reg_wr_*` — privileged GS/MMIO writes

- Source: CPU MMIO writes to the `0x12000000` privileged-register block, e.g.
  via `platform_video_stub` or a direct test-harness path.
- Address: `reg_wr_addr[15:0]` is the offset *inside* the privileged block.
- Examples: `BGCOLOR` at offset `0x00E0`, `PMODE` at `0x0000`,
  `SMODE2` at `0x0020`, etc.
- Currently latched: `BGCOLOR` only. Other offsets emit `EV_MODE`.

### `gif_reg_*` — GIF A+D register-number writes

- Source: `gif_packed_stub` consuming a PACKED A+D entry when run with
  `REAL_AD_REG_MAP=1` (the new default-on path for real PS2 packets;
  parameter still defaults to `0` for back-compat with project-local
  Ch72/Ch73 PACKED-A+D layout).
- Address: `gif_reg_num[7:0]` is the **GIF A+D register number** straight
  out of the PACKED entry's `in_data[71:64]`. Source-of-truth is PCSX2
  `pcsx2/GS/GSRegs.h`.
- Currently decoded: `PRIM=0x00`, `RGBAQ=0x01`, `XYZF2=0x04`, `XYZ2=0x05`,
  `FRAME_1=0x4C`, `ZBUF_1=0x4E` (**not `0x4F` — that is `ZBUF_2`**).
  Each has a dedicated 64-bit latch output. Other reg numbers emit `EV_MODE`.

### Event taxonomy

The two write paths emit different events. Read this carefully — `arg2`
semantics differ across emitters.

- `EV_BGCOLOR` — emitted **only** by `gs_stub` on the privileged port
  when `reg_wr_addr == 0x00E0`. Carries the unpacked R/G/B in
  `arg0`/`arg1`/`arg2`. The privileged port has no per-register
  "selector" beyond this dedicated event; everything else on that port
  goes to `EV_MODE` with `arg0=offset`, `arg1=data`.

- `EV_WRITE` — emitted in two places with different `arg2` semantics:
  - **`gif_packed_stub`** on a PACKED A+D accept (REGS nibble = `0xE`).
    Carries the raw PACKED address bits in `arg2` (`{48'd0,
    in_data[79:64]}`). Under `REAL_AD_REG_MAP=1` the low 8 bits are the
    real GIF reg# (`in_data[71:64]`); under `REAL_AD_REG_MAP=0` the low
    16 bits are the project-local privileged-style offset. **Not a
    stable selector — it is the address half of the wire.**
  - **`gs_stub`** on the `gif_reg_*` port for a tracked GIF reg
    (PRIM/RGBAQ/XYZF2/XYZ2/FRAME_1/ZBUF_1). Carries a **stable
    per-register selector** in `arg2`: `1=PRIM, 2=RGBAQ, 3=XYZF2,
    4=XYZ2, 5=FRAME_1, 6=ZBUF_1, 7=TEX0_1` (Ch98). `arg0=reg#`,
    `arg1=data`. Use this
    selector for trace-side filtering; it does not depend on
    `REAL_AD_REG_MAP`.
  - **Ch76 caveat**: a tracked vertex commit (XYZ2 or XYZF2) on the
    `gif_reg_*` port that *closes* a primitive does NOT emit EV_WRITE
    that cycle — `EV_PRIM_DRAW` preempts it (see below). The xyz2_q /
    xyzf2_q latch still updates. Trace consumers counting "vertices
    seen" must sum `EV_WRITE`(selector=3 or 4) + `EV_PRIM_DRAW` to get
    the true total.

- `EV_PRIM_DRAW` — Ch76 / Ch77. Fired by `gs_stub` once per primitive
  completion: when an XYZ2 or XYZF2 vertex commit on the `gif_reg_*`
  port closes a primitive under the current `PRIM[2:0]`. Preempts the
  EV_WRITE that the closing vertex would otherwise have emitted.
  Args: `arg0=PRIM[2:0]` (prim type), `arg1=primary threshold`,
  `arg2=cumulative `prim_complete_count` post-increment`,
  `arg3=closing vertex data` (the same 64 bits that latched into
  xyz2_q / xyzf2_q on this cycle).
  - **Discrete primitives** (POINT=1, LINE=2, TRIANGLE=3, SPRITE=2):
    one draw per N vertices; the vertex counter resets to 0 after each
    draw.
  - **Strip / fan primitives** (LINE_STRIP=2, TRI_STRIP=3, TRI_FAN=3):
    Ch77. Anchor on the first N vertices, then fire one draw per
    additional vertex commit. The vertex counter saturates at the
    primary threshold so every subsequent vertex closes another
    primitive. Ch78 adds **vertex-identity tracking** distinguishing
    TRI_STRIP rolling triangles `{v_n-2, v_n-1, v_n}` from TRI_FAN
    pivot triangles `{v_pivot, v_n-1, v_n}` — see the next section.
  - **Reserved** (PRIM=7): no draw, vertex commits do not increment
    the counter, latches still update.
  - A PRIM write always resets the vertex counter so a fresh
    primitive type starts cleanly.

### Per-primitive vertex snapshot (Ch78)

Alongside `EV_PRIM_DRAW`, `gs_stub` exposes three 64-bit outputs —
`prim_v0_q`, `prim_v1_q`, `prim_v2_q` — that hold the *vertex tuple*
of the most recently closed primitive. Snapshot is registered on the
same clock edge as the `ev_valid` pulse and held until the next
`prim_complete`, so a TB can sample it at the same time it sees
`gs_ev_event == EV_PRIM_DRAW`.

The number of valid slots is implicit in `PRIM[2:0]`:

| `PRIM` | type | valid slots | semantics |
|---|---|---|---|
| 0 | POINT | `v0` | the single vertex |
| 1 | LINE | `v0`, `v1` | endpoints |
| 2 | LINE_STRIP | `v0`, `v1` | each segment uses `{v_n-1, v_n}` |
| 3 | TRIANGLE | `v0`, `v1`, `v2` | the three vertices |
| 4 | TRI_STRIP | `v0`, `v1`, `v2` | rolling: `{v_n-2, v_n-1, v_n}` |
| 5 | TRI_FAN | `v0`, `v1`, `v2` | pivot+rolling: `{v_pivot, v_n-1, v_n}` |
| 6 | SPRITE | `v0`, `v1` | top-left + bottom-right |
| 7 | reserved | — | observer never closes |

The TRI_STRIP-vs-TRI_FAN distinction lives entirely in the
saturated-extension path: a TRI_STRIP advances `v0` each draw with
the rolling window; a TRI_FAN pins `v0` to `v_pivot` (the first
vertex committed since the most recent PRIM write). On the *anchor*
draw, `v_pivot` and the rolling `v_prev` happen to coincide, so
TRI_STRIP and TRI_FAN report the same tuple for their first
triangle.

A PRIM write clears the rolling window (`v_curr` / `v_prev` /
`v_prev_prev` / `v_pivot` / `pivot_seen`) so a fresh primitive
context starts with no residual vertex bleed. Slots not used by the
current primitive type read `0`.

The snapshot tracks identity, not geometry — the values written are
the raw 64-bit `gif_reg_data` payloads of XYZ2 / XYZF2 commits, with
no decoding into screen-space coordinates. Rasterization is still
out of scope.

### Per-primitive color snapshot (Ch79 / Ch80)

`prim_color_q[63:0]` is registered on the same edge as
`prim_v0_q` / `prim_v1_q` / `prim_v2_q` and carries the value of
`rgbaq_q` at the moment the primitive closed. RGBAQ writes are
separate A+D entries from XYZ2 / XYZF2 commits (gif_packed_stub
serializes A+D to one accept per cycle), so `rgbaq_q` is always
settled to its draw-time value when `prim_complete_now` fires.

`prim_color_q` reads `0` if no RGBAQ has been written since reset;
`rgbaq_q` itself is **not** cleared on a PRIM write — color carries
forward across PRIM context switches, matching real GS behavior —
but it does reset to `0` on `rst_n`.

#### Per-vertex Gouraud color (Ch80)

For real game streams that interleave RGBAQ writes with vertex
commits to drive Gouraud shading, `gs_stub` exposes three
additional outputs:

| Output | Slot semantics |
|---|---|
| `prim_color_v0_q[63:0]` | color of vertex 0 |
| `prim_color_v1_q[63:0]` | color of vertex 1 |
| `prim_color_v2_q[63:0]` | color of vertex 2 |

A parallel rolling color window (`c_curr_q` / `c_prev_q` /
`c_prev_prev_q` / `c_pivot_q`, internal) samples `rgbaq_q` on
every vertex commit, mirroring the Ch78 vertex-identity window.
The snapshot layout matches the vertex layout exactly:

| `PRIM` | type | `_v0_q` color of | `_v1_q` color of | `_v2_q` color of |
|---|---|---|---|---|
| 0 | POINT | the single vertex | 0 | 0 |
| 1 | LINE | first endpoint | closing | 0 |
| 2 | LINE_STRIP | previous vertex | closing | 0 |
| 3 | TRIANGLE | `v_n-2` | `v_n-1` | closing |
| 4 | TRI_STRIP | `v_n-2` (rolls) | `v_n-1` | closing |
| 5 | TRI_FAN, anchor | `v1` (≡ pivot) | `v2` | `v3` |
| 5 | TRI_FAN, saturated | `v_pivot` (PINNED) | `v_n-1` | closing |
| 6 | SPRITE | first endpoint | closing | 0 |

`prim_color_q` is exactly the closing-vertex color (≡
`prim_color_v_close`), kept as a convenience alias for consumers
that don't care about Gouraud.

For **flat-shaded** primitives (RGBAQ written once before the
strip), all per-vertex color slots used by the primitive equal
each other and equal `prim_color_q`. For **Gouraud-shaded**
primitives (RGBAQ rewritten between vertex commits), the slots
may differ — capturing the per-vertex color identity needed to
distinguish a strip's rolling colors from a fan's pivot color.

The color window is **cleared on PRIM write** (unlike `rgbaq_q`
itself, which carries forward). This means per-vertex color
identity stays tied to the current primitive context — a stream
that switches PRIM types mid-context starts color tracking fresh
for the new context. Slots not used by the current primitive type
read `0`.

Like the vertex snapshot, this captures identity, not interpolated
geometry — the stored values are the raw 64-bit RGBAQ payloads
(packing R, G, B, A, and the texture-coord divisor Q together);
GS-style Gouraud interpolation across the primitive interior
remains out of scope.

### Structured-field decode (Ch81)

`gs_stub` exposes pre-decoded snapshot outputs alongside the raw
64-bit slots so a downstream rasterizer or pixel-emit path doesn't
have to re-derive bit fields:

| Output | Type | Carries |
|---|---|---|
| `prim_v0_decoded_q` / `_v1_` / `_v2_` | `trace_pkg::vertex_t` | `x` / `y` / `z` / `fog` / `is_xyzf2` per slot |
| `prim_v0_color_decoded_q` / `_v1_` / `_v2_` | `trace_pkg::color_t` | `r` / `g` / `b` / `a` / `q` per slot |

The decoded outputs latch on the same edge as the raw snapshots, so
a TB samples both atomically with `EV_PRIM_DRAW`.

#### vertex_t and the XYZ2 / XYZF2 distinction

```sv
typedef struct packed {
    logic        is_xyzf2;  // 1 = XYZF2 source, 0 = XYZ2
    logic [7:0]  fog;       // valid iff is_xyzf2; else 0
    logic [31:0] z;         // 32-bit (XYZ2) or zero-extended 24-bit (XYZF2)
    logic [15:0] y;         // 12.4 fixed-point screen Y
    logic [15:0] x;         // 12.4 fixed-point screen X
} vertex_t;
```

XYZ2 packs full 32-bit Z in `data[63:32]`. XYZF2 packs 24-bit Z in
`data[55:32]` and an 8-bit fog byte in `data[63:56]`. The `is_xyzf2`
flag is registered in a parallel rolling format-flag window
(`xyzf2_curr_q` / `xyzf2_prev_q` / `xyzf2_prev_prev_q` /
`xyzf2_pivot_q`) that tracks the source format of each vertex
through the rolling window — so when an XYZF2 vertex rolls into
the `v_prev` slot of a TRI_STRIP saturated extension, its
`is_xyzf2` flag rolls with it.

Cleared on `rst_n` and on PRIM write, same as the vertex/color
windows.

#### color_t

```sv
typedef struct packed {
    logic [31:0] q;  // texture-coord divisor (IEEE float)
    logic [7:0]  a;
    logic [7:0]  b;
    logic [7:0]  g;
    logic [7:0]  r;
} color_t;
```

Direct bit-slice of the RGBAQ payload — no interpretation. Q is
carried verbatim as a 32-bit IEEE float (the GS uses it for
texture coordinate division during rasterization, which remains
out of scope).

#### Decode helper functions

`trace_pkg` exposes `decode_vertex(data, is_xyzf2)` and
`decode_color(data)` so downstream code can re-decode raw 64-bit
values consistently with the `gs_stub` snapshot.

The decoded outputs are an additive contract — the raw `prim_v*_q`
and `prim_color_v*_q` outputs continue to work for consumers that
don't care about per-channel decoding.

### Minimal pixel emit (Ch82)

`gs_stub` exposes a per-primitive *pixel emit* — the smallest
possible output that ties the recognition layer to a framebuffer
destination. One pixel per closed primitive (the closing vertex,
in screen-space integer coords), addressed by the latched
`frame_1_q` register. No interpolation, no coverage, no
rasterization — this is the contact point for a future raster
chapter, not a substitute for one.

| Output | Width | Carries |
|---|---|---|
| `pixel_emit` | 1 | 1-cycle strobe; pulses on the same edge as `prim_complete` |
| `pixel_emit_count` | 32 | Running tally of emits since reset |
| `pixel_x_q` / `pixel_y_q` | 12 | Closing vertex integer screen coords (top 12 bits of 12.4 fixed-point) |
| `pixel_color_q` | 64 | RGBAQ at the emit moment (= `prim_color_q`) |
| `pixel_fbp_q` | 9 | `FRAME_1[8:0]` (framebuffer base / 2048) |
| `pixel_fbw_q` | 6 | `FRAME_1[21:16]` (framebuffer width / 64 in pixels) |
| `pixel_psm_q` | 6 | `FRAME_1[29:24]` (pixel storage format) |
| `pixel_fb_addr_q` | 32 | Computed VRAM byte offset (see below) |

#### Address arithmetic

```
fb_addr = FBP * 2048 + (Y * FBW * 64 + X) * bytes_per_pixel
```

Ch83 added PSM-aware `bytes_per_pixel` derived from the latched
`FRAME_1[29:24]` (PSM field):

| PSM (hex) | Format | bytes/pixel | Notes |
|---|---|---|---|
| 00, 01 | PSMCT32 / PSMCT24 | 4 | host-word |
| 02, 0A | PSMCT16 / PSMCT16S | 2 | |
| 13 | PSMT8 | 1 | indexed |
| 14 | PSMT4 | 4 here (host-word) | **legacy `pixel_emit` channel only** — see note below |
| 1B, 24, 2C | PSMT8H / PSMT4HL / PSMT4HH | 4 | host-word (high/low nibble of 32-bit slot) |
| 30, 31 | PSMZ32 / PSMZ24 | 4 | depth |
| 32, 3A | PSMZ16 / PSMZ16S | 2 | depth |
| other | — | 4 (host-word fallback) | unrecognized PSM |

This table describes the **legacy `pixel_emit` channel** (the
single-pixel-per-primitive debug strobe from Ch82/Ch83). That
channel does not commit to `vram_stub`; it only emits a trace
event. Its PSMT4 entry stays at host-word fallback — the
recognition layer never tracked sub-byte position there.

The **raster channel (`raster_pixel_emit`)** does NOT use this
table. It owns its own PSM-aware emit packing in S2 with full
PSMT4 support after Ch106:
- Byte address = `pixel_index >> 1` (overrides the
  `pixel_index << ras_bpp_shift` form).
- The 4-bit index from R[3:0] is placed in the targeted nibble
  (low/high keyed by `pixel_index[0]`) of `write_data[7:0]`.
- `raster_pixel_be_q = 4'b0001`, `raster_pixel_mask_q = 0x0F`
  or `0xF0` so `vram_stub`'s per-bit merge updates only that
  nibble.

PSMT8H / PSMT4HL / PSMT4HH still address the host 32-bit slot,
not the high/low byte/nibble within it; the extracted sub-byte
is rasterizer/blit-specific and out of scope here.

`pixel_psm_q` is still exposed verbatim so consumers can apply
their own sub-slot offset arithmetic if needed.

#### Carry-forward semantics

`frame_1_q` is part of the standard GIF-context register file and
carries forward across PRIM writes (matching real GS). A stream
that sets `FRAME_1` once and then emits multiple primitives
correctly addresses all of them. A stream that never writes
`FRAME_1` lands every pixel at `fb_addr=0` — observable but not
useful, behaves cleanly under reset.

`rgbaq_q` likewise carries forward, so `pixel_color_q` reflects
the most recent RGBAQ write at emit time. If a Gouraud-style
stream rewrites RGBAQ between vertices, `pixel_color_q` captures
the closing-vertex color — same semantic as Ch79's
`prim_color_q`.

#### Strobe channel, not trace event

`pixel_emit` is a dedicated 1-cycle strobe alongside the snapshot
outputs, not a multiplexed event on the main `ev_valid` trace
stream. This avoids contention with `EV_PRIM_DRAW` on the close
cycle. A consumer that wants both can sample on `pixel_emit`
posedge and read the snapshots atomically.

### Minimal interior rasterizer (Ch84)

`gs_stub` adds a *separate* per-interior-pixel emit channel
alongside the per-primitive `pixel_emit` of Ch82. The Ch82
strobe is unchanged (still pulses once per closed primitive); the
new channel pulses once per pixel that the rasterizer determines
is inside the closed primitive's interior.

| Output | Width | Carries |
|---|---|---|
| `raster_pixel_emit` | 1 | 1-cycle strobe per emitted interior pixel |
| `raster_pixel_emit_count` | 32 | Cumulative interior pixels emitted since reset |
| `raster_pixel_x_q` / `_y_q` | 12 | Integer screen coords of the emitted pixel |
| `raster_pixel_color_q` | 64 | Per-pixel color: Gouraud-interpolated R/G/B/A for TRI/TRI_STRIP/TRI_FAN (Ch86), flat (= `prim_color_q`) for SPRITE. Q passes through from the closing vertex. |
| `raster_pixel_fb_addr_q` | 32 | Computed VRAM byte offset (PSM-aware, same math as Ch82/Ch83) |
| `raster_active` | 1 | High while the FSM is scanning a primitive |
| `raster_overflow` | 1 | Latches if a new primitive closes while the 2-entry raster FIFO is full and no concurrent pop frees a slot (Ch87 + audit-medium fix). See "Raster command queue (Ch87)" below for the back-to-back-close budget. |
| `raster_degenerate` | 1 | Latches if a TRI/STRIP/FAN closes with zero signed area (3 colinear vertices). SCAN is skipped; SPRITE never sets this. |

#### Per-primitive coverage

| `PRIM` | Raster behavior |
|---|---|
| 0 POINT | No raster emit — Ch82 closing-pixel only |
| 1 LINE | No raster emit — Ch82 closing-pixel only |
| 2 LINE_STRIP | No raster emit — Ch82 closing-pixel only |
| 3 TRIANGLE | Bounding-box scan with edge-function half-plane test |
| 4 TRI_STRIP | Same engine as TRIANGLE, fires per closed strip triangle |
| 5 TRI_FAN | Same engine as TRIANGLE, fires per closed fan triangle |
| 6 SPRITE | Bounding-box rectangle fill (every pixel inside) |
| 7 reserved | No raster emit |

#### Triangle edge-function math

For each candidate pixel `p` and each edge `(vA, vB)` of the
triangle:

```
e(p) = (p.x - vA.x) * (vB.y - vA.y) - (p.y - vA.y) * (vB.x - vA.x)
```

32-bit signed math is used to avoid overflow at typical coord
ranges.

##### Top-left fill rule (Ch85)

Adjacent triangles that share an edge would double-paint pixels
on that edge under a naïve same-sign test. Ch85 applies the
standard D3D-style top-left fill rule so each shared-edge pixel
is owned by exactly one of the two triangles.

At the IDLE→SCAN transition the FSM:

1. Computes `signed_area = (v1-v0) × (v2-v0)`.
2. If `signed_area == 0` → degenerate (3 colinear vertices);
   `raster_degenerate` latches and SCAN is skipped (no
   raster pixels emit). The Ch82 `pixel_emit` and `prim_complete`
   pulses still fire — only the interior raster is suppressed.
3. If `signed_area < 0` → CW winding; the FSM swaps `v1` and
   `v2` so the rule applies uniformly to a CCW-ordered triangle.
4. For each edge of the post-swap CCW triangle, classifies it as
   *top-or-left* (inclusive) or *right/bottom* (exclusive):
   - **Top edge**: horizontal going right (`dy == 0 && dx > 0`).
   - **Left edge**: going down in Y-down screen (`dy > 0`).
   - Anything else is a right or bottom edge.

The inside test in SCAN becomes:

```
inside = (e[i] + bias[i] <= 0)  for all i in {0, 1, 2}
```

where `bias[i] = 0` if edge `i` is top-or-left and `bias[i] = 1`
otherwise. The `+1` bias converts the strict `< 0` test for
right/bottom edges into a non-strict `<= 0` test on the biased
value, keeping the math integer and uniform.

Result: for any two adjacent triangles sharing an edge, the
edge's pixels are inclusive in exactly one triangle's bias
configuration and exclusive in the other's — no double-paint.

Some shared-corner pixels may end up unpainted by either
triangle. That's the standard top-left rule trade-off:
non-overlap takes priority over coverage of every boundary
pixel.

##### Per-pixel Gouraud color (Ch86)

Triangle interior pixels now use **per-pixel Gouraud color
interpolation** instead of flat shading. The three per-vertex
colors (the same Ch80 `prim_color_v0_q` / `prim_color_v1_q` /
`prim_color_v2_q` slot mapping) are latched at SCAN init with
the same `v1↔v2` swap mirror as the vertex coords, so the
post-swap CCW vertex order matches the latched color order.

For each interior pixel `p`, barycentric weights are derived
directly from the unbiased edge functions:

```
L0(p) = -e1(p)   // weight for v0 = signed area of (p, v1, v2)
L1(p) = -e2(p)   // weight for v1
L2(p) = -e0(p)   // weight for v2
       —  L0 + L1 + L2 == sa  for all p inside the triangle
```

For each color channel `ch` ∈ {R, G, B, A}:

```
ch_out(p) = (L0(p)*c0.ch + L1(p)*c1.ch + L2(p)*c2.ch) / sa
```

Q (the texture-coord IEEE float in c2's upper 32 bits) is **not**
interpolated — it passes through from the closing vertex's RGBAQ
unchanged.

For a flat-shaded primitive (RGBAQ written once before all three
vertices, all three vertex colors equal), `λ0+λ1+λ2 = 1` and
the formula collapses to `c0` exactly with no rounding error —
existing flat-shaded raster TBs (raster_basic, raster_topleft)
continue to pass.

The R/G/B/A division uses **integer truncation toward zero**.
Real PS2 GS uses fixed-point with specific rounding rules; the
recognition-layer stub is intentionally simpler. SPRITE keeps
flat shading (only 2 vertices, no barycentric weights defined).

#### Sprite rectangle fill

A SPRITE has two vertices forming opposite corners. The bounding
box is computed via `min`/`max` of each axis; every pixel inside
the box is emitted in row-major order.

#### FSM and scan timing

The FSM is `IDLE` → `SCAN`. On `prim_complete_now` for an eligible
primitive, the FSM latches the vertex tuple, color, FRAME_1
fields, and bounding box, then walks the box one pixel per cycle.
For each pixel: combinational inside-test → if inside, pulse
`raster_pixel_emit` and update the snapshot. Returns to `IDLE`
when `(ras_cur_x, ras_cur_y) == (x_max, y_max)`.

Color is **Gouraud-interpolated per pixel** for triangles
(Ch86) and **flat** for sprites — see the dedicated subsections
below for the fill-rule and Gouraud math. The closing-primitive
flat color (`prim_color_q`) is still used as the SPRITE fill
color and as a backward-compat reference for flat-shaded TRIs
(when all three vertex colors are equal, the Gouraud formula
reduces to that flat value with no rounding error).

Coordinates are **integer** — the 4-bit sub-pixel of 12.4
fixed-point is discarded. Sub-pixel edge adjustment is not
modeled (top-left fill rule IS modeled — see Ch85 subsection
above).

#### Raster command queue (Ch87) and `raster_overflow`

`gs_stub` has a **2-entry FIFO** in front of the SCAN FSM. Every
primitive close that targets the rasterizer (`RM_TRI` /
`RM_SPRITE`) snapshots its full per-prim context (vertices,
bias, signed area, per-vertex colors, FRAME_1 fields, bounding
box) into the queue at the close cycle. The FSM dequeues the
oldest entry whenever it's idle or finishing a scan. Effective
concurrency is **1 in-flight + 2 queued = up to 3 back-to-back
closes** absorbed without drop.

`raster_overflow` now latches when a 4th close arrives while the
FIFO is **full** (1 in-flight, both FIFO slots occupied). The
4th primitive is dropped. Earlier chapters' bound of "1 close
mid-scan = overflow" is replaced by Ch87's "3 closes
back-to-back = OK; 4th = overflow."

Degenerate triangles are **filtered at enqueue**: they set
`raster_degenerate` and are not pushed into the queue. SPRITE
never sets `raster_degenerate`. POINT/LINE/LINE_STRIP don't
raster (RM_NONE) — they don't enqueue at all and the queue
ignores them.

Pop happens at `IDLE`→`SCAN` AND at drain-done (Ch88; see below)
when the queue has more work, so back-to-back scans run
contiguously without an `IDLE` bubble. `raster_active` stays
high across the boundary.

Real PS2 game streams emit thousands of primitives back-to-back;
3-deep concurrency is enough for most TRI_STRIP / TRI_FAN
patterns with small bounding boxes. Larger sprites or larger
triangles increase scan length and reduce headroom — a future
chapter can grow the FIFO depth.

#### Pixel pipeline (Ch88)

The SCAN body is **3 stages, throughput 1 candidate pixel per
cycle**:

| Stage | Source | Work |
|-------|--------|------|
| **S0** | `ras_cur_x` / `ras_cur_y` (bbox walker) | Generate the next candidate coord; advance the bbox walker; on bbox corner, fire `ras_at_end_of_s0` and transition R_SCAN→R_DRAIN. |
| **S1** | `s1_x_q` / `s1_y_q` (registered) | Combinational edge functions on `(s1_x, s1_y)` against the three triangle edges (or trivial-true for SPRITE), top-left bias, inside test → `s1_pixel_inside`. Latched into `s2_inside_q`. |
| **S2** | `s2_x_q` / `s2_y_q` / `s2_L0..L2_q` / `s2_inside_q` | Compute Gouraud `interp_byte(λ_i, c_i)` × 4 channels and `s2_fb_addr` from PSM/FBP/FBW. If `s2_valid_q && s2_inside_q`, drive `raster_pixel_emit` with the resolved fb_addr / x / y / color. |

`raster_state` is now a 3-state FSM:

- **R_IDLE** — no work; `pop_ok` fires on a non-empty FIFO.
- **R_SCAN** — S0 produces one valid coord per cycle; S1/S2
  latches propagate. On bbox corner, transitions to R_DRAIN.
- **R_DRAIN** — S0 stops producing valids (`s1_valid_q <= 0`);
  S1 and S2 finish their in-flight pixels. When both pipeline
  valids are low (`drain_done`), the FSM either pops the next
  primitive (back-to-back contiguous SCAN) or returns to R_IDLE.

`pop_ok = !fifo_empty && (R_IDLE || drain_done)` — the
end-of-scan pop is now drain-done, three cycles after S0
produces the corner. This guarantees the pipeline-tail pixels
of the previous primitive are not overwritten by the next
primitive's pop, while still keeping `raster_active` high
across the seam.

Latency from `pop_ok` to first registered `raster_pixel_emit`
is **3 stages of pipeline + 1 cycle of FIFO turnaround + 1
cycle of registered emit output = 5 posedges from the close
cycle of the closing vertex** (see
`sim/tb/gif_gs/tb_gs_raster_pipeline.sv` for the cycle-exact
contract).

- `EV_MODE` — fired for any accept that did not resolve to a tracked
  register: REGLIST entries, IMAGE/DISABLE payload qwords, NOP-nibble
  PACKED slots, unknown privileged offsets, unknown GIF reg numbers.
  Reserved for "we know we saw something, we are intentionally not
  modeling it yet."

- `EV_GIFTAG` — one per accepted GIFtag; carries `flg`/`nreg`/`nloop`/`eop`
  for stream-level checking.

When trace event semantics change, audit this section and the per-stub
trace-schema header comments together.

#### VRAM persistence (Ch89)

`vram_stub` (`rtl/gif_gs/vram_stub.sv`) is the **first persistence
layer** the rasterizer has had. Every `raster_pixel_emit` pulse
writes 4 bytes of pixel data at `raster_pixel_fb_addr_q` into
`vram_stub`'s linear byte array. A combinational debug read port
exposes `read_data` byte-addressably so testbenches can verify
storage.

Wiring:

| vram_stub port | gs_stub source |
|---|---|
| `write_en`   | `raster_pixel_emit` |
| `write_addr` | `raster_pixel_fb_addr_q` |
| `write_data` | `raster_pixel_color_q[31:0]` (the lower 32 bits — Q in the upper 32 is not framebuffer data) |
| `write_be`   | `raster_pixel_be_q` (Ch95) — per-byte write enable: byte i (the byte at `write_addr + i`) is committed only when `write_en && write_be[i]`. Lets the same 32-bit write port serve PSMs of any byte width. |
| `write_mask` | `raster_pixel_mask_q` (Ch106) — per-bit merge mask: for each enabled byte, `mem[i] <= (mem[i] & ~mask_i) | (data_i & mask_i)`. Tied to `0xFFFFFFFF` for PSMs ≥ 1 byte/pixel (no behavior change). PSMT4 drives `0x0000_000F` or `0x0000_00F0` to preserve the un-targeted nibble in the same byte. |

Scope (current write-side support, after Ch105):

- **PSMCT32 + PSMCT16 + PSMT8** at the raster write port. The PSM
  width is selected by `gs_stub`'s `bpp_shift` mux off
  `FRAME_1.PSM` and surfaced as `raster_pixel_psm_q`; `gs_stub`'s
  S2 packs the pixel into the right byte lane and drives
  `raster_pixel_be_q` so `vram_stub` commits exactly the right
  bytes:
  - PSMCT32 (PSM=0x00) → 4 bytes/pixel, `be = 4'b1111`, ABGR in
    `write_data[31:0]`.
  - PSMCT16 (PSM=0x02) → 2 bytes/pixel, `be = 4'b0011`, RGB5A1
    packed in `write_data[15:0]` (Ch95). `write_addr` is the
    halfword byte address — per-byte `be` makes unaligned
    halfword writes safe.
  - PSMT8 (PSM=0x13) → 1 byte/pixel, `be = 4'b0001`, the natural
    ABGR's R channel goes into `write_data[7:0]` as the PSMT8
    index (Ch105). `write_addr` is the exact byte address;
    `vram_stub` commits `mem[write_addr] ← write_data[7:0]` at
    any byte alignment without needing data-lane shifting.
  - PSMT4 (PSM=0x14) → 0.5 bytes/pixel (2 pixels per byte),
    `be = 4'b0001`, `write_mask = 0x0000_000F` (low nibble) or
    `0x0000_00F0` (high nibble) per `pixel_index[0]`. The 4-bit
    index (low nibble of natural ABGR's R) is placed in the
    targeted nibble position in `write_data[7:0]`. vram_stub
    merges only the masked bits — the OTHER nibble of the same
    byte is preserved (Ch106). Back-to-back same-byte emits
    (e.g. PSMT4 pixels x=0 and x=1, both landing in byte 0)
    chain through NBA semantics: the second NBA samples
    mem[addr] AFTER the prior commit, so both nibbles end up in
    the byte without a bypass-forwarding net.
  - PSMCT24 / PSMCT16S / PSMZ32 / PSMZ24 / PSMZ16 / PSMZ16S /
    PSMT8H / PSMT4HL / PSMT4HH — `bpp_shift` falls through to a
    host-word default (4 bytes); raster emit through these PSMs
    is not contract-tested.
- **Write-side addressing**. Real PS2 VRAM is 4 MiB organized
  into pages × blocks × columns per PSM. By DEFAULT, both
  `gs_stub` raster emit and `gif_image_xfer_stub` TRXDIR uploads
  produce the linear-framebuffer layout PCSX2 calls "linear PSM".
  Optional per-PSM swizzle paths gated by parameters on each
  module:
    * **PSMCT32**: `PSMCT32_SWIZZLE` parameter on `gs_pcrtc_stub`
      (Ch120 read-side), `gif_image_xfer_stub` (Ch121 image-xfer
      write-side), and `gs_stub` (Ch122 raster write-side).
    * **PSMCT16**: `PSMCT16_SWIZZLE` parameter on `gs_pcrtc_stub`
      (Ch126 read-side), `gif_image_xfer_stub` (Ch127 image-xfer
      write-side), and `gs_stub` (Ch128 raster write-side). All
      three integration points live, mirroring the PSMCT32 trio.
  When on, byte addresses route through the per-PSM swizzle module
  (`gs_swizzle_psmct32_stub` / `gs_swizzle_psmct16_stub`); image-xfer
  adds `dest_base_q = DBP*256` on top of the swizzle output so any
  DBP works, while raster emit feeds the active `ras_fbp` directly
  so the swizzle output is already the absolute address. Per-PSM
  parameters are independent — enabling one doesn't affect the
  other PSM. **PSMT8** has its full three-path swizzle integration
  as of Ch134, mirroring the PSMCT32/PSMCT16 trios: standalone
  math primitive `gs_swizzle_psmt8_stub` (Ch131) wired into
  `gs_pcrtc_stub` (Ch132 read-side, `PSMT8_SWIZZLE`),
  `gif_image_xfer_stub` (Ch133 write-side), and `gs_stub` (Ch134
  raster emit) — same parameter name on all three modules.
  **PSMT4** has its full three-path swizzle integration as of
  Ch140, mirroring the PSMCT32/PSMCT16/PSMT8 trios: standalone
  math primitive `gs_swizzle_psmt4_stub` (Ch137) wired into
  `gs_pcrtc_stub` (Ch138 read-side, `PSMT4_SWIZZLE`),
  `gif_image_xfer_stub` (Ch139 write-side), and `gs_stub` (Ch140
  raster emit) — same parameter name on all three modules. The
  PSMT4 paths additionally thread the swizzle module's
  `nibble_hi` output through the existing Ch106 (raster) /
  Ch118 (image-xfer) nibble RMW machinery (replacing
  `s2_pixel_index[0]` / `x_eff[0]` as the high/low nibble
  selector when the gate is on). All parameter defaults are 0,
  so existing TBs see the legacy linear behavior. **All four
  common GS PSMs (CT32 + CT16 + T8 + T4) now have COMPLETE
  three-path swizzle integration foundation.**
- **Stub-sized**. Default `BYTES = 65536`. Real VRAM is 4 MiB; for
  TB purposes a small linear region is enough to verify that
  emitted pixels actually land at the addresses gs_stub computes.
- **Scanout path** is provided by `gs_pcrtc_stub` (Ch90 — see
  below). The legacy `platform_video_stub` flood-fills BGCOLOR
  and is unaware of VRAM; TBs that want to verify the round trip
  use `gs_pcrtc_stub` instead.

The Ch89 white-box TB `tb_gs_vram_writeback.sv` exercises the
contract end-to-end: drive a 4×4 SPRITE through gs_stub, capture
the (fb_addr, color) of each `raster_pixel_emit` pulse, then
read each fb_addr back from `vram_stub` and assert byte-exact
match.

#### PCRTC scanout (Ch90)

`gs_pcrtc_stub` (`rtl/gif_gs/gs_pcrtc_stub.sv`) is the **scanout
side** of the GS pipeline — its dual is `gs_stub` (the write
side). It models a minimal PCRTC (Programmable CRT Controller):
runs its own raster timing, generates a VRAM read address from
the current `(hcnt, vcnt)` using the same fb_addr math as
gs_stub, reads the byte returned by `vram_stub`'s combinational
debug port, and drives `r`/`g`/`b` for the active area. Together
with Ch88's pipeline + Ch89's VRAM, this closes the loop:

```
raster_pixel_emit → vram_stub.write → vram_stub.read → pcrtc.r/g/b
```

Configuration (Ch91 — privileged-block CPU MMIO):

`gs_pcrtc_stub` consumes two real PS2 GS privileged display
register latches directly from `gs_stub`:

| pcrtc input | gs_stub source | Layout |
|---|---|---|
| `pmode_q[63:0]` | privileged write at offset 0x0000 | bit 0 = EN1 (display 1 enable) |
| `dispfb1_q[63:0]` | privileged write at offset 0x0070 | FBP[8:0], FBW[14:9], PSM[19:15], DBX[42:32] (Ch91-audit), DBY[53:43] (Ch91-audit) |
| `display1_q[63:0]` (Ch92, Ch93) | privileged write at offset 0x0080 | DX[11:0], DY[22:12], MAGH[26:23] (Ch93 — H scale = MAGH+1), MAGV[28:27] (Ch93 — V scale = MAGV+1), DW[43:32] (width-1), DH[54:44] (height-1) |

The Ch90 sideband ports (`scanout_enable` / `dispfb_fbp` /
`dispfb_fbw`) are **gone**. TBs program scanout the way a real
PS2 driver would: write DISPFB1, then DISPLAY1, then PMODE.EN1=1
(Ch92). Out of reset, all three registers are 0, so EN1 is low
and pcrtc outputs 0.

`scanout_enable` inside pcrtc is derived combinationally from
the latches:
`scanout_enable = pmode_q[0] & (PSM ∈ {0, 2, 0x13, 0x14})`.
PSMCT32 (=0), PSMCT16 (=2), PSMT8 (=0x13), and PSMT4 (=0x14) are
honored at this scope; any other PSM forces scanout off rather
than mis-decoding the byte layout.

DISPLAY1 (Ch92, Ch93) supplies the **display window** — the
sub-rect inside the active area where pcrtc actually pulls
pixels from VRAM — and the **per-axis magnification**: each
VRAM column is shown for (MAGH+1) consecutive VCK pulses, each
VRAM line for (MAGV+1) raster lines. Outside the window pcrtc
drives r/g/b = 0 even with EN1=1. Pcrtc's H_TOTAL/V_TOTAL still
come from module parameters at instantiation; only the
active-area sub-rect gated by DX/DY/DW/DH is register-driven.
Dual-display (PMODE.EN2 + DISPFB2 + DISPLAY2) is deferred.

Address math + display-window gating + magnification:

```
hmag_factor    = MAGH + 1                        // 1..16
vmag_factor    = MAGV + 1                        // 1..4
hwin_rel       = hcnt - DX                       // pixel offset inside the window
vwin_rel       = vcnt - DY
in_window      = (hcnt >= DX) && (hwin_rel <= DW)
              && (vcnt >= DY) && (vwin_rel <= DH)
fbp_bytes      = dispfb_fbp << 11               // FBP × 2048
pixels_per_row = dispfb_fbw << 6                // FBW × 64
vram_x_unshift = hwin_rel / hmag_factor          // 4 displayed pixels = 1 VRAM column at MAGH=3
vram_y_unshift = vwin_rel / vmag_factor
effective_x    = vram_x_unshift + DBX
effective_y    = vram_y_unshift + DBY
pixel_index    = effective_y × pixels_per_row + effective_x
bpp_shift      = (PSM == PSMCT32) ? 2 :
                 (PSM == PSMCT16) ? 1 :
                 (PSM == PSMT8)   ? 0 : 2
fb_addr        = fbp_bytes + (pixel_index << bpp_shift)
r/g/b drive    = (de && scanout_enable && in_window) ? decode(VRAM, PSM) : 0
```

Per-PSM color decode at `vram_read_data`:

- **PSMCT32**: `r = data[7:0]`, `g = data[15:8]`, `b = data[23:16]`. Alpha at `[31:24]` is dropped (no DAC channel).
- **PSMCT16** (Ch94): RGB5A1 packed into the lower 16 bits as `{A[15], B[14:10], G[9:5], R[4:0]}`. 5→8 expansion uses bit-replicate `r8 = {r5, r5[4:2]}` (so 5'h1F → 8'hFF, 5'h00 → 8'h00). Alpha bit dropped at the DAC.
- **PSMT8** (Ch96/Ch97): index in `data[7:0]`. With `clut_enable=1` (Ch97), pcrtc presents `clut_read_idx = idx + (CSA << 4)` to the external `clut_stub` and decodes the returned PSMCT32 entry as `r = clut_data[7:0]`, `g = clut_data[15:8]`, `b = clut_data[23:16]`. With `clut_enable=0` (Ch96 fallback), pcrtc surfaces the index as grayscale so the 8-bit storage lane is visually verifiable without programming a CLUT.
- **PSMT4** (Ch103): 2 pixels per byte. `byte_offset = pixel_index >> 1` (overrides the standard `pixel_index << bpp_shift` math). `nibble = pixel_index[0] ? data[7:4] : data[3:0]` picks the 4-bit pixel; the zero-extended 8-bit value `{4'd0, nibble}` plus `(CSA << 4)` is presented on `clut_read_idx`. With `clut_enable=1`, pcrtc decodes the returned PSMCT32 entry the same way as PSMT8. With `clut_enable=0`, the fallback replicates the nibble across the 8-bit DAC value (`r = g = b = {nibble, nibble}`) so 4'hF → 0xFF and 4'h5 → 0x55. CSA is the natural per-palette-window selector for PSMT4 — multiple 16-entry palettes can share the 256-entry staging area, indexed by CSA.

**Ch95 — gs_stub raster channel emits PSMCT16**. The S2 stage
of the pipeline now packs ABGR → RGB5A1 (`r5=R[7:3]`, `g5=G[7:3]`,
`b5=B[7:3]`, `a1=A[7]`) when `ras_bpp_shift==1` (PSMCT16 / PSMCT16S
/ PSMZ16 / PSMZ16S — any 16-bit PSM). The packed 16-bit pixel
goes in the LOW halfword of `raster_pixel_color_q[31:0]`, and a
new `raster_pixel_be_q[3:0]` selects which bytes vram_stub
commits: `4'b0011` for PSMCT16, `4'b1111` for PSMCT32. vram_stub
gates each byte write on `write_be[i]`, so back-to-back PSMCT16
emits write 2 bytes each without halfword stomping. New
`raster_pixel_psm_q[5:0]` exposes the current PSM for trace.

The Ch95 TB `tb_gs_raster_psmct16.sv` exercises the round trip:
gs_stub renders a 4×4 SPRITE with FRAME_1.PSM=PSMCT16, then VRAM
read-back verifies each pixel landed at the right halfword AND
that the halfword right after the sprite stays zero (no leak).

Ch105 extends the raster channel to PSMT8 (FRAME_1.PSM=0x13).
When `ras_bpp_shift==0`, S2 takes the natural ABGR's R channel
(low 8 bits) as the PSMT8 index — the same lane real PS2 hardware
writes when the destination FB is PSMT8 — places it in the LOW
byte of the emit lane, and sets `raster_pixel_be_q = 4'b0001` so
vram_stub commits exactly the 1 byte at fb_addr. The 1-byte
commit works at any byte alignment because vram_stub gates each
byte lane independently. The Ch105 TB `tb_gs_raster_psmt8.sv`
renders a 5×3 SPRITE (chosen so the row spans byte lanes 1, 2, 3,
0, 1 — exercising every lane alignment) at FRAME_1.PSM=PSMT8 with
RGBAQ R=0x55, G=0xAA, B=0xBB, A=0xCC; asserts each sprite byte
reads back as 0x55, the bytes immediately left and right of the
sprite stay 0x00 (so 32-bit-aligned overwrite would be visible),
and a full-VRAM sweep finds NO byte equal to 0xAA / 0xBB / 0xCC
(channel-isolation: only R reaches VRAM at PSMT8).

Ch106 closes the indexed-write gap with PSMT4 (FRAME_1.PSM=0x14)
as a per-bit RMW into `vram_stub`. Three changes form the
mechanism:

1. `vram_stub` gains a new `write_mask[31:0]` input (Ch106). The
   commit is now `mem[i] <= (mem[i] & ~mask_i) | (data_i & mask_i)`
   for each enabled byte. PSMCT32/16/PSMT8 tie mask=`0xFFFF_FFFF`
   (no behavior change — full byte writes).
2. `gs_stub`'s S2 PSM-aware emit packing gets a PSMT4 branch:
   the byte address is `pixel_index >> 1` (overrides the
   `pixel_index << ras_bpp_shift` form), the index is the low
   4 bits of the natural ABGR's R channel, and the emit places
   that 4-bit value in either the low (`{4'd0, idx}`) or high
   (`{idx, 4'd0}`) nibble of `write_data[7:0]` based on
   `pixel_index[0]`. `s2_emit_be = 4'b0001`,
   `s2_emit_mask = pixel_index[0] ? 0x0000_00F0 : 0x0000_000F`.
3. New `raster_pixel_mask_q[31:0]` output on `gs_stub` carries
   the mask through to `vram_stub.write_mask`.

The Ch106 TB `tb_gs_raster_psmt4.sv` is intentionally
adversarial about preservation. VRAM is preloaded with `0xA5`
(high=A, low=5) at every byte the sprites will touch. Three
phases:

- **Phase A**: 4×2 SPRITE at (0,0)..(3,1), R=0x05 → idx=5. Both
  nibbles of each enclosing byte are written (8 emits across 4
  bytes); each byte ends at `0x55` and the four neighbouring
  preloaded bytes (2..3, 34..35) remain `0xA5`. This proves the
  back-to-back same-byte case (NBA chaining) and the neighbour-
  byte preservation in one go.
- **Phase B**: single-pixel SPRITE at (5, 2). x=5 odd → high
  nibble; pixel_index = 133, byte_addr = 66; idx=7. Preload
  `mem[66] = 0xA5`. Expected after raster: `mem[66] = 0x75` —
  high nibble updated from A to 7, low nibble stays 5. Proves
  isolated high-nibble RMW preserves the low nibble.
- **Phase C**: single-pixel SPRITE at (4, 3). x=4 even → low
  nibble; pixel_index = 196, byte_addr = 98; idx=9. Preload
  `mem[98] = 0xA5`. Expected after: `mem[98] = 0xA9` — low
  nibble updated from 5 to 9, high nibble stays A. Proves
  isolated low-nibble RMW preserves the high nibble.

Continuous observer asserts `psm_q == 6'h14`, `be_q == 4'b0001`,
and `mask_q ∈ {0x0F, 0xF0}` on every emit. Final aggregate
checks: 10 emits total, full-VRAM sweep finds NO byte equal to
0xAA / 0xBB / 0xCC (only R reaches the framebuffer at PSMT4).

DBX / DBY shift the VRAM origin: the pixel that appears at
displayed (DX, DY) corresponds to VRAM (DBX, DBY). Real PS2
drivers use this for double-buffered framebuffers (alternate
frames at different DBX/DBY) and offset display windows.

Five TBs lock these contracts:

- `tb_gs_scanout_basic.sv` — DBX=DBY=0, DISPLAY1 covers full
  active area, MAGH=MAGV=0 (1×): classic sprite-at-origin scanout.
- `tb_gs_scanout_dbx_dby.sv` — sprite at VRAM (4,2)..(7,5),
  DISPFB1.DBX=4/DBY=2, DISPLAY1 full active area, MAGH=MAGV=0:
  sprite shows at displayed (0..3, 0..3).
- `tb_gs_scanout_display_window.sv` — sprite at VRAM (0..3, 0..3),
  DBX=DBY=0, DISPLAY1 with DX=2/DY=1/DW=3/DH=3, MAGH=MAGV=0:
  sprite shows at displayed (2..5, 1..4); pixels outside the
  window are black even though pcrtc's raster passes through them.
- `tb_gs_scanout_magh_magv.sv` (Ch93) — sprite at VRAM (0..3, 0..3),
  DBX=DBY=0, DISPLAY1 with DX=4/DY=2/DW=7/DH=7, MAGH=1/MAGV=1
  (2×/2×): 4×4 VRAM sprite stretches to fill the 8×8 displayed
  window pixel-perfect; pixels outside the window are black.
- `tb_gs_scanout_psm16.sv` (Ch94) — 4×4 RGB5A1 sprite written
  directly to vram_stub at PSMCT16 byte stride, DISPFB1.PSM=0x02:
  5→8 bit-replicate decode produces the right (R8, G8, B8) at
  scanout. (No gs_stub instantiated; this TB exercises the PSM
  decode path in isolation.)
- `tb_gs_scanout_psmt8.sv` (Ch96) — 4×4 PSMT8 sprite of indices
  0x10..0x1F written directly to vram_stub at 1 byte/pixel
  stride. DISPFB1.PSM=0x13, DISPLAY1 with DX=4/DY=2/DW=7/DH=3
  AND MAGH=1 (2× horizontal). Asserts each scan-out displayed
  pixel reads back as grayscale R=G=B=expected index, proving
  byte stride + display window + horizontal magnification all
  work at 1 byte/pixel.
- `tb_gs_scanout_psmt8_clut.sv` (Ch97) — same 4×4 PSMT8 sprite,
  plus a programmed CLUT where `CLUT[i] = ABGR(0xFF, i+0x80, i+0x40, i)`.
  With `clut_enable=1` and `clut_csa=0`, asserts each scan-out
  pixel reads back as the CLUT entry for its index — PSMT8
  storage + CLUT lookup compose correctly into real RGB. Three
  phases: full-frame CSA=0, single-pixel CSA=1 (idx 0x00 →
  CLUT[0x10]), and CSA=1 wrap (idx 0xF8 → CLUT[0x08]).
- `tb_gs_tex0_clut.sv` (Ch98) — drives gs_stub's GIF reg# 0x06
  (TEX0_1) and asserts the latch + sub-field decoders match the
  encoded payload (CBP/CPSM/CSM/CSA/CLD bit ranges). Phase 2
  wires `pcrtc.clut_csa` from `gs_stub.tex0_1_csa_q` (instead
  of TB-side sideband) and verifies the CSA value flows from a
  GIF register write into the CLUT lookup math at scan-out.
- `tb_gs_clut_load.sv` (Ch99) — full TEX0.CLD-driven VRAM→CLUT
  load round trip. Stages 256 PSMCT32 entries in VRAM at
  `CBP*256` (using the new `vram_stub` second read port), drives
  TEX0_1 with `CBP=4, CPSM=PSMCT32, CSM=CSM2, CLD=1`, waits for
  `clut_loader_stub.load_busy` to fall, then runs PSMT8 scanout
  and asserts each in-sprite pixel reads back as the CLUT entry
  the loader copied — no TB-direct CLUT writes needed. Also
  carries a Ch99-audit negative phase: a TEX0 write with CSM=0
  (CSM1 swizzle, deferred) silently no-ops instead of laying
  down wrong linear bytes.
- `tb_gs_clut_load_ct16.sv` (Ch100) — CPSM=PSMCT16 variant of the
  Ch99 load TB. Stages 256 RGB5A1 entries (2 bytes each) in VRAM
  at `CBP*256`, drives TEX0_1 with `CPSM=2`. The loader now
  walks at 2-byte stride, unpacks RGB5A1 → PSMCT32 ABGR via 5→8
  bit-replicate, and writes to clut_stub. PSMT8 scanout produces
  the expanded RGB. Ch100-audit alpha coverage: per-entry `a1 = idx[0]`
  varies the alpha bit so both `{8{0}} = 0x00` and `{8{1}} = 0xFF`
  are exercised; a TB-side `clut_we` snoop captures every loader
  write so alpha can be asserted directly without going through
  the RGB-only scanout path.
- `tb_gs_clut_load_cld_modes.sv` (Ch101 + Ch102) — conditional
  CLD-mode policy. Phases walk through CLD ∈ {0, 1, 2, 3, 4, 5,
  6, 7} with varying CBP/CPSM/CSA, counting `loader_busy` rising
  edges to prove: CLD=0 never loads; CLD=1 always (full); CLD=2
  loads only when CBP changed; CLD=3 loads when CBP/CPSM/CSA
  any-changed (CBP, CSA, and CPSM arms each isolated); CLD=4
  always loads but only the 16-entry CSA window (Ch102 — write
  range correctness is locked by `tb_gs_clut_load_csa_window`);
  CLD ∈ {5, 6, 7} reserved no-ops.
- `tb_gs_clut_load_csa_window.sv` (Ch102) — CLD=4 write-range
  correctness. Phase 1 stages 256 distinct PSMCT32 entries in
  VRAM and runs CLD=1 to fill all 256 CLUT slots with pattern_a.
  Phase 2 stages 16 different entries at a new CBP, drives CLD=4
  with CSA=2 (window = idx 32..47), and asserts via a `clut_we`
  snoop that exactly 16 writes occurred AND the captured array
  contains: pattern_a(i) at i ∈ [0, 32) ∪ [48, 256), pattern_b(i-32)
  at i ∈ [32, 48). Proves 240 entries are preserved across the
  partial load. Audit-low extensions: Phase 3 covers the
  high-CSA wrap (CSA=16 → window-base wraps mod-256 to 0); Phase
  4 covers CT16 partial (CPSM=PSMCT16, 2-byte stride, RGB5A1
  unpack at the loader, window at idx 160..175).
- `tb_gs_scanout_psmt4_clut.sv` (Ch103) — PSMT4 scanout. Stages
  a 4×4 PSMT4 sprite (2 pixels/byte) and 16 CLUT entries.
  Phase 1 (`clut_enable=1`): asserts each pixel reads
  `CLUT[zero-ext(nibble) + CSA*16]`. Phase 2 (`clut_enable=0`):
  asserts the grayscale fallback replicates the 4-bit nibble
  across the 8-bit DAC value. Both phases verify byte-stride
  half-extraction (low/high nibble pick) at every active pixel.
  Audit-low Phase 3 locks PSMT4 + nonzero CSA (CSA=1, window
  16..31) end-to-end: TB-direct CLUT writes plant a 0xDEAD_BEEF
  sentinel at entries 0..15 and a per-index formula at 16..31,
  scanout asserts each pixel reads the formula and never the
  sentinel.
- `tb_gs_demo_psmt4_e2e.sv` (Ch107) — first end-to-end demo for
  the GS/PCRTC stack. **Scope is GS-side only**: the post-GIF
  register stream (per-reg A+D writes via `gs_stub.gif_reg_*`)
  plus privileged-block MMIO drive the pipeline; `gif_packed_stub`
  / GIFtag-PACKED is BYPASSED — feeding the same demo through
  the GIF front-end is a future chapter. Step 1 stages 16
  PSMCT32 palette entries in VRAM at `CBP*256` (modelled as a
  TB-direct write — DMA→GS image transfer is a future chapter,
  but the framebuffer itself is NOT TB-direct). Step 2 drives
  per-reg writes (PRIM/FRAME_1/RGBAQ/XYZ2) for four SPRITEs
  paying out a 4-quadrant 8×4 image (TL idx 0x5, TR idx 0x7,
  BL idx 0xA, BR idx 0xC) at FRAME_1.PSM=PSMT4 — all 32
  framebuffer pixels arrive via the Ch106 raster channel.
  Step 3 drives TEX0_1 with `CBP=palette, CPSM=PSMCT32,
  CSM=CSM2, CSA=0, CLD=4`; loader writes clut_stub[0..15].
  Step 4 brings up scanout via privileged-block writes to
  DISPFB1 (PSM=PSMT4) + DISPLAY1 + PMODE.EN1. Step 5 captures
  one full frame and asserts each pixel reads back as
  `CLUT[quadrant_idx]` (or `CLUT[0]` outside the 8×4 image
  since vram_stub zero-init means nibble=0). Aggregate asserts:
  32 PSMT4 emits, mask ∈ {0x0F, 0xF0} on every emit
  (channel-isolation locked architecturally — only R[3:0] ever
  reaches VRAM at PSMT4), loader fires exactly once, no
  raster_overflow / raster_degenerate. This TB is the first
  stack-wide proof that the GS-side post-GIF sequence —
  per-reg writes → indexed framebuffer → TEX0+CLD palette
  upload → PMODE/DISPFB/DISPLAY scanout — produces a coherent
  RGB frame end to end without TB sideband for the framebuffer
  pixels. Routing the same primitives through GIFtag/PACKED A+D
  via `gif_packed_stub` closes the last sideband and is the
  natural Ch108 anchor.
- `tb_gs_demo_psmt4_e2e_ee_full_bootlet.sv` (Ch114) — extends
  Ch113's EE-driven control plane to ALSO drive the DMAC
  channel-2 setup from the same MIPS instruction stream. The EE
  program now writes the 4 GS-priv registers + the 3 DMAC ch2
  registers (MADR / QWC / CHCR.start) via real `sw`
  instructions, then SYSCALLs to halt. Total: 7 EE-CPU MMIO
  writes (4 GS-priv + 3 DMAC) producing the same 16×8 captured
  frame. **Architectural note**: the program lives in
  `bios_rom_stub` at 0xBFC0_0000 / phys 0x1FC0_0000, NOT in
  RAM. A RAM-resident program would have its instruction
  fetches contend with the DMAC's RAM reads through
  `ee_ram_stub`'s single read port (the map's CPU>DMAC
  arbitration silently corrupts DMAC data). Putting the program
  in BIOS decouples the two paths so EE and DMAC run truly in
  parallel. This also matches real PS2: the EE boots out of
  BIOS ROM. PASS criteria add to Ch113's: **3 EE-driven DMAC
  writes** seen at the map's DMAC-ch2 decode; the existing
  `dma=(1,36,1)` event taxonomy still holds (those events are
  triggered by the EE's CHCR write, not a TB-direct write).
  The remaining TB-direct surfaces in the demo are now narrowly
  the GIF payload pre-stage in RAM (a real EE driver would
  itself stage this) and bios_rom_stub's program preload (which
  is the EE bootlet itself — not a runtime TB sideband).
- `tb_gs_demo_psmt4_e2e_ee_program.sv` (Ch113) — same demo as
  Ch112 but the 4 control-plane MMIO writes (PMODE / DISPFB1 /
  DISPLAY1 lo / DISPLAY1 hi) are no longer issued by the TB
  directly. Instead a 10-instruction MIPS program preloaded into
  ee_ram_stub at phys 0x800 (kseg0 0x80000800) is fetched and
  executed by `ee_core_stub` (parameterized with
  `PC_RESET=0x80000800`). The program is `LUI/ORI/SW × 4` plus a
  SYSCALL terminator; the SW instructions target `0x12000000+`
  and flow through `ee_memory_map_stub`'s GS-priv decode →
  `ee_gs_priv_bridge_stub` → `gs_stub.reg_wr_*`. Closes the
  very last TB-direct surface in the demo flow: every byte AND
  every register bit AND every control-plane decision now
  arrives from a real-shape source. PASS criteria add to
  Ch112's: `core_halt_o == 1` (asserts exactly once on the
  SYSCALL halt), `core_trap == 0`, EE program halts at
  `EE_PROG_VA + 36 = 0x80000824` (the SYSCALL slot). The TB
  still pre-stages the GIF payload and triggers the DMAC
  channel-2 transfer via TB-direct CHCR/MADR/QWC writes — a
  wider EE program that also drives DMAC bring-up is a
  separate future chapter.
- `tb_gs_demo_psmt4_e2e_eemap.sv` (Ch112) — same demo as Ch111
  but the bridge is no longer driven by the TB directly. Instead
  the TB drives `ee_memory_map_stub.ee_wr_*` with full 32-bit
  physical addresses targeting the new GS-privileged-MMIO window
  at 0x1200_0000-0x1200_FFFF (64 KiB; phys[28:16] == 13'h1200).
  The map decodes the window, peels the 16-bit offset, and hands
  the 32-bit half-write to `ee_gs_priv_bridge_stub`, which then
  fires gs_stub.reg_wr_* with the running 64-bit shadow value.
  Closes the last control-plane routing gap before a real EE
  instruction stream can drive the demo's bring-up: PMODE /
  DISPFB1 / DISPLAY1 are now reachable from `sw 0x1200_0080(...)`-
  shaped writes rather than from a TB-shaped EE-MMIO port.
  PASS criteria identical to Ch111: 4 EE-MMIO writes / 4 bridge
  fires, same 16×8 captured frame. **Architectural note**: this
  chapter ALSO adds 4 new output ports to `ee_memory_map_stub`
  (`ee_gs_priv_wr_en/addr/data/be`). Existing 56 ee_memory_map_
  stub-using TBs leave those outputs unconnected (named-port
  instantiation tolerates omitted outputs); only the new Ch112
  TB wires them through to the bridge.
- `tb_gs_demo_psmt4_e2e_eemmio.sv` (Ch111) — same demo as
  Ch110 but the privileged-block control writes (PMODE / DISPFB1
  / DISPLAY1) now arrive through `ee_gs_priv_bridge_stub` (a new
  RTL module) driven by EE-shaped 32-bit MMIO writes from the
  TB, instead of TB-direct gs_stub.reg_wr_* pulses. The bridge
  accumulates 32-bit half-writes per 8-byte slot and fires a
  64-bit gs_stub.reg_wr_* pulse on each EE half-write —
  single-half writes work for PMODE.EN1 and DISPFB1 (interesting
  bits in the low 32), and a pair of writes (lo+hi to
  consecutive 4-byte addresses) handles DISPLAY1 whose DW/DH
  live in the high 32. **Bridge contract**: full-word writes
  only — `ee_wr_be` must be `4'b1111`; sub-word (per-byte)
  merging into the 64-bit shadow is intentionally out of scope
  and a `$error` fires on any narrower be (control-plane GS
  registers are always written as full 32-bit `sw` halves of an
  `sd`). **Scope precision**: this chapter closes the TB-direct
  `gs_stub.reg_wr_*` surface — i.e., the privileged-MMIO sink at
  the GS itself. The bridge is instantiated by the TB directly;
  it is NOT yet wired into `ee_memory_map_stub`, so the full
  EE-CPU / memory-map MMIO path (a real EE instruction stream
  reaching 0x12000000+ via `sw`) is a separate future chapter.
  PASS criteria add to Ch110's: **4 EE-MMIO writes** (1 PMODE +
  1 DISPFB1 + 2 DISPLAY1) and **4 bridge fires** producing the
  same 16×8 captured frame as Ch110.
- `tb_gs_demo_psmct32_swizzle_trxdir_e2e.sv` (Ch124) — companion
  to Ch123: same EE-bootlet → DMAC → GIF data plane and same all-
  three-gates-on instantiation, but the framebuffer is filled by
  a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of
  by raster. The Ch121 image-xfer write-side swizzle gate becomes
  LOAD-BEARING inside the demo flow — every byte the GS produces
  comes out of the image-xfer engine at canonical PSMCT32
  swizzled addresses, and the raster path is dormant. Payload:
  U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT32} /
  TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0})
  + U2 (IMAGE, NLOOP=32: 32 IMAGE qwords carrying the 128 PSMCT32
  pixels of the same four-quadrant pattern Ch123 used). DMAC QWC
  = 38. Verification mirrors Ch123: (1) full 16×8 scanout frame
  capture; (2) per-pixel byte readback at the canonical swizzled
  address via vram_stub's 2nd read port; (3) strict linear-vs-
  swizzled separator at byte 1024 stays 0. Aggregate counts:
  `dma=(1,38,1) ee_dmac_wr=3 giftags=2 ad_writes=4
  xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1
  emits=0 frame=16x8`. Ch123 + Ch124 together exercise BOTH
  PSMCT32 write-side paths (raster Ch122 + image-xfer Ch121)
  end-to-end through the same driver-shaped flow with the
  same swizzled-scanout (Ch120) read side.
- `tb_gs_demo_psmct32_swizzle_e2e.sv` (Ch123) — full driver-shaped
  end-to-end demo with ALL THREE PSMCT32 swizzle gates flipped
  on simultaneously: `gs_stub#(PSMCT32_SWIZZLE=1)` (Ch122 raster),
  `gif_image_xfer_stub#(PSMCT32_SWIZZLE=1)` (Ch121 — instantiated
  but unused in this demo), `gs_pcrtc_stub#(PSMCT32_SWIZZLE=1)`
  (Ch120 read). The data plane is the same DMAC + GIF + EE-bootlet
  shape Ch107..Ch114 demos use: a BIOS-resident EE program
  (PC_RESET=0xBFC0_0000) configures GS-priv (DISPFB1, DISPLAY1
  lo/hi, PMODE.EN1) via `sw` instructions through
  `ee_memory_map_stub` → `ee_gs_priv_bridge_stub` →
  `gs_stub.reg_wr_*`, then kicks DMAC ch2 (MADR / QWC / CHCR)
  via `sw` to the DMAC reg window, then `SYSCALL` halts. DMAC
  delivers a 24-qword payload from `ee_ram_stub` to
  `gif_packed_stub`, which dispatches 4 SPRITE PACKED packets
  (1 GIFtag + 5 A+D each — PRIM, FRAME_1=PSMCT32, RGBAQ, XYZ2,
  XYZ2). The 4 sprites tile the 16×8 active area into 4 quadrants
  with unique RGB triples. With the raster gate on, all 128
  per-pixel store addresses go through `gs_swizzle_psmct32_stub`;
  with the pcrtc gate on, scanout reads from those same swizzled
  addresses. **Two-phase verification**: (1) **scanout** — every
  (x, y) in 16×8 captures its sprite's RGB; (2) **byte readback
  via vram_stub's 2nd read port** — for every (x, y), the 32-bit
  word at `ref_addr_psmct32(0, 1, x, y)` equals the sprite's
  `{A=0xFF, B, G, R}` PSMCT32 word. Strict linear-vs-swizzled
  separator at byte 1024 (where the linear formula's y=4 row
  would land at stride=256) stays 0 — the swizzled write set
  for the 16×8 image stays in blocks (0,0) and (1,0) of page 0
  (bytes 0..511), so a fall-through to linear would have placed
  sprite-2's color at byte 1024. Aggregate counts:
  `dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0
  ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8`.
  This is the FIRST end-to-end demo where every PSMCT32 byte
  the GS produces lives at the canonical PCSX2 swizzled address
  AND the scanout reads from it — byte-accurate to real PS2
  VRAM layout, end-to-end through the driver-shaped flow.
- `tb_gs_raster_swizzle_psmct32.sv` (Ch122) — focused contract
  for the new `PSMCT32_SWIZZLE` parameter on `gs_stub`. When the
  parameter is set to 1 AND the active raster PSM is PSMCT32
  (`ras_psm == 6'h00`), the per-pixel raster emit address is
  routed through the Ch119 `gs_swizzle_psmct32_stub` (FBP=ras_fbp,
  FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) and its output is
  the absolute byte address (FBP*2048 already baked in).
  At Ch122 only, PSMCT16/PSMT8/PSMT4 raster emits always took
  the linear path. Ch128 later closed the PSMCT16 raster gate
  and Ch134 closed the PSMT8 raster gate (each with its own
  per-PSM parameter on this same `gs_stub`); PSMT4 raster still
  takes the linear path. Default 0 keeps every existing PSMCT32
  raster TB unchanged.
  **Three-phase verification**: (1) **origin SPRITE** — drive a
  single 16×4 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT32} with RGBAQ
  R=0x55/G=0xAA/B=0xCC/A=0x77, expect 64 emits, per-pixel byte
  readback via vram_stub's 2nd read port at swizzled addresses
  confirms each pixel lands where the swizzle says. Strict
  linear-vs-swizzled separators at bytes 512 and 768 (the linear
  formula's y=2 / y=3 row starts) stay 0 — proves the gate is
  live. (2) **scanout agreement** — enable the Ch120 swizzled-
  pcrtc path on the same VRAM contents, capture the full 16×4
  frame, assert each visible pixel reads back the SPRITE's RGB.
  Both gs_stub (Ch122 raster) and gs_pcrtc_stub (Ch120 scanout)
  instantiate the same swizzle module; a successful capture
  proves the two integrations agree at byte level — what raster
  wrote at swizzled addresses comes out on r/g/b at the same
  (x, y). (3) **non-origin SPRITE** — re-arm the raster with
  FRAME_1{FBP=4, FBW=2, PSMCT32} and an 8×2 SPRITE at
  (60, 4)..(67, 5) crossing the page-x boundary at x=64 (so
  page_index actually changes mid-row). Pins three contracts
  the origin transfer can't distinguish from a buggy
  implementation: (a) `ras_fbp` reaches the swizzle's `fbp` input
  (FBP=0 in Phase 1 would have masked a tied-zero regression),
  (b) `ras_fbw` reaches the swizzle's `fbw` input (FBW=1 would
  have masked a tied-one regression), (c) the swizzle gets the
  FULL absolute pixel coords (s2_x_q, s2_y_q) rather than
  bbox-local coords (Phase 1's sprite started at (0,0) so
  absolute and local were equal there). Strict linear-vs-
  swizzled separator at byte 10480 (where the linear formula
  would land Phase-3's first pixel) stays 0. Total emit count
  after all phases: 64 + 16 = 80. With Ch120 (read), Ch121
  (TRXDIR upload), and Ch122 (raster emit) all live, the three
  major PSMCT32 paths are byte-consistent end-to-end.
- `tb_gs_image_xfer_swizzle_psmct32.sv` (Ch121) — focused contract
  for the new `PSMCT32_SWIZZLE` parameter on `gif_image_xfer_stub`.
  When the parameter is set to 1 AND the upload's PSM is PSMCT32,
  per-pixel VRAM byte addresses are routed through the Ch119
  `gs_swizzle_psmct32_stub` (FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+
  cur_y) and `dest_base_q (= DBP*256)` is added back to anchor at
  the upload's DBP base. PSMCT16/PSMT8/PSMT4 always take the
  linear path. Default 0 keeps every existing image-xfer TB
  unchanged. **Three-phase verification**: (1) **origin transfer**
  — TRXDIR upload of a 16×4 PSMCT32 image at DBP=DSAX=DSAY=0,
  DBW=1, RRW=16, RRH=4 → 64 pixels, 16 IMAGE qwords. After the
  upload completes, the TB reads VRAM via vram_stub's 2nd read
  port at the SWIZZLED byte address (TB-side `ref_addr()` mirrors
  the swizzle module) and asserts each pixel landed where the
  swizzle says. Strict linear-vs-swizzled separator: bytes 512
  and 768 (where linear y=2 and y=3 rows would land) stay 0 under
  swizzled, since the 16×4 image only fills blocks (0,0) and (1,0)
  which together cover bytes [0..127] ∪ [256..383]. (2) **scanout
  agreement** — enable the Ch120 swizzled-pcrtc path on the same
  VRAM contents, capture the full 16×4 frame, assert each
  scanned-out pixel matches its uploaded color. Both upload and
  scanout instantiate the same `gs_swizzle_psmct32_stub`, so a
  successful capture proves the two integrations agree at byte
  level — what was written by TRXDIR comes out on r/g/b at the
  same (x, y). (3) **non-origin transfer** — re-arm with NONZERO
  DBP, DSAX, and DSAY (DBP=8, DSAX=4, DSAY=2, RRW=8, RRH=4) and
  verify each uploaded pixel lands at `DBP*256 + swizzle(0, DBW,
  DSAX+x_local, DSAY+y_local)`. Phase 3 pins TWO contracts the
  origin transfer can't distinguish from a buggy implementation:
  (a) `dest_base_q (= DBP*256)` is correctly ADDED ON TOP of the
  swizzle output (with DBP=0 a missing-add regression would still
  pass), and (b) the swizzle is fed the FULL effective coordinates
  (with DSAX=DSAY=0 a "feeds only cur_x/cur_y" regression would
  still pass). Strict linear-vs-swizzled separator at byte 3088
  (where the linear formula's y=2 row of the P3 image would
  land) stays 0 under swizzled. NOTE: gs_stub raster writes
  still use linear addressing — that wiring is a follow-on
  chapter.
- `tb_gs_scanout_swizzle_psmct32.sv` (Ch120) — focused contract
  for the new `PSMCT32_SWIZZLE` parameter on `gs_pcrtc_stub`. When
  the parameter is set to 1 AND the active PSM is PSMCT32, PCRTC
  reads VRAM at swizzled addresses (via the Ch119 swizzle module
  instantiated inside pcrtc) instead of the legacy linear formula.
  Other PSMs (CT16/T8/T4) and `PSMCT32_SWIZZLE=0` keep the original
  linear path unchanged. Topology: TB drives `vram_stub.write_*`
  directly with each pixel's color preloaded at the swizzled byte
  address (TB-side `ref_addr()` mirrors the DUT swizzle math), then
  pcrtc with `PSMCT32_SWIZZLE=1` scans out the frame and the TB
  asserts each captured pixel matches the preloaded color. Image
  is 16×4 PSMCT32 (covers blocks (0,0) AND (1,0) horizontally) at
  FBP=0/FBW=1; pcrtc active area is 8×4 (block (0,0) entirely),
  but the swizzle vs. linear distinction shows up at any y>0
  (linear y=1 → byte 64; swizzled byte 32) so even the in-window
  region is a strict separator. Per-pixel color is unique
  (`{A=0xFF, B=y<<4, G=x<<4, R=0x10|(y*8+x)}`) so any wrong-
  address commit surfaces immediately. NOTE: at Ch120 ONLY,
  gs_stub raster writes and gif_image_xfer_stub uploads still
  used linear addressing — Ch120 was read-side only. Ch121
  (image-xfer) and Ch122 (raster) closed the write-side gates,
  and Ch123 demonstrates all three running together end-to-end.
- `tb_gs_demo_psmt8_swizzle_trxdir_e2e.sv` (Ch136) — companion to
  Ch135: same EE-bootlet → DMAC → GIF data plane and same all-
  three-gates-on instantiation, but the framebuffer is filled by
  a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of
  by raster. The Ch133 PSMT8 image-xfer write-side swizzle gate
  becomes LOAD-BEARING inside the demo flow — every byte the GS
  produces comes out of the image-xfer engine at canonical PSMT8
  swizzled addresses, and the raster path is dormant. Mirrors
  Ch124 PSMCT32 + Ch130 PSMCT16 TRXDIR demos for the third PSM.
  Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2,
  DPSM=PSMT8} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} /
  TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=8: 8 IMAGE qwords each
  carrying 16 PSMT8 bytes for the 16×8 image, row-major). DBW=2
  is the minimum even DBW for PSMT8. DMAC QWC=14. Per-quadrant
  byte indices Q0=0xA0/Q1=0x40/Q2=0xC0/Q3=0x60 reused from Ch135
  so the verify side is unchanged. New `trxdir_arms_seen` counter
  asserts =1 (single TRX setup) + xfer-side per-emit observer
  asserts every xfer_we pulse fires with be=4'b0001, mask=
  0xFFFFFFFF (PSMT8 single-byte commit shape). Verification
  mirrors Ch135: (1) full 16×8 scanout frame capture; (2) per-
  pixel BYTE readback at the canonical swizzled byte address
  (with `addr[1:0]` selecting the right byte from the 32-bit
  word) via vram_stub's 2nd port; (3) strict separators at bytes
  128 and 256 stay 0. Aggregate counts: `dma=(1,14,1)
  ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1
  xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1
  emits=0 frame=16x8`. **First-attempt PASS** errors=0. Ch135 +
  Ch136 together close the PSMT8 byte-accuracy milestone end-
  to-end through the full driver-shaped flow — same Ch123+Ch124
  (PSMCT32) and Ch129+Ch130 (PSMCT16) shape.
- `tb_gs_demo_psmt4_swizzle_trxdir_e2e.sv` (Ch142) — companion to
  Ch141 (raster-driven PSMT4 e2e): same EE-bootlet → DMAC → GIF
  data plane and same all-three-gates-on instantiation, but the
  framebuffer is filled by a TRXDIR/IMAGE upload through
  `gif_image_xfer_stub` instead of by raster. The Ch139 PSMT4
  image-xfer write-side swizzle gate becomes LOAD-BEARING inside
  the demo flow — every nibble the GS produces comes out of the
  image-xfer engine at canonical PSMT4 swizzled (addr,
  nibble_hi) slots, and the raster path is dormant. Mirrors
  Ch124's PSMCT32 TRXDIR demo, Ch130's PSMCT16 TRXDIR demo, and
  Ch136's PSMT8 TRXDIR demo for the fourth (and last) common
  GS PSM. Cloned from Ch136 and surgically retargeted to
  PSMT4. Payload: U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=2,
  DPSM=PSMT4} / TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} /
  TRXDIR{XDIR=0}) + U2 (IMAGE, NLOOP=4 EOP=1: 4 IMAGE qwords
  carrying 32 PSMT4 nibbles each — at RRW=16 each qword holds
  2 rows: lanes 0..15 = row 2*qi, lanes 16..31 = row 2*qi+1,
  matching Ch139's focused-TB packing). Total QWC = 10 (5+5).
  EE-bootlet DISPFB1 immediate identical to Ch141 (LUI 0x000A;
  ORI 0x0400 → PSM=PSMT4). Per-quadrant nibbles match Ch141
  verbatim (Q0=0xA → 0xAA, Q1=0x4 → 0x44, Q2=0xC → 0xCC,
  Q3=0x6 → 0x66) so the verify side reuses Ch141's pattern
  unchanged. Verification mirrors Ch141: (1) full 16×8 scanout
  frame capture via Ch138 swizzled-pcrtc; (2) per-pixel NIBBLE
  readback at the canonical swizzled (addr, nibble_hi) slot
  via vram_stub's 2nd port (addr[1:0]-keyed byte selection
  then nibble_hi-keyed nibble selection); (3) strict linear-
  vs-swizzled separator at byte 128 stays 0 (per-byte check,
  not full word: a neighbor byte may legitimately be touched);
  (4) per-emit observer asserts every image-xfer write is
  `be=4'b0001` / `mask ∈ {0x0F, 0xF0}` (PSMT4 nibble RMW
  shape) and the `trxdir_wr_q` arming pulse fires exactly
  once. Aggregate counts: `dma=(1,10,1) ee_dmac_wr=3
  giftags=2 ad_writes=4 trxdir_arms=1 xfer_writes=128
  ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=0 frame=16x8`.
  Ch141 + Ch142 together exercise BOTH PSMT4 write-side paths
  (raster Ch140 + image-xfer Ch139) end-to-end through the
  same driver-shaped flow with the same swizzled-scanout
  (Ch138) read side — bringing PSMT4 to full parity with the
  PSMCT32, PSMCT16, and PSMT8 e2e coverage from Ch123+Ch124,
  Ch129+Ch130, and Ch135+Ch136. **Architectural milestone**:
  this is the first state of the project where ALL FOUR
  common GS PSMs (CT32 + CT16 + T8 + T4) have BOTH a raster-
  driven AND a TRXDIR-driven driver-shaped end-to-end byte-
  accuracy demo — closing the **four-PSM × three-path × dual-
  driver-shape e2e foundation** (8 demos total). The bug-fix
  iteration: TB-side `ref_col_idx4` was first written with a
  7-bit case key `{yb[2:0], xb[3:0]}` covering yb=0..7 in
  indices 0..127, but the values for yb=4..7 were miscopied
  from Ch139's yb=12..15 row (Ch139 only exercises yb=0..3
  and yb=12..15). Phase 2 readback failed for all 64 pixels
  in y=4..7 with `got=0 expected=0xC/0x6` — the engine wrote
  the right nibbles to the right addresses (scanout passed),
  but the TB's ref looked at the wrong slot. Fix: switched to
  Ch141's 9-bit case key `{yb[3:0], xb[4:0]}` and used
  Ch141's verified yb=0..7 values verbatim. **First-attempt
  PASS** after the table fix.
- `tb_gs_demo_psmt4_swizzle_e2e.sv` (Ch141) — first driver-shaped
  end-to-end PSMT4 demo with all three PSMT4 swizzle gates
  (Ch138 read-side pcrtc, Ch139 image-xfer write-side, Ch140
  raster write-side) parameter-set to 1 simultaneously, but with
  the demo flow exercising only the raster (Ch140) + scanout
  (Ch138) paths as load-bearing. The Ch139 image-xfer gate is
  smoke-only here (parameter is set but `xfer_writes_seen == 0`
  is asserted, since no TRXDIR/IMAGE packet is delivered in the
  raster-driven payload); the Ch139 load-bearing variant is
  the Ch142 TRXDIR-driven PSMT4 e2e (mirrors Ch124/Ch130/Ch136).
  PSMT4 counterpart of Ch123's PSMCT32 /
  Ch129's PSMCT16 / Ch135's PSMT8 e2e demos. Same EE-bootlet →
  DMAC → GIF data plane: BIOS-resident EE program configures
  GS-priv (DISPFB1 PSMT4 with FBW=2, DISPLAY1, PMODE) via `sw`
  instructions → kicks DMAC ch2 → SYSCALL halts. DMAC delivers
  a 24-qword payload (4 SPRITE PACKED packets) through
  `gif_packed_stub` into `gs_stub` raster. The 4 sprites tile
  the 16×8 active area into 4 quadrants with per-quadrant unique
  RGBAQ.R[3:0] nibbles (Q0=0xA → 0xAA, Q1=0x4 → 0x44,
  Q2=0xC → 0xCC, Q3=0x6 → 0x66). PSMT4 raster (Ch106) takes
  RGBAQ.R[3:0] as the nibble that hits VRAM via the existing
  Ch106 nibble RMW machinery (write_be=4'b0001 + write_mask
  0x0F or 0xF0); Ch140 keys the high/low nibble selector off the
  swizzle's `nibble_hi` output instead of `s2_pixel_index[0]`.
  PCRTC's Ch103 PSMT4 grayscale fallback (clut_enable=0)
  surfaces the nibble as r=g=b={n, n} at scanout, so each
  captured pixel IS the nibble we wrote (no CLUT setup needed
  for this demo; a CLUT-driven Ch141 variant is a future
  chapter). With the raster gate on, all 128 per-pixel nibble
  stores go through `gs_swizzle_psmt4_stub`; with the pcrtc
  gate on, scanout reads from those same swizzled (addr,
  nibble_hi) slots. **Two-phase verification**: (1) full-frame
  scanout asserts each (x, y) reads back its quadrant's nibble
  as PSMT4 grayscale r=g=b={n, n}; (2) per-pixel NIBBLE readback
  at the canonical swizzled address (with `addr[1:0]` selecting
  the right byte from the 32-bit word, then `nibble_hi`
  selecting which nibble of that byte) via vram_stub's 2nd
  port — the 16×8 PSMT4 image lives entirely in the upper-left
  of block (0,0) of page 0 (PSMT4 block = 32×16 px) and the
  within-block columnTable4 yb=0..7 / xb=0..15 exercises
  nibble_idx values [0..127]. Strict linear-vs-swizzled
  separator at byte 128 (linear y=2 row start at PSMT4
  stride=64 with FBW=2) stays 0 — outside block (0,0)'s
  touched range. Per-emit observer locks PSM=0x14, be=4'b0001,
  mask ∈ {0x0F, 0xF0}. Aggregate counts: `dma=(1,24,1)
  ee_dmac_wr=3 giftags=4 ad_writes=20 xfer_writes=0
  ee_priv_wr=4 bridge_fires=4 core_halt=1 emits=128 frame=16x8`.
  **First-attempt PASS** errors=0. Together with Ch123 (PSMCT32
  e2e), Ch129 (PSMCT16 e2e), and Ch135 (PSMT8 e2e), this is the
  first state of the project where the full driver-shaped flow
  has end-to-end byte-accuracy demos for ALL FOUR common GS
  PSMs (CT32 + CT16 + T8 + T4) under software-shaped raster
  traffic. The TRXDIR-driven PSMT4 companion landed at Ch142
  (mirror of Ch124/Ch130/Ch136 making Ch139 load-bearing), so
  Ch141 + Ch142 together close the PSMT4 byte-accuracy
  milestone end-to-end through both driver shapes — bringing
  PSMT4 to full parity with CT32/CT16/T8.
- `tb_gs_demo_psmt8_swizzle_e2e.sv` (Ch135) — first driver-shaped
  end-to-end PSMT8 demo with all three PSMT8 swizzle gates
  (Ch132 read-side pcrtc, Ch133 image-xfer write-side, Ch134
  raster write-side) parameter-set to 1 simultaneously, but with
  the demo flow exercising only the raster (Ch134) + scanout
  (Ch132) paths as load-bearing. The Ch133 image-xfer gate is
  smoke-only here (parameter is set but `xfer_writes_seen == 0`
  is asserted, since no TRXDIR/IMAGE packet is delivered in the
  raster-driven payload); the Ch133 load-bearing variant is the
  Ch136 TRXDIR-driven PSMT8 e2e (mirror of Ch124/Ch130). PSMT8
  counterpart of Ch123's PSMCT32 / Ch129's PSMCT16 e2e demos. Same EE-bootlet → DMAC → GIF data plane:
  BIOS-resident EE program configures GS-priv (DISPFB1 PSMT8
  with FBW=2, DISPLAY1, PMODE) via `sw` instructions → kicks
  DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword payload
  (4 SPRITE PACKED packets) through `gif_packed_stub` into
  `gs_stub` raster. The 4 sprites tile the 16×8 active area into
  4 quadrants with per-quadrant unique RGBAQ.R values
  (Q0=0xA0, Q1=0x40, Q2=0xC0, Q3=0x60). PSMT8 raster (Ch105)
  takes the natural ABGR's R channel as the byte index that
  hits VRAM; PCRTC's Ch96 grayscale fallback (clut_enable=0)
  surfaces the byte as R=G=B at scanout, so each captured pixel
  IS the byte we wrote (no CLUT setup needed for this demo;
  a CLUT-driven Ch135 variant is a future chapter). With the
  raster gate on, all 128 per-pixel byte stores go through
  `gs_swizzle_psmt8_stub`; with the pcrtc gate on, scanout
  reads from those same swizzled addresses. **Two-phase
  verification**: (1) full-frame scanout asserts each (x, y)
  reads back its quadrant's byte as PSMT8 grayscale R=G=B; (2)
  per-pixel BYTE readback at the canonical swizzled address
  (with `addr[1:0]` selecting the right byte from the 32-bit
  word) via vram_stub's 2nd port — the 16×8 PSMT8 image lives
  entirely in the upper half of block (0,0) of page 0 (PSMT8
  block = 16×16 px) and the within-block columnTable8 yb=0..7
  exercises byte values [0..127]. Strict linear-vs-swizzled
  separators at bytes 128 (linear y=1 row start at PSMT8
  stride=128 with FBW=2) and 256 (linear y=2) stay 0 — both
  outside block (0,0)'s touched range. Aggregate counts:
  `dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20
  xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1
  emits=128 frame=16x8`. Together with Ch123 (PSMCT32 e2e) and
  Ch129 (PSMCT16 e2e), this was the first state of the project
  where the full driver-shaped flow had end-to-end byte-accuracy
  demos for the CT32/CT16/T8 trio under software-shaped traffic.
  PSMT4 was the natural follow-on and landed at Ch141 (raster-
  driven, mirror of this demo) + Ch142 (TRXDIR-driven, mirror
  of Ch136), closing the four-PSM × dual-driver-shape e2e
  matrix.
- `tb_gs_demo_psmct16_swizzle_trxdir_e2e.sv` (Ch130) — companion
  to Ch129: same EE-bootlet → DMAC → GIF data plane and same all-
  three-gates-on instantiation, but the framebuffer is filled by
  a TRXDIR/IMAGE upload through `gif_image_xfer_stub` instead of
  by raster. The Ch127 image-xfer write-side swizzle gate becomes
  LOAD-BEARING inside the demo flow — every byte the GS produces
  comes out of the image-xfer engine at canonical PSMCT16
  swizzled addresses, and the raster path is dormant. Payload:
  U1 (PACKED, NREG=4: BITBLTBUF{DBP=0, DBW=1, DPSM=PSMCT16} /
  TRXPOS{DSAX=DSAY=0} / TRXREG{RRW=16, RRH=8} / TRXDIR{XDIR=0})
  + U2 (IMAGE, NLOOP=16: 16 IMAGE qwords carrying the 128 PSMCT16
  halfwords of the same four-quadrant pattern Ch129 used). DMAC
  QWC = 22. Verification mirrors Ch129: (1) full 16×8 scanout
  frame capture; (2) per-pixel halfword readback at the canonical
  swizzled byte address (with `addr[1]` selecting the right 16-bit
  slot) via vram_stub's 2nd read port; (3) strict linear-vs-
  swizzled separators at bytes 256 and 384 stay 0; (4) per-emit
  observer asserts every image-xfer write is `be=4'b0011` /
  `mask=0xFFFF_FFFF` (low halfword) and the `trxdir_wr_q` arming
  pulse fires exactly once. Aggregate counts: `dma=(1,22,1)
  ee_dmac_wr=3 giftags=2 ad_writes=4 trxdir_arms=1
  xfer_writes=128 ee_priv_wr=4 bridge_fires=4 core_halt=1
  emits=0 frame=16x8`. Ch129 + Ch130 together exercise BOTH
  PSMCT16 write-side paths (raster Ch128 + image-xfer Ch127)
  end-to-end through the same driver-shaped flow with the same
  swizzled-scanout (Ch126) read side — bringing PSMCT16 to
  full parity with the PSMCT32 e2e coverage from Ch123 + Ch124.
- `tb_gs_demo_psmct16_swizzle_e2e.sv` (Ch129) — full driver-shaped
  end-to-end demo with all three PSMCT16 swizzle gates
  (Ch126 read-side pcrtc, Ch127 image-xfer write-side, Ch128
  raster write-side) parameter-set to 1 simultaneously, but with
  the demo flow exercising only the raster (Ch128) + scanout
  (Ch126) paths as load-bearing. The Ch127 image-xfer gate is
  smoke-only here (parameter is set but `xfer_writes_seen == 0`
  is asserted, since no TRXDIR/IMAGE packet is delivered in the
  raster-driven payload); Ch130 (TRXDIR-driven PSMCT16 e2e) is
  the load-bearing image-xfer-side counterpart.
  PSMCT16 counterpart of Ch123's PSMCT32 e2e demo. Same EE-bootlet → DMAC
  → GIF data plane: BIOS-resident EE program configures GS-priv
  (DISPFB1 PSMCT16, DISPLAY1, PMODE) via `sw` instructions →
  kicks DMAC ch2 → SYSCALL halts. DMAC delivers a 24-qword
  payload (4 SPRITE PACKED packets) through `gif_packed_stub`
  into `gs_stub` raster. The 4 sprites tile the 16×8 active area
  into 4 quadrants with per-quadrant unique RGB5A1 colors picked
  so the 5→8 bit-replicate at PCRTC output produces unique 8-bit
  RGB triples. With the raster gate on, all 128 per-pixel
  halfword stores go through `gs_swizzle_psmct16_stub`; with the
  pcrtc gate on, scanout reads from those same swizzled
  addresses. **Two-phase verification**: (1) full-frame scanout
  asserts each (x, y) reads back its quadrant's 5→8-expanded
  RGB; (2) per-pixel halfword readback via vram_stub's 2nd port
  at swizzled addresses (with `addr[1]` selecting the right
  16-bit slot) confirms each sprite halfword landed where the
  swizzle says — the 16×8 PSMCT16 image lives entirely in block
  (0,0) of page 0 (PSMCT16 block = 16×8 px), so the readback
  exercises ALL 16 xb × 8 yb entries of `columnTable16`. Strict
  linear-vs-swizzled separators at bytes 256 (linear y=2 row
  start at PSMCT16 stride=128) and 384 (linear y=3) stay 0 —
  both outside block (0,0)'s 256-byte range. Aggregate counts:
  `dma=(1,24,1) ee_dmac_wr=3 giftags=4 ad_writes=20
  xfer_writes=0 ee_priv_wr=4 bridge_fires=4 core_halt=1
  emits=128 frame=16x8`. Together with Ch123 (PSMCT32 e2e),
  this is the first state of the project where the full
  driver-shaped flow has end-to-end byte-accuracy demos for
  BOTH direct-color PS2 PSMs.
- `tb_gs_raster_swizzle_psmct16.sv` (Ch128) — focused contract for
  the new `PSMCT16_SWIZZLE` parameter on `gs_stub` (the raster emit
  surface). Mirrors Ch122's wiring shape but for PSMCT16: when the
  parameter is 1 AND the active raster PSM is PSMCT16
  (`ras_psm == 6'h02`), the per-pixel raster emit address is routed
  through the Ch125 `gs_swizzle_psmct16_stub` (FBP=ras_fbp, FBW=
  ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) — its output is the
  absolute byte address. PSMCT32 is gated by its own
  `PSMCT32_SWIZZLE` parameter (Ch122). At Ch128 only, PSMT8/PSMT4
  raster emits stayed linear; Ch134 later closed the PSMT8 raster
  gate via `PSMT8_SWIZZLE` on this same `gs_stub`. PSMT4 raster
  still takes the linear path.
  Default 0 keeps every existing PSMCT16 raster TB (Ch95 etc.)
  unchanged. **Three-phase verification**: (1) **origin SPRITE**
  — drive a 16×4 PSMCT16 SPRITE at FRAME_1{FBP=0, FBW=1, PSMCT16}
  with RGBAQ {R=0xAA, G=0x50, B=0xC0, A=0x00} → halfword 0x6155
  (R5=0x15, G5=0x0A, B5=0x18, A1=0). Per-pixel halfword readback
  via vram_stub's 2nd port (with `addr[1]` selecting the right
  16-bit slot) confirms each lands at the swizzled byte. The
  16×4 image lives in block (0,0) of page (0,0), so within-block
  columnTable16 rows 0..3 are exercised. **Strict separators**:
  bytes 128 (linear y=1 row start at PSMCT16 stride=128) and 256
  (linear y=2) stay 0 — proves the gate is live, since a fall-
  through to the legacy linear path would put the SPRITE
  halfword there. (2) **scanout agreement** — enable the Ch126
  swizzled-pcrtc path on the same VRAM contents, capture the
  full 16×4 frame, assert each visible pixel reads back the
  expected RGB after PCRTC's 5→8 bit-replicate (RGB = {0xAD,
  0x52, 0xC6}). Both gs_stub (Ch128 raster) and gs_pcrtc_stub
  (Ch126 scanout) instantiate the same swizzle module. (3)
  **non-origin SPRITE** — re-arm with FRAME_1{FBP=4, FBW=2,
  PSMCT16} and an 8×4 SPRITE at (60, 4)..(67, 7) with distinct
  color (halfword 0x9F8E). Crosses the PAGE-x boundary at x=64
  (page (0,0) for x∈[60..63] — block (0,3) by swizzle table —
  vs page (1,0) for x∈[64..67] — block (0,0)) so page_index
  changes mid-row. Within-block column-table coords (xb=12..3,
  yb=4..7) cover columnTable16 rows 4..7 — a different region
  than Phase 1's yb=0..3. Pins three contracts Phase 1 can't:
  (a) `ras_fbp` reaches the swizzle's `fbp` input (FBP=0 in P1
  would mask a tied-zero); (b) `ras_fbw` reaches `fbw` (FBW=1
  in P1 would mask a tied-one); (c) the swizzle gets the FULL
  absolute pixel coords s2_x_q/s2_y_q rather than bbox-local
  (P1's sprite started at (0,0), so absolute and local were
  equal). Strict P3 separator at byte 9336 (linear formula's
  effective (60, 4) byte) stays 0 — outside the P3 swizzled
  write set, which lives in block (0,3) of page (0,0)
  (10914..11006) and block (0,0) of page (1,0) (16512..16604).
  Total emit count after all phases: 64 + 32 = 96. With Ch126
  (read), Ch127 (TRXDIR upload), and Ch128 (raster emit) all
  live, the three major PSMCT16 paths are byte-consistent
  end-to-end — completes the byte-accuracy milestone for the
  second PSM, mirroring the Ch120/Ch121/Ch122 PSMCT32 closure.
- `tb_gs_image_xfer_swizzle_psmct16.sv` (Ch127) — focused contract
  for the new `PSMCT16_SWIZZLE` parameter on `gif_image_xfer_stub`.
  Mirrors Ch121's wiring shape but for PSMCT16: when the parameter
  is 1 AND the upload's PSM is PSMCT16, per-pixel byte addresses
  route through the Ch125 `gs_swizzle_psmct16_stub` (FBP=0,
  FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y) and `dest_base_q
  (= DBP*256)` is added back to anchor at the upload's DBP base.
  PSMCT32 is gated by its own PSMCT32_SWIZZLE parameter (Ch121);
  PSMT8/T4 always linear. Default 0 keeps every existing PSMCT16
  image-xfer TB unchanged. **Three-phase verification**: (1)
  **origin transfer** — TRXDIR upload of a 16×4 PSMCT16 image at
  DBP=DSAX=DSAY=0, DBW=1, RRW=16, RRH=4 → 64 pixels, 8 IMAGE
  qwords (8 px/qword for PSMCT16). After upload, the TB reads
  vram_stub's 2nd port at the SWIZZLED byte address (TB-side
  `ref_addr16/ref_block_idx16/ref_col_idx16` carry the verbatim
  PCSX2 tables locked at Ch125) and asserts each halfword landed
  where the swizzle says (selecting the right 16-bit slot inside
  the 32-bit word via `addr[1]`). Strict linear-vs-swizzled
  separators at bytes 128 (linear y=1) and 256 (linear y=2) stay
  0 — swizzled writes for the 16×4 image fill only block (0,0)
  bytes [0..126]. (2) **scanout agreement** — enable the Ch126
  swizzled-pcrtc path on the same VRAM contents, capture the
  full 16×4 frame, assert each scanned pixel matches the uploaded
  RGB5A1 → RGB888 5→8 bit-replicate. Both upload and scanout
  instantiate the same `gs_swizzle_psmct16_stub`. (3) **non-origin
  transfer** — re-arm with DBP=8, DSAX=12, DSAY=6, RRW=8, RRH=4.
  Effective coords (12..19, 6..9) cross block_x=0→1 at
  effective_x=16 AND block_y=0→1 at effective_y=8, exercising
  both block-table dimensions inside a single non-origin upload.
  Pins three contracts the origin transfer can't distinguish from
  a buggy implementation: (a) `dest_base_q (= DBP*256)` is added
  on top of the swizzle output (DBP=0 in P1 would mask a
  missing-add); (b) the swizzle is fed the FULL effective coords
  (DSAX=DSAY=0 in P1 would mask a "feeds only cur_x/cur_y"
  regression); (c) BOTH block_x and block_y propagate through
  `blockTable16[by][bx]` (block_x=0 throughout P1 would mask a
  tied-block_x regression). Strict P3 separator at byte 3096
  (linear formula's effective (12, 8) byte) stays 0 — outside
  the P3 swizzled write set [2048..3071]. NOTE (now historical):
  PSMCT16 raster swizzle was deferred when Ch127 landed; it
  shipped at Ch128 (mirrors Ch122 for PSMCT32) so the PSMCT16
  raster path is now byte-consistent with the image-xfer path
  documented here.
- `tb_gs_raster_swizzle_psmt4.sv` (Ch140) — focused contract for
  the new `PSMT4_SWIZZLE` parameter on `gs_stub` (the raster emit
  surface). Mirrors Ch122/Ch128/Ch134 wiring shape but for the
  fourth (and last) PSM, and threads the Ch137 swizzle module's
  `nibble_hi` output into the existing Ch106 PSMT4 raster nibble
  RMW data lane (replacing `s2_pixel_index[0]` as the high/low
  nibble selector when the gate is on). When the parameter is 1
  AND the active raster PSM is PSMT4 (`ras_psm == 6'h14`), the
  per-pixel raster emit address is routed through the Ch137
  `gs_swizzle_psmt4_stub` (FBP=ras_fbp, FBW=ras_fbw,
  x=s2_x_q[11:0], y=s2_y_q[11:0]) — its `addr` output is the
  absolute byte address, AND its `nibble_hi` output keys
  `s2_emit_color64`'s nibble placement and `s2_emit_mask`'s
  high/low gating (write_be stays 4'b0001 for both paths).
  PSMCT32/PSMCT16/PSMT8 are gated by their own parameters;
  default 0 keeps every existing PSMT4 raster TB (Ch106
  raster_psmt4, Ch107 PSMT4-e2e, Ch103 PSMT4+CLUT, Ch104 round-
  trip, etc.) on the original linear addressing. No new ports.
  Default-off smoke verification: ran Ch106 + Ch107 + Ch103 +
  Ch104 PSMT4 TBs before writing the new TB; all PASSed
  unchanged. **Three-phase verification** (mirrors Ch134 PSMT8
  raster shape, with PSMT4 nibble adaptations + CLUT-disabled
  grayscale at scanout):
  (1) **origin SPRITE** at FBP=0/FBW=2 (FBW must be even per
  PCSX2 GSLocalMemory.h:560 — same as PSMT8). Drive a 16×4 PSMT4
  SPRITE with RGBAQ.R=0xAA (PSMT4 raster channel takes R[3:0] as
  the nibble per Ch106 → nibble = 0xA). Per-pixel nibble readback
  via vram_stub's 2nd port (with `addr[1:0]`-keyed byte
  selection then `nibble_hi`-keyed nibble selection inside the
  byte) confirms each pixel landed at the correct (byte, nibble)
  slot. The image lives in the upper-left of block (0,0) of page
  (0,0); within-block columnTable4 entries for yb=0..3, xb=0..15
  cover nibble_idx values [0..127] → byte_in_block ∈ [0..63].
  Strict separator: byte 64 (linear y=1 row start at PSMT4
  FBW=2 stride 64) stays 0.
  (2) **scanout agreement** — enable Ch138 swizzled-pcrtc on
  the same VRAM, capture full 16×4 frame, assert each pixel
  reads back as PSMT4 grayscale R=G=B={0xA, 0xA} = 0xAA. Both
  gs_stub and gs_pcrtc_stub instantiate the same
  `gs_swizzle_psmt4_stub` AND thread its `nibble_hi` output
  through their respective nibble selectors — agreement at this
  layer means both integrations land at the same byte+nibble
  positions for PSMT4.
  (3) **non-origin SPRITE** at FBP=4/FBW=4 (bw_pg=2) drawing
  8×4 SPRITE at (124, 4)..(131, 7) with R=0x55 (nibble = 0x5).
  Crosses PSMT4 PAGE-x at x=128 (page (0,0) for x∈[124..127],
  page (1,0) for x∈[128..131]). 2 blocks visited:
  blockTable4[0][3]=10 → page (0,0) block_base 10752;
  blockTable4[0][0]=0 → page (1,0) block_base 16384. Pins three
  contracts the origin transfer can't: ras_fbp reaches the
  swizzle's fbp input; ras_fbw reaches fbw; the swizzle gets
  the FULL absolute pixel coords s2_x_q/s2_y_q. Strict P3
  separator at byte 8766 (linear (124, 4) at FBP=4/FBW=4) stays
  0 — outside the P3 swizzled write set [10752..11007] +
  [16384..16639]. Total emit count: 64 + 32 = 96. **First-
  attempt PASS** errors=0.
  With Ch138 (read-side), Ch139 (TRXDIR upload), and Ch140
  (raster emit) all live, the three major PSMT4 paths can be
  byte-consistent under the canonical swizzle when their gates
  are flipped on — completing the **four-PSM × three-path
  byte-accuracy foundation** (CT32 Ch120/Ch121/Ch122 + CT16
  Ch126/Ch127/Ch128 + T8 Ch132/Ch133/Ch134 + T4 Ch138/Ch139/
  Ch140). End-to-end PSMT4 swizzled demos (mirroring Ch123/
  Ch124, Ch129/Ch130, Ch135/Ch136) are now possible.
- `tb_gs_raster_swizzle_psmt8.sv` (Ch134) — focused contract for
  the new `PSMT8_SWIZZLE` parameter on `gs_stub` (the raster emit
  surface). Mirrors Ch122's PSMCT32 + Ch128's PSMCT16 wiring shape
  but for the third PSM: when the parameter is 1 AND the active
  raster PSM is PSMT8 (`ras_psm == 6'h13`), the per-pixel raster
  emit address is routed through the Ch131 `gs_swizzle_psmt8_stub`
  (FBP=ras_fbp, FBW=ras_fbw, x=s2_x_q[11:0], y=s2_y_q[11:0]) —
  its output is the absolute byte address. PSMCT32/PSMCT16 are
  gated by their own parameters; PSMT4 stays linear. Default 0
  keeps every existing PSMT8 raster TB (Ch105 raster_psmt8, Ch107
  PSMT4-via-CT16-CLUT palette path, etc.) on the original linear
  addressing. No new ports — parameter-only API change. Default-
  off smoke verification: ran Ch105 `tb_gs_raster_psmt8` before
  writing the new TB; PASSed unchanged. **Three-phase verification**
  (mirrors Ch128 PSMCT16 raster shape):
  (1) **origin SPRITE** at FBP=0/FBW=2 (DBW must be even — PCSX2
  asserts `(bw & 1) == 0` for PSMT8). Drive a 16×8 PSMT8 SPRITE
  with RGBAQ.R=0xA5 (PSMT8 raster channel uses R as the byte
  index per Ch105). Per-pixel byte readback via vram_stub's 2nd
  port confirms each lands at the swizzled byte. The 16×8 image
  lives in the upper half of block (0,0) of page (0,0); the
  within-block columnTable8 distributes the 128 bytes across yb
  rows 0..7 — byte values 0..127 within the block. **Strict
  separators**: bytes 128 (linear y=1 row start at PSMT8
  stride=128) and 256 (linear y=2) stay 0 — proves the gate is
  live, since a fall-through to the legacy linear path would put
  the SPRITE byte there. (2) **scanout agreement** — enable the
  Ch132 swizzled-pcrtc path on the same VRAM, capture the full
  16×8 frame, assert each pixel's PCRTC PSMT8 grayscale R=G=B
  matches `idx=0xA5`. Both gs_stub and gs_pcrtc_stub instantiate
  the same `gs_swizzle_psmt8_stub`, so success proves byte-level
  agreement. (3) **non-origin SPRITE** at FBP=4/FBW=4 (bw_pg=2)
  drawing 8×4 SPRITE at (124, 4)..(131, 7) with RGBAQ.R=0x5A.
  Crosses PSMT8 PAGE-x at x=128 (x∈[124..127] is in page (0,0)
  block (0,7) by swizzle table; x∈[128..131] is in page (1,0)
  block (0,0)) so page_index changes mid-row. Pins three
  contracts the origin transfer can't: `ras_fbp` reaches the
  swizzle's fbp input (FBP=0 in P1 would mask a tied-zero);
  `ras_fbw` reaches fbw (FBW=2 would mask a tied-two); the
  swizzle gets the FULL absolute pixel coords s2_x_q/s2_y_q
  rather than bbox-local (P1 sprite started at (0,0) so
  absolute=local). PSMT8's page-x boundary at x=128 is different
  from CT32/CT16's x=64, so this exercises the PSMT8-specific
  x[7] wiring of the swizzle. Strict P3 separator at byte 9340
  (linear (124, 4) at FBP=4/FBW=4) stays 0 — outside the P3
  swizzled write set (page (0,0) block (0,7) at base 13568, page
  (1,0) block (0,0) at base 16384). Total emit count: 128 + 32 =
  160. **First-attempt PASS** errors=0. With Ch132 (read-side),
  Ch133 (TRXDIR upload), and Ch134 (raster emit) all live, the
  three major PSMT8 paths can be byte-consistent under the
  canonical swizzle when their gates are flipped on — completing
  the third-PSM byte-accuracy milestone for ALL three integration
  points (mirrors the Ch120/Ch121/Ch122 PSMCT32 trio + the
  Ch126/Ch127/Ch128 PSMCT16 trio).
- `tb_gs_image_xfer_swizzle_psmt4.sv` (Ch139) — focused contract
  for the new `PSMT4_SWIZZLE` parameter on `gif_image_xfer_stub`.
  Mirrors Ch121/Ch127/Ch133 wiring shape but for the fourth (and
  last) PSM, and threads the Ch137 swizzle module's `nibble_hi`
  output into the existing Ch118 nibble RMW data lane (replacing
  `x_eff[0]` as the high/low nibble selector when the gate is
  on). When the parameter is 1 AND the active DPSM is PSMT4, the
  per-pixel byte address is `dest_base_q (= DBP*256) +
  swizzle_psmt4(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y).addr`,
  AND `cur_mask_c` is `0x0000_00F0` when `swizzle4_nibble_hi=1`
  (high nibble) or `0x0000_000F` when 0 (low nibble) — the
  per-bit write_mask machinery (vram_stub merges only the
  targeted nibble) layers on top of the swizzled address. PSMCT32
  /PSMCT16/PSMT8 are gated by their own parameters. Default 0
  keeps the legacy linear path for every existing PSMT4 image-
  xfer TB (Ch118 etc.). No new ports — parameter-only API
  change. Default-off smoke verification: ran Ch118
  `tb_gs_image_xfer_psmt4` before writing the new TB; PASSed
  unchanged. **Three-phase verification** (mirrors Ch127/Ch133
  audit-closed shape): (1) **origin write-side lock** at DBP=0/
  DBW=2/DSAX=DSAY=0 (DBW must be even per PCSX2 GSLocalMemory.h:
  560 — same FBW-evenness as PSMT8). 16×4 PSMT4 image upload via
  2 IMAGE qwords (32 px/qword for PSMT4 = 4 rows × 16-px row at
  RRW=16). After upload, per-pixel nibble readback at the
  swizzled `(addr, nibble_hi)` slot asserts each nibble landed
  where the swizzle says. Strict separator: PSMT4 row stride at
  DBW=2 = DBW*32 = 64 bytes, so linear y=1 starts at byte 64.
  Swizzled write set lives in [0..63] within block (0,0). Byte
  64 stays 0 (verified via per-byte check, not full-word — the
  `check_byte_zero` task initially had a full-word bug that
  misreported neighbor-byte writes; fixed to check only the
  targeted byte via `addr[1:0]`-keyed case statement).
  (2) **end-to-end agreement**: enable Ch138 PSMT4 swizzled
  scanout on the same VRAM (PSMT4_SWIZZLE=1 on pcrtc, CLUT
  disabled), capture the 16×4 frame, verify each pixel's grayscale
  R=G=B={nibble, nibble} matches `nibble_at(xx, yy)`. Both
  modules instantiate the same `gs_swizzle_psmt4_stub` so success
  proves byte+nibble-level agreement under TRXDIR-style emit +
  scanout-style read.
  (3) **non-origin transfer** at DBP=8/DBW=2/DSAX=28/DSAY=12/
  RRW=8/RRH=8. Effective coords (28..35, 12..19) cross block_x=
  0→1 at effective_x=32 AND block_y=0→1 at effective_y=16 (PSMT4
  block geometry: 32×16 px). All 4 corner blocks of page (0,0)
  at DBP=8 visited: blockTable4[0][0]=0, [0][1]=2, [1][0]=1,
  [1][1]=3 (block bases 2048/2560/2304/2816). Pins three
  contracts the origin transfer can't: dest_base_q ADDED ON TOP
  of the swizzle output (DBP=0 in P1 would mask a missing-add
  regression — fixed during bring-up after the TB initially
  passed P3_DBP directly to ref_pos_psmt4 instead of using
  fbp_v=0 + adding DBP*256); FULL effective coords; BOTH
  block_x and block_y propagate through `blockTable4[by][bx]`.
  Phase 3 strict separator: linear formula puts effective coord
  (28, 12) at byte 2830 — under linear, the neighboring pixel
  (29, 12) writes high nibble = 1 to that byte. Under swizzled,
  no Phase-3 pixel hits byte 2830 (cross-checked: col_idx_psmt4
  for the 4-block × 16-pixel coord set never produces nibble_idx
  28 or 29). Byte 2830 stays 0 → fall-through to linear would
  have stomped it with 0x10. **PASS** errors=0 after two bug-fix
  iterations: (a) ref_pos_psmt4(P3_DBP, ...) was wrong — engine
  feeds FBP=0 to the swizzle and adds DBP*256 separately, so TB
  must do the same; (b) check_byte_zero tested the full word
  instead of the targeted byte, producing false failures when a
  neighbor byte in the same word was independently touched.
  Counts: arms=2, writes=128 (P1 64 + P3 64). With Ch138 (read-
  side scanout) + Ch139 (image-xfer write-side) + Ch140 (raster
  write-side) all live, the Ch137 PSMT4 primitive now has all 3
  integration points wired, and Ch141 closes the e2e demo.
- `tb_gs_image_xfer_swizzle_psmt8.sv` (Ch133) — focused contract
  for the new `PSMT8_SWIZZLE` parameter on `gif_image_xfer_stub`.
  Mirrors Ch121's PSMCT32 + Ch127's PSMCT16 wiring shape but for
  the third PSM: when the parameter is 1 AND the active DPSM is
  PSMT8, the per-pixel byte address is `dest_base_q (= DBP*256) +
  swizzle_psmt8(FBP=0, FBW=DBW, x=DSAX+cur_x, y=DSAY+cur_y)`.
  PSMCT32/PSMCT16 are gated by their own parameters; PSMT4 stays
  linear (its swizzle math is future). Default 0 keeps the legacy
  linear path for every existing PSMT8 image-xfer TB (Ch117 etc.).
  No new ports — parameter-only API change. Default-off smoke
  verification: ran Ch117 `tb_gs_image_xfer_psmt8` before writing
  the new TB; PASSed unchanged. **Three-phase verification**
  (mirrors Ch127 audit-closed shape):
  (1) **origin write-side lock** at DBP=0/DBW=2 (DBW must be even
  per PCSX2 GSLocalMemory.h:553 — PSMT8 pages are 128 px wide vs
  FBW's 64-px units, so 2 FBW units per page → bw_pg=1 here).
  16×8 PSMT8 image upload via 8 IMAGE qwords (16 px/qword). Per-
  pixel index `idx_at(x, y) = (y[2:0] << 4) | x[3:0]` ∈
  [0x00..0x7F]. After upload, byte-readback at the swizzled
  address asserts each byte landed where the swizzle says. Strict
  separators: linear y=1 (byte 128) and y=2 (byte 256) row starts
  stay 0 — swizzled write set lives entirely in [0..127].
  (2) **end-to-end agreement**: enable Ch132 swizzled scanout on
  the same VRAM, capture the frame, verify each visible pixel's
  PCRTC PSMT8 grayscale R=G=B matches `idx_at(x, y)`. Both modules
  instantiate the same `gs_swizzle_psmt8_stub` so success proves
  byte-level agreement under TRXDIR-style emit + scanout-style
  read. (3) **non-origin transfer** at DBP=8/DBW=2/DSAX=12/DSAY=10/
  RRW=8/RRH=8. Effective coords (12..19, 10..17) cross block_x=0→1
  at effective_x=16 AND block_y=0→1 at effective_y=16, so all 4
  corner blocks of page (0,0) at DBP=8 (blockTable8[0][0]=0,
  [0][1]=1, [1][0]=2, [1][1]=3 → block bases 2048/2304/2560/2816)
  are visited. Pins three contracts the origin transfer can't:
  `dest_base_q = DBP*256` ADDED ON TOP; the swizzle is fed FULL
  effective coords (DSAX/DSAY non-zero); BOTH block_x and block_y
  propagate through `blockTable8[by][bx]`. Phase 3 distinct-pixel
  pattern uses `p3_idx = 0x80 | idx` ∈ [0x80..0xFF] (disjoint
  from Phase 1's [0x00..0x7F]) so a P3 pixel landing at a P1
  byte (or vice versa) surfaces as wrong RGB. Phase 3 strict
  separator: linear formula puts effective coord (12, 10) at
  byte `2048 + 10*128 + 12 = 3340` (outside swizzled set
  [2048..3071]); byte 3340 stays 0 — proves a fall-through to
  linear would have stomped that byte. **First-attempt PASS**:
  arms=2, writes=192 (=128+64), errors=0. NOTE: at Ch133 only,
  PSMT8 raster-side emits via `gs_stub` still used linear
  addressing — Ch133 was image-xfer write-side only. Ch134 later
  closed the raster-side gate via `PSMT8_SWIZZLE` on `gs_stub`
  (mirrors Ch122 for PSMCT32 and Ch128 for PSMCT16) — see Ch134
  row above.
- `tb_gs_scanout_swizzle_psmt4.sv` (Ch138) — focused contract for
  the new `PSMT4_SWIZZLE` parameter on `gs_pcrtc_stub`. Mirrors
  Ch120/Ch126/Ch132's read-side-first wiring shape but adds the
  PSMT4-specific twist: the swizzle module outputs both an
  absolute byte address AND a `nibble_hi` selector (PSMT4 = 4
  bits/pixel = 2 pixels per byte, and the canonical PCSX2 column
  table reorders nibbles within a block, so `pixel_index[0]`
  is no longer the right selector under the swizzled layout).
  When the parameter is 1 AND the active PSM is PSMT4, scanout
  reads go through the Ch137 `gs_swizzle_psmt4_stub` and the
  PSMT4 nibble extractor uses `swizzle4_nibble_hi` instead of
  `pixel_index[0]`. PSMCT32/PSMCT16/PSMT8 are gated by their own
  parameters; default 0 keeps every existing PSMT4 scanout TB
  (Ch103 PSMT4+CLUT, Ch104 PSMT4 round-trip, Ch107 PSMT4 e2e,
  etc.) on the legacy linear path. No new ports — parameter-
  only API change. Default-off smoke verification: ran Ch103
  `tb_gs_scanout_psmt4_clut` + Ch104 `tb_gs_psmt4_round_trip`
  before writing the new TB; both PASSed unchanged. **Two-phase
  verification** (mirrors Ch132 closure shape; CLUT disabled so
  PCRTC's PSMT4 grayscale fallback gives `r=g=b={nibble,
  nibble}` at scanout):
  (1) **origin** at FBP=0/FBW=2/DBX=DBY=0 (FBW must be even per
  PCSX2 GSLocalMemory.h:560 because PSMT4 pages are 128 px wide,
  same as PSMT8). 16×4 region preloaded at swizzled bytes via a
  TB-side `byte_shadow` accumulator that lays each pixel's
  nibble at its `(addr, nibble_hi)` slot; bytes are then flushed
  to vram_stub via per-byte BE writes. Per-pixel nibble pattern
  `nibble_at(x, y) = ((y << 1) ^ x) & 4'h7` ∈ [0..7] gives unique
  gray values across the 16×4 frame. The image lives entirely
  in block (0,0) of page (0,0) and exercises within-block
  columnTable4 entries for yb=0..3, xb=0..15. Strict separator:
  byte 64 (linear y=1 row start at FBW=2 stride) pre-colored
  with sentinel 0xCC (gray=0xCC, unproducible by Phase 1's
  [0..7]-nibble pattern) — fall-through to linear would surface
  as RGB(0xCC, 0xCC, 0xCC).
  (2) **non-origin** at FBP=4/FBW=4 (bw_pg=2), DBX=120, DBY=126.
  Effective coords range x∈[120..135], y∈[126..129]. page_x
  crosses 0→1 at effective_x=128, page_y crosses 0→1 at
  effective_y=128 (PSMT4's 128-tall page boundary — different
  from PSMT8's 64-tall). All 4 corner pages of FBP=4/FBW=4
  visited, each with a distinct blockTable4 lookup
  (blockTable4[7][3]=31 → page (0,0) block_base 16128;
  blockTable4[7][0]=21 → page (1,0) block_base 21760;
  blockTable4[0][3]=10 → page (0,1) block_base 27136;
  blockTable4[0][0]=0 → page (1,1) block_base 32768). A
  regression that tied any of {dispfb_fbp, dbx, dby, FBW,
  block_x, block_y, page_index, bw_pg=FBW/2, swizzle
  nibble_hi} to zero would NOT survive Phase 2. Strict P2
  separator: byte 24380 (linear formula's place for (120, 126);
  outside all 4 swizzled chunks) pre-colored with sentinel 0xDD
  → fall-through to linear would surface as RGB(0xDD, 0xDD,
  0xDD), unproducible by the Phase-2 pattern. **PASS** errors=0
  after one bug-fix iteration: Phase 2's flush-loop initially
  hardcoded the wrong byte ranges due to a `blockTable4[7][3]`
  lookup mistake (the value is 31, not 15) — replaced with a
  shadow-array sweep [256..65535] that flushes any non-zero
  byte, eliminating the hardcode/lookup mismatch class entirely.
  NOTE (now historical): Ch138 was read-side only when
  introduced; the PSMT4 write-side is now live as well — Ch139
  (image-xfer) + Ch140 (raster) + Ch141 (raster-driven e2e
  demo). With Ch138, **all four common GS PSMs now have read-
  side byte-accuracy under their swizzle gates** (CT32 Ch120 +
  CT16 Ch126 + T8 Ch132 + T4 Ch138).
- `tb_gs_scanout_swizzle_psmt8.sv` (Ch132) — focused contract for
  the new `PSMT8_SWIZZLE` parameter on `gs_pcrtc_stub`. Mirrors
  Ch120/Ch126's wiring shape but for PSMT8: when the parameter is
  1 AND the active PSM is PSMT8, scanout reads go through the
  Ch131 `gs_swizzle_psmt8_stub` (real PS2 GS page/block/column
  layout — 128×64 pixel pages, 4×8 block grid, 16×16 within-block
  bytes, `bw_pg = FBW>>1`) instead of the legacy linear
  `FBW*64*y + x` formula. PSMCT32/PSMCT16 are gated by their own
  parameters; PSMT4 stays linear (its swizzle math is future).
  Default PSMT8_SWIZZLE=0 keeps every existing PSMT8 scanout TB
  (Ch96 storage-only, Ch97 PSMT8+CLUT, Ch103 PSMT4-via-CT16-CLUT,
  Ch107 PSMT4-e2e palette path) on the original linear addressing.
  No new ports — parameter-only API change. Default-off smoke
  verification: ran Ch96 `tb_gs_scanout_psmt8` before writing the
  new TB; PASSed unchanged, confirming the new instance + 4-way
  mux extension don't disturb the linear path. **Two-phase
  verification** (mirrors Ch126 PSMCT16 closure shape):
  (1) **origin** (FBP=0, FBW=2, DBX=DBY=0; FBW must be even —
  PCSX2 asserts `(bw & 1) == 0` for PSMT8 because pages are 128 px
  wide vs FBW's 64-px units, so 2 FBW units per page → bw_pg=1
  here). 16×8 region preloaded at swizzled bytes; per-pixel index
  `idx = (y[2:0] << 4) | x[3:0]` ∈ [0x00..0x7F] surfaces as
  grayscale R=G=B=idx via PCRTC's PSMT8 fallback (Ch96). x∈[0..15]
  is entirely block_x_in_page=0, so the within-block test
  exercises ALL 16 xb positions of `columnTable8` across yb rows
  0..7. Strict separators: linear y=1 starts at byte 128 (FBW=2
  stride) but swizzled lands at byte 8 (`columnTable8[1][0]=8`,
  no `*2` scale since PSMT8 is 1 byte/pixel); linear x=8,y=0 is
  byte 8 but swizzled is byte 2. (2) **non-origin** (FBP=4,
  FBW=4 → bw_pg=2, DBX=120, DBY=60). Effective coords range
  x∈[120..135], y∈[60..67] — page_x crosses 0→1 at effective_x=128
  (proves x[7] reaches the page-x lane of the PSMT8 swizzle —
  different boundary from CT16/CT32's x[6]); page_y crosses 0→1
  at effective_y=64; block_x and block_y both flip; ALL 4 pages
  (0,0)/(1,0)/(0,1)/(1,1) are visited, each with a distinct
  blockTable8 lookup ([3][7]=31, [3][0]=10, [0][7]=21, [0][0]=0).
  A regression that tied any of {dispfb_fbp, dbx, dby, FBW,
  block_x, block_y, page_index, bw_pg=FBW/2} to zero would NOT
  survive Phase 2. **Sentinel separator**: byte 24500 (inside
  linear range 23672..25479 for the Phase-2 effective region,
  outside ALL 4 swizzled write-set blocks) pre-colored with 0xFF
  → fall-through to linear would surface as RGB(0xFF, 0xFF, 0xFF),
  which is unproducible by the Phase-2 unique pattern (idx ∈
  [0x00..0x7F]). **First-attempt PASS** errors=0 — no audit
  iteration needed because Phase 2's coord choices were designed
  up front to make all 7 chain-layer wires load-bearing AND the
  page-x crossing boundary is at PSMT8's specific x=128 (not the
  64-px boundary the direct-color PSMs use). NOTE (now historical):
  Ch132 was read-side only when introduced; Ch133 then Ch134
  closed the image-xfer + raster write sides for PSMT8, so all
  three PSMT8 swizzle integration points are now live (mirrors
  Ch120/Ch121/Ch122 for PSMCT32 and Ch126/Ch127/Ch128 for PSMCT16).
- `tb_gs_scanout_swizzle_psmct16.sv` (Ch126) — focused contract
  for the new `PSMCT16_SWIZZLE` parameter on `gs_pcrtc_stub`.
  Mirrors Ch120's wiring shape but for PSMCT16: when the
  parameter is 1 AND the active PSM is PSMCT16, scanout reads
  go through the Ch125 `gs_swizzle_psmct16_stub` (real PS2 GS
  page/block/column layout) instead of the legacy linear
  `FBW*64*y + x*2` formula. PSMCT32 is gated by its own
  `PSMCT32_SWIZZLE` parameter (Ch120); PSMT8/PSMT4 stay linear.
  Default 0 keeps every existing PSMCT16 scanout TB
  (Ch94/Ch95/Ch103/etc.) on the original linear addressing.
  Topology: TB drives `vram_stub.write_*` directly with each
  pixel's RGB5A1 halfword preloaded at the swizzled byte address
  (TB-side `ref_addr16()` mirrors the swizzle math + the Ch125
  source-table-locked tables); pcrtc with `PSMCT16_SWIZZLE=1`
  scans out the 16×8 frame and the TB asserts each captured
  pixel matches the preloaded color after 5→8 bit-replicate.
  Per-pixel pattern is unique per (x, y): R5=`(x^y)&0xF`,
  G5=`x&0xF`, B5=`y&0xF`, expanded to 8 bits via PCRTC's
  bit-replicate. The PSMCT16 swizzle vs. linear distinction
  shows up at any y>0 (linear y=1 → byte 128 with FBW=1, but
  swizzled within block (0,0) yb=1 → columnTable16[1][0]=4
  → byte 8) and at x=8, y=0 (linear byte 16 vs swizzled byte 2)
  so even within the first row + first block, the gate is a
  strict separator. NOTE (now historical): Ch126 was read-side
  only when introduced; Ch127 (image-xfer) then Ch128 (raster)
  closed the PSMCT16 write sides, mirroring Ch121/Ch122 for
  PSMCT32.
- `tb_gs_swizzle_psmt4.sv` (Ch137) — focused contract for the new
  `gs_swizzle_psmt4_stub` math primitive: a pure-comb module mapping
  `(FBP, FBW, x, y)` to a VRAM **byte address + nibble_hi selector**
  using the real PS2 GS PSMT4 layout (8 KiB pages organized as
  128×128 PSMT4 pixels — 4× as many pixels per page as PSMT8 since
  each PSMT4 pixel is a NIBBLE; 32 blocks/page in an 8-rows × 4-cols
  grid (same orientation as PSMCT16's blockTable16); each block
  32×16 pixels = 512 nibbles = 256 bytes; **512-entry within-block
  column table** — 2× the entries of PSMT8's 256-entry table due to
  the doubled block area, indexed [yb][xb] with yb=0..15 + xb=0..31
  → nibble 0..511). PSMT4 is the most complex of the four common GS
  PSMs because each pixel is HALF a byte, so the swizzle outputs
  both a byte address and a `nibble_hi` selector (=0 for low
  nibble of the byte at `addr`, =1 for high). PSMT4 reuses PSMT8's
  page-stride convention (`bw_pg = FBW >> 1`; PCSX2 asserts FBW
  must be even at GSLocalMemory.h:560 because PSMT4 pages are 128
  px wide). Source-table provenance pinned: `_blockTable4` taken
  verbatim from pcsx2/GS/GSTables.cpp lines 61–69; `columnTable4`
  from same file lines 147–213. Master HEAD commit
  `3000e113e2b3a76357c08dfa80d3c747f40e2706`; file blob SHA
  `3581209b8217378f473f9de22a9dbc8c45ca49b6` (same blob Ch131
  pinned). Cross-checked against GSLocalMemory.h:558
  `BlockNumber4` + the `pxOffset` template at GSTables.cpp:247–258
  (blockSize=512 in NIBBLE units, pageSize=16384 nibble units =
  8192 bytes, pageWidth=128). The existing per-bit write_mask
  0x0F/0xF0 nibble RMW from Ch106/Ch118 will still apply on top
  of the swizzled byte address — the swizzle module doesn't touch
  the nibble merge logic; it just produces (addr, nibble_hi).
  **Five-phase verification** (mirrors Ch125/Ch131 shape, scaled
  up): (1) **spot-checks** at 15 hand-computed corners (origin,
  intra-block xb=1/8/16/yb=1/yb=2-with-hi-nibble, last nibble of
  block (0,0), first/second/third/fourth horizontal blocks,
  second-row-of-blocks origin, page-x at x=128 + page-y at y=128,
  FBP=4 origin, page0-last-pixel (127,127) → addr 8191 hi=1).
  (2a) **INDEPENDENT column-table source lock** — 32 hard-coded
  `check_nibble()` calls for yb=0 (literal-by-literal verbatim
  from PCSX2 columnTable4 row 0) PLUS a programmatic walk for
  yb=1..15 against the in-TB ref function (480 more checks);
  Phase 2a's literal yb=0 row + Phase 5's bijectivity sweep +
  Phase 3's literal block-table lock together pin the table.
  (3) **INDEPENDENT block-table source lock** — 32 hard-coded
  checks (one per block in page 0) with expected block index
  taken VERBATIM from PCSX2 blockTable4. (4) block-swizzle walk
  via in-TB ref_block_idx4. (5) **bijectivity sweep over the
  128×128 page** — 16384 NIBBLE slots (vs PSMT8's 8192 byte
  slots), every pixel must hit a unique (byte_addr, nibble_hi)
  pair and agree with both the in-TB ref byte address AND
  ref nibble_hi. Plus multi-page sanity at FBW=4/bw_pg=2
  (page-x crossing at x=192 → byte 10496 with blockTable4[1][2]=9,
  and page-y crossing at y=128 → byte 16384) and non-page-aligned
  FBP coverage at FBP ∈ {1,2,3}, including FBP=3+FBW=4+page-(1,1)
  intra-block at (129, 129) → byte 30732 (= 6144 + 3*8192 + 0*256
  + ref_col_idx4(1,1)/2 = 30720 + 12). **First-attempt PASS**
  errors=0. NOTE: This module is NOT YET wired into
  `gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` — those
  still use linear PSMT4 addressing as of Ch137. The math is
  locked here so follow-on chapters can wire `PSMT4_SWIZZLE`
  parameter gates into the existing address paths without
  disturbing the legacy linear-PSMT4 TBs (Ch103 / Ch106 / Ch107
  / Ch118). With Ch119 PSMCT32 + Ch125 PSMCT16 + Ch131 PSMT8 +
  Ch137 PSMT4, **all four common GS PSMs now have byte-accurate-
  to-real-PS2 swizzle math available as standalone primitives** —
  the four-PSM swizzle math foundation is complete. Future
  chapters can wire PSMT4 into pcrtc/image-xfer/raster behind a
  PSMT4_SWIZZLE parameter (mirroring Ch120→Ch124 / Ch126→Ch130
  / Ch132→Ch136), with the existing nibble RMW machinery layered
  on top.
- `tb_gs_swizzle_psmt8.sv` (Ch131) — focused contract for the new
  `gs_swizzle_psmt8_stub` math primitive: a pure-comb module mapping
  `(FBP, FBW, x, y)` to a VRAM byte address using the real PS2 GS
  PSMT8 layout (8 KiB pages organized as 128×64 PSMT8 pixels — 2×
  wider than CT16's 64×64 page; 32 blocks/page in a 4-rows × 8-cols
  grid; each block 16×16 pixels = 256 bytes; **256-entry within-
  block column table** — 2× the entries of CT16's 128-entry table
  due to the doubled block area, indexed [yb][xb] with yb=0..15 +
  xb=0..15 → byte 0..255). PSMT8 also introduces a new page-stride
  constant `bw_pg = FBW >> 1` (PCSX2 asserts `(bw & 1) == 0` at
  GSLocalMemory.h:553 because PSMT8 pages are 128 px wide vs FBW's
  64-px units → 2 FBW units per PSMT8 page, so FBW must be even).
  Source-table provenance pinned: `blockTable8` taken verbatim from
  pcsx2/GS/GSTables.cpp lines 53–59; `columnTable8` from same file
  lines 111–145. Master HEAD commit
  `3000e113e2b3a76357c08dfa80d3c747f40e2706`; file blob SHA
  `3581209b8217378f473f9de22a9dbc8c45ca49b6`. Cross-checked against
  GSLocalMemory.h:551 `BlockNumber8` + the `pxOffset` template at
  GSTables.cpp:247–258 (blockSize=256, pageSize=8192, pageWidth=128).
  PCSX2's `bp` is in 256-byte block-pointer units; in our
  FBP=2048-byte units, `bp = FBP * 8` so `bp*256 = FBP*2048`.
  **Five-phase verification** (mirrors Ch125 PSMCT16 shape):
  (1) **spot-checks** at 15 hand-computed corners (origin, intra-
  block xb=1/4/8/yb=1, last byte of block (0,0), first/second block
  origins, second row of blocks, third+fourth blocks, page-x at
  x=128 and page-y at y=64, FBP=4 origin); (2a) **INDEPENDENT
  column-table source lock** — 256 hard-coded `check()` calls (one
  per (yb, xb) inside block (0,0)) where the expected byte index is
  taken VERBATIM from PCSX2 columnTable8 with `<literal>` arithmetic,
  NOT derived from the in-TB ref function. Catches any case where
  DUT and ref share the same miscopy (the same trap Ch125 added
  Phase 2a for with PSMCT16's column table); (2b) within-block
  16×16 walk via the in-TB ref_col_idx8 (self-check); (3)
  **INDEPENDENT block-table source lock** — 32 hard-coded checks
  (one per block in page 0) with the expected block index taken
  VERBATIM from PCSX2 blockTable8, NOT derived from the in-TB ref;
  (4) block-swizzle walk via in-TB ref_block_idx8; (5)
  **bijectivity sweep over the 128×64 page** — 8192 byte slots
  (vs CT16's 4096 halfword slots), every pixel must hit a unique
  byte address in `[0, 8192)` and agree with the in-TB reference.
  Plus multi-page sanity at FBW=4/bw_pg=2 (page-x crossing at
  x=192 and page-y crossing at y=64) and non-page-aligned FBP
  coverage at FBP ∈ {1, 2, 3}, including FBP=3+FBW=4+page-(1,1)
  intra-block crossing at (129, 65). **First-attempt PASS**
  errors=0. NOTE: This module is NOT YET wired into
  `gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` — those
  still use linear PSMT8 addressing as of Ch131. The math is
  locked here so follow-on chapters can wire `PSMT8_SWIZZLE`
  parameter gates into the existing address paths without
  disturbing the legacy linear-PSMT8 TBs (Ch96 / Ch97 / Ch103 /
  Ch105 / Ch107 / Ch117). With Ch119 PSMCT32 + Ch125 PSMCT16 +
  Ch131 PSMT8, three of the four common GS PSMs now have byte-
  accurate-to-real-PS2 swizzle math available as standalone
  primitives; PSMT4 (with its 32×16 nibble intra-block layout) is
  the natural Ch132 candidate.
- `tb_gs_swizzle_psmct16.sv` (Ch125) — focused contract for the
  new `gs_swizzle_psmct16_stub` math primitive: a pure-comb module
  mapping `(FBP, FBW, x, y)` to a VRAM byte address using the real
  PS2 GS PSMCT16 layout (8 KiB pages organized as 64×64 PSMCT16
  pixels; 32 blocks/page in a 4×8 grid; each block 16×8 pixels =
  256 bytes; **non-trivial within-block column table** — unlike
  PSMCT32 where within-block IS row-major halfwords by accident,
  PSMCT16 has 4 internal 16×2-pixel sub-columns with a 128-entry
  permutation). Source-table provenance pinned: `blockTable16`
  taken verbatim from pcsx2/GS/GSTables.cpp lines 29–39
  (master HEAD commit 3d71e310; file-touch commit d983b2b0,
  2026-01-12); `columnTable16` from same file lines 91–109.
  Cross-check against the older Debian-packaged GSdx
  `PixelAddressOrg16(x, y, bp, bw) = (BlockNumber16(...) << 7) +
  columnTable16[y & 7][x & 15]` confirms the address chain
  (`<< 7` lifts to halfword units, multiply by 2 for bytes; in
  our FBP=2048-byte units, bp = FBP * 8 so bp*256 = FBP*2048).
  **Five-phase verification**: (1) spot-checks at 13 well-defined
  corners (origin, intra-block, first/second block, second row of
  blocks, page-x and page-y boundaries, FBP=4 origin); (2)
  within-block 16×8 walk asserting `byte = 2 * columnTable16[yb][xb]`
  — locks the column table; a row-major-halfwords regression would
  fail; (3) **source-table lock** — 32 hard-coded address checks
  (one per block in page 0) with the expected block index taken
  VERBATIM from PCSX2 blockTable16, NOT derived from the in-TB
  reference function; (4) block-swizzle walk cross-checking the
  in-TB ref function against the DUT (the bijectivity sweep
  relies on it being correct); (5) **bijectivity sweep over the
  64×64 page** — 4096 halfword slots, every pixel must hit a
  unique halfword address in `[0, 8192)` and agree with the in-TB
  reference. Plus multi-page sanity at FBW=2 and non-page-aligned
  FBP coverage at FBP ∈ {1, 2, 3} (real PS2 supports any
  2048-byte-aligned FBP — same broadening Ch119 adopted post-
  audit). NOTE: This module is NOT YET wired into `gs_pcrtc_stub`
  / `gif_image_xfer_stub` / `gs_stub` — those still use linear
  PSMCT16 addressing as of Ch125. The math is locked here so
  follow-on chapters can wire `PSMCT16_SWIZZLE` parameter gates
  into the existing address paths without disturbing the legacy
  linear-PSMCT16 TBs (Ch94 / Ch95 / Ch103 / Ch116).
- `tb_gs_swizzle_psmct32.sv` (Ch119) — focused contract for the
  new `gs_swizzle_psmct32_stub` math primitive: a pure-combinational
  module mapping `(FBP, FBW, x, y)` to a VRAM byte address using
  the real PS2 GS PSMCT32 page/block swizzle layout (8 KiB pages,
  4×8 grid of 8×8-pixel blocks per page, blocks ordered per the
  canonical PCSX2 PSMCT32 swizzle table, row-major within a block).
  Verification has five phases: (1) spot-checks on the well-defined
  corners — origin, intra-block walks, first/second block, second
  row of blocks, page-x and page-y boundaries, second page on x,
  and FBP=4 origin; (2) within-block 8×8 walk asserting
  `byte_in_block = yb*32 + xb*4`; (3) **source-table lock** — 32
  hard-coded address checks (one per block in page 0) where the
  expected block index is taken VERBATIM from PCSX2's PSMCT32 block
  table, NOT derived from the in-TB reference function. This proves
  the DUT's `swizzle_psmct32()` table matches the canonical source;
  a copied-wrong table that happened to still be a valid permutation
  of 0..31 would fail this phase, while the bijectivity sweep below
  would pass it; (4) block-swizzle walk (redundant with phase 3,
  cross-checks ref_block_idx against the DUT — the bijectivity
  sweep relies on ref_block_idx being correct); (5) bijectivity
  sweep over the full 64×32 PSMCT32 page — every word slot in
  `[0, 8192)` reached exactly once (catches any swap/typo in the
  swizzle table). Plus a multi-page sanity check at FBW=2 (pixel
  (96, 16) → block (4,2) of page 1 → addr 14336) and a **non-page-
  aligned FBP** phase that drives FBP=1, 2, 3 (mid-page in the 8 KiB
  sense — real PS2 supports any 2048-byte-aligned FBP; our address
  formula is bit-correct for non-page-aligned FBP) plus FBP=3 with
  FBW=2 + intra-block crossing as a stress case. NOTE (now
  historical): at Ch119 this module was standalone math only;
  Ch120 (PCRTC read), Ch121 (image-xfer write), and Ch122
  (raster write) wired it into the three integration points —
  the same shape that Ch125–Ch128 (PSMCT16), Ch131–Ch134
  (PSMT8), and Ch137–Ch140 (PSMT4) followed for the other
  three PSMs.
- `tb_gs_image_xfer_psmt4.sv` (Ch118) — focused contract for
  `gif_image_xfer_stub`'s PSMT4 path (the fourth and final
  supported PSM). PSMT4 packs 0.5 bytes/pixel (4-bit nibble per
  pixel = 2 px/byte), so each 128-bit IMAGE qword carries 32
  pixels in 16 bytes. Each emit is a SUB-BYTE write: `write_be
  = 4'b0001` with a per-emit nibble mask
  (`write_mask = 0x0000_000F` for the LOW nibble,
  `0x0000_00F0` for the HIGH nibble), keyed by `(DSAX+x)[0]`;
  vram_stub's per-bit merge commits exactly the targeted
  nibble, preserving the OTHER nibble of the byte.
  Back-to-back emits to the same byte (e.g. x=0 + x=1 of the
  same row) chain through NBA semantics without bypass logic
  — the same trick the raster channel uses since Ch106. The TB
  is INTENTIONALLY adversarial: VRAM is preloaded with `0xA5`
  across every byte the engine will write (plus boundary
  bytes), then a single IMAGE qword (32 PSMT4 pixels) covers
  the entire 8×4 rect. Every byte ends as
  `{nibble_high_pixel, nibble_low_pixel}` (no trace of 0xA5);
  bytes immediately right of the rect on each row stay 0xA5
  (proves no nibble leak past RRW); bytes before / after the
  destination region also stay 0xA5. Pattern
  `pixel(x,y) = 4'((y*8+x) & 0xF)`. Asserts: 1 trxdir arm, 32
  vram writes, every emit `be=0001` and `mask ∈ {0x0F, 0xF0}`,
  per-byte readback matches, boundary bytes preserved.
- `tb_gs_image_xfer_psmt8.sv` (Ch117) — focused contract for
  `gif_image_xfer_stub`'s PSMT8 path. Pushes 2 IMAGE qwords
  (32 PSMT8 pixels = 16 px/qword × 2) through the engine after
  a TRXDIR-shaped GIF-A+D register sequence with DPSM=PSMT8
  (=0x13). PSMT8 packs 1 byte/pixel (an 8-bit CLUT index), so
  each qword holds 16 pixels; the engine emits one 8-bit pixel
  per cycle with `write_be = 4'b0001`, the index in the LOW
  byte of `write_data`, and `write_mask = 0xFFFFFFFF`;
  vram_stub commits `mem[write_addr] <= write_data[7:0]` at
  any byte alignment. Pattern is `pixel(x,y) = 8'(y*16 + x)` —
  32 distinct values across the 8×4 rect so a wrong-byte-lane
  commit shows up unambiguously. Asserts: 1 trxdir arm, 32
  vram writes (all `be=0001`, `mask=0xFFFFFFFF`), every pixel
  reads back at `dest_base + y*64 + x`, plus right-of-rect /
  before / after byte-zero boundary preservation. Each qword
  packs TWO rows of 8 pixels (lanes 0..7 = row y, lanes 8..15
  = row y+1) — exercises the per-lane row-stride math at the
  qword boundary.
- `tb_gs_image_xfer_psmct16.sv` (Ch116) — focused contract for
  `gif_image_xfer_stub`'s new PSMCT16 path. Pushes 4 IMAGE
  qwords (32 PSMCT16 pixels = 8 px/qword × 4) through the
  engine after a TRXDIR-shaped GIF-A+D register sequence
  (BITBLTBUF/TRXPOS/TRXREG/TRXDIR). PSMCT16 packs 2 bytes/pixel,
  so each qword holds 8 pixels (vs 4 for PSMCT32). The engine
  emits one 16-bit pixel per cycle to vram_stub with
  `write_be = 4'b0011`, the pixel value in the LOW halfword of
  `write_data`, and `write_mask = 0xFFFFFFFF`; vram_stub commits
  the 2 bytes at the 2-byte-aligned destination address. Pattern
  is `pixel(x,y) = 16'h{yyxx}{yyxx}` — distinct per-pixel value
  so a wrong-lane commit shows up unambiguously. Asserts:
  1 trxdir arm, 32 vram writes (all `be=0011`, `mask=0xFFFFFFFF`),
  every pixel reads back at `dest_base + y*row_stride + x*2`,
  and the bytes immediately right of the rect on each row +
  before the dest region + after the dest region all stay zero
  (proves row-stride math + no halfword leak past RRW). PSMT8
  image-xfer landed in Ch117 and PSMT4 image-xfer landed in
  Ch118 — see those TB rows for their own per-byte / per-nibble
  contract coverage.
- `tb_gs_demo_psmt4_e2e_trxdir.sv` (Ch110) — driver-shaped
  PSMT4 demo with the palette upload now arriving via a real
  TRXDIR/TRXPOS/TRXREG/HWREG image-transfer GIF packet sequence
  instead of TB-direct vram_stub writes. Closes the LAST
  TB-direct path in the e2e demo flow: every byte the GS sees —
  framebuffer pixels AND palette source — now arrives through a
  driver-shaped GIF stream. The DMAC delivers 36 qwords total:
  U1 (PACKED, NREG=4): BITBLTBUF/TRXPOS/TRXREG/TRXDIR — TRXDIR
  arms `gif_image_xfer_stub`. U2 (IMAGE, NLOOP=4): 4 qwords of 4
  PSMCT32 entries each → 16 palette entries written into VRAM at
  DBP*256 by `gif_image_xfer_stub`. Then 4 SPRITE PACKED packets
  + 1 TEX0_1 PACKED packet. PASS criteria add to Ch109's:
  **1 EV_DMA_START / 36 EV_DMA_BEAT / 1 EV_DMA_DONE**, **7
  GIFtag accepts** (U1 + U2 + 4×SPRITE + TEX0), **25 PACKED A+D
  dispatches** (4 TRX-setup + 20 SPRITE + 1 TEX0), **16
  image-xfer VRAM writes** from `gif_image_xfer_stub` (DBP=4,
  DBW=1, DPSM=PSMCT32, DSAX=DSAY=0, RRW=16, RRH=1). The vram_stub
  write port is muxed at TB level: `xfer_busy ? xfer_we :
  raster_pixel_emit` (sequenced — palette upload completes before
  sprites raster). Ch110 also added a backpressure path on
  `gif_packed_stub` (`image_data_ready` input) so the upstream
  DMA stalls while `gif_image_xfer_stub` is draining the previous
  IMAGE qword's 4 PSMCT32 lanes; outside S_IMAGE the gate is a
  no-op (in_ready stays high). Privileged-block MMIO (PMODE/
  DISPFB1/DISPLAY1) remains TB-direct because those are CPU MMIO
  writes in real hardware, not GIF traffic.
- `tb_gs_demo_psmt4_e2e_dmac.sv` (Ch109) — same 4-quadrant
  PSMT4 demo as Ch108, but the GIFtag + PACKED A+D quadwords
  arrive at `gif_packed_stub` via the DMAC channel-2 →
  `ee_memory_map_stub` → `ee_ram_stub` path instead of being
  TB-driven directly. Closes the last GIF-side sideband from
  Ch108: the demo is now reachable the way real EE/IOP code
  reaches it. The TB pre-stages the same 26 qwords (4 SPRITE
  packets × 6 qwords + 1 TEX0_1 packet × 2 qwords) into RAM at
  PAYLOAD_MADR, then writes DMAC channel-2 MADR/QWC/CHCR; a
  single NORMAL transfer with QWC=26 streams them into the GIF.
  PASS criteria add to Ch108's: **1 EV_DMA_START / 26
  EV_DMA_BEAT / 1 EV_DMA_DONE** (DMA event taxonomy locked),
  with the same downstream chain — 5 GIFtag accepts, 21 A+D
  dispatches in the expected reg-num sequence, 32 PSMT4 emits,
  1 loader_busy rise, identical 16×8 captured frame. Privileged-
  block MMIO and palette pre-stage stay TB-direct (NOT GIF-side);
  TRXDIR/HWREG image-transfer for palette upload is a separate
  future chapter.
- `tb_gs_demo_psmt4_e2e_packed.sv` (Ch108) — same 4-quadrant
  PSMT4 demo as Ch107 but routed through the GIFtag / PACKED
  A+D front-end (`gif_packed_stub` with REAL_AD_REG_MAP=1).
  Closes the last bit of GS-side sideband from Ch107: instead
  of TB-driving `gs_stub.gif_reg_*` directly, the TB pushes raw
  128-bit GIFtag + PACKED A+D quadwords into `gif_packed_stub.
  in_*` exactly the way the real GIF would receive them from
  PATH3. Each SPRITE is a packet of 1 GIFtag (NLOOP=1, NREG=5,
  PACKED, REGS=0xEEEEE — 5×A+D in the low 5 nibble slots) +
  5 PACKED A+D qwords (PRIM, FRAME_1=PSMT4, RGBAQ, XYZ2, XYZ2);
  TEX0_1 load is its own 1-tag/1-A+D packet. Total: 5 GIFtag
  accepts (4 SPRITEs + 1 TEX0_1) and 4×5 + 1×1 = 21 PACKED A+D
  register-write dispatches into gs_stub.gif_reg_*. 32 PSMT4
  raster emits arrive (Ch106 RMW), loader fires exactly once
  on TEX0_1, and the captured 16×8 frame matches the same
  expected CLUT-decoded RGB as Ch107 — i.e. real-format GIF
  packets reach the GS register file with the same cadence the
  TB previously synthesised by hand. Privileged-block MMIO
  (PMODE/DISPFB1/DISPLAY1) and the palette pre-stage in VRAM
  remain TB-direct because they are NOT GIF-side; the palette
  upload via real-PS2 TRXDIR/TRXPOS/TRXREG/HWREG image-transfer
  packets is a separate future chapter, as is the DMAC channel-2
  burst that would normally deliver the GIFtag qwords (this TB
  drives `gif_packed_stub.in_*` directly to keep the demo
  narrow and deterministic; the full DMAC→RAM→GIF round trip
  is what the integration-tier `tb_ee_core_gif_*` family
  covers).
- `tb_gs_psmt4_round_trip.sv` (Ch104) — full driver-shaped
  PSMT4 + CLD=4 + CSA round trip. Wires `gs_stub` +
  `vram_stub` + `clut_stub` + `clut_loader_stub` + `gs_pcrtc_stub`
  end-to-end with `pcrtc.clut_csa = gs_stub.tex0_1_csa_q` (the
  Ch98 sideband-free pattern). Phase 1: stages a 4×4 PSMT4 sprite
  in VRAM, plus a 16-entry pattern_a palette in VRAM at
  `CBP_A*256`. Drives TEX0_1 with `CBP=4, CPSM=PSMCT32, CSM=CSM2,
  CSA=0, CLD=4`; the loader writes pattern_a into `clut_stub[0..15]`
  and `pcrtc.clut_csa` is 0, so PSMT4 scanout reads pattern_a per
  nibble. Phase 2: stages a different pattern_b palette at
  `CBP_B*256` and drives TEX0_1 with `CBP=8, CSA=4, CLD=4`; the
  loader writes pattern_b into `clut_stub[64..79]` (the CSA=4
  window) and `pcrtc.clut_csa` flips to 4, so the same VRAM
  sprite — same DISPFB1 / DISPLAY1 / PMODE — now reads pattern_b.
  Proves loader policy + clut_stub contents + PCRTC lookup are
  wired consistently.

Scope (current, after Ch165):

- **PSMCT32 (DISPFB1.PSM=0), PSMCT16 (PSM=2), PSMT8 (PSM=0x13),
  and PSMT4 (PSM=0x14) honored at BOTH the read and write
  sides** (Ch94 + Ch95 + Ch96 + Ch97 + Ch103 + Ch105 + Ch106).
  PSMCT24/PSMCT16S/PSMZ32/etc. force scanout off and are not
  contract-tested at the raster channel. The write side
  (gs_stub.raster_pixel_emit) emits the four supported PSMs via
  `raster_pixel_be_q` (per-byte gate) and `raster_pixel_mask_q`
  (per-bit merge mask, Ch106): PSMCT32 = be `0xF` / mask
  `0xFFFFFFFF`, PSMCT16 = be `0x3` / mask `0xFFFFFFFF`, PSMT8 =
  be `0x1` / mask `0xFFFFFFFF`, PSMT4 = be `0x1` / mask `0x0F`
  or `0xF0`. The mask path is no-op for byte-or-larger PSMs
  (mem[i] = data[i] when mask_i = 0xFF) and only meaningful for
  PSMT4 sub-byte writes. PSMT8 / PSMT4
  scanout surfaces the index/nibble as grayscale by default;
  with `clut_enable=1` (Ch97/Ch103) and a programmed
  `clut_stub`, the index/nibble looks up real RGB. CLUT contents come either from a TB-direct write OR
  (Ch99..Ch102) from a VRAM→CLUT load triggered by a TEX0_1 GIF
  write with `CSM == 1` (CSM2 linear), `CPSM` ∈ {PSMCT32,
  PSMCT16}, and a CLD value passing the policy: CLD=0 never;
  CLD=1 always (full 256-entry load); CLD=2 if CBP changed since
  last load (full); CLD=3 if CBP/CPSM/CSA any-changed (full);
  CLD=4 always but only the 16-entry CSA window at indices
  `CSA*16 + i` (Ch102 — preserves the other 240 entries);
  CLD ∈ {5..7} silently no-op (reserved). `clut_loader_stub`
  walks the entries via `vram_stub`'s second read port; PSMCT16
  entries are unpacked with the same 5→8 bit-replicate the
  scanout side uses (Ch94). CSM1 swizzle and CPSM ∉ {PSMCT32,
  PSMCT16} remain deferred.
- **Single CRTC, single DISPFB**. Real PS2 has two interlace-
  capable CRTCs (DISPFB1, DISPFB2). One context is enough for
  TBs to verify the round trip; PMODE.EN2 + DISPFB2 + DISPLAY2
  is deferred.
- **Read-side addressing**. Linear by default (legacy formula
  `vram_read_addr = FBP*2048 + (effective_y*FBW*64 + effective_x)
  << bpp_shift`). Four OPTIONAL per-PSM swizzle paths gated by
  parameters on `gs_pcrtc_stub`: `PSMCT32_SWIZZLE=1` (Ch120)
  routes PSMCT32 reads through `gs_swizzle_psmct32_stub`;
  `PSMCT16_SWIZZLE=1` (Ch126) routes PSMCT16 reads through
  `gs_swizzle_psmct16_stub`; `PSMT8_SWIZZLE=1` (Ch132) routes
  PSMT8 reads through `gs_swizzle_psmt8_stub` (Ch131) — FBW must
  be even because PSMT8 pages are 128 px wide and the swizzle
  internally divides FBW by 2; `PSMT4_SWIZZLE=1` (Ch138) routes
  PSMT4 reads through `gs_swizzle_psmt4_stub` (Ch137); FBW must
  be even (same as PSMT8). The four parameters are independent —
  enabling one doesn't affect the others. PSMT4's swizzle module
  also outputs a `nibble_hi` selector that PCRTC uses in place of
  `pixel_index[0]` to pick which nibble of the byte at the
  swizzled address holds this pixel (PSMT4 packs 2 pixels per
  byte and the canonical PCSX2 column table reorders nibbles
  within a block, so the linear formula's `pixel_index[0]`
  selector is no longer correct under the swizzled layout). All
  four swizzle parameter defaults are 0 so all existing PCRTC-
  using TBs see the legacy linear behavior unchanged. The
  PSMT4 image-xfer (Ch139) and raster (Ch140) write-side
  wiring is now live as well, completing the four-PSM × three-
  path swizzle integration. Both driver-shape e2e demos for
  PSMT4 are also live: raster-driven (Ch141) and TRXDIR-driven
  (Ch142). All four common GS PSMs now have BOTH driver-shape
  e2e demos (CT32 Ch123+Ch124, CT16 Ch129+Ch130, T8 Ch135+
  Ch136, T4 Ch141+Ch142) — closing the four-PSM × three-path
  × dual-driver-shape e2e foundation.
- **Parallel to `platform_video_stub`, not a replacement**. We
  did not extend `platform_video_stub` (which would have
  rippled through 6 existing TBs). Pcrtc is the alternative
  video source for TBs that want VRAM-backed scanout. The legacy
  flood-fill module stays as-is.

### End-to-end demo manifest (Ch143)

Eight driver-shaped end-to-end byte-accurate demos cover the
four common GS PSMs across both driver shapes (raster-driven
PACKED-SPRITE payload + TRXDIR-driven IMAGE payload). Each demo
runs the same EE-bootlet → DMAC → GIF → GS → vram → swizzled-
PCRTC chain with all three same-PSM swizzle gates parameter-set
to 1; the listed write-side path is load-bearing and the other
write-side path is asserted dormant in the demo flow.

All eight demos emit a 16×8 framebuffer (128 pixels). The raster
column shows `(emits, xfer_writes)`; the TRXDIR column shows
`(xfer_writes, emits)` — in both cases the load-bearing path
fires 128 times and the dormant path is asserted 0.

| PSM     | Raster-driven e2e               | TRXDIR-driven e2e                  |
|---------|---------------------------------|------------------------------------|
| PSMCT32 | Ch123 — `tb_gs_demo_psmct32_swizzle_e2e`         (128, 0) | Ch124 — `tb_gs_demo_psmct32_swizzle_trxdir_e2e` (128, 0) |
| PSMCT16 | Ch129 — `tb_gs_demo_psmct16_swizzle_e2e`         (128, 0) | Ch130 — `tb_gs_demo_psmct16_swizzle_trxdir_e2e` (128, 0) |
| PSMT8   | Ch135 — `tb_gs_demo_psmt8_swizzle_e2e`           (128, 0) | Ch136 — `tb_gs_demo_psmt8_swizzle_trxdir_e2e`   (128, 0) |
| PSMT4   | Ch141 — `tb_gs_demo_psmt4_swizzle_e2e`           (128, 0) | Ch142 — `tb_gs_demo_psmt4_swizzle_trxdir_e2e`   (128, 0) |

For each row both demos use the same per-quadrant pixel pattern
(so the verify side is shared across the row), the same DBW-
even constraint where applicable (PSMT8 / PSMT4: 128-px-wide
pages → DBW=2 minimum even), and verification through the
freed-up `vram_stub` 2nd read port. Ch141 + Ch142 together
close the four-PSM × three-path × dual-driver-shape e2e
foundation — the foundation Ch143 manifests and seals.

**Hardware-demo candidates**:

- **PSMCT32 swizzled raster e2e (Ch123)** — simplest direct-
  color path: 4 SPRITE PACKED packets, RGBAQ.{R,G,B,A} mapped
  1:1 to scanout RGB, no CLUT, no nibble RMW. The natural first
  hardware demo because every byte from EE-bootlet through the
  swizzled 16×8 framebuffer to PCRTC RGB is visible without
  any indirection. Build target: `make tb_gs_demo_psmct32_swizzle_e2e`.
- **PSMT4 swizzled TRXDIR e2e (Ch142)** — strongest indexed/
  CLUT-like stress path: U1 PACKED A+D TRX setup + U2 IMAGE
  NLOOP=4 with 32 PSMT4 nibbles per qword, image-xfer engine
  decoding the canonical PCSX2 columnTable4 (which reorders
  nibbles within a block — the linear `pixel_index[0]` rule is
  wrong under swizzle), and per-pixel nibble RMW on vram_stub
  via `write_be=4'b0001 + write_mask ∈ {0x0F, 0xF0}` keyed by
  the swizzle's `nibble_hi`. Exercises the full sub-byte
  pipeline + the canonical-source-locked column table. Build
  target: `make tb_gs_demo_psmt4_swizzle_trxdir_e2e`.

### First hardware-targeted top wrapper (Ch146)

Ch146 turns the Ch144 readiness audit + Ch145 BRAM-shrink groundwork
into a real top-level SystemVerilog module: [`rtl/top/top_psmct32_raster_demo.sv`](../../rtl/top/top_psmct32_raster_demo.sv).
This is the module a board-level synthesis project would target
first. Board-level concerns (HDMI/VGA PHY, pin constraints, .mem
bake tooling, clock-domain crossings) are deliberately deferred —
Ch146 proves the design can be expressed as a single hardware-
shape module.

**Top ports**:
- `clk` / `rst_n` / `core_go` — clock, active-low synchronous reset,
  start pulse (a board reset-release sequencer can tie `core_go`
  high after `rst_n` deasserts).
- `r/g/b/hsync/vsync/de` — 8-bit RGB scanout from PCRTC.
- `core_halt` / `dma_done_seen` / `frame_seen` — debug/status bundle
  suitable for LEDs or a board-level state observer.

**Top parameters**: `H_ACTIVE` (default 16), `V_ACTIVE` (default 8),
`BIOS_SIZE_BYTES`, `RAM_SIZE_BYTES`, `VRAM_BYTES`,
`USEG_SHADOW_WORDS_PARAM` (default 1024 = 4 KiB per Ch145).

**Image fixtures** are passed via macros (iverilog-12 string-
parameter forwarding limitation):
`TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE` and
`TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE`. The fixtures are
baked by [`sim/data/top_psmct32_raster_demo/bake.py`](../../sim/data/top_psmct32_raster_demo/bake.py)
which writes:
- `bios.mem` — 18-word EE bootlet (one 32-bit hex word per line)
- `payload.mem` — 40 qwords for ee_ram_stub (16 zero qwords +
  24 GIF qwords carrying 4 SPRITE PACKED packets)

The bake script is a deterministic Python rewrite of the
procedural `ee_prog_word()` + `preload_qword()` loops in the
Ch123 TB. Same bit-exact values, just baked into static repo
artifacts so a hardware top can `$readmemh` them.

**Focused TB**: [`sim/tb/top/tb_top_psmct32_raster_demo.sv`](../../sim/tb/top/tb_top_psmct32_raster_demo.sv).
Drives the top with the static fixtures, captures one full
PCRTC frame after the EE halts and DMAC completes, and asserts
the per-quadrant RGB matches the Ch123 frame exactly. Counts:
`raster_emits=128, errors=0, core_halt=1, dma_done_seen=1,
frame_seen=1`.

**Bug-fix iteration**: the first bake had Y in XYZ2 placed at
bits[43:32] instead of bits[31:20] — a Python translation error
of the SystemVerilog `{32'd0, y_int, 4'd0, x_int, 4'd0}`
concatenation. Symptom: per-sprite emit count was 8 instead of
32 (each sprite drew one row), and VRAM held the per-sprite R
component scattered across 32 consecutive 4-byte cells. Caught
by adding a per-emit observer that printed
`(addr, data, be, mask, color_q)` for the first 10 emits.
Fix: `y << 20` instead of `y << 32` in `bake.py`. **PASS after
the fix.**

**What's still NOT in this chapter** (deferred to Ch147+):
- Real `.mem` bake tooling integration (currently the
  `bake.py` is run manually before sim; a Makefile target or
  CI step that invokes it would belong in Ch147).
- Board-specific top: pin constraints, target FPGA family,
  PHY shim (HDMI/DVI/VGA), reset-release sequencer.
- A multi-PSM top (the Ch142 PSMT4 TRXDIR variant would be a
  natural second wrapper once the build flow is proven).

### Fixture bake flow (Ch147)

Ch147 makes the Ch146 `.mem` bake first-class so the static
fixtures can't drift from `bake.py`. Three new Makefile targets:

| Target                                  | Purpose                                                               |
|-----------------------------------------|-----------------------------------------------------------------------|
| `top_psmct32_raster_demo_mem`           | Re-runs `bake.py`; produces `bios.mem` + `payload.mem` atomically.    |
| `top_psmct32_raster_demo_mem_check`     | Verifies fixture sizes (bios.mem = 1024 lines, payload.mem = 256).    |
| `tb_top_psmct32_raster_demo` (existing) | Now declares `top_psmct32_raster_demo_mem` as a prerequisite.         |

The bake target uses Make's grouped-target syntax (`&:`) so a
single `bake.py` run produces both files atomically — they can
never be out-of-step.

The size-check target counts payload lines (skipping blanks +
`// ...` comment-only lines) and asserts the exact expected
counts. A non-matching count exits with status 1, surfacing a
fixture/script drift as a hard build failure.

Deleting the fixtures and running the TB triggers the bake
automatically:
```
$ make tb_top_psmct32_raster_demo
=== bake top_psmct32_raster_demo .mem fixtures ===
python3 .../bake.py
[bake] wrote bios.mem (1024 words, 18 active) and payload.mem (256 qwords, 40 active)
=== build tb_top_psmct32_raster_demo ===
...
[tb_top_psmct32_raster_demo] PASS
```

#### Synthesis-facing macros

When pointing a synthesis tool at `rtl/top/top_psmct32_raster_demo.sv`,
two preprocessor defines must be set so `bios_rom_stub` and
`ee_ram_stub` find their `$readmemh` images. These are macros
(NOT module parameters) per the iverilog-12 string-parameter
forwarding workaround documented in the Ch146 wrapper banner;
they map cleanly to FPGA-tool defines.

| Macro                                              | Value                                                          |
|----------------------------------------------------|----------------------------------------------------------------|
| `TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE`          | Absolute (or tool-relative) path to `bios.mem`                 |
| `TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE`       | Absolute (or tool-relative) path to `payload.mem`              |

Both default to `""` so the wrapper still elaborates without
fixtures (synthetic NOP-sled in `bios_rom_stub` + zero-init
`ee_ram_stub`, which produces no DMAC payload but a stable
PCRTC frame with `r=g=b=0`).

**Vivado** (preprocessor `verilog_define` on the synthesis +
implementation filesets — these are macros, not module
generics):
```
set_property verilog_define { \
    TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE="$path/bios.mem" \
    TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE="$path/payload.mem" \
} [get_filesets sources_1]
```
Repeat for the implementation fileset if it diverges from
`sources_1`.

**Quartus** (project-level macro defines):
```
set_global_assignment -name VERILOG_MACRO \
    "TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE=\"$path/bios.mem\""
set_global_assignment -name VERILOG_MACRO \
    "TOP_PSMCT32_RASTER_DEMO_PAYLOAD_IMAGE_FILE=\"$path/payload.mem\""
```

**Iverilog (sim)**: the Ch147 Makefile passes them via `-D`
flags in the `tb_top_psmct32_raster_demo` build rule —
`-DTOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE='"$(SIM_DIR)/data/...
/bios.mem"'` — and the `top_psmct32_raster_demo_mem`
prerequisite ensures the .mem files exist before the TB
elaborates.

### DE25-Nano synthesis scaffold (Ch148)

Ch148 makes the Ch146 hardware top synthesis-addressable on
DE25-Nano without committing to a video PHY shim or final pin
constraints (those land in Ch149+).

| File / target                                                    | Purpose                                                    |
|------------------------------------------------------------------|------------------------------------------------------------|
| `synth/de25_nano/top_psmct32_raster_demo/files.f`                | RTL filelist — Ch123 dep tree only (~14 entries).          |
| `synth/de25_nano/top_psmct32_raster_demo/README.md`              | Top module + macros + fixtures + DE25-Nano clock/reset/video assumptions. |
| `make top_psmct32_raster_demo_synth_check`                       | Validates files.f paths + fixture presence.                |

The synth-check target depends on `top_psmct32_raster_demo_mem_check`,
so a single command verifies fixture sizes AND that every file
referenced by the synth filelist exists. It exits non-zero on
any miss — surfacing both fixture drift (Ch147 size guard) and
filelist drift as hard build failures.

`.qsf` (Quartus pin assignments) is **not** committed in Ch148.
The README documents the board assumptions (clock domain,
reset polarity, `core_go` strategy, video-out path candidates,
LED status mapping) so the next chapter can author it without
inventing context. The point of Ch148 is that a Quartus project
import (or Vivado / `verilator --lint-only`) finds every file
the design needs, with the macros documented end-to-end.

### DE25-Nano board wrapper (Ch149)

Ch149 turns the Ch146 board-agnostic top into a real board top
without yet committing to pin assignments or a video PHY. New:

| Artifact                                                  | Purpose                                                                |
|-----------------------------------------------------------|------------------------------------------------------------------------|
| `rtl/top/de25_nano_psmct32_raster_demo_top.sv`            | Board wrapper — DE25-Nano signal names + reset sequencer + LED status. |
| `sim/tb/top/tb_de25_nano_psmct32_raster_demo_top.sv`      | Smoke TB exercising clock/reset/core_go/LED/video pins.                |

**Top ports** (matching the Terasic Golden_top.v conventions
from the DE25-Nano resource CD): `CLOCK0_50` / `CLOCK1_50` /
`CLOCK2_50`, `KEY[1:0]` (active-LOW), `SW[3:0]`, `LED[7:0]`
(active-LOW), and raw `VIDEO_R/G/B/HSYNC/VSYNC/DE` outputs that
a future PHY shim will consume.

**Reset bridge**:
1. `ninit_done` sourced from Terasic's `reset_release` IP under
   `\`ifdef USE_TERASIC_RESET_RELEASE_IP` (default-off; sim uses
   an inline 16-cycle stub matching the IP's shape).
2. `KEY[0]` + `ninit_done` feed an async-assert/sync-deassert
   2-stage shift register on CLOCK2_50. Mirrors the retroDE_nes
   pattern at `retroDE_nes.sv:170-177`.

**`core_go` sequencer**: 16-cycle delay after `core_rst_n`
deasserts, then a one-cycle `core_go` pulse. Matches the
"recommended hardware path" documented in the Ch148 README and
the level-sensitive `go_i` semantics at `ee_core_stub.sv:812-813`.

**LED status**: the Ch146 wrapper's three sticky status outputs
drive `LED[2:0]` (active-LOW): `LED[0] = ~core_halt`,
`LED[1] = ~dma_done_seen`, `LED[2] = ~frame_seen`. `LED[7:3]`
tied HIGH (OFF).

**Smoke TB counts**: `core_go_pulses=1`, all three status LEDs
eventually latch (the actual fall-edge order is `frame_seen`
first, then `core_halt`, then `dma_done_seen` — `frame_seen`
is a "PCRTC alive" indicator that fires on the first empty
frame after reset, well before the bootlet runs), and
`VIDEO_DE` rises inside the active region. Standalone PASS.

`.qsf` (pin assignments), PLL, and video PHY shim remain
deferred (Ch150+). Ch149 makes the design board-shaped, not
yet board-pinned.

### Quartus scaffold for DE25-Nano (Ch150)

Ch150 commits the first real Quartus artifacts for the Ch149
board wrapper — a minimal `.qsf` + `.sdc` pair, deliberately
PHY-light:

| File                                                            | Purpose                                                           |
|-----------------------------------------------------------------|-------------------------------------------------------------------|
| `synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf` | Device + family + pin assignments + IO standards + .mem macros + file list. |
| `synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc` | CLOCK2_50 50 MHz clock + reset-sync false-path + IO false-paths.  |
| `make top_psmct32_raster_demo_quartus_scaffold_check`           | Validates both files exist + top entity + pins + clock period.    |

**Device** (sourced from `retroDE_splash.qsf`): Agilex 5
`A5EB013BB23BE4SCS`, package `VPBGA`. **Top entity**:
`de25_nano_psmct32_raster_demo_top` (the Ch149 board wrapper —
NOT the inner Ch146 module). **Pin assignments** match the
DE25-Nano board pinout used by `retroDE_splash` and
`retroDE_nes`: `CLOCK2_50` → `PIN_BF23`, `KEY[0]` → `PIN_C8`,
`LED[2:0]` → `PIN_DN22 / PIN_DJ32 / PIN_DF35`. CLOCK0/1_50,
KEY[1], SW[3:0], and LED[7:3] are also assigned (their canonical
pins) so Quartus doesn't flag them as unconstrained inputs/
outputs even though the Ch149 wrapper ties them off.

**SDC** (sourced from `retroDE_splash.sdc`): a single 50 MHz
`create_clock` on CLOCK2_50, the standard reset-sync first-stage
false-path (`set_false_path -to [get_registers -nowarn
{*rst_sync[0]}]`), and IO false paths for `KEY[*]`, `SW[*]`,
`LED[*]` plus the as-yet-unpinned `VIDEO_*` outputs (replaced
by real `set_output_delay` constraints when the PHY shim
lands).

**`.mem` macros** baked into the QSF (project-relative paths):
`TOP_PSMCT32_RASTER_DEMO_BIOS_IMAGE_FILE = sim/data/top_psmct32_raster_demo/bios.mem`
and the matching payload macro. Run `make -C sim
top_psmct32_raster_demo_mem` before launching Quartus.

**`USE_TERASIC_RESET_RELEASE_IP`** is **not** defined in this
QSF — keeping the wrapper self-contained for the first project
import. To wire in Terasic's `reset_release` IP, define the
macro and add the IP file from
`DE25_Nano_ResourceCD/Demonstration/FPGA/Board_Info_RTL/reset_release/`.

**Deferred to Ch151+**: video PHY pins + shim (HDMI ADV7513 +
I²C config FSM, VGA DAC, or PMOD), PLL `.ip` config, LPDDR4 /
SDRAM / HPS / CAM / UART / GPIO assignments. Ch150 makes the
project Quartus-importable, not yet Quartus-buildable for video
output.

### PLL + lock-gated reset (Ch151)

Ch151 adds the most conservative hardware bring-up step before
touching the video PHY: a board-clock PLL on the path between
`CLOCK2_50` and the design clock, with the reset bridge gated
on PLL lock so the design can only leave reset once the PLL is
stable.

| Artifact                                              | Purpose                                                              |
|-------------------------------------------------------|----------------------------------------------------------------------|
| `rtl/top/de25_nano_pll_stub.sv`                       | Sim stub matching the Quartus IOPLL `pll` module signature.          |
| `rtl/top/de25_nano_psmct32_raster_demo_top.sv` (Ch151) | Reworked with PLL instantiation + lock-gated reset bridge + `design_clk` distribution to the Ch146 wrapper and `core_go` sequencer. |
| `tb_de25_nano_psmct32_raster_demo_top` (Ch151 update) | Adds rising-edge timestamps for `pll_locked` / `core_rst_n` / `core_go` and asserts the contract `pll_locked < core_rst_n < core_go`. |

**PLL signature** (matches `retroDE_nes/ip/pll/pll_bb.v` and
`retroDE_splash/ip/sys_pll/sys_pll_bb.v`):
```
module pll (
    input  wire  refclk,
    input  wire  rst,
    output wire  outclk_0,
    output wire  locked
);
```

**Sim stub behavior**: `outclk_0 = refclk` (pass-through, no
multiplication — sim doesn't need a different frequency, and a
pass-through still exercises the lock-gated reset bridge).
`locked` rises after 32 cycles with `rst` low; held LOW while
`rst` is HIGH.

**Reset gating**: the board top's `rst_sync` register
async-asserts on `(ninit_done | ~pll_locked)` — both FPGA init
AND PLL lock must complete before reset can deassert.

**Synth swap**: define `USE_PLL_IP` and add a Quartus IOPLL
`.qip` to the project; the board wrapper's `\`ifdef USE_PLL_IP`
swaps the stub for the real IP. The QSF documents the swap
mechanism but ships with the IP commented out, keeping the
scaffold self-contained until the PLL chapter (Ch152+) commits
a frequency choice + IP file.

**TB contract** (smoke output): `t_pll/rstn/go=(950000,990000,
1330000)` ns — PLL locks at 950 ns, reset deasserts 40 ns
later (the 2-stage sync register prop), `core_go` fires
340 ns later (the GO_DELAY=16 wait). Order assertions catch
any future regression of the gating.

**Deferred to Ch152+**: real PLL output frequency tuning (the
stub passes refclk through; a real build sets `outclk_0` to
whatever the video PHY chapter needs), committing the actual
IOPLL `.ip` file under `synth/de25_nano/.../ip/`, the video
PHY shim itself.

### First Quartus compile + baseline report (Ch152)

Ch152 is the chapter where the toolchain is finally asked the
honest question: "does this DE25-Nano board top synthesize, fit,
and pass static timing analysis?"

**Driver**: [`synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh`](../../synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh)
runs `quartus_syn → quartus_fit → quartus_sta` against the Ch150
QSF + Ch151 PLL stub. `quartus_asm` (bitstream gen) is
deliberately skipped — Ch152 is a compile-and-report smoke,
not a deploy path. `USE_PLL_IP` is left undefined so the Ch151
self-contained PLL stub stays under test (per Codex framing).

**Make targets**:
| Target                          | Action                                                      |
|---------------------------------|-------------------------------------------------------------|
| `make quartus_compile`          | Full syn + fit + sta flow.                                  |
| `make quartus_compile_clean`    | Wipe outputs first, then full flow.                         |
| `make quartus_syn_only`         | Synthesis only (~14 min smoke).                             |
| `make quartus_compile_report`   | Run [`parse_reports.py`](../../synth/de25_nano/top_psmct32_raster_demo/parse_reports.py) on the latest output. |

**Ch152 RTL fixes that landed before synthesis would even
elaborate**:

| Issue                                                                              | Fix                                                                          |
|------------------------------------------------------------------------------------|------------------------------------------------------------------------------|
| QSF line-continuation (`\`) parse error in `set_global_assignment -name VERILOG_MACRO` | Collapsed to single-line lines.                                              |
| `vram_stub.mem` 8192-iter init loop exceeded Quartus's 5000-iter synthesizable-loop limit (Error 13356) | Wrapped initial block in `// synthesis translate_off` / `_on` pragmas. Real Altera/Intel BRAM is power-on-zero so the procedural loop is sim-only. |
| `gs_pcrtc_stub` / `gif_image_xfer_stub` / `gs_stub` unconditionally instantiate all four swizzle math primitives even when their gate is 0 | Added `gs_swizzle_psmct16/8/4_stub.sv` to the synth filelist + QSF (iverilog trimmed silently; Quartus errors). |
| `gs_stub.interp_byte` (Ch86 Gouraud TRI math) 64-bit signed divide hits Quartus Pro's lpm_divide LPM_WIDTHN ≤64 limit (Error 272006) | Wrapped divide in `// synthesis translate_off`; default fallback returns 0. The Ch123 SPRITE-only demo doesn't exercise Gouraud TRIs, so this is dead code in the build. A future Gouraud-TRI hardware demo would need a divider redesign sized for Agilex 5. |
| QSF `SDC_FILE` referenced via repo-root-relative path failed when the build script ran Quartus from a per-build work dir (Warning 16124) | Changed to basename-only — works from either the repo root or the work dir (the script symlinks the SDC alongside the QSF). |

**First successful synthesis**: 0 errors, 3 warnings, 14:08
elapsed. 160 RAM segments + 26 DSP elements inferred.

**Fitter result — design too large for the part (the chapter's
honest answer)**:

```
Total dedicated logic registers : 121,176
Total pins                      :      17 / 351      ( 5 %)
Total block memory bits         :  65,536 / 7,331,840 (<1 %)
Total RAM Blocks                :       6 / 358      ( 2 %)
Total DSP Blocks                :      20 / 188      (11 %)
Logic utilization (ALMs needed) : 155,104 / 46,800   (331 %)
```

The design needs **155,104 ALMs vs the part's 46,800 — 3.31×
oversized**. `Error (170011): Design contains 260,263 blocks of
type combinational node. However, the device contains only
93,600 blocks.`

**Why so big** (the precise picture, to be drilled into by Ch153+):

The synthesis log reports `Info (22567): extracting RAM` for
**all four** memory identifiers — `ee_ram_stub.mem`,
`bios_rom_stub.mem`, `ee_memory_map_stub.useg_shadow_mem`, and
`vram_stub.mem` — so Quartus *did* recognize each as a memory
structure at syn time. But the fit report shows only **65,536
bits / 6 RAM Blocks** committed (roughly enough for BIOS 4 KB +
EE-RAM 4 KB). Something between syn and fit caused the larger
arrays — most likely `vram_stub.mem` (8 KB) and possibly
`useg_shadow_mem` (4 KB after Ch145's 1024-word shrink) — to
either (a) be replicated into combinational mux/decoder logic
because of their access-port shape, or (b) lose their RAM
attribute during fitter optimization and fall back to
flip-flop implementation. The 121,176 dedicated registers + the
260,263 combinational nodes are consistent with at least
`u_vram` getting massively unrolled.

Ch153's job is to isolate **which array(s)** and **which port
shape(s)** prevent compact block-RAM implementation. The
likely candidates: `vram_stub`'s dual read ports + per-byte
write_be lane (Ch95's per-byte gate may not be RAM-block-
friendly on Agilex 5), and the EE memory map's wide arbitration
into the useg-shadow port. None of this is fixed in Ch152 —
surfacing the gap precisely is the chapter's deliverable.

**Other notable findings** (full list in
[`output_files/build_logs/`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/build_logs/)):
- **Critical Warning 20759**: "Use the Reset Release IP in
  Agilex 5 designs to ensure a successful configuration." This
  is the Ch151 `\`ifdef USE_TERASIC_RESET_RELEASE_IP` opt-in;
  enabling it (and committing the IP file) is a Ch153+ task.
- **6× Warning 16749**: identifiers used before declaration in
  `dmac_reg_stub`, `gif_packed_stub`, `gs_stub`,
  `gif_image_xfer_stub`. Style/lint warnings, no functional
  impact; clean-up candidate for a future polish chapter.
- **STA never ran** because fit failed.

**What Ch152 leaves for Ch153+**:
- Resource reduction. Most likely candidates: BRAM-infer
  `vram_stub.mem` and `useg_shadow_mem` cleanly (Quartus
  attribute hints / restructure read ports), or shrink the EE
  core's MIPS decode (table-driven vs LUT-driven), or move to
  a larger Agilex 5 part if available.
- Enabling `USE_TERASIC_RESET_RELEASE_IP` and committing the
  Terasic `reset_release` IP file.
- The PHY shim chapter (`VIDEO_*` virtualized → real HDMI
  ADV7513 / VGA / PMOD pins).
- Cleaning up the 6× forward-reference style warnings.

### Memory-shape forensics (Ch153)

Ch153 is a memory-forensics chapter (NOT a rewrite chapter): two
isolated tiny Quartus projects under [`synth/de25_nano/experiments/`](../../synth/de25_nano/experiments/)
target the same Agilex 5 part as the Ch150 board top so resource
deltas are apples-to-apples. The goal is to identify which feature(s)
of `vram_stub`'s shape prevent compact block-RAM implementation and
drive the Ch152 size deficit.

| Experiment            | Memory shape                                                                                  |
|-----------------------|-----------------------------------------------------------------------------------------------|
| `exp_a_bram_friendly` | 2048 × 32-bit, single port, sync read + sync write with byte-WE. Intel-friendly BRAM template. |
| `exp_b_vram_shape`    | 8192 × 8-bit, dual COMBINATIONAL read, byte-WE + per-bit mask RMW. Exact `vram_stub` shape.    |

**The result is decisive**:

| Metric                          | exp_a (BRAM-friendly) | exp_b (vram_stub-shape) |
|---------------------------------|-----------------------|-------------------------|
| Fitter status                   | ✅ **Successful**      | ❌ **Failed**           |
| Logic utilization (ALMs)        | **46** / 46,800 (< 1 %) | (fit failed — placement reports 257,986 combinational nodes vs 93,600 device max) |
| Total dedicated logic registers | **0**                  | **65,536**              |
| Total RAM Blocks                | **4** / 358 (1 %)      | **0** / 358 (0 %)       |
| Total block memory bits         | **65,536** (8 KB)      | **0**                   |

**Interpretation**:
- The Intel-friendly shape maps the same 8 KB to **4 RAM Blocks**
  with **zero combinational logic and zero registers** beyond the
  read-output flop.
- The `vram_stub` shape maps the same 8 KB to **zero RAM Blocks**,
  **65,536 dedicated registers** (one flip-flop per byte), and
  **257,986 combinational nodes** (the 4-byte concatenation
  multiplexers for the dual combinational reads + the per-bit
  mask RMW gates).
- The 257,986 combinational-node figure for a single 8 KB memory
  almost exactly matches the 260,263 combinational-node figure
  Ch152 reported for the **entire top-wrapper design** —
  empirical confirmation that `u_vram` alone accounts for
  essentially all of the Ch152 size deficit.

**Which feature is the dominant cost** (the four candidates the
shape diff isolates):

The exp_a vs exp_b diff folds four feature changes together
(byte-addressable storage, combinational reads, dual reads,
per-bit mask RMW). To pin down which feature(s) dominate, a
future chapter could insert intermediate experiments — but the
exp_a result already gives the upper bound on what BRAM-native
inference can buy: ~4 RAM blocks + ~50 ALMs for 8 KB. Anything
that gets `vram_stub` close to that bar wins back the entire
Ch152 fit headroom.

The most likely individual culprit is the **per-bit mask RMW**:
Agilex 5's M20K BRAM has byte-WE primitives but does NOT have
per-bit RMW. Quartus has to materialize the
`(mem & ~mask) | (data & mask)` arithmetic outside the BRAM,
which forces the storage out of BRAM and into per-bit flip-flops.
Combinational reads are the second most likely (BRAMs are
synchronous-read-only on Agilex 5; Quartus has to either insert
a register on the read path or materialize the storage as
discrete flip-flops to feed the comb output).

**Make targets**:

| Target                                | Action                                                       |
|---------------------------------------|--------------------------------------------------------------|
| `make quartus_experiments`            | Compile every `synth/.../experiments/exp_*` project.         |
| `make quartus_experiments_clean`      | Wipe outputs first, then compile.                            |
| `make quartus_experiments_report`     | Side-by-side resource summary (no recompile).                |

**What Ch153 leaves for Ch154+**:
- Refactor `vram_stub` into a BRAM-friendly shape: replace
  combinational reads with sync (registered output) reads,
  replace per-bit mask RMW with byte-WE-only writes (move the
  per-pixel sub-byte merging logic into the writer module —
  most likely `gs_stub.raster_pixel_emit` for the PSMT4 nibble
  case), and switch to 32-bit word-addressable storage with
  byte-WE for the unaligned-byte case.
- Audit `useg_shadow_mem` next — it had `Info (22567): extracting RAM`
  at synthesis but didn't survive to fit. Likely culprits there:
  the `Ch64` / `Ch65` / `Ch70` mirror-write features that turn
  the simple useg-shadow into a multi-port write structure.

### BRAM-friendly vram sibling (Ch154)

Ch154 adds a hardware-friendly sibling of `vram_stub` —
[`rtl/gif_gs/vram_bram_stub.sv`](../../rtl/gif_gs/vram_bram_stub.sv) — that maps cleanly onto Agilex 5
M20K block-RAM. Per Codex's framing, the chapter's blast radius
stays narrow: **add the sibling + prove it works + measure the
BRAM-inference win**. The actual swap of the board top to use
the new module + the writer-side PSMT4 nibble-RMW rework lands
in Ch155+.

**`vram_bram_stub` shape vs `vram_stub`**:

| Feature                    | `vram_stub` (legacy / sim reference) | `vram_bram_stub` (Ch154, hw-friendly) |
|----------------------------|-------------------------------------|----------------------------------------|
| Storage                    | 8192 × 8-bit byte-addressable        | 2048 × 32-bit word-addressable         |
| Reads                      | Combinational; arbitrary alignment    | Synchronous (1-cycle); word-aligned only |
| Read ports                 | 2 (combinational)                     | 2 (sync, true dual-port M20K)          |
| Write granularity          | byte-WE + per-bit `write_mask` RMW    | byte-WE only                           |
| Per-bit mask RMW (Ch106)   | yes — supports PSMT4 nibble splice    | NO — caller must splice on writer side |

**New equivalence TB**: [`tb_vram_bram_stub_equivalence`](../../sim/tb/gif_gs/tb_vram_bram_stub_equivalence.sv).
Drives both DUTs in lockstep with byte-WE-only writes
(`write_mask = 0xFFFFFFFF` on the legacy module so the per-bit
RMW path is a no-op), aligns sample times across the new
module's 1-cycle sync-read latency, and asserts data
equivalence across:
- 32-bit word writes (`be=4'b1111`)
- per-byte-lane writes (`be=4'b0001 / 0010 / 0100 / 1000`)
- per-byte non-wrapping admission near MAX_BASE
- dual-port read agreement

PASS standalone + in the full sim regression.

**Quartus experiment `exp_c_vram_bram_stub`** ([synth/.../experiments/exp_c_vram_bram_stub/](../../synth/de25_nano/experiments/exp_c_vram_bram_stub/))
proves the new module infers BRAM cleanly. Side-by-side with
the Ch153 baselines, all on the same Agilex 5 part:

| Experiment             | Fit       | ALMs | Registers | RAM Blocks | Block memory bits |
|------------------------|-----------|------|-----------|------------|-------------------|
| `exp_a_bram_friendly`  | ✅ Success | **46** / 46,800 | **0** | **4** / 358 | 65,536 |
| `exp_b_vram_shape`     | ❌ Failed  | (261,578 comb nodes vs 93,600 device max) | **65,536** | **0** / 358 | 0 |
| `exp_c_vram_bram_stub` | ✅ Success | **190** / 46,800 | **2** | **8** / 358 | 131,072 |

**Interpretation**:
- `exp_c` lands close to `exp_a`'s ideal (190 vs 46 ALMs; 8 vs
  4 RAM Blocks). The slight overhead vs `exp_a` is the dual
  read port (M20K replicates storage to serve two independent
  read addresses simultaneously, hence 2× block memory bits)
  plus the per-byte non-wrapping admission gate Ch95 inherited
  from `vram_stub`.
- `exp_c` consumes **3.4× fewer** dedicated registers than
  `exp_a` would have if `read_data` was reset (2 vs the 32 a
  reset would require) — the canonical Quartus inference
  template demands no reset on the BRAM data register.
- vs `exp_b`'s **65,536 registers + 261,578 combinational nodes**,
  swapping `vram_stub` → `vram_bram_stub` recovers essentially
  all of the Ch152 ALM headroom on the vram side. Useg-shadow
  is the next forensic target (likely similar shape).

**Inference template gotcha** (caught + fixed in this chapter):
the first cut of `vram_bram_stub` had a reset on `read_data`
inside the always_ff block AND an in-bounds gate guarding the
`mem` read. Quartus rejected BRAM inference with
`Info (276007): RAM logic ... uninferred due to asynchronous
read logic`. Fix: simplified the read path to the canonical
template (`always_ff @(posedge clk) read_data <= mem[idx];`)
and moved bounds + alignment checks to a parallel `read_valid`
pipeline. Then `Implemented 64 RAM segments` instead of 0.

**Ch155+ surface — writer-side normalization for ALL sub-32-bit
PSMs, not just PSMT4**: `vram_bram_stub`'s contract is stricter
than `vram_stub`'s — `write_addr` MUST be word-aligned
(`write_addr[1:0] == 2'b00`), and the byte lane(s) being written
are selected via `write_be` with the payload pre-shifted into
the right byte lane(s) of `write_data[31:0]`. Today's writer-
side RTL emits at sub-word boundaries:
- **PSMCT16** raster + image-xfer write at halfword addresses
  (`write_addr[1] == 1` for the high halfword) with `be=4'b0011`
  or `4'b1100` and the 16-bit payload in `write_data[15:0]`.
- **PSMT8** raster + image-xfer write at byte addresses
  (any `write_addr[1:0]`) with `be=4'b0001` and the 8-bit payload
  in `write_data[7:0]`.
- **PSMT4** raster + image-xfer write at byte addresses with
  `be=4'b0001` + per-bit `write_mask` 0x0F or 0xF0 to splice
  one nibble.
- **PSMCT32** raster + image-xfer write at word addresses with
  `be=4'b1111` + the full 32-bit payload — the ONLY PSM that
  natively matches `vram_bram_stub`'s contract today.

If we swap the board top to `vram_bram_stub` without writer-side
normalization, **CT16/T8/T4 writes silently drop** because
`write_addr[1:0] != 0` fails admission. So Ch155 must rework
each writer to:
1. Mask `write_addr` down to its word base (`write_addr & ~32'd3`).
2. Shift the payload from its native byte lane into the
   appropriate byte lane(s) of a 32-bit `write_data` based on
   the original `write_addr[1:0]`.
3. Generate `write_be` with bits set only for the byte lanes
   the original sub-word address actually targets.
4. **For PSMT4 specifically**: replace the per-bit `write_mask`
   nibble splice with a writer-side read-modify-write — read
   the existing byte first, splice the new nibble in, then
   issue a normal byte-WE write. Adds ~1 cycle of latency per
   nibble-write but that's well within the 16×8 demo budget.

The rework lands inside `gs_stub.raster_pixel_emit` (Ch95/Ch105/
Ch106 wrote the legacy paths) and `gif_image_xfer_stub`'s per-
PSM dispatch. A focused TB that drives sub-word writes through
the normalizer and asserts the resulting `vram_bram_stub` words
match the legacy `vram_stub` byte-/halfword-/nibble-level
state would be the cleanest proof.

**Other Ch155+ work**:
- Update scanout / debug TBs that sample VRAM via vram_stub's
  combinational reads to handle the 1-cycle sync-read latency
  (or keep them on `vram_stub` if they're sim-only).
- Swap the Ch146 board top to instantiate `vram_bram_stub`
  AFTER the writer-side normalization lands. Rerun the full
  Quartus compile and expect a dramatic ALM/register reduction.
- Audit `useg_shadow_mem` next — Ch64/Ch65/Ch70 mirror-write
  features may make it multi-port-write-shaped.

### VRAM write normalizer + first BRAM integration (Ch155)

Ch155 lands the writer-side normalization layer that bridges
the contract gap between the legacy `vram_stub` (byte-addressed
sub-word writes + per-bit RMW) and the new `vram_bram_stub`
(word-aligned + byte-WE only). Per Codex's framing the chapter
keeps blast radius narrow: build the normalizer + verify it
standalone for all 4 PSMs + prove the easiest case (PSMCT32)
end-to-end through the new VRAM. RTL plumbing into
`gs_stub.raster_pixel_emit` and `gif_image_xfer_stub` lands in
Ch156+.

| Artifact                                                   | Purpose                                                          |
|------------------------------------------------------------|------------------------------------------------------------------|
| `rtl/gif_gs/vram_normalize_pkg.sv`                         | Pure-comb `normalize_write` function — natural byte address + PSM + payload + (T4-only) old_byte → word-aligned write_addr + shifted write_data + write_be. |
| `tb_vram_normalize_write`                                  | Focused unit TB — 17 cases across CT32 / CT16 / T8 / T4 lanes + misuse detection. |
| `rtl/top/top_psmct32_raster_demo_bram.sv`                  | Sibling of the Ch146 wrapper with `vram_bram_stub` swapped in.   |
| `tb_top_psmct32_raster_demo_bram`                          | Integration TB — drives Ch146 fixtures + verifies VRAM contents at PSMCT32 swizzled addresses via hierarchical probe. |

**Function contract** (`vram_normalize_pkg::normalize_write`):

| PSM       | byte_addr alignment | payload bits used | output `write_be` shape | extras |
|-----------|---------------------|-------------------|-------------------------|--------|
| PSMCT32   | word (`addr[1:0]==0`) | `payload[31:0]` (full ABGR) | `4'b1111` | misuse → drop (`be=0000`) |
| PSMCT16   | halfword (`addr[0]==0`) | `payload[15:0]` (RGB5A1) | `4'b0011` (low) / `4'b1100` (high), keyed on `addr[1]` | misuse → drop |
| PSMT8     | byte (any)           | `payload[7:0]` (index byte)  | one of `4'b0001 / 0010 / 0100 / 1000`, keyed on `addr[1:0]` | — |
| PSMT4     | byte (any)           | `payload[3:0]` (nibble)      | one of `4'b0001 / 0010 / 0100 / 1000`, keyed on `addr[1:0]` | needs `old_byte` + `nibble_hi`; output is the spliced full byte at the addressed lane |
| any other | —                    | —                            | `4'b0000`                | — |

**PSMT4 splice math** (the only PSM whose output depends on
prior memory state): given `nibble_hi=0`, the function returns
`new_byte = {old_byte[7:4], payload[3:0]}` — preserves the
upper nibble, replaces the lower. With `nibble_hi=1`,
`new_byte = {payload[3:0], old_byte[3:0]}`. The CALLER is
responsible for sourcing `old_byte` via a 1-cycle read of
`mem[byte_addr]` upstream of the write; the function itself is
purely combinational. The Ch156+ RTL plumbing chapter is
where that read pipeline lives inside
`gs_stub.raster_pixel_emit` and `gif_image_xfer_stub`.

**`top_psmct32_raster_demo_bram` integration result**: the new
sibling wrapper substitutes `vram_bram_stub` for `vram_stub`,
drops `write_mask` wiring (CT32's `mask=0xFFFFFFFF` makes the
per-bit RMW path a no-op so dropping it is functionally
equivalent), and accepts the 1-cycle sync-read latency on
PCRTC's `vram_read_data` path (so PCRTC scanout is 1-pixel
shifted; the integration TB skips frame capture and verifies
VRAM content via direct hierarchical probe). All 128 pixel
words at canonical PSMCT32 swizzled addresses match expected
ABGR. Standalone PASS.

**Ch155 critical audit check**: `vram_normalize_write`'s
function-level misuse handling pins the contract — passing an
unaligned `byte_addr` for CT32 OR CT16 returns `write_be=4'b0000`,
which `vram_bram_stub` then drops cleanly. Combined with
Codex's stance that "no sub-32-bit writer is allowed to hand
an unaligned address directly to vram_bram_stub", the Ch156+
plumbing chapter has a hard contract to verify against.

**Ch156+ surface**:
- Insert a 1-cycle byte-read pipeline upstream of the PSMT4
  raster emit + image-xfer paths inside `gs_stub` and
  `gif_image_xfer_stub`. The read returns `old_byte` for
  `normalize_write`'s splice input.
- Apply `normalize_write` to all four PSM emit lanes inside
  both writers.
- Add focused TBs for PSMCT16 / PSMT8 / PSMT4 paths analogous
  to `tb_top_psmct32_raster_demo_bram` — each verifies the
  swizzled VRAM contents under the new normalizer + bram_stub.
- Add a 1-cycle address-stage register inside
  `gs_pcrtc_stub` so scanout consumers see a clean
  combinational-look read (`addr` → `data` with the BRAM's
  internal sync stage hidden).
- Once all four lanes pass, swap the Ch146 board top to use
  `vram_bram_stub` directly (or retire `vram_stub` outright).
- Audit `useg_shadow_mem` next — the Ch64/Ch65/Ch70 mirror-
  write features may make it multi-port-write-shaped, which
  is its own forensic exercise.

### Writer-side normalize plumbing — CT16 + T8 (Ch156)

Ch156 plumbs the Ch155 `vram_normalize_pkg::normalize_write`
function into the BRAM-friendly path so PSMCT16 and PSMT8
raster emits land at the right `vram_bram_stub` byte lane. The
chapter intentionally keeps blast radius narrow — the function
is wired in at the **wrapper site** between the unmodified
writer engines (`gs_stub.raster_pixel_emit`) and
`vram_bram_stub`, so the legacy byte-addressable contract on
gs_stub's raster emit ports stays exactly as Ch128/Ch134 / etc.
defined them. PSMT4 still requires the read-modify-write
pipeline and is deferred to Ch157+.

| File / target                                              | Role                                                                                                  |
| ---------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| `rtl/top/top_psmct32_raster_demo_bram.sv`                  | Wrapper updated: `raster_pixel_psm_q` exposed; `bitbltbuf_q[61:56]` provides the PSM during xfer; the muxed `(byte_addr, psm, payload)` triple is run through `vram_normalize_pkg::normalize_write` and the result feeds `vram_bram_stub`. CT32 path remains a passthrough; CT16/T8 paths now write to the right lane. |
| `tb_gs_raster_bram_psmct16`                                | Focused CT16 integration TB — 16×4 SPRITE at FBP=0/FBW=1, halfword 0x6155. Drives gs_stub#(PSMCT16_SWIZZLE=1) directly; verifies all 64 swizzled halfwords land in `u_vram.mem[byte_addr >> 2]` at the addr[1]-keyed lane; pins the linear-stride separator at byte 0x80 = zero. |
| `tb_gs_raster_bram_psmt8`                                  | Focused PSMT8 integration TB — 16×8 SPRITE at FBP=0/FBW=2, byte index 0xA5. Drives gs_stub#(PSMT8_SWIZZLE=1) directly; verifies all 128 swizzled bytes land in `u_vram.mem[byte_addr >> 2]` at the addr[1:0]-keyed lane. |

**Why wrapper-site, not in-engine**: keeping `gs_stub` and
`gif_image_xfer_stub` byte-addressable preserves the contract
that every Ch128 / Ch134 / Ch140 swizzle TB (and the legacy
`vram_stub`) was written against. Ch156's only structural
change is that a top wrapper which targets `vram_bram_stub`
also runs `normalize_write` between the writer and VRAM. A
future chapter can promote the normalizer into the writer
engines once we've decided to retire `vram_stub`; until then
the function lives where it can be removed without changing
the writers.

**PSMT4 deferral — explicit hard-gate** (Ch156 audit Medium #1
fix; **superseded by Ch157**): when Ch156 closed, the wrapper
masked `write_en` off when the active PSM was PSMT4
(`vram_psmt4_block = (vram_psm_pre == PSM_PSMT4)`,
`vram_we_mux = vram_we_pre && !vram_psmt4_block`). Without that
gate, `normalize_write`'s PSMT4 branch returned a real one-byte
write spliced against `old_byte=0`, silently corrupting VRAM
on any T4 raster emit. The Ch156 focused TB
`tb_gs_raster_bram_psmt4_gate` drove a 16×4 PSMT4 SPRITE
through the wrapper-shape gate and asserted (1) raster_pixel_emit
pulses fired, (2) every pulse hit the gate (`blocked == emit`),
(3) VRAM stayed at sentinel 0xDEADBEEF — zero corruption.
**Ch157 retires both the gate and that TB**: the wrapper now
runs a real RMW pipeline (see "PSMT4 RMW pipeline" section
below) and supplies a live `old_byte` so the splice produces
correct bytes. The retired TB's coverage is replaced by
`tb_gs_raster_bram_psmt4`, which drives the same kind of PSMT4
SPRITE but verifies *correct* nibble splices instead of
*absence* of writes.

**Adversarial coverage on the CT16 / PSMT8 TBs** (Ch156 audit
Medium #2 fix): both TBs originally drove a single uniform
payload across the whole sprite, so a buggy normalizer that
wrote all four byte lanes (or duplicated payload, or stomped
neighboring lanes) could still leave every checked pixel
matching. The TBs now split the image into TWO half-width
SPRITEs with **distinct** payloads:
- `tb_gs_raster_bram_psmct16` drives `(0,0)..(7,3)` with
  halfword 0x6155 (low halfword lane via PSMCT16 swizzle) and
  `(8,0)..(15,3)` with halfword 0x9F8E (high halfword lane of
  the same 32-bit words). Sentinel preload (0xDEADBEEF) on
  every VRAM word before the drive plus a linear-stride
  separator check at byte 0x80 (outside the swizzled set).
- `tb_gs_raster_bram_psmt8` drives `(0,0)..(7,7)` with byte
  0xA5 (lanes {0,1}) and `(8,0)..(15,7)` with byte 0x5A
  (lanes {2,3}). Same sentinel preload.

A normalizer that swaps lanes, sets be too wide, or fails to
preserve the other halfword/byte lane(s) of the shared word
now surfaces as a per-pixel mismatch.

**Sim regression**: 141 PASS / 0 FAIL after the audit fixes
(140 + the new `tb_gs_raster_bram_psmt4_gate`).

**xfer-side coverage**: `gif_image_xfer_stub` already feeds
the wrapper's pre-normalize mux during `xfer_busy`. CT32
TRXDIR uploads (no Ch156 TB exists yet, but the path is
wired) pass through the normalizer cleanly because xfer
emits CT32 word-aligned. CT16 + T8 xfer TBs that exercise
this path are a follow-on item — the wiring is already in
place; only a focused TB is missing.

**Sim regression**: 140 PASS / 0 FAIL after Ch156 (138 +
2 new BRAM-integration TBs).

### PSMT4 RMW pipeline — `vram_bram_stub` writes enabled (Ch157)

Ch157 closes the last writer-PSM gap that Ch156 left behind: the
PSMT4 hard-gate is replaced by a wrapper-site read-modify-write
pipeline that supplies a LIVE `old_byte` from VRAM, splices the
new nibble against it, and commits a full-byte write through
`vram_bram_stub`'s byte-WE (no per-bit RMW required). The
nibble splice itself uses the SAME math as `vram_normalize_pkg`'s
PSMT4 branch (`new = nibble_hi ? {nib, old[3:0]} : {old[7:4], nib}`)
but lives **inline in the wrapper**, not inside a call to
`normalize_write` — the function is pure-comb and would have
required `old_byte` to be combinationally available, whereas
`vram_bram_stub`'s registered read port hands the byte back one
cycle later. The CT32/CT16/T8 paths still call `normalize_write`
directly (same-cycle, no read-back required). Goal Codex framed:
"all writer PSMs safe before swapping the board top."

**Pipeline shape** (inside
[`rtl/top/top_psmct32_raster_demo_bram.sv`](../../rtl/top/top_psmct32_raster_demo_bram.sv)):

```
emit cycle N:        is_t4_emit=1; vram_read2_addr = byte_addr & ~3;
                     pipe_q <= (byte_addr, nibble_hi, nibble[3:0]).
posedge → cycle N+1: vram_read2_data = mem[byte_addr] (sync read);
                     splice new_byte = nibble_hi
                            ? {nibble, old[3:0]}
                            : {old[7:4], nibble};
                     drive vram_we_final=1, write_addr=byte_addr&~3,
                     write_data shifted to byte_addr[1:0] lane,
                     write_be one-hot to that lane.
posedge → cycle N+2: mem[byte_addr] commits new_byte.
```

`old_byte` is sourced from the lane-correct slice of
`vram_read2_data`. CT32/CT16/T8 emits skip the pipe entirely and
fall through `vram_norm` same-cycle (CT32 stays a passthrough,
existing TBs unaffected).

**Forwarding hazard — back-to-back same-byte writes**: a PSMT4
SPRITE rasters adjacent pixels at `x=2k` and `x=2k+1` to the
SAME `byte_addr` (low + high nibble of a single byte). At cycle
N+1 the wrapper reads `mem[byte_addr]` for emit-2 in the SAME
posedge that emit-1's write commits. NBA semantics inside
`vram_bram_stub` (separate `always_ff` blocks for the write port
and the read port) make the read see the PRE-write value, so
emit-2 would splice against stale data. The Ch157 pipe carries
a 1-deep `t4_prev_*` register set (addr + new_byte from the
just-completed RMW) and forwards `t4_prev_new_byte_q` whenever
the in-flight emit's `byte_addr` matches the previous emit's
`byte_addr`. The forwarding chain extends across any number of
back-to-back same-byte emits — emit-N reads emit-(N-1)'s
`new_byte` from the forward register, splices on top, and
emit-(N+1) reads emit-N's new_byte from that same register.

| File / target                                              | Role                                                                                                   |
| ---------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| `rtl/top/top_psmct32_raster_demo_bram.sv`                  | Ch156 hard-gate replaced by the RMW pipe + forwarding registers; `vram_read2_addr` driven on T4 emit cycles; `vram_we_final` mux selects T4 pipe write or non-T4 same-cycle path. |
| `tb_gs_raster_bram_psmt4`                                  | New positive-proof TB — drives a 16×4 LINEAR PSMT4 SPRITE (PSMT4_SWIZZLE=0 so adjacent x's hit the same byte) split into two halves with distinct nibbles (0xA / 0x5). 64 raster emits; verifies every byte under the sprite holds the expected pair of spliced nibbles (left half = 0xAA, right half = 0x55) plus sentinel preserved on bytes outside the sprite. **PASS**. |
| `tb_gs_raster_bram_psmt4_gate`                             | Retired — the gate it asserted no longer exists. |

**Why LINEAR PSMT4 in the new TB**: the linear address formula
`(y*FBW*32) + (x>>1)` puts adjacent x's at the same byte, which
is exactly the back-to-back same-byte forwarding hazard. The
swizzled path scatters bytes via `columnTable4`, so it touches
the forwarding logic less often. Linear coverage is strictly
stronger here.

**Non-T4 TB cleanup**: `tb_gs_raster_bram_psmct16` and
`tb_gs_raster_bram_psmt8` still mirror the *non-T4* portion of
the wrapper-site plumbing, but they no longer carry the Ch156
PSMT4 hard-gate (now removed in the wrapper). Both wire
`raster_pixel_emit` straight to `write_en` and let
`vram_norm` drive addr/data/be — focused TBs verifying their
own PSM lane. Full pipe coverage lives in `tb_gs_raster_bram_psmt4`
and the top wrapper TB.

**Sim regression**: 141 PASS / 0 FAIL after Ch157 (140 + new
`tb_gs_raster_bram_psmt4` − retired `tb_gs_raster_bram_psmt4_gate`).

### PCRTC sync-read alignment (Ch158)

Ch158 closes the last big blocker before swapping the board top
to `vram_bram_stub`: the PCRTC's data-decode + sync-output
pipeline is now aware that `vram_bram_stub`'s `read_data` is
registered with 1-cycle latency, so the captured scanout no
longer trails the address stage by one column.

**`gs_pcrtc_stub` change** (in
[`rtl/gif_gs/gs_pcrtc_stub.sv`](../../rtl/gif_gs/gs_pcrtc_stub.sv)):
new module parameter `VRAM_SYNC_READ` (default 0). When set to 1,
every hcnt/vcnt-derived signal that the data-decode comb consumes
is run through a 1-cycle register before the consumer sees it
(`active_h_dec`, `active_v_dec`, `in_hsync_dec`, `in_vsync_dec`,
`in_display_window_dec`, `scanout_enable_dec`, `dispfb_psm_*_dec`,
`psm4_nibble_select_dec`, `end_of_frame_dec`). The address-side
(`vram_read_addr`) keeps using the current `(hcnt, vcnt)` so the
read is issued one pixel "ahead"; the registered `vram_read_data`
arrives one cycle later, paired with the matching delayed counter
view. Outputs `r/g/b/hsync/vsync/de` come from the `_dec` signals,
so the entire output stream shifts right by exactly one clock
when `VRAM_SYNC_READ=1`. Default `VRAM_SYNC_READ=0` is a pure
passthrough — every existing PCRTC TB written against the legacy
`vram_stub` (comb-read) shape is unaffected.

**`top_psmct32_raster_demo_bram` change**: instantiates
`gs_pcrtc_stub` with `.VRAM_SYNC_READ(1'b1)`. The wrapper banner
updates to drop the Ch155 caveat about scanout being 1 column
shifted — that caveat is now resolved.

**`tb_top_psmct32_raster_demo_bram` extension**: adds a Phase 2
frame-capture block that arms on the next vsync rising edge
after raster drain, captures one full frame's r/g/b into
`cap_*[v][h]` indexed by a 1-cycle-delayed copy of PCRTC's
address-stage counters (since the registered `de` aligns with
those delayed counters), and asserts each captured pixel's
post-decode r/g/b matches the expected ABGR for its quadrant.
Phase 1 (per-pixel VRAM probe via hierarchical `mem[byte_addr >> 2]`)
is unchanged. **PASS** — 16×8 active region, all 128 pixels
captured + all 128 VRAM words probe-verified, `frame_seen`
latched.

**Open Ch159+ items**:
- xfer-side T4 coverage TB — the Ch157 wrapper handles xfer-side
  T4 emits identically (the mux feeds `vram_psm_pre` from
  `bitbltbuf_q[61:56]` during `xfer_busy`), but no focused TB
  exercises that path yet.
- Swap the Ch146 board top to instantiate `vram_bram_stub` and
  the Ch158 PCRTC-sync mode directly (or retire `vram_stub`
  outright). All four writer PSMs and PCRTC scanout are now
  proven correct against the BRAM-friendly contract; the
  remaining work is the integration commit on the board side.
- Audit `useg_shadow_mem` for the same BRAM-shape forensics that
  Ch153 ran on `vram_stub` (Ch64/Ch65/Ch70 mirror writes may
  make it multi-port-write-shaped).

**Ch158 audit Medium fix — sub-word PSM lane selection**: the
initial Ch158 cut shifted the data-decode pipeline by 1 cycle
to align with `vram_bram_stub`'s registered output, but it
still extracted CT16 / PSMT8 / PSMT4 sub-word values from the
LOW lane of `vram_read_data` (i.e. `[15:0]` halfword and
`[7:0]` byte). That worked for `vram_stub` (byte-addressable;
the read returns 4 bytes starting at `byte_addr` so the
sub-word always lands at the low lane) but NOT for
`vram_bram_stub` (word-addressable; `read_data` is
`mem[byte_addr >> 2]` so the sub-word lives at lane
`byte_addr[1:0]` of the returned word). Codex Ch158 audit
called this out as a blocker for any sub-word PSM scanout
through the BRAM. The fix adds:

- `vram_addr_lane_q` — 1-cycle-delayed copy of
  `vram_read_addr[1:0]`, paralleling the other `_q` decode-
  stage registers added in the original Ch158 cut.
- `data_lane = VRAM_SYNC_READ ? vram_addr_lane_dec : 2'd0` —
  forces the legacy comb-read path to keep using the low lane
  (preserving every existing PCRTC TB's expectation), and
  resolves to the correct byte_addr-keyed lane in sync mode.
- `psm16_pixel = data_lane[1] ? read_data[31:16] : read_data[15:0]`.
- A `vram_byte_lane` mux extracting one of 4 byte lanes for
  PSMT8 (`psm8_idx`) and PSMT4 (`psm4_byte_lane` → nibble
  splice).

Two new focused integration TBs prove the fix end-to-end with
adversarial pre-loads:

| TB                                                                                | Coverage                                                                                                       |
| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| [`tb_gs_scanout_bram_psmct16`](../../sim/tb/gif_gs/tb_gs_scanout_bram_psmct16.sv) | 4-pixel CT16 scanout reading mem[0]/mem[1] with FOUR distinct halfwords across both halfword lanes (`byte_addr[1]∈{0,1}`); each pixel's captured 5→8-decoded RGB matches the expected halfword. **PASS** |
| [`tb_gs_scanout_bram_psmt8`](../../sim/tb/gif_gs/tb_gs_scanout_bram_psmt8.sv)     | 4-pixel PSMT8 scanout reading mem[0] with FOUR distinct byte indices, one per byte lane (`byte_addr[1:0] ∈ {0,1,2,3}`); each pixel's grayscale RGB matches the expected byte. **PASS**                  |

Without the fix, both TBs would have failed: the CT16 TB would
emit the same pair of pixels twice (low halfword of each word),
and the PSMT8 TB would emit `IDX_0` for all four pixels.

**Sim regression**: 143 PASS / 0 FAIL after Ch158 audit fixes
(141 + 2 new BRAM scanout TBs).

### Board-top swap to BRAM wrapper + Quartus fit recovery (Ch159)

Ch159 commits the integration step that the prior chapters
were building toward: the DE25-Nano board top
([`rtl/top/de25_nano_psmct32_raster_demo_top.sv`](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv))
now instantiates [`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv)
instead of the legacy [`top_psmct32_raster_demo`](../../rtl/top/top_psmct32_raster_demo.sv).
External port shape is identical so this is drop-in at the
board level; the BRAM-backed wrapper carries through every
Ch155-Ch158 fix (writer-side normalize + PSMT4 RMW pipe +
PCRTC sync-read alignment + sub-word lane select). The synth
file list ([`synth/de25_nano/top_psmct32_raster_demo/files.f`](../../synth/de25_nano/top_psmct32_raster_demo/files.f))
and Quartus QSF gain `vram_normalize_pkg.sv`, `vram_bram_stub.sv`,
and `top_psmct32_raster_demo_bram.sv`; the legacy `vram_stub`
+ legacy top stay on the project for back-compat with sim TBs
that still target them.

**Quartus fit recovery — vs Ch152 baseline**: the headline of
this chapter. Ch152 fit FAILED at 155k ALMs needed (331% over)
because `vram_stub`'s 8 KiB byte-addressable + per-bit-RMW
storage didn't infer as M20K and landed as a 65,536-flip-flop
array, dragging 121k registers and 199k synthesis ALMs along
with it. Ch159 swap turns those numbers around:

| Metric                             | Ch152 (vram_stub)            | Ch159 (vram_bram_stub)        | Δ                       |
| ---------------------------------- | ---------------------------- | ----------------------------- | ----------------------- |
| Synthesis status                   | Successful                   | Successful                    | —                       |
| Synthesis ALMs estimate            | 199,103 / 46,800 (425% over) | **22,704 / 46,800 (49%)**     | −176,399 (**−88.6%**)   |
| Synthesis registers                | 101,457                      | **36,008**                    | −65,449 (**−64.5%**)    |
| **Fit status**                     | **FAILED** (155k / 331% over) | **Successful** (30,364 / 65%) | ✅ **fits**              |
| Fit registers                      | 121,176                      | **39,085**                    | −82,091 (**−67.7%**)    |
| Fit RAM blocks                     | 6 / 358                      | **14 / 358**                  | +8 (BRAM-inferred VRAM) |
| Fit block memory bits              | 65,536                       | **196,608**                   | +131,072 (data in M20K) |
| Fit DSP blocks                     | 20                           | 18                            | −2                      |
| **STA status**                     | **DID NOT RUN** (fit failed) | **Successful** (12 warnings)  | ✅ STA reachable         |
| STA setup slack worst (CLOCK2_50)  | n/a                          | −6.950 ns                     | timing miss at 50 MHz   |
| Fmax                               | n/a                          | 37.11 MHz                     | (Ch160+ tunes)          |

The eight new RAM blocks are the same `vram_bram_stub`
footprint exp_c proved in Ch154 (8 RAM blocks for the dual-port
+ admission-gated 8 KiB shape; the +6 already in the Ch152
baseline came from `bios_rom_stub` + `ee_ram_stub` +
`useg_shadow_mem` correctly inferring as BRAM there). The
register drop (121k → 39k) is essentially the entire VRAM
flip-flop array vanishing.

**Setup-slack reality check**: STA reports −6.950 ns slack at
the CLOCK2_50 50 MHz constraint (Fmax = 37.11 MHz). The
critical path is somewhere in the Ch123 dep tree's longer
combinational chains (likely the Gouraud divider or one of
the swizzle muxers). That is **NOT** a Ch159 regression — it's
a brand-new visibility unlocked by being able to run STA at
all. Ch160+ owns timing closure (PLL down-clock to ≤30 MHz,
critical-path pipelining, or both).

**Snapshots preserved**: the Ch152 baseline reports are saved
under
[`synth/de25_nano/top_psmct32_raster_demo/baseline_ch152/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch152/)
(syn / fit summaries + flow.rpt + parse_report.txt) so future
chapters can diff against them without re-running the failing
Ch152 baseline.

**Sim regression**: 143 PASS / 0 FAIL unchanged. The Ch149
board-wrapper TB exercises the same external behavior with
the new core wrapper inside.

### Down-clock target + first .sof bitstream (Ch160)

Ch160 closes the loop Codex framed at the end of Ch159 — "first
add a down-clock PLL profile so we can get a real bitstream
moving on hardware, then use the successful STA path report to
decide whether to pipeline toward 50 MHz." The chapter is SDC-
and build-flow-only; no RTL changes.

**SDC retarget** ([`synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc`](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc))
relaxes the CLOCK2_50 period from 20.000 ns (50 MHz) to
33.333 ns (30 MHz). The DE25-Nano's CLOCK2_50 oscillator is
physically still 50 MHz; the SDC tells Quartus to ASSUME a
30 MHz input so the fitter closes timing at the down-clock
target. A real PLL `.ip` that divides 50 → 30 MHz on hardware
is the Ch161+ commit (the QSF's commented-out `QIP_FILE`
swap-point is staged for it). Until then, the .sof produced
under this constraint is structurally clean for 30 MHz
operation; programming it on a board where CLOCK2_50 is still
wired straight through gives an effective 50 MHz chip clock
that may show setup-violating behavior — Ch161 closes that
gap.

**`build_quartus.sh` adds `quartus_asm`** ([`synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh`](../../synth/de25_nano/top_psmct32_raster_demo/build_quartus.sh))
gated on a clean STA, so a `.sof` bitstream is now produced
when the design fits and timing closes. The Make scaffold
check is loosened to accept either the 50 MHz (legacy) or
33.333 ns (Ch160 down-clock) period.

**Quartus result vs Ch159**:

| Metric                        | Ch159 (50 MHz target)         | Ch160 (30 MHz target)         |
|-------------------------------|-------------------------------|-------------------------------|
| Synth ALMs estimate           | 22,704 / 46,800 (49 %)        | 22,704 / 46,800 (49 %)        |
| Synth registers               | 36,008                        | 36,008                        |
| Fit status                    | Successful                    | Successful                    |
| Fit ALMs                      | 30,364 / 46,800 (65 %)        | 31,056 / 46,800 (66 %)        |
| Fit registers                 | 39,085                        | 37,381                        |
| Fit RAM blocks                | 14 / 358                      | 14 / 358                      |
| **STA setup slack worst**     | **−6.950 ns** (timing miss)   | **+0.805 ns** (closes)        |
| **Fmax (CLOCK2_50)**          | 37.11 MHz                     | 30.74 MHz                     |
| **`quartus_asm`**             | (skipped)                     | **Successful — `.sof` produced** |

The synth-side numbers are identical because no RTL changed —
the differences are entirely in the fitter's placement choices
under the looser timing constraint. Fmax dropped slightly
(37.11 → 30.74 MHz) because Quartus optimizes harder when the
target is tighter; the headline is that **at the 30 MHz target
the design CLOSES** (positive slack on every report) and a
real `.sof` is now generated.

**Critical path** (from
[`output_files/de25_nano_psmct32_raster_demo_top.sta.rpt`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/de25_nano_psmct32_raster_demo_top.sta.rpt),
worst-10 paths all in the same module hierarchy):

| Field        | Value                                                                                    |
|--------------|------------------------------------------------------------------------------------------|
| Slack        | +0.805 ns (worst path of 10 with this slack value)                                       |
| From / To    | `u_demo|u_core|div_0_rtl_0|auto_generated|divider|divider|...` (intra-divider register-to-register) |
| Data Delay   | 32.516 ns (out of 33.333 ns period)                                                      |
| Critical net | The EE core's auto-generated 64-bit signed divider (the Ch152-noted Gouraud TRI divider — dead code in the PSMCT32 raster demo because no `RM_TRI` primitive is dispatched). |

**Ch161+ pipelining handoff**: the path Codex's framing asked
us to surface is now visible. Two options:

1. **Pipeline the divider** — re-implement `ee_core`'s 64-bit
   division as an N-cycle multi-cycle path. Quartus's auto-
   generated divider is a single-cycle ripple chain; making it
   2-3 stage pipelined would put Fmax comfortably above 50 MHz.
2. **Strip it from the build** — gate the Gouraud TRI
   divider behind a `STRIP_GOURAUD_TRI` parameter (default
   off), so the PSMCT32 raster demo's hardware build instances
   the EE core without it. Quartus removes the entire
   `div_0_rtl_0` block; Fmax should jump dramatically.

Option 2 is the lower-blast-radius hardware bring-up move
(removes ~32 ns of dead-code combinational chain); option 1
is the long-term correct fix once the Gouraud TRI path goes
load-bearing.

**Snapshots**: Ch159 baseline reports preserved under
[`baseline_ch159/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch159/)
(syn / fit / sta summaries + parse_report).

**Sim regression**: 143 PASS / 0 FAIL unchanged (no RTL
changes). Scaffold check + Ch149 board TB + top BRAM TB all
green under the new SDC.

### Real PLL IP commit — `.sof` actually runs at 30 MHz (Ch161)

Ch161 retires the Ch160 hardware-honesty caveat by committing a
real Quartus IOPLL `.ip` configured for 50 MHz refclk → 30 MHz
outclk_0. The wrapper's `\`ifdef USE_PLL_IP` (staged in Ch151)
now flips to the IP-generated `pll` module on Quartus builds;
sim TBs continue to use the pass-through `de25_nano_pll_stub`.

**Files committed under
[`synth/de25_nano/top_psmct32_raster_demo/ip/`](../../synth/de25_nano/top_psmct32_raster_demo/ip/)**:

- `pll.ip` — adapted from `retroDE_nes/ip/audio_pll.ip` (single-
  output Agilex 5 IOPLL template), retargeted to 50 MHz refclk
  → 30 MHz outclk_0.
- `pll/pll.qip` + `pll/synth/pll.v` + `pll/pll_bb.v` — Quartus
  IP-generated artifacts (`quartus_ipgenerate de25_nano_psmct32_raster_demo_top --ip_file=ip/pll.ip --generate_ip_file --synthesis=verilog`).
  The generated `pll` module exposes
  `(refclk, rst, outclk_0, locked)` — exactly the Ch151 stub's
  signature, so the `\`ifdef` swap is drop-in.

**Wiring changes**:

- `de25_nano_psmct32_raster_demo_top.qsf` uncommented the
  `set_global_assignment -name QIP_FILE ip/pll/pll.qip` swap-
  point and added
  `set_global_assignment -name VERILOG_MACRO "USE_PLL_IP=1"` so
  Quartus instantiates the IP `pll` instead of the
  `de25_nano_pll_stub`.
- `de25_nano_psmct32_raster_demo_top.sdc` reverted the Ch160
  CLOCK2_50 period back to 20.000 ns (the physical 50 MHz
  oscillator). The IOPLL's auto-generated SDC inside the .qip
  declares the post-PLL `outclk_0` clock at 30 MHz, so STA
  picks up two domains: `u_pll|iopll_0_refclk` (50 MHz, the
  pin) and `u_pll|iopll_0_outclk0` (30 MHz, the design clock).
- `build_quartus.sh` symlinks the `ip/` dir alongside the
  existing `rtl/` and `sim/` symlinks so the QIP_FILE's
  `ip/pll/pll.qip` path resolves from the work dir.

**Quartus result vs Ch160**:

| Metric                            | Ch160 (SDC profile only)     | Ch161 (real PLL IP)          |
|-----------------------------------|------------------------------|------------------------------|
| Fit ALMs                          | 31,056 / 46,800 (66 %)       | 30,898 / 46,800 (66 %)       |
| Fit registers                     | 37,381                       | 37,352                       |
| **Fit PLLs**                      | **0 / 11**                   | **1 / 11** (real IOPLL)      |
| RAM blocks                        | 14 / 358                     | 14 / 358                     |
| Setup slack worst (design_clk)    | +0.805 ns @ CLOCK2_50         | **+0.565 ns @ u_pll|iopll_0_outclk0** |
| Fmax (design_clk)                 | 30.74 MHz                    | **30.74 MHz**                |
| `quartus_asm`                     | Successful                   | Successful (`.sof` produced) |

The `+1` PLL block is the real IOPLL on the chip; ALMs go down
slightly because the stub's clock-distribution path no longer
needs ALM glue. STA now reports BOTH clock domains: the refclk
(50 MHz, +19.249 ns slack — trivially fast) and the design_clk
(30 MHz post-PLL, +0.565 ns slack — comfortable margin). The
`.sof` produced under this configuration **genuinely runs at
30 MHz on the DE25-Nano**: the IOPLL takes the 50 MHz CLOCK2_50
input and divides to 30 MHz inside the chip, so the entire
design downstream of `u_pll.outclk_0` operates at the
constrained frequency. (Setup slack landed at +0.914 ns on the
initial Ch161 build; the Ch161 audit's wider reset false-path
nudged the fitter into a slightly different placement, dropping
the worst-case setup slack to +0.565 ns. Recovery analysis on
the rst_sync stages — which had been hiding a real -0.079 ns
violation under the original `*rst_sync[0]` constraint — is now
gone from the .sta.summary entirely after the false-path was
widened to `*rst_sync[*]`.)

**Snapshots**: Ch160 baseline (parse_report + summaries +
`.sof`) preserved under
[`baseline_ch160/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch160/).

**Open Ch162+ items** (Ch161 forward-ref, **superseded by
Ch162 below**):

- ~~Pipeline or strip the EE-core 64-bit Gouraud TRI divider~~ —
  **closed in Ch162** via `STRIP_HW_DIVIDER` (note: the actual
  divider is the Ch43 DIVU divider, not Gouraud TRI; the
  forward-ref's name was loose). The Ch162 strip retired the
  `u_demo|u_core|div_0_rtl_0|...` STA worst path entirely; see
  the Ch162 section below for the new critical path.
- xfer-side T4 coverage TB (open from Ch157+).
- `useg_shadow_mem` BRAM-shape forensics.
- Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins
  virtualized.

**Sim regression**: 143 PASS / 0 FAIL unchanged. Sim ignores
the `\`ifdef USE_PLL_IP` (no `+define+USE_PLL_IP` in the
iverilog Makefile) so the stub stays active under sim.

### Strip the EE-core hardware divider (Ch162)

Ch162 takes the lower-blast move from the Ch161 STA handoff:
add a parameter that gates the EE-core's auto-inferred 32-bit
hardware divider out of synthesis on the PSMCT32 SPRITE-only
hardware build, then re-measure Fmax.

**RTL change** ([rtl/ee/ee_core_stub.sv](../../rtl/ee/ee_core_stub.sv))
gains `parameter bit STRIP_HW_DIVIDER = 1'b0`. Two `/` and `%`
sites tied to the Ch43 DIVU instruction are gated by this
parameter — the writeback (lines ~932-935) and the retire-
trace `arg3` mirror (lines ~1005-1014). Default `0` keeps
DIVU semantics intact for every existing sim TB
(`tb_ee_core_divu_mflo` is the only consumer; it stays at the
default). When the parameter is `1`, the writeback becomes a
no-op (HI/LO unchanged, identical to the divisor==0 case the
spec calls undefined) and the retire-trace `arg3` reports 0.
Quartus then has nothing to infer — the `div_0_rtl_0` block
disappears.

**Wrapper plumbing**:
[`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv)
gains a matching `STRIP_HW_DIVIDER` parameter and forwards it
to `ee_core_stub`. The
[DE25-Nano board top](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)
sets `.STRIP_HW_DIVIDER(1'b1)` on its `u_demo` instantiation
(the bootlet doesn't execute DIVU, so this is behavior-neutral
for the demo). Sim TBs that instantiate the BRAM wrapper
directly use the default 0.

**Quartus result vs Ch161 (real-PLL baseline)**:

| Metric                            | Ch161 (real PLL)              | Ch162 (real PLL + strip)      |
|-----------------------------------|-------------------------------|-------------------------------|
| Fit ALMs                          | 30,898 / 46,800 (66 %)        | 30,006 / 46,800 (64 %)        |
| Fit registers                     | 37,352                        | 36,618                        |
| Fit PLLs                          | 1                             | 1                             |
| RAM blocks                        | 14                            | 14                            |
| **Setup slack worst (design)**    | +0.565 ns                     | **+3.567 ns**                 |
| **Fmax (design domain)**          | 30.74 MHz                     | **33.6 MHz** (+9.4 %)         |
| `quartus_asm`                     | Successful                    | Successful (`.sof` produced)  |

Stripping the divider freed 892 ALMs / 734 registers and
yielded ~3 ns of new setup margin. **Fmax climbs from 30.74
MHz to 33.6 MHz** — a real jump, but **not enough to clear the
50 MHz target** (which would need a +67 % jump). Codex's
Ch162 framing predicted this branch: "if Fmax jumps, we have a
clean path to a 50 MHz demo bitstream; if not, the next real
critical path will reveal itself." We landed in the second
branch — Fmax jumped, but not far enough.

**New critical path** (the Ch163+ handoff, from
[`output_files/de25_nano_psmct32_raster_demo_top.sta.rpt`](../../synth/de25_nano/top_psmct32_raster_demo/output_files/de25_nano_psmct32_raster_demo_top.sta.rpt)):

| Field      | Value                                                                                                             |
|------------|-------------------------------------------------------------------------------------------------------------------|
| Slack      | +3.567 ns                                                                                                         |
| From       | `u_demo|u_pcrtc|div_1_rtl_0|auto_generated|divider|divider|...` (PCRTC magnification divider)                     |
| To         | `u_demo|u_vram|mem_rtl_0|auto_generated|altera_syncram_impl1|ram_block2a15~reg0` (VRAM port input)                |
| Data delay | 38.443 ns of arrival vs 42.010 ns required (period 33.333 ns + clock skew + uncertainty)                          |

The PCRTC divider comes from
[`gs_pcrtc_stub.sv`](../../rtl/gif_gs/gs_pcrtc_stub.sv) lines:
```
assign vram_x_unshift = {20'd0, hwin_rel} / hmag_factor;
assign vram_y_unshift = {20'd0, vwin_rel} / vmag_factor;
```
where `hmag_factor = MAGH + 1` and `vmag_factor = MAGV + 1`.
For the demo `MAGH = MAGV = 0`, so the divisor is constant 1
— but Quartus doesn't constant-propagate through this
formulation and synthesizes a real 32-bit divider anyway. The
parallel Ch162 fix shape would be a `STRIP_PCRTC_MAG_DIV`
parameter (or a more general "demo doesn't use magnification"
hint that bypasses the divider when both MAGH and MAGV are
constant 0).

**Snapshots**: Ch161 baseline preserved under
[`baseline_ch161/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch161/)
(syn / fit / sta summaries + parse_report + .sof) for diff.

**Open Ch163+ items**:
- Strip the PCRTC magnification divider on hardware builds
  (next critical path; same shape as Ch162's
  `STRIP_HW_DIVIDER`).
- Once Fmax climbs north of 50 MHz, retune the IOPLL `.ip` to
  outclk_0 = 50 MHz, retarget the SDC, and ship a 50 MHz
  bitstream.
- xfer-side T4 coverage TB (still open from Ch157+).
- `useg_shadow_mem` BRAM-shape forensics.
- Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins
  virtualized.

**Sim regression**: 143 PASS / 0 FAIL unchanged. Default
`STRIP_HW_DIVIDER=0` preserves DIVU semantics for
`tb_ee_core_divu_mflo`; the board top's `STRIP_HW_DIVIDER=1`
goes through `tb_de25_nano_psmct32_raster_demo_top` cleanly
because the Ch149 board TB doesn't exercise DIVU.

### Strip PCRTC magnification divider + 50 MHz close (Ch163)

Ch163 takes the next critical-path attack from the Ch162 STA
report (the PCRTC magnification divider) and uses the resulting
Fmax headroom to retune the PLL IP to 50 MHz output — closing
the journey that started at the Ch152 fit failure with a real
50 MHz bitstream.

**RTL change** ([rtl/gif_gs/gs_pcrtc_stub.sv](../../rtl/gif_gs/gs_pcrtc_stub.sv))
gains `parameter bit STRIP_PCRTC_MAG_DIV = 1'b0`. The two `/`
operators are gated:
```
assign vram_x_unshift = STRIP_PCRTC_MAG_DIV
                        ? {20'd0, hwin_rel}
                        : ({20'd0, hwin_rel} / hmag_factor);
assign vram_y_unshift = STRIP_PCRTC_MAG_DIV
                        ? {20'd0, vwin_rel}
                        : ({20'd0, vwin_rel} / vmag_factor);
```
Default `0` keeps the live divider math for every Ch93-era
magnification scanout TB (`tb_gs_scanout_magh_magv` etc.). When
`1`, the math collapses to a passthrough — equivalent to the
MAGH=MAGV=0 case the demo always hits but with no inferred
divider for Quartus to synthesize.

**Wrapper plumbing**:
[`top_psmct32_raster_demo_bram`](../../rtl/top/top_psmct32_raster_demo_bram.sv)
gains a matching `STRIP_PCRTC_MAG_DIV` parameter that forwards
to `gs_pcrtc_stub`. The
[DE25-Nano board top](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)
sets `.STRIP_PCRTC_MAG_DIV(1'b1)` on its `u_demo` instantiation.

**Quartus result, two stages**:

*Stage A — strip @ 30 MHz target (still on the Ch161 PLL .ip)*:

| Metric                | Ch162 (strip EE divider only) | Ch163 (strip both, 30 MHz) |
|-----------------------|-------------------------------|----------------------------|
| Fit ALMs              | 30,006 / 46,800 (64 %)        | 27,216 / 46,800 (58 %)     |
| Setup slack worst     | +3.567 ns                     | +21.113 ns                 |
| **Fmax (design)**     | 33.6 MHz                      | **81.83 MHz** (+143 %)     |

The Ch163 strip alone freed +17.5 ns of margin and 2,790 ALMs
— large enough to clear 50 MHz outright. Codex's Ch162 framing
predicted both branches of the if-Fmax-jumps fork; Ch163 lands
in the **first** branch ("clean path to a 50 MHz demo
bitstream").

*Stage B — retune PLL .ip from 30 MHz → 50 MHz output*:

The `pll.ip` source's `gui_output_clock_frequency0` and
`gui_output_clock_frequency_ps0` are bumped (30.0 → 50.0 MHz;
33333.333 → 20000.0 ps). `quartus_ipgenerate` rebuilds the
.qip / synth files in-place. No SDC change needed — CLOCK2_50
stays pinned at the physical 50 MHz period; the IOPLL's auto-
generated SDC declares the new outclk_0 frequency.

| Metric                | Ch163 strip @ 30 MHz target | Ch163 strip @ 50 MHz target  |
|-----------------------|-----------------------------|------------------------------|
| Fit ALMs              | 27,216 / 46,800 (58 %)      | 27,543 / 46,800 (59 %)       |
| RAM blocks / PLLs     | 14 / 1                      | 14 / 1                       |
| **Setup slack worst** | +21.113 ns                  | **+7.500 ns**                |
| **Fmax (design)**     | 81.83 MHz                   | **80.0 MHz**                 |
| `.sof` produced       | yes (30 MHz run on hw)      | **yes — 50 MHz on hw**       |

**The .sof produced under Stage B genuinely runs at 50 MHz on
the DE25-Nano** — the IOPLL takes 50 MHz CLOCK2_50 in and
emits 50 MHz outclk_0 (effectively a 1:1 relation through the
real PLL hardware so the chip's clock distribution still goes
through the IOPLL's clock network). All 8 timing classes
positive; no recovery violations; build gate Successful.

**Snapshots**:
- [`baseline_ch162/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch162/)
  — Ch162 30 MHz state with EE divider stripped only.
- [`baseline_ch163_30mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_30mhz/)
  — Ch163 strip-both at 30 MHz target (Stage A milestone).

**Open Ch164+ items** (the project has hit the major hardware
milestone Codex called out at Ch157+; Ch164+ is post-launch):
- xfer-side T4 coverage TB (open from Ch157+).
- `useg_shadow_mem` BRAM-shape forensics.
- Video PHY shim (HDMI / VGA / PMOD) — `VIDEO_*` pins still
  virtualized; this is the next big front-end deliverable
  before the demo can paint a real screen.

**Sim regression**: 143 PASS / 0 FAIL unchanged. Default-off
on `STRIP_PCRTC_MAG_DIV` preserves every Ch93 magnification
scanout TB; the board top's `STRIP_PCRTC_MAG_DIV=1` propagates
cleanly through `tb_de25_nano_psmct32_raster_demo_top` since
the demo locks MAGH=MAGV=0.

### HDMI pin shim — pixels off-chip (Ch164)

Ch164 is the first video-PHY chapter — Codex's framing was "small
PHY shim chapter, not a full display-stack leap. Get pixels off-
chip before making them pretty." Replace the abstract
`VIDEO_R/G/B/HSYNC/VSYNC/DE` virtual pins with real DE25-Nano
HDMI transmitter signals; the ADV7513 chip itself stays asleep
(its I²C wake-up FSM is the Ch165+ chapter), so the bitstream
makes the FPGA pins toggle correctly but a real monitor stays
dark until Ch165 lands.

**Wrapper change** ([rtl/top/de25_nano_psmct32_raster_demo_top.sv](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)):
five new top-level outputs added — `HDMI_TX_CLK` (= `design_clk`,
the 50 MHz pixel clock), `HDMI_TX_D[23:0]` packing
`{VIDEO_R, VIDEO_G, VIDEO_B}` (R in MSBs, ADV7513 default 24-bit
RGB), and `HDMI_TX_HS / HDMI_TX_VS / HDMI_TX_DE` mirroring the
abstract VIDEO_* signals. The VIDEO_* ports are kept on the
wrapper as `VIRTUAL_PIN ON` (the Ch149 board TB references them
via hierarchical probe).

**QSF change** ([synth/.../de25_nano_psmct32_raster_demo_top.qsf](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf)):
HDMI pinout sourced from
[`retroDE_nes/retroDE_nes.qsf`](../../../retroDE_nes/retroDE_nes.qsf)
for the same DE25-Nano (Terasic Agilex 5) board — `HDMI_TX_CLK`
on `PIN_DJ24` with 1.1-V IO standard (matches the on-board level
shifter), data + sync pins on 3.3-V LVCMOS. The companion
ADV7513 control pins (`HDMI_I2C_SCL`, `HDMI_I2C_SDA`,
`HDMI_TX_INT`, `HDMI_MCLK`) are intentionally NOT pinned — the
chip stays in standby on power-up and ignores its 24-bit RGB
input until the I²C wake-up FSM lands in Ch165+.

**SDC change** ([synth/.../de25_nano_psmct32_raster_demo_top.sdc](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc)):
`set_false_path -to` for each HDMI output port. Proper
`set_output_delay` constraints with respect to a generated
`HDMI_TX_CLK` domain land alongside the Ch165+ wake-up FSM,
when the ADV7513's actual setup/hold window comes out of the
chip's datasheet pass.

**Scaffold-check extension** ([sim/Makefile](../../sim/Makefile)):
`top_psmct32_raster_demo_quartus_scaffold_check` now also
verifies `HDMI_TX_CLK + HDMI_TX_D[0..23] + HS/VS/DE` are
pin-assigned (sentinel set; not exhaustive) — fails the gate
if Quartus would auto-place them on arbitrary package pins.

**Quartus result vs Ch163 (50 MHz)**:

| Metric                      | Ch163 (50 MHz, no HDMI pins)  | Ch164 (50 MHz + HDMI pins)    |
|-----------------------------|-------------------------------|-------------------------------|
| Fit ALMs                    | 27,543 / 46,800 (59 %)        | 27,271 / 46,800 (58 %)        |
| Fit RAM / PLL blocks        | 14 / 1                        | 14 / 1                        |
| **Fit pins**                | **17 / 351 (5 %)**            | **45 / 351 (13 %)** (+28 HDMI) |
| Setup slack worst (design)  | +7.500 ns                     | +7.536 ns                     |
| Fmax (design domain)        | 80.0 MHz                      | ~80 MHz (unchanged)           |
| `quartus_asm`               | Successful                    | Successful (`.sof` produced)  |

The +28 pins are exactly the new HDMI shim — 24 RGB lanes, 1
clock, 3 sync (HS / VS / DE). Setup slack stays at ~+7.5 ns
because the new pins are `false_path`'d — STA doesn't time
anything against them yet. ALMs ticked down slightly as the
fitter rebalanced under the wider pin map.

**Snapshot**: Ch163 50 MHz baseline preserved at
[`baseline_ch163_50mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_50mhz/)
(syn / fit / sta summaries + parse_report + .sof). The
[`baseline_ch163_30mhz/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch163_30mhz/)
30-MHz milestone is also preserved.

**Open Ch165+ items**:
- **ADV7513 I²C wake-up FSM** — without this the HDMI port
  outputs nothing on a real monitor. Ch165 owns the chip
  bring-up: pin `HDMI_I2C_SCL` / `HDMI_I2C_SDA` /
  `HDMI_TX_INT` / `HDMI_MCLK`, drop in an I²C master that
  walks the canonical ADV7513 register-set (sourced from
  `retroDE_nes`'s working bring-up).
- Proper `set_output_delay` constraints once the ADV7513
  setup/hold window is documented (replacing Ch164's
  `false_path`).
- Make the rendered pattern bigger than Ch123's 16×8 SPRITE so
  there's something visible to admire on a real screen.
- xfer-side T4 coverage TB (still open from Ch157+).
- `useg_shadow_mem` BRAM-shape forensics.

**Sim regression**: 143 PASS / 0 FAIL unchanged — no RTL
changes that touched sim semantics; the new HDMI ports are
combinational mirrors of existing VIDEO_* signals, and
`tb_de25_nano_psmct32_raster_demo_top` references VIDEO_*
unchanged.

### Wake the ADV7513 — first .sof that drives a real HDMI monitor (Ch165)

Ch165 turns "FPGA pins toggling" into "monitor has a fighting
chance of showing the tiny frame" — Codex's framing for the
chapter. The ADV7513 chip stays in standby on power-up; an I²C
master needs to walk a canonical register-write sequence to
configure 24-bit RGB input + sync polarity + power-up + HPD
override before the chip will accept the FPGA's HDMI_TX_*
data and drive the connector.

**Modules ported** (Terasic DE-series reference design, free
use on Terasic hardware per the license that ships with the
DE25-Nano System CD; copyright retained):

- [`rtl/platform/I2C_Controller.v`](../../rtl/platform/I2C_Controller.v)
  — bit-bang I²C master with 23-step transaction layout (start /
  slave-addr / sub-addr / data / stop, ~50 µs per byte at the
  derived 20 kHz I²C clock).
- [`rtl/platform/I2C_HDMI_Config.v`](../../rtl/platform/I2C_HDMI_Config.v)
  — wake-up FSM that walks a 38-entry LUT of ADV7513 register
  writes (slave 0x72): power-up + HPD override + audio init +
  AVI InfoFrame for full-range RGB 444 + dither + clock-divide +
  HDMI mode select. Adapted from the
  `retroDE_splash/rtl/platform/` versions (same DE25-Nano
  board); LUT customizations (HPD override, AVI InfoFrame for
  full-range RGB) carry through.

**Wrapper changes** ([rtl/top/de25_nano_psmct32_raster_demo_top.sv](../../rtl/top/de25_nano_psmct32_raster_demo_top.sv)):

- Four new top-level ports: `inout HDMI_I2C_SCL`,
  `inout HDMI_I2C_SDA` (open-drain I²C bus), `input HDMI_TX_INT`
  (chip's HPD / monitor-sense interrupt, active-low), and
  `output HDMI_MCLK` (audio sample-rate reference, driven by
  CLOCK2_50 since the demo is video-only).
- `I2C_HDMI_Config u_hdmi_i2c` instantiated. Clocked on
  `CLOCK2_50` (NOT `design_clk` — the wake-up runs even before
  the PLL locks); reset on `~ninit_done` (raw async reset; the
  I²C bus stays held in a clean state until FPGA init
  completes). Output `READY` (= `hdmi_init_done`) goes high
  after the LUT walk; `HDMI_TX_INT` going low retriggers the
  walk so a late hot-plug after FPGA boot still wakes the chip.
- New status LED: `LED[3] = ~hdmi_init_done` (active-low; lit
  means the chip is configured). `LED[7:4]` retie at HIGH.

**QSF + files.f + sim Makefile**:
[QSF](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.qsf)
gains pin assignments for the 4 new control pins (sourced from
`retroDE_nes`: `BT1` / `BW2` / `CF2` / `CF1`) plus IO standards
(3.3-V LVCMOS for everything). The two new platform Verilog
sources are added to the QSF source list, the synth
[files.f](../../synth/de25_nano/top_psmct32_raster_demo/files.f),
and the sim Makefile's `RTL_SRCS`. The
[scaffold-check](../../sim/Makefile)
extends to verify all 4 control pins are pin-assigned + IO
standard'd, alongside the Ch164 24-pin HDMI data set.

**SDC change**
([de25_nano_psmct32_raster_demo_top.sdc](../../synth/de25_nano/top_psmct32_raster_demo/de25_nano_psmct32_raster_demo_top.sdc)):
`set_false_path -to / -from` on the new control pins. The I²C
bus runs at ~20 kHz (50 µs per SCL period) and is inherently
async to the design clock; HDMI_MCLK is driven by CLOCK2_50 and
sampled by the chip's audio PLL — both well below any
constraint on the fabric.

**Quartus result vs Ch164**:

| Metric                  | Ch164 (HDMI data only)        | Ch165 (HDMI data + wake-up)   |
|-------------------------|-------------------------------|-------------------------------|
| Fit ALMs                | 27,271 / 46,800 (58 %)        | 27,374 / 46,800 (58 %)        |
| Fit RAM / PLL blocks    | 14 / 1                        | 14 / 1 (unchanged)            |
| **Fit pins**            | **45 / 351**                  | **49 / 351** (+4 control)     |
| Setup slack worst       | +7.536 ns                     | +7.198 ns                     |
| `quartus_asm`           | Successful                    | Successful (`.sof` produced)  |

The +103 ALMs are the I²C controller's bit-bang state machine
and the 38-entry LUT walker. STA stays positive on every
class — the wake-up FSM lives entirely on the I²C-clock domain
(slow), and Recovery analysis on `iRST_N` async-deassert is
cleanly +17.621 ns of slack.

**TB note** — `tb_de25_nano_psmct32_raster_demo_top` (the
Ch149 board smoke) wires up the new HDMI_TX_INT input
(tied high = no interrupt) and leaves the I²C SCL/SDA lines
floating; the wake-up FSM walks the LUT but full completion
takes ~125 ms simulated at the production divider
(controller-clock period ~100 µs × 33 phases/byte × 38 bytes),
far longer than the existing 5 ms TB runtime. The board TB
doesn't observe `hdmi_init_done` directly — it pre-dates the
wake-up FSM and only smoke-tests the wrapper. The Ch165 audit
landed `tb_hdmi_i2c_wake_smoke` (`sim/tb/top/`), which
overrides `CLK_Freq / I2C_Freq` to collapse the divider so the
walk runs in microseconds and asserts the LUT walk + READY
rise + HDMI_TX_INT retrigger + open-drain SDA + the Ch166
sticky NACK watchdog. Ch167 added a bus-level byte-sequence
lock: the TB switched its SDA model from pulldown to
pullup + a phase-aware slave-ACK driver (drives strong-LOW
exactly when `u_dut.u0.phase` is `PH_ACK0/1/2`, releases
otherwise so the master's data bits are visible on the
wire). A decoder samples SDA on each SCL rising edge
between START and STOP, assembles the three bytes per
transaction into a 24-bit `{dev_addr, reg, data}` tuple,
and compares against `u_dut.mI2C_DATA[23:0]` snapshotted
on `mI2C_GO` rising edges. Asserts: 38 captured == 38
intent, every byte matches, every dev_addr is `8'h72`.
The Phase 3 open-drain check also flipped semantics from
"SDA never strong-HIGH" to "SDA never `'x`" (the right
violation test for the pullup bus).

**Snapshots**: Ch164 baseline preserved at
[`baseline_ch164/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch164/);
Ch165 baseline at
[`baseline_ch165/`](../../synth/de25_nano/top_psmct32_raster_demo/baseline_ch165/).

**Open Ch168+ items**:
- Proper `set_output_delay` constraints on HDMI_TX_* once the
  ADV7513 setup/hold window is locked from the bring-up
  datasheet pass (replaces the Ch164 `set_false_path -to`).
- Make the rendered pattern bigger than Ch123's 16×8 SPRITE
  so there's something visible to admire on a real screen.
- xfer-side T4 coverage TB (still open from Ch157+).
- `useg_shadow_mem` BRAM-shape forensics.

**Sim regression**: 144 PASS / 0 FAIL.
`tb_de25_nano_psmct32_raster_demo_top` PASSES with the new
HDMI control ports wired up (HDMI_TX_INT held high in the
TB; LED=`0b11111000` shows the existing 3 status LEDs lit
— LED[3] stays unlit because the LUT walk doesn't complete
in 5 ms of sim). `tb_hdmi_i2c_wake_smoke` PASSES the
accelerated bring-up + Ch166 NACK-watchdog assertions.

### Hardware-readiness pass for the Ch123 PSMCT32 raster demo (Ch144)

Ch144 is a synthesis/FPGA-readiness audit around the first
hardware-demo candidate (Ch123 PSMCT32 raster e2e, marked above).
No RTL changes — Ch144 documents what a top-level FPGA wrapper
needs to know before attempting a first build.

**RTL dependency tree (Ch123-only)** — what the demo *actually*
instantiates. The full `RTL_SRCS` list compiled by sim contains
~40 modules; Ch123 only reaches these 11, plus the swizzle math
primitive that the three swizzle-aware modules each instantiate
internally:

| Module                       | Role in Ch123                                               |
|------------------------------|-------------------------------------------------------------|
| `bios_rom_stub`              | EE bootlet at 0xBFC0_0000 (~18 instructions)                |
| `ee_ram_stub`                | DMAC-side GIF payload (~24 qwords)                          |
| `ee_memory_map_stub`         | EE-CPU + DMAC + bios + map's GS-priv decode                 |
| `ee_core_stub`               | MIPS R5900 core running the bootlet                         |
| `ee_gs_priv_bridge_stub`     | EE 32-bit MMIO → 64-bit GS-priv reg writes                  |
| `dmac_reg_stub`              | DMAC ch2 NORMAL transfer                                    |
| `gif_packed_stub`            | GIFtag + PACKED A+D parser                                  |
| `gs_stub`                    | GS register file + raster (`PSMCT32_SWIZZLE=1`)             |
| `gif_image_xfer_stub`        | TRXDIR/IMAGE engine (`PSMCT32_SWIZZLE=1`, dormant in Ch123) |
| `vram_stub`                  | 8 KiB VRAM (one PSMCT32 page)                               |
| `gs_pcrtc_stub`              | PCRTC scanout (`PSMCT32_SWIZZLE=1`)                         |
| `gs_swizzle_psmct32_stub`    | Pure-comb math, instantiated x3 inside the gates above      |

**Sim-only constructs audit** (full sweep of the 12 modules
above):
- `bios_rom_stub.sv` and `ee_ram_stub.sv` — `$display` /
  `$readmemh` inside `initial begin`. Both are synth-safe:
  Xilinx Vivado and Intel Quartus support `$readmemh` for BRAM
  initialization, and `$display` is silently ignored by all
  major synthesizers.
- `vram_stub.sv` L114-117 — single `$error` parameter validator
  inside `initial begin`. Synth ignores it; the BYTES parameter
  must be set to a sane value at instantiation regardless.
- `ee_gs_priv_bridge_stub.sv` L118 — runtime `$error` on
  unsupported byte enables, inside `always_ff`. Synth ignores
  the `$error`; the surrounding logic still synthesizes
  correctly.
- **No** `$finish` / `$dumpfile` / `$random` / `force` /
  `release` / `real`-typed signals / hierarchical refs in any
  module of the **Ch123 dep tree**. (TBs use hierarchical refs
  into `bios_rom_stub` to preload the bootlet — that's a TB-
  only concern; on hardware the bootlet image is the BRAM init.
  Out-of-tree note: `boot_install_agent_stub.sv` (SIF subsystem,
  not in the Ch123 dep tree) contains a `$fatal` runtime
  validator, but it is never compiled into the Ch123 hardware
  build.)

**Memory sizing**:

| Memory             | Default       | Ch123 sim setting | Ch123 hw recommendation | FPGA fit                         |
|--------------------|---------------|-------------------|-------------------------|----------------------------------|
| `bios_rom_stub`    | 4 MiB         | 4 KiB             | 4 KiB                   | ≤1 BRAM tile                     |
| `ee_ram_stub`      | 16 KiB        | 4 KiB             | 4 KiB                   | ≤1 BRAM tile                     |
| `vram_stub`        | 64 KiB        | 8 KiB             | 8 KiB                   | ≤2 BRAM tiles (one PSMCT32 page) |
| `ee_memory_map_stub.useg_shadow_mem` (Ch145) | 4 MiB | 4 MiB | **4 KiB** (override `USEG_SHADOW_WORDS_PARAM=1024`) | ≤1 BRAM tile when overridden |

The 16×8 framebuffer needs only 16×8×4 = 512 bytes; 8 KiB gives
the full first PSMCT32 page (FBP=0). For a more ambitious
hardware demo (multi-page framebuffers, textures), grow
`vram_stub.BYTES` toward 1 MiB / 4 MiB. Real PS2 has 4 MiB of
VRAM; a first hardware build can stay at 8 KiB.

**Ch145 — `useg_shadow_mem` parameterization**: pre-Ch145, the
ee_memory_map_stub's useg-shadow backing was a fixed 1M-word /
4 MiB array. That was correct for the BIOS-smoke chapters that
need full first-4-MiB-of-useg coverage, but it's wasted area
for the Ch123 hardware demo (which never touches useg — the
bootlet runs from BIOS at 0xBFC0_0000 and the GIF payload from
RAM at phys 0x100). Ch145 promotes `USEG_SHADOW_WORDS` from a
hardcoded `localparam` to the `USEG_SHADOW_WORDS_PARAM` module
parameter (default 1M words = 4 MiB → existing TBs unchanged).
For the Ch123 hardware demo, the top-level wrapper instantiates
`ee_memory_map_stub` with `.USEG_SHADOW_WORDS_PARAM(1024)` to
shrink the inferred BRAM footprint by ~1024×; correctness is
unaffected because no useg access ever happens in the Ch123
data plane.

**Clock / reset assumptions**:
- Single clock domain (`clk`) — all 12 modules share one input.
- Active-low synchronous reset input (`rst_n`) — also a single
  shared input. No reset gating, no per-module variants. The
  reset is sampled inside `always_ff @(posedge clk)` via the
  `if (!rst_n)` pattern (NOT `posedge clk or negedge rst_n`) —
  i.e., it is NOT an async reset despite being active-low. On
  FPGA this should be brought up via the device's reset bridge
  so the deasserting edge is synchronous to `clk`.
- No clock gating, no derived clocks. The PCRTC's hsync/vsync/de
  are regular clock-domain outputs, not separate clocks.

**Swizzle gate parameter defaults**:
- All four swizzle parameters (`PSMCT32_SWIZZLE`,
  `PSMCT16_SWIZZLE`, `PSMT8_SWIZZLE`, `PSMT4_SWIZZLE`) default
  to `1'b0` on `gs_stub`, `gs_pcrtc_stub`, and
  `gif_image_xfer_stub`. For the Ch123 hardware demo,
  instantiate these three modules with **`PSMCT32_SWIZZLE(1'b1)`**
  and the other three left at `1'b0`. The swizzle-math
  primitives (`gs_swizzle_psmct32_stub` etc.) are pure-comb and
  trim cleanly when their gate is off.

**Top-level harness expectations** (for a future
`top_psmct32_raster_demo.sv`):
- Inputs: `clk`, `rst_n`, plus board-level video-out connections
  (HDMI / DVI / VGA — driven by `r/g/b/hsync/vsync/de` from
  `gs_pcrtc_stub`).
- The EE bootlet image must be preloaded into `bios_rom_stub`
  via either `IMAGE_FILE` (→ `$readmemh`) or a bake-step that
  writes a `.mem` next to the synthesis project. The bootlet is
  18 MIPS instructions (currently authored procedurally in the
  Ch123 TB body via `ee_prog_word()`); for hardware this needs
  to become a static `.mem` checked into the repo.
- The GIF payload must be preloaded into `ee_ram_stub` via the
  same mechanism — 24 qwords starting at `PAYLOAD_MADR=0x100`.
  Current TB authors them procedurally with `preload_qword()`;
  hardware needs a static `.mem`.
- The `core_go` signal must be tied high (or pulsed by a board
  reset-release sequencer) so the EE starts fetching from
  `0xBFC0_0000`.

**Known sim-only constructs that should NOT block first build**:
- `$display` lines in BIOS/RAM init (synth ignores).
- `$readmemh` (synth tools handle it for BRAM init).
- `$error` parameter validators (synth ignores).

**Known sim-only constructs that WOULD block first build**:
- None found in the Ch123 dep tree.

**Open questions for the hardware-build session** (deliberately
not answered here — they need a board-level decision):
- Target FPGA family + clock frequency (PCRTC was designed
  around 13.5 MHz pixel clock for the 16×8 active area; first
  build can run at any clock since the TB doesn't model real
  CRTC timing).
- Video-out PHY (HDMI core, VGA DAC, on-board HDMI
  transmitter chip).
- BIOS / payload bake step (Vivado `update_compile_order` +
  `.mem` files vs. a SystemVerilog `localparam` array
  preload).
- Whether to keep `ee_core_stub`'s `STRICT_UNSUPPORTED` gate
  active on hardware (catches unknown opcodes by halt+latch —
  useful for debugging, but a hard failure on any unintended
  fetch).

The Ch90 white-box TB `tb_gs_scanout_basic.sv` exercises the
full round trip: instantiates `gs_stub` + `vram_stub` +
`gs_pcrtc_stub`, drives a 4×4 sprite through the GIF reg port,
waits for raster to fully drain, then enables scanout and
captures one full frame's `(hcnt, vcnt) → (r, g, b)` trace.
Asserts: every pixel inside the sprite reads as the emitted
color, every pixel outside reads as 0, and at least one EV_MODE
frame trace fires.